Lecture 2: Forecasting [FINISHED]
Machine Learning is generally about prediction (i.e., we want good values for $\hat{y}$) though later in the semester we will see how ML tools can be used to address casaulity as well ($\hat{\beta}$). This notebook is a brief introduction to the time series models, identification, and methods. Often forecasting comes down to being able to estimate time-series models well.
The agenda for today's lecture is as follows:
import pandas as pd                    # for data handling
import numpy as np                     # for numerical methods and data structures
import matplotlib.pyplot as plt        # for plotting
from matplotlib.ticker import FormatStrFormatter
import seaborn as sea                  # advanced plotting
1. Static Models (top)¶
Suppose that we have time series data available on two variables, say y and z, where $y_t$ and $z_t$ mean the data are contemporaneous (eg, same year). A static model relating y to z is
$$ y_t = \beta_0 + \beta_1z_t + \epsilon_{t} $$where $t=1,...,T$. We call this kind of model "static" since we think a change in z at time t is believed to have an immediate effect on y. Static regression models are also used when we are interested in knowing the tradeoff between y and z. While the models in the previous lecture were based on the cross-section (ie, we look at two different people which are observationally equivalent but one has more education and we look to see if one sleeps more than the other), data variation here is based on time.
Just as in the cross-sectional case, the usual inference procedures are only as good as the underlying assumptions. The classical linear model assumptions for time series data are much more restrictive than those for the cross-sectional data—in particular, the strict exogeneity and no serial correlation assumptions can be unrealistic. Nevertheless, this is a good framework to start.
Example: 3-month Treasuries¶
The data in INTDEF.dta come from the 1997 Economic Report of the President and span
the years 1948 through 1996. The variable i3 is the three-month T-bill rate, inf is the annual
inflation rate based on the consumer price index (CPI), and def is the federal budget deficit
as a percentage of GDP. We're interested in understanding the connection between inflation and treasuries.
Let's load the data and clean things up a bit:
df = pd.read_stata('./Data/INTDEF.dta')
df.rename(columns={'i3':'treasuries','inf':'inflation','def':'deficit','def_1':'deficit_lag'}, inplace=True)
df.head()
| year | treasuries | inflation | rec | out | deficit | i3_1 | inf_1 | deficit_lag | ci3 | cinf | cdef | y77 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1948 | 1.04 | 8.1 | 16.4 | 11.7 | -4.700000 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 
| 1 | 1949 | 1.10 | -1.2 | 14.6 | 14.4 | -0.200001 | 1.04 | 8.1 | -4.700000 | 0.06 | -9.3 | 4.499999 | 0 | 
| 2 | 1950 | 1.22 | 1.3 | 14.5 | 15.6 | 1.100000 | 1.10 | -1.2 | -0.200001 | 0.12 | 2.5 | 1.300001 | 0 | 
| 3 | 1951 | 1.55 | 7.9 | 16.1 | 14.2 | -1.900001 | 1.22 | 1.3 | 1.100000 | 0.33 | 6.6 | -3.000001 | 0 | 
| 4 | 1952 | 1.77 | 1.9 | 18.9 | 19.4 | 0.500000 | 1.55 | 7.9 | -1.900001 | 0.22 | -6.0 | 2.400001 | 0 | 
We're interested in the connection between inflation, treasuries, and the deficit so we estimate the following model:
$$ \text{inflation}_t = \beta_0 + \beta_1\text{treasuries}_t + \beta_2\text{deficit}_t + \epsilon_t $$We can estimate this equation using a standard OLS estimator:
import statsmodels.formula.api as smf
print(smf.ols('inflation ~ treasuries + deficit', data=df).fit().summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              inflation   R-squared:                       0.588
Model:                            OLS   Adj. R-squared:                  0.570
Method:                 Least Squares   F-statistic:                     32.87
Date:                Fri, 14 Oct 2022   Prob (F-statistic):           1.36e-09
Time:                        07:06:10   Log-Likelihood:                -104.01
No. Observations:                  49   AIC:                             214.0
Df Residuals:                      46   BIC:                             219.7
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.4382      0.595      0.737      0.465      -0.759       1.635
treasuries     0.9579      0.118      8.092      0.000       0.720       1.196
deficit       -0.6399      0.172     -3.723      0.001      -0.986      -0.294
==============================================================================
Omnibus:                        6.677   Durbin-Watson:                   0.962
Prob(Omnibus):                  0.035   Jarque-Bera (JB):                6.664
Skew:                           0.899   Prob(JB):                       0.0357
Kurtosis:                       2.833   Cond. No.                         12.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
These estimates show that increases in inflation and the relative size of the deficit work together to increase short-term interest rates, both of which are expected from basic economics. For example, a ceteris paribus one percentage point increase in the treasury rate is correlated with a 0.95 point increase in inflation while a ceteris paribus increase in the deficit is correlated with a 0.63 point decrease in inflation. Both are very statistically significant but the coefficient on deficit is counter-intuitive.
2. Lagged Independent Variables (top)¶
It's not hard to write down a model (or use basic intuition) to see that inflation is a function past variables since people make purchase decisions basd on their expectations of the future economy. In a finite distributed lag (FDL) model, we allow one or more variables to affect y with a time lag.
df['treasuries_lag'] = df['treasuries'].shift(1)  # create lagged treasuries
print(smf.ols('inflation ~ treasuries + treasuries_lag + deficit + deficit_lag', data=df).fit().summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              inflation   R-squared:                       0.614
Model:                            OLS   Adj. R-squared:                  0.578
Method:                 Least Squares   F-statistic:                     17.07
Date:                Fri, 14 Oct 2022   Prob (F-statistic):           1.87e-08
Time:                        07:06:10   Log-Likelihood:                -100.04
No. Observations:                  48   AIC:                             210.1
Df Residuals:                      43   BIC:                             219.4
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          0.0900      0.612      0.147      0.884      -1.144       1.324
treasuries         0.8456      0.243      3.476      0.001       0.355       1.336
treasuries_lag     0.1175      0.268      0.438      0.664      -0.424       0.659
deficit           -0.5750      0.285     -2.021      0.050      -1.149      -0.001
deficit_lag        0.0677      0.216      0.314      0.755      -0.367       0.502
==============================================================================
Omnibus:                        8.592   Durbin-Watson:                   0.759
Prob(Omnibus):                  0.014   Jarque-Bera (JB):                8.300
Skew:                           1.010   Prob(JB):                       0.0158
Kurtosis:                       3.266   Cond. No.                         18.6
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Hmmm... lags don't seem to matter. I think we need a better theory to motivate this regression!
Example: Fertility and Economic Conditions¶
People write papers documenting that economic cycles are connected to fertility: When times are good, people have kids. I wonder how policy might explicitly be connectd to fertility. Let's find out. Consider the model
$$ \text{gfr}_t=\beta_0 + \beta_1\text{pe}_t + \beta_2\text{pe}_{t-1} + \beta_3\text{pe}_{t-2} + \epsilon_t $$where $\text{gfr}_t$ is the general fertility rate (children born per 1,000 women of childbearing age) and $\text{pe}_t$ is the real dollar value of the personal tax exemption. The idea is to see whether, in the aggregate, the decision to have children is linked to the tax value of having a child. The above equation recognizes that, for both biological and behavioral reasons, decisions to have children would not immediately result from changes in the personal tax exemption (or a change in disposable income more generally).
from statsmodels.iolib.summary2 import summary_col  # Import a tool to make regression tables
df = pd.read_stata("./Data/FERTIL3.dta")
regf = smf.ols('gfr ~ pe + ww2 + pill', data=df).fit()
tsregf = smf.ols('gfr ~ pe + pe_1 + pe_2 + ww2 + pill', data=df).fit()
print(summary_col([regf, tsregf],stars=True,float_format='%0.3f',
                  model_names=['Static\n(b/se)','Lags\n(b/se)'],
                 info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.3f}".format(x.rsquared), 
                           'Adj.R2':lambda x: "{:.3f}".format(x.rsquared_adj)}))
====================================
                 Static      Lags   
                 (b/se)     (b/se)  
------------------------------------
Intercept      98.682***  95.870*** 
               (3.208)    (3.282)   
R-squared      0.473      0.499     
R-squared Adj. 0.450      0.459     
pe             0.083***   0.073     
               (0.030)    (0.126)   
pe_1                      -0.006    
                          (0.156)   
pe_2                      0.034     
                          (0.126)   
pill           -31.594*** -31.305***
               (4.081)    (3.982)   
ww2            -24.238*** -22.126** 
               (7.458)    (10.732)  
N              72         70        
R2             0.473      0.499     
Adj.R2         0.450      0.459     
====================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01
Static Model¶
Let's take a look at Static model. Each variable is statistically significant at the 1% level against a two-sided alternative. We see that the fertility rate was lower during World War II: i.e., given pe, there were about 24
fewer births for every 1,000 women of childbearing age, which is a large reduction since From 1913 through 1984, gfr ranged from about 65 to 127. Similarly, the fertility rate has been substantially lower since the introduction of the birth control pill (i.e., negative sign and statistically signifcant).
The variable of economic interest is pe. The average pe over this time period is \$100.40 and ranges from zero to \\$243.83. The coefficient on pe implies that a 12-dollar increase in pe increases gfr by about one birth per 1,000 women of childbearing age -- that's pretty big!
Lags Model¶
Now look at the Lags model. In this regression, we only have 70 observations because we lose two when we lag pe twice. The coefficients on the pe variables are estimated very imprecisely -- each one is individually insignificant. It turns out that there is substantial correlation between the lags (why?) this multicollinearity makes it difficult to estimate the effect at each lag.
hypotheses = '(pe = pe_1  = pe_2 = 0)'
f_test = tsregf.f_test(hypotheses)
print(f_test)
<F test: F=3.972964046978317, p=0.011652005303129494, df_denom=64, df_num=3>
We get an F statistic for joint-significant indicating with p-value of 1.2% so there's something going on but our crude lagged interpretation is not quite capturing it. Maybe we can model this better. Perhaps include a linear or exponential time trend.
3. Time Trends and Seasonality (top)¶
Many economic time series have a common tendency of growing over time. We must therefore recognize that some series contain a time trend in order to draw causal inference using time series data. Ignoring the fact that two sequences are trending in the same or opposite directions can lead us to falsely conclude that changes in one variable are actually caused by changes in another variable. In many cases, two time series processes appear to be correlated only because they are both trending over time for reasons related to other unobserved factors.
Let's load some data to illustrate this point:
df = pd.read_stata('./Data/EARNS.dta')
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(df.year, df['outphr'], color = 'red')
# Make it look nice
sea.despine(ax=ax)
ax.set_ylabel('')
ax.set_title('Labor Productivity (Output per Hour Worked)',size=28)
ax.xaxis.set_ticks_position('none')   # Remove the small tick marks
plt.yticks(size=18)
plt.xticks(size=18)
plt.show()
The above presents labor productivity (output per hour of work) in the US for the years 1947 through 1987. This series displays a clear upward trend reflecting the fact that workers have become more productive over time. What kind of statistical models adequately capture trending behavior?
$$ y_t = \beta_0 + \beta_1y_{t-1} + \epsilon_t $$The intrepretation is straightforward: If $\beta_1>0$, then, on average, $y_t$ is growing over time and therefore has an upward trend. Of course, the opposite is true when $\beta_1<0$. When $\beta_1=1$, we say this process exhibits a unit root -- the best prediction for the future is today's value.
Seasonality¶
If a time series is observed at monthly or quarterly intervals (or even weekly or daily), it may exhibit seasonality. Some examples:
- Monthly housing starts in the Midwest are strongly influenced by weather. While weather patterns are somewhat random, we can be sure that the weather during January will usually be more inclement than in June, and so housing starts are generally higher in June than in January.
- Retail sales in the fourth quarter are typically higher than in the previous three quarters because of the Christmas holiday. Again, this can be captured by allowing the average retail sales to differ over the course of a year. This is in addition to possibly allowing for a trending mean. That is, retail sales in the most recent first quarter were higher than retail sales in the fourth quarter from 30 years ago, because retail sales have been steadily growing. Nevertheless, if we compare average sales within a typical year, the seasonal holiday factor tends to make sales larger in the fourth quarter.
We can model these issues by including season fixed effects so to allow the expected value of the series, $y_t$ , to vary by month, quarter, etc.
An Example: Antidumping¶
Krupp and Pollard (1996) analyzed the effects of antidumping filings by U.S. chemical industries on imports of various chemicals. We focus here on one industrial chemical, barium chloride, a cleaning agent used in various chemical processes and in gasoline production. In the early 1980s, U.S. barium chloride producers believed that China was offering its U.S. imports at an unfairly low price (an action known as dumping), and the barium chloride industry filed a complaint with the U.S. International Trade Commission (ITC) in October 1983. The ITC ruled in favor of the U.S. barium chloride industry in October 1984.
There are several questions of interest in this case:
- Are imports unusually high in the period immediately preceding the initial filing? 
- Do imports change noticeably after an antidumping filing? 
- What is the reduction in imports after a decision in favor of the U.S. industry? 
df = pd.read_stata('./Data/BARIUM.dta')
antid_month = smf.ols('lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + afdec6 + feb + mar + apr + may + jun + jul + aug + sep + oct + nov + dec + 1', data=df).fit()
print(antid_month.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                lchnimp   R-squared:                       0.358
Model:                            OLS   Adj. R-squared:                  0.262
Method:                 Least Squares   F-statistic:                     3.712
Date:                Fri, 14 Oct 2022   Prob (F-statistic):           1.28e-05
Time:                        07:06:13   Log-Likelihood:                -109.54
No. Observations:                 131   AIC:                             255.1
Df Residuals:                     113   BIC:                             306.8
Df Model:                          17                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     16.7788     32.429      0.517      0.606     -47.468      81.026
lchempi        3.2651      0.493      6.624      0.000       2.288       4.242
lgas          -1.2781      1.389     -0.920      0.359      -4.030       1.474
lrtwex         0.6630      0.471      1.407      0.162      -0.271       1.597
befile6        0.1397      0.267      0.524      0.602      -0.389       0.668
affile6        0.0126      0.279      0.045      0.964      -0.539       0.565
afdec6        -0.5213      0.302     -1.726      0.087      -1.120       0.077
feb           -0.4177      0.304     -1.372      0.173      -1.021       0.185
mar            0.0591      0.265      0.223      0.824      -0.465       0.584
apr           -0.4515      0.268     -1.682      0.095      -0.983       0.080
may            0.0333      0.269      0.124      0.902      -0.500       0.567
jun           -0.2063      0.269     -0.766      0.445      -0.740       0.327
jul            0.0038      0.279      0.014      0.989      -0.548       0.556
aug           -0.1571      0.278     -0.565      0.573      -0.708       0.394
sep           -0.1342      0.268     -0.501      0.617      -0.664       0.396
oct            0.0517      0.267      0.194      0.847      -0.477       0.580
nov           -0.2463      0.263     -0.937      0.351      -0.767       0.274
dec            0.1328      0.271      0.489      0.626      -0.405       0.671
==============================================================================
Omnibus:                        9.169   Durbin-Watson:                   1.325
Prob(Omnibus):                  0.010   Jarque-Bera (JB):                9.324
Skew:                          -0.540   Prob(JB):                      0.00945
Kurtosis:                       3.736   Cond. No.                     1.47e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.47e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Some comments:
- The dependent variable is the (logged) volume of imports of barium chloride from China chnimp.
- The equation shows that befile6(ie, a dummy variable equal to one during the six months before filing) is statistically insignificant so there is no evidence that Chinese imports were unusually high during the six months before the suit was filed.
- Although the estimate on affile6(ie, a dummy variable equal to one during the six after filing) is negative, the coefficient is small (indicating about a 1.3% fall in Chinese imports), and it is statistically very insignificant.
- The coefficient on afdec6shows a substantial fall in Chinese imports of barium chloride after the decision in favor of the U.S. industry. How much?
- Are there seasonality concerns? It could be that the months just before the suit was filed are months where imports are higher or lower, on average, than in other months.
hypotheses = '(feb = mar = apr = may = jun = jul = aug = sep = oct = nov = dec= 0)'
f_test = antid_month.f_test(hypotheses)
# Print the results with an f-string
print(f'The F-statistic for all months equal to zero is {f_test.fvalue:.2f} implying we can reject the null hypothesis that they are\njointly zero with probability {1-f_test.pvalue:.2f}.')
The F-statistic for all months equal to zero is 0.86 implying we can reject the null hypothesis that they are jointly zero with probability 0.41.
A test of the joint significance of the 11 month dummies generates a low F-statistic (0.86) with a p-value 0.59 and so the seasonal dummies are jointly insignificant.
Practice¶
- Re-estimate the above model with seasonal dummies (Spring, Summer, Fall, Winter). Hint: Look at the data.
df.dtypes
chnimp float32 bchlimp float32 befile6 int8 affile6 int8 afdec6 int8 befile12 int8 affile12 int8 afdec12 int8 chempi float32 gas float32 rtwex float32 spr int8 sum int8 fall int8 lchnimp float32 lgas float32 lrtwex float32 lchempi float32 t int16 feb int8 mar int8 apr int8 may int8 jun int8 jul int8 aug int8 sep int8 oct int8 nov int8 dec int8 percchn float32 dtype: object
antid_season = smf.ols('lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + afdec6 + spr + sum + fall', data=df).fit()
print(antid_season.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                lchnimp   R-squared:                       0.310
Model:                            OLS   Adj. R-squared:                  0.258
Method:                 Least Squares   F-statistic:                     6.032
Date:                Fri, 14 Oct 2022   Prob (F-statistic):           5.79e-07
Time:                        07:06:13   Log-Likelihood:                -114.33
No. Observations:                 131   AIC:                             248.7
Df Residuals:                     121   BIC:                             277.4
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -26.5219     23.297     -1.138      0.257     -72.645      19.602
lchempi        3.0779      0.486      6.331      0.000       2.116       4.040
lgas           0.5651      1.000      0.565      0.573      -1.415       2.545
lrtwex         1.1015      0.425      2.594      0.011       0.261       1.942
befile6        0.0767      0.265      0.289      0.773      -0.448       0.601
affile6       -0.0833      0.273     -0.305      0.761      -0.623       0.457
afdec6        -0.6212      0.295     -2.103      0.038      -1.206      -0.036
spr           -0.0412      0.151     -0.273      0.786      -0.341       0.258
sum           -0.1519      0.169     -0.897      0.371      -0.487       0.183
fall          -0.0673      0.154     -0.436      0.664      -0.373       0.239
==============================================================================
Omnibus:                        8.751   Durbin-Watson:                   1.439
Prob(Omnibus):                  0.013   Jarque-Bera (JB):                9.596
Skew:                          -0.466   Prob(JB):                      0.00825
Kurtosis:                       3.943   Cond. No.                     1.06e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.06e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
- Test whether these seasonal variables are jointly-significant.
hypotheses = '(spr = sum = fall= 0)'
f_test = antid_season.f_test(hypotheses)
print(f_test)
<F test: F=0.28224471031401044, p=0.8381333029196874, df_denom=121, df_num=3>
- Report the reduction in Chinese imports after the decision in favor of US firms.
print(f'Chinese imports fell by {100*(np.exp(0.6212)-1):4.2f} percent in the six months after the US ITC October 1984 decision.')
Chinese imports fell by 86.12 percent in the six months after the US ITC October 1984 decision.
4. Stationarity and Autoregressive (AR) Models (top)¶
Let's expand our analysis of the linear trend models to look at autoregressive models. Here, we'll take advantage of statsmodels.tsa which contains model classes and functions useful for time series analysis. Basic models include univariate autoregressive models (AR), vector autoregressive models (VAR) and univariate autoregressive moving average models (ARMA). We are only going to touch on autoregressive models here. The package also includes filtering methods and time series related plotting functions.
We'll focuson on autoregressive models (AR) which amount to modelling a specific type of linear trend. Such models are particularly useful in looking at economic data, such as those data available from FRED. You may have to install the pandas API interface via
pip install pandas_datareader
import statsmodels.graphics.tsaplots as tsaplots  # Gives the the autocorrelation plot
import statsmodels.tsa as tsa                     # The time series models
from statsmodels.tsa.ar_model import AutoReg
from pandas_datareader import data, wb    # we are grabbing the data and wb functions from the package
import datetime as dt                     # for time and date
codes = ['GDPC1']  # The code for real GDP at FRED. You can go to the website and look it up.
start = dt.datetime(1947, 1, 1)
usa = data.DataReader(codes, 'fred', start)
usa.rename(columns={'GDPC1':'gdp'}, inplace=True)
It's always a good idea to plot your data before doing any serious stuff. Look for variations in the data which you can use to build a better model.
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(usa.index, usa['gdp'], color = 'red')
ax.set_ylabel('bil 2012 dollars')
# Make it look nice
sea.despine(ax=ax)
ax.set_title('US GDP',size=28)
ax.xaxis.set_ticks_position('none')   # Remove the small tick marks
plt.yticks(size=18)
plt.xticks(size=18)
plt.show()
A different version/ source of labor productivity and again we observe a linear trend.
Stationarity¶
The GDP data are not stationary --- there is clearly a trend. Autoregressive models are usually used on stationary data. We can filter the data to remove the trend and recover a stationary series (tsa has methods to to this), but GDP is usually stationary in growth rates.
To see this, let's compute and plot the growth rate of the GDP data.
# The log-difference is a close approximation of the growth rate when growth rates are small. 
# log-differences are often used in practice because they are symmetric. 
usa['gdp_diff'] = np.log(usa['gdp']) - np.log(usa['gdp'].shift(1))
usa['gdp_pct'] = usa['gdp'].pct_change()
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(usa.index, usa['gdp_diff'], color = 'red')
ax.plot(usa.index, usa['gdp_pct'], color = 'blue')
# Make it look nice
sea.despine(ax=ax)
ax.set_ylabel('growth rate',size=18)
ax.set_title('US GDP growth rates',size=28)
ax.yaxis.set_major_formatter(FormatStrFormatter('%0.3f'))
ax.xaxis.set_ticks_position('none')   # Remove the small tick marks
plt.yticks(size=18)
plt.xticks(size=18)
plt.show()
Note that the growth percent and log difference plots lie on-top of eachother. This illustrates the idea that logs can be used as a close approximating to growth rates and percentage changes.
The figure looks stationary; i.e., no trend. Even the COVID19 period seems to revolve around a trend. There are things like Dickey-Fuller tests for stationarity (again, these can be done in statsmodels) but they are outside of the scope of this notebook.
Autoregressive (AR) models¶
An autoregressive model is a model where the outcome variable is a function of its past values. The general model is
$$ y_t = \varphi_0 + \varphi_1y_{t-1} + \varphi_2 y_{t-2} + \cdots + \varphi_p y_{t-p}+\epsilon_t.$$This model has $p$ lags. We refer to it as an AR(p) model or an autoregressive model of order $p$.
Determining the number of lags to include is part of specifying the model.
Let's start by looking at the data to see if it has an autoregressive property. Is there a relationship between the current value of GDP growth and its past value?
fig, ax = plt.subplots(figsize=(10,10))
ax.scatter(usa['gdp_diff'], usa['gdp_diff'].shift(1), color = 'red')
# plot the 45 degree line
ax.plot([-0.02, 0.045], [-0.02, 0.045], color='black')
ax.text(0.04, 0.037, '45-degree line')
ax.set_ylabel('gdp(t)')
ax.set_xlabel('gdp(t-1)')
sea.despine(ax=ax, trim=True)
plt.show()
There appears to be a positive relationship, but it is noisy.
The autocorrelation function¶
Our plot above tells us about the $t$ and $t-1$ relationship. How about $t$ and $t-2$? $t-5$? We could continue to make the plots above, but there are better ways.
The autocorrelation function plots $\text{corr}(y_t, y_{t-k})$ for many values of $k$. statsmodels automates this for us with the .plot_acf() function (docs).
import statsmodels.graphics.tsaplots as tsaplots  # Gives the the autocorrelation plot
import statsmodels.tsa as tsa                     # The time series models
from statsmodels.tsa.ar_model import AutoReg
fig, ax = plt.subplots(figsize=(10,6))
# plot_acf is picky about nans, so drop them
# The 'lags' parameter determines how many lags to show
tsaplots.plot_acf(usa['gdp_diff'].dropna(), lags = 8, ax = ax)
ax.set_xlabel('lag = k', fontsize=14)
ax.set_ylabel(r'corr(gdp$_t$, gdp$_{t-k})$', fontsize=14)
sea.despine(ax=ax)
plt.show()
Estimating the AR(p) model¶
The AR(p) model in statsmodels works like the other models. You create a model object, then  you call the .fit( ) method to estimate the parameters. We'll first estimate the model with two lags using the standard statsmodels approach:
print(smf.ols('gdp_diff ~ usa["gdp_diff"].shift(1) + usa["gdp_diff"].shift(2)', data=usa).fit().summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               gdp_diff   R-squared:                       0.026
Model:                            OLS   Adj. R-squared:                  0.019
Method:                 Least Squares   F-statistic:                     3.946
Date:                Fri, 14 Oct 2022   Prob (F-statistic):             0.0204
Time:                        07:06:30   Log-Likelihood:                 914.88
No. Observations:                 299   AIC:                            -1824.
Df Residuals:                     296   BIC:                            -1813.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                    0.0060      0.001      6.787      0.000       0.004       0.008
usa["gdp_diff"].shift(1)     0.1010      0.058      1.749      0.081      -0.013       0.215
usa["gdp_diff"].shift(2)     0.1146      0.058      1.985      0.048       0.001       0.228
==============================================================================
Omnibus:                      118.366   Durbin-Watson:                   1.984
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             6450.531
Skew:                          -0.731   Prob(JB):                         0.00
Kurtosis:                      25.708   Cond. No.                         92.4
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Now, let's shift and use the AutoReg() method. Note that we imported the method above. The only real difference here is that you do not need to write out a string with the model specification. The AutoReg( ) method knows that the model is AR(p). NB, you can include other regressors using the exog argument.
# Need to set the data frequency.
usa = usa.resample('q').mean()
# Estimate the model. 
mod = AutoReg(usa['gdp_diff'].dropna(), 2)
res = mod.fit()
print(res.summary())
                            AutoReg Model Results                             
==============================================================================
Dep. Variable:               gdp_diff   No. Observations:                  301
Model:                     AutoReg(2)   Log Likelihood                 914.883
Method:               Conditional MLE   S.D. of innovations              0.011
Date:                Fri, 14 Oct 2022   AIC                          -1821.765
Time:                        07:06:30   BIC                          -1806.964
Sample:                    12-31-1947   HQIC                         -1815.841
                         - 06-30-2022                                         
===============================================================================
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
const           0.0060      0.001      6.821      0.000       0.004       0.008
gdp_diff.L1     0.1010      0.057      1.758      0.079      -0.012       0.214
gdp_diff.L2     0.1146      0.057      1.995      0.046       0.002       0.227
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            2.5458           +0.0000j            2.5458            0.0000
AR.2           -3.4267           +0.0000j            3.4267            0.5000
-----------------------------------------------------------------------------
We see the same coefficient estimates as the standard statsmodels approach.
Comments: The stuff at the bottom is outside the scope of this class but in case you're curious... the modulus refers to the eigenvalues of the stochastic process we're modeling. This speaks to whether or not a shock would result in the process returning to an equilibrium. Eigenvalues / modulus outside the unit circle (ie, absolute value greater than one) imply such stability. Since we observe business cycles in the US economy revolving around a trend, a "smell" test for whether a model makes sense is whether the eigenvalues imply stability. Of course, since the data do reflect this stability, such an an outcome should be expected.
The results object has lots of stuff in it. Try tab completion to see what's in there. Let's plot the fitted values:
fig, ax = plt.subplots(figsize=(15,10))
ax.plot(res.fittedvalues, color='blue', label = 'AR(2)')
ax.plot(usa.index, usa.gdp_diff, color='black', alpha = 0.5, label = 'Data')
# Make it look nice
sea.despine(ax=ax)
ax.set_ylabel('growth rate',size=18)
ax.set_title('US GDP growth rates',size=28)
ax.yaxis.set_major_formatter(FormatStrFormatter('%0.3f'))
ax.xaxis.set_ticks_position('none')   # Remove the small tick marks
plt.yticks(size=18)
plt.xticks(size=18)
ax.legend(frameon=False,fontsize=18)
plt.show()
The fitted values track the data, but do not generate enough volatility.
Forecasting¶
We can use the estimated model to develop a forecast simply by iterating through the future and applying our fitted model.
# To make plotting easier, initialize forecast_list to current value of gdp.
# This will simply connect the lines for the historical data and the forecast.
forecast_list = []
forecast_list.append(float(usa.loc['2022-06-30','gdp_diff']))
usa['L1gdp_diff'] = usa["gdp_diff"].shift(1)
usa['L2gdp_diff'] = usa["gdp_diff"].shift(2)
Xvars = ['L1gdp_diff','L2gdp_diff']
Xvars_string = ' + '.join(Xvars)
# Series containing current values of the explanatory variables. Used to make forecast
current_values = usa.loc['2022-06-30',Xvars]
# Create new columns to contain forward values of the GDP growth rate:
for h in range(1,10):
    var = 'F' + str(h) + 'gdp_diff' # Create string containing the variable name, like "F1gdp"
    usa[var] = usa['gdp_diff'].shift(-h) # Create column in data frame
    
# Loop through each forecast horizon:
for h in range(1,10):
    
    # Define dependant variable
    var = 'F' + str(h) + 'gdp_diff' 
    
    # Use same trick as above to conveniently write the expression for the model:
    res = smf.ols(var + ' ~ ' + Xvars_string, data=usa).fit()
    
    # Print results (if necessary)
    if h == 1:
        print(res.summary()) # Print results summary for first regression only
    
    # Record forecast
    pt_forecast = float(res.predict(current_values)) # The .predict() method creates the forecast of GDP.
    forecast_list.append(pt_forecast) # Append the forecast for the current horizon to the list of forecasts
    
print('')    
print(forecast_list)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             F1gdp_diff   R-squared:                       0.017
Model:                            OLS   Adj. R-squared:                  0.010
Method:                 Least Squares   F-statistic:                     2.522
Date:                Fri, 14 Oct 2022   Prob (F-statistic):             0.0820
Time:                        07:06:33   Log-Likelihood:                 910.16
No. Observations:                 298   AIC:                            -1814.
Df Residuals:                     295   BIC:                            -1803.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0067      0.001      7.589      0.000       0.005       0.008
L1gdp_diff     0.1306      0.058      2.246      0.025       0.016       0.245
L2gdp_diff    -0.0179      0.058     -0.308      0.758      -0.132       0.097
==============================================================================
Omnibus:                      141.061   Durbin-Watson:                   1.791
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5274.083
Skew:                          -1.209   Prob(JB):                         0.00
Kurtosis:                      23.467   Cond. No.                         92.5
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[-0.0014473866859976425, 0.005910242683759876, 0.006888434724432077, 0.007716174337970049, 0.007819723553806616, 0.008022337924499738, 0.00832140520106706, 0.008631598715385247, 0.007503914856584633, 0.006786637591904043]
forecast_dates = pd.date_range('2022-06-30', periods=10, freq='Q')
fig, ax = plt.subplots(figsize=(15,10))
ax.plot(res.fittedvalues, color='blue', label = 'AR(2)')
ax.plot(usa.index, usa.gdp_diff, color='black', alpha = 0.5, label = 'Data')
ax.plot(forecast_dates, forecast_list, color = 'red', linestyle = '-', label = 'Forecast')
# Make it look nice
sea.despine(ax=ax)
ax.set_ylabel('growth rate',size=18)
ax.set_title('US GDP growth rates',size=28)
ax.yaxis.set_major_formatter(FormatStrFormatter('%0.3f'))
ax.xaxis.set_ticks_position('none')   # Remove the small tick marks
plt.yticks(size=18)
plt.xticks(size=18)
ax.legend(frameon=False,fontsize=18)
plt.show()
5. Autoregressive Integrated Moving-Average (ARIMA) (top)¶
Let's get serious about modeling GDP growth and fit an Autoregressive Integrated Moving-Average (ARMA) model. The "integrated" part refers to the fact that the model provides a way to choose the number of differences required to make the data stationary but we've already done that so what we'll be modelling is a Autoregressive Moving-Average (ARMA) model defined as
$$ Y_t = \varepsilon_t + \underbrace{\Sigma_{i=1}^p \varphi_i Y_{i-1}}_{\text{Auto-regressive}} + \underbrace{\Sigma_{i=1}^q \theta_{i}\varepsilon_{i-1}}_{\text{Moving-Average}} $$We need to choose the variables $p$ and $q$ and then estimate (fit) ${\varphi_i}_{i=1}^p$ and ${\theta_i}_{i=1}^q$. Fortunately, there's a recipe to chose (p,q):
- Plot the autocorrelation functions for an estimate of q. We've already done this above and found the autocorrelation was significant for a lag of two so q=2. We repeat the process and graph below.
fig, ax = plt.subplots(figsize=(10,6))
# plot_acf is also picky about nans, so drop them
# The 'lags' parameter determines how many lags to show
tsaplots.plot_acf(usa['gdp_diff'].dropna(), lags = 10, ax = ax)
ax.set_xlabel('lag number', fontsize=14)
plt.xticks(np.arange(0, 11, 1.0))
ax.set_ylabel(r'corr(gdp$_t$, gdp$_{t-k})$', fontsize=14)
sea.despine(ax=ax)
plt.show()
- Plot the partial autocorrelation functions for an estimate of p. This is beyond what you need to know for this class but in case you're interested:
Given a time series $\{z_{t}\}$, the partial autocorrelation of lag k, denoted $\phi _{k,k}$, is the autocorrelation between $z_{t}$ and $z_{t+k}$ with the linear dependence of $z_{t}$ on $z_{t+1}$ through $z_{t+k-1}$ removed. In a slightly less mathy tone, the partial autocorrelation of lag k, denoted $\phi _{k,k}$, is the autocorrelation between $z_{t}$ and $z_{{t+k}}$ not accounted for by lags $1$ through (and including) $k-1$. For us, it will serve as a tool to identify how mant lags to include for the AR component.
We haven't done this yet so let's do it.
fig, ax = plt.subplots(figsize=(10,6))
# plot_pcf is also picky about nans, so drop them
# The 'lags' parameter determines how many lags to show
tsaplots.plot_pacf(usa['gdp_diff'].dropna(), lags = 10, ax = ax,method='ywm')
ax.set_xlabel('lag number', fontsize=14)
plt.xticks(np.arange(0, 11, 1.0))
ax.set_ylabel(r'corr(gdp$_t$, gdp$_{t-k})$', fontsize=14)
sea.despine(ax=ax)
plt.show()
Hmm... very similar to the autoregressive model and the number of significant lags is equal to two so set p=2. And now we can fit the model:
import statsmodels.api as sm
arima_model = sm.tsa.arima.ARIMA(usa['gdp_diff'], order=(2,0,2))  # The "zero" in "order" is for the number of differences to 
                                                                  # but we've already done that so I'm telling the algorithm to 
                                                                  # not apply a difference.
res = arima_model.fit()
A nice feature of statsmodels is that it's easy to use the fitted model to make a forecaste so let's do that:
# Forecast
fc = res.forecast(15, alpha=0.05)  # 95% conf
# Make as pandas series
fc_series = pd.DataFrame(fc)
And of course it would be useful to plot the results:
fig, ax = plt.subplots(figsize=(15,10))
ax.plot(res.fittedvalues, color='blue', label = 'ARIMA')
ax.plot(usa.index, usa.gdp_diff, color='black', alpha = 0.5, label = 'Data')
ax.plot(fc_series.index, fc_series.predicted_mean, color='red', label = 'Forecast')
sea.despine(ax=ax)
ax.set_ylabel('growth rate',size=18)
ax.set_title('US GDP growth rates',size=28)
ax.yaxis.set_major_formatter(FormatStrFormatter('%0.3f'))
ax.xaxis.set_ticks_position('none')   # Remove the small tick marks
plt.yticks(size=18)
plt.xticks(size=18)
ax.legend(frameon=False,fontsize=18)
plt.show()
Still not enough volatility captured in our model.
Practice¶
- Collect historical data on the S\&P 500 price index from FRED using the pandas-datareader API. The FRED variable code is 'SP500'. See the API notebook for how to do this. Use November 23, 2013 as the start date.
codes = ['SP500']  # The code for real GDP at FRED.
start = start = dt.datetime(2013, 11, 23)
stocks = data.DataReader(codes, 'fred', start)
- Resample the data to weekly frequency using mean().
stocks = stocks.resample('w').mean()
- Plot the weekly average price. Are the data stationary?
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(stocks.index, stocks['SP500'], color = 'blue')
# Make it look nice
sea.despine(ax=ax)
ax.set_ylabel('')
ax.set_title('S&P 500 Index Price',size=28)
ax.xaxis.set_ticks_position('none')   # Remove the small tick marks
plt.yticks(size=18)
plt.xticks(size=18)
plt.show()
- Compute the log difference of the weekly price data.
#stocks['sp500_diff'] = np.log(stocks['SP500']) - (np.log(stocks['SP500'].shift(1))) # Approximation using logs
stocks['sp500_diff'] = stocks['SP500'].pct_change()  # Calculated directly
stocks.head()
| SP500 | sp500_diff | |
|---|---|---|
| DATE | ||
| 2013-12-01 | 1804.5675 | NaN | 
| 2013-12-08 | 1795.7960 | -0.004861 | 
| 2013-12-15 | 1788.8060 | -0.003892 | 
| 2013-12-22 | 1801.2220 | 0.006941 | 
| 2013-12-29 | 1836.1825 | 0.019409 | 
- Plot the growth rates. Are the data stationary?
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(stocks.index, stocks['sp500_diff'], color = 'blue')
# Make it look nice
sea.despine(ax=ax)
ax.set_ylabel('')
ax.set_title('S&P 500 Index Growth Rate',size=28)
ax.xaxis.set_ticks_position('none')   # Remove the small tick marks
plt.yticks(size=18)
plt.xticks(size=18)
plt.show()
- Plot the autocorrelation function for 20 lags. What lags look meaningful?
fig, ax = plt.subplots(figsize=(10,6))
# plot_acf is picky about nas, so drop them
tsaplots.plot_acf(stocks['sp500_diff'].dropna(), lags = 20, ax = ax)
ax.set_xlabel('lag = k', fontsize=14)
ax.set_ylabel(r'corr(sp500$_t$, sp500$_{t-k})$', fontsize=14)
sea.despine(ax=ax)
plt.show()
- Estimate an AR(p) model of the price data. Include 1 lag.
stock_res_1 = AutoReg(stocks['sp500_diff'].dropna(),1).fit()
print(stock_res_1.summary())
                            AutoReg Model Results                             
==============================================================================
Dep. Variable:             sp500_diff   No. Observations:                  463
Model:                     AutoReg(1)   Log Likelihood                1206.886
Method:               Conditional MLE   S.D. of innovations              0.018
Date:                Fri, 14 Oct 2022   AIC                          -2407.773
Time:                        07:06:46   BIC                          -2395.366
Sample:                    12-15-2013   HQIC                         -2402.888
                         - 10-16-2022                                         
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
const             0.0013      0.001      1.531      0.126      -0.000       0.003
sp500_diff.L1     0.2377      0.045      5.241      0.000       0.149       0.327
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            4.2072           +0.0000j            4.2072            0.0000
-----------------------------------------------------------------------------
- Plot the data and the fitted values.
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(stock_res_1.fittedvalues, color='blue', label = 'fitted values')
ax.plot(stocks.index, stocks['sp500_diff'], color='black', alpha = 0.5, label = 'data')
sea.despine(ax=ax)
ax.set_ylabel('growth rate',size=18)
ax.set_title('S&P500 growth rates',size=28)
ax.legend(frameon=False,fontsize=18)
plt.yticks(size=18)
plt.xticks(size=18)
plt.show()
- Plot the data and the fitted values from the 1-lag model and the 2-lag model. Plot just for the period 1/1/2015 through 1/1/2017.
stock_res_2 = AutoReg(stocks['sp500_diff'].dropna(),2).fit()
print(stock_res_2.summary())
                            AutoReg Model Results                             
==============================================================================
Dep. Variable:             sp500_diff   No. Observations:                  463
Model:                     AutoReg(2)   Log Likelihood                1205.415
Method:               Conditional MLE   S.D. of innovations              0.018
Date:                Fri, 14 Oct 2022   AIC                          -2402.830
Time:                        07:06:47   BIC                          -2386.297
Sample:                    12-22-2013   HQIC                         -2396.320
                         - 10-16-2022                                         
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
const             0.0014      0.001      1.669      0.095      -0.000       0.003
sp500_diff.L1     0.2578      0.047      5.529      0.000       0.166       0.349
sp500_diff.L2    -0.0841      0.047     -1.800      0.072      -0.176       0.007
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.5339           -3.0894j            3.4493           -0.1767
AR.2            1.5339           +3.0894j            3.4493            0.1767
-----------------------------------------------------------------------------
fig, ax = plt.subplots(figsize=(15,6))
ax.plot(stock_res_2.fittedvalues, color='blue', label = 'fitted values, p = 2')
ax.plot(stock_res_1.fittedvalues, color='red', label = 'fitted values, p = 1')
ax.plot(stocks.index, stocks['sp500_diff'], color='black', alpha = 0.5, label = 'data')
sea.despine(ax=ax)
ax.set_ylabel('growth rate',size=18)
ax.set_title('S&P500 growth rates (2015-2017)',size=28)
ax.legend(frameon=False,fontsize=18)
ax.set_xlim([dt.datetime(2015, 1, 1), dt.datetime(2017, 1, 1)])
plt.yticks(size=18)
plt.xticks(size=18)
plt.show()