Lecture 20: Linear Regression via Ordinary Least Squares (OLS)
[MY SOLUTIONS]
This semester we've spent time adding tools to your toolbox. While we have spent a lot of time looking at economic data, we haven't spent much time doing "economics." Well, all of this work was a big wind-up to start using economomic theory to make sense of data. There are two ways in which we'll do this:
Econometrics
Machine Learning
This notebook introduces us to Econometrics and the statsmodels package (docs), which provides functions for formulating and estimating statistical models. Econometrics is not a prerequisite for this course so will spend some time motivating/ proving the tools. If you have had econometrics, we will apply what you learned in econometrics class to use in python.
The agenda for today's lecture is as follows:
Class Announcements
Exam 2 is due to eLC by end of day Thursday.
No class on Thursday.
Inference (Econometrics) vs. Prediction (Machine Learning)¶
Suppose we have data on a variable of interest $y$ and variables we think are important for determining $y$, $x_1, x_2,...x_p$. We assume that $y$ is related to the $x$ variables according to
$$y = f(X)+\epsilon$$where $X$ is the matrix of $x$ variables. The $f()$ function represents the systematic or structural relationship between $y$ and $X$ and $\epsilon$ is the noise or error term.
An important part of both machine learning and econometrics is estimating $f()$. Why we care about $f()$ is where machine learning and econometrics often (but not always) diverge.
- **Econometrics** prioritizes inference: The estimate of $f()$ is important in that it tells us about how $y$ and $X$ are related. An economist may be interested in understanding the (causal) relationship between interest rates (influenced directly by the FOMC) and the S&P 500 index (as a proxy for the economy). Inference is often associated with some kind of theoretical model in mind that lays out relationships between the $y$ and $X$ variables. The estimate of $f()$ helps us quantify and better understand these models.
- **Machine learning** prioritizes prediction: The estimate of $f()$ is only important in that it is useful for out-of-sample prediction. For example, a 'trained' ML model will be capable of predicting the value of the S&P 500 index tomorrow but the user is not inherently interested in why the S&P 500 index will change.
Recent work has focused on leveraging machine learning tools to better infer causality (econometrics) so the distinction between the two is becoming blurry.
Specifying f( )¶
How do we know which variables to include in $X$, or what is $f(\; )$? In econometrics, we use a model to guide our choice of $f()$. This reflects researcher's goal of inference. For example, if we're interested in understanding the causal relationship between interest rates and the S&P 500, interest rates had better be in the regression on the right-hand side. Plus, we'll want our estimates to reflect some deeper economic idea such as an elasticity. Doing so will enable us to port our results to other settings, including the large and more complicated equilibrium models of the economy used by the Federal Reserve to inform policy.
For this class, we will assume that $f(x)$ is linear so our assumed "data generating process" takes the following form:
$$y = X\beta+\epsilon$$where $X$ is some data and $\beta$ is the unknown to us and we want to use the data to estimate.
Identifying $\beta$: A Motivating Simple Example¶
Supose we have $j=1,...,N$ observations of some outcome variable we're interested in ($y$) and $j=1,...,N$ observations of some variable $X$ which we think influences $y$. We want to know exactly how; ie, want to find $\beta$. The world is a complicated place so we also think there's some random noise out there so include an error term ($\epsilon$). Our specification is therefore
$$y_j = X_j\beta+\epsilon_j$$How to find $\beta$?
Monte Carlo Simulation¶
Normally, we are presented with data, we assume a functional form for $f()$, and we solve for the $\beta$ such that our specification best represents the observed data. Our results could therefore be wrong if we specificied the wrong $f()$. Let's ignore that possibility and generate our own data so we know the true Data Generating Process (DGP) and therefore the true value of $\beta$. This kind of approach is called Monte Carlo Simulation. There is an optional lecture on the course website about how to properly construct random draws.
# Initialize packages
import numpy as np
import numpy.random as npr
import matplotlib.pyplot as plt
import seaborn as sea
import pandas as pd
# Set seed for random number generator to ensure the same "random" numbers are generated each time we run this code.
np.random.seed(10)
# Set values
beta = 1
N = 100
# Generate J random draws with an exponential distribution
X = npr.exponential(1,size=N)
# Generate J random draws with an normal distribution
eps = npr.normal(0,1,size=N)
# Determine Y
Y = X*beta + eps
# Create Figure
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(Y,X, alpha = 0.4, color = 'red')
ax.set_xlabel('Independent Variable (X)',size=18)
ax.set_ylabel('Dependent Variable (Y)',size=18)
ax.set_title(r'$Y_j = \beta X_j + \epsilon_j,\;\beta$='+str(beta),size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
Let's pretend that we don't know the true $\beta$ is equal to one.
We want to find the $\beta$ which best fits the data. Since our assumed $f()$ is linear, this amounts to drawing a line through the above dots, beginning at the origin. We this line to go through as many of the dots as possible. In elementary school you would have just positioned your ruler at (0,0) and rotated it unytil you found a good enough line. We're going to do the same thing here but with math.
Finding the 'best' line amounts to finding the line which minimizes the (squared) distance between the line and the red dots. Mathematically, this amounts to solving
$$\text{min}_{\beta}\sum_j \bigg[Y_j - \underbrace{\big(X_j\beta+\epsilon_j\big)}_{\text{Model}}\bigg]^2$$We squared the distance because we're indifferent between whether the line misses a red dot from above or below. Also note that the above specification lends itself to the name of this regression: "Least Squares". The "ordinary" refers to this approach being the starting point for regression models.
We can solve the $\beta$ using calculus; i.e., by taking a derivative and setting equal to zero:
$$ \sum_j \bigg[Y_j - \big(X_j\beta+\epsilon_j\big)\bigg]\times X_j=0$$which is equivalent to
$$ \sum_j Y_jX_j - X_jX_j\beta -X_j\epsilon_j=0$$$$ \hookrightarrow \sum_j Y_jX_j -\sum_jX_j\epsilon_j=\beta\sum_jX_jX_j$$But we can do better yet. Remember, we've assumed in our specification that $\epsilon$ is random noise. Since it's random noise, it cannot be correlated with $X$ so we know the interaction between the two ($\sum_jX_j\epsilon_j$) is equal to zero in expectation (or equivalently when the number of observations are suffiicently high).
We therefore can derive the value of $\beta$ directly from the First-Order Condition:
$$\hat\beta=\frac{\sum_j Y_jX_j}{\sum_jX_jX_j}$$We can check this with our graph:
numerator = np.sum(Y*X)
denominator = np.sum(X*X)
bhat = numerator/denominator
Yhat = X*bhat
# Create Figure
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(Y,X, alpha = 0.4, color = 'red')
ax.plot(Yhat,X, color = 'blue')
ax.set_xlabel('Independent Variable (X)')
ax.set_ylabel('Dependent Variable (Y)')
ax.set_title(r'Proposed DGP with $\beta$='+str(beta),size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.annotate(fr'$\hat\beta=${bhat:.2f}', (2.2,3.5),size=16, color='blue')
plt.show()
Well, we did pretty good at recovering the true $\beta$ but didn't get it exactly. Why not? Well, not that above I said "in expectation" regarding the interaction between $X$ and $\epsilon$ and set it to zero. If that interaction wasn't exactly zero, my calculated $\hat\beta$ will be off. hence, our work is just an estimate of the coefficient, and we would therefore like to understand the variance of this estimate to quantify our uncertainty and possibly to perform significance tests.
We can actually derive an explicit function to represent the variance of our estimate -- where our uncertainty will stem from how good and how much $X$ we have. Mathematically, define the variance as
$$V(\hat\beta)= \frac{1}{N}\sum_j \bigg[\big(\beta-\hat\beta\big)X_j\bigg]^2$$$$\hookrightarrow V(\hat\beta)= \frac{\sigma^2}{\sum_j X_j^2} $$where $\sigma^2 = \frac{\sum_j\epsilon_j^2}{N-p}$ and $p$ is the 'number of parameters estimated'. We refer to the denominator as the 'degrees of freedom.' We don't know the true epsilon, but our estimated equation gives us an estimate so use that to approximate the true value. Hence, our estimated variance of the estimate is
$$V(\hat\beta)= \frac{\sum_j\hat\epsilon_j^2}{\sum_j X_j^2} $$We define the standard error as the square root of $V(\hat\beta)$ and we can easilly compute this given our estimation. Moreover, we can test whether our estimate of $\beta$ is significantly different to a 'null hypothesis' (typically set as zero meaning no relationship exists and it could be that $\beta=0$) using a t-test:
$$t = \frac{\hat\beta-\beta^0}{\text{se}(\hat\beta)}\sim T_{N-p}$$where $T_{N-p}$ represents the T-distribution with 'N-p' degrees of freedom. BTW the story behind the 'student T-test' is a worthwhile read. A good rule-of-thumb is for the t-statistic to be greater than 2 which says that we can reject the null hypothesis with 95\% probability. In practice, we only report the estimated standard errors since that enables the reader to compute t-statistics for a variety of null hypotheses.
resid = Y-X*bhat
s = np.sum(resid**2)/(N-1)
var = s/(np.sum(X**2))
se = np.sqrt(var)
# Create Figure
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(Y,X, alpha = 0.4, color = 'red')
ax.plot(Yhat,X, color = 'blue')
ax.set_xlabel('Independent Variable (X)')
ax.set_ylabel('Dependent Variable (Y)')
ax.set_title(r'Proposed DGP with $\beta$='+str(beta),size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.annotate(fr'$\hat\beta=${bhat:.2f} ({se:.4f})', (1.6,3.5),size=16, color='blue')
plt.show()
From the graph, we can see that our simple linear model replicates the data well and -- since we know the true DGP -- also recovers the true value of $\beta$.
In our simple example we only included on variable and therefore only solved for one coefficient. Going forward, our data will be much more complicated (e.g., more independent X variables) and we'll want to account for the possibility that our dependent variable attains some value independent of the observed X variables in our data -- we'll want to include a constant in our $f()$ specification. It turns out that doing all of this is pretty straight-forward and the intution gained from our simple example extends easilly.
Practice ¶
- Increase N to $\{200,500,1000\}$. What happens to the estimated coefficient $(\hat\beta)$ and the estimated standard errors $(\hat \epsilon)$?
# Set values
beta = 1
Nval = [100, 200, 500, 1000, 100000, 100000]
for n in Nval:
# Generate J random draws with an exponential distribution
X = npr.exponential(1,size=n)
# Generate J random draws with an normal distribution
eps = npr.normal(0,1,size=n)
# Determine Y
Y = X*beta + eps
numerator = np.sum(Y*X)
denominator = np.sum(X*X)
print(f'Given N of {n:d}, the value of bhat is: {numerator/denominator:.4f}')
Given N of 100, the value of bhat is: 1.0222 Given N of 200, the value of bhat is: 0.9870 Given N of 500, the value of bhat is: 0.9854 Given N of 1000, the value of bhat is: 0.9786 Given N of 100000, the value of bhat is: 0.9995 Given N of 100000, the value of bhat is: 1.0013
- Change the specification by changing $\beta$ to a value of your choice and re-run the code to find $\hat\beta$.
Nval = [100, 200, 500, 1000, 100000, 100000]
for n in Nval:
# Generate J random draws with an exponential distribution
X = npr.exponential(1,size=n)
# Generate J random draws with an normal distribution
eps = npr.normal(0,1,size=n)
# Determine Y
Y = X*1.5 + eps
numerator = np.sum(Y*X)
denominator = np.sum(X*X)
print(f'Given N of {n:d}, the value of bhat is: {numerator/denominator:.4f}')
Given N of 100, the value of bhat is: 1.4653 Given N of 200, the value of bhat is: 1.4602 Given N of 500, the value of bhat is: 1.4470 Given N of 1000, the value of bhat is: 1.4435 Given N of 100000, the value of bhat is: 1.4994 Given N of 100000, the value of bhat is: 1.4965
Reading Stata data files¶
If you took econometrics, you probably used one of Wooldridge's textbooks. I figure you would like to relive those happy times, so we'll use some examples from his textbook.
On the plus side, the data that correspond to the Wooldridge Introductory Econometrics: A Modern Approach problems are available to download and they are ALREADY CLEANED. [I contemplated adding some junk to the files to make it more interesting...]
On the minus side, the files are in STATA's .dta format.
Lucky for use, pandas has a method that reads stata files. It also has methods for csv, excel, json, SQL, SAS,...
We'll be looking at the connection between labor wages and sleep using a 1990 paper by Biddle and Hamermesh (link). The authors use time-use data from 12 countries to assess whether an increase in wage causes individuals to sacrifice sleep.
Take note of a couple things:
The "economic mechanism" is that people are time-constrained so when their wages go up, they sacrifice sleep to earn more. If we were to write down a consumer optimization problem, utility would be increasing in sleep and consumption; ie, $$ max_{s,c} \,U(\underbrace{T-n}_{\text{sleep}},c)\;\;\text{s.t.}\; T \geq n + s, \underbrace{w\times n}_{\text{income}}\geq \underbrace{p\times c}_{\substack{\text{Money spent}\\ \text{on cons.}}} $$
They want to show "causation" but all we'll be able to show with OLS is correlation.
Let's load the underlying data:
# Use pandas read_stata method to get the stata formatted data file into a DataFrame.
sleep = pd.read_stata('./Data/sleep75.dta')
# Take a look...so clean!
sleep.head()
| age | black | case | clerical | construc | educ | earns74 | gdhlth | inlf | leis1 | ... | spwrk75 | totwrk | union | worknrm | workscnd | exper | yngkid | yrsmarr | hrwage | agesq | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.0 | 0.0 | 1.0 | 0.0 | 0.0 | 12.0 | 0.0 | 0.0 | 1.0 | 3529.0 | ... | 0.0 | 3438.0 | 0.0 | 3438.0 | 0.0 | 14.0 | 0.0 | 13.0 | 7.070004 | 1024.0 |
| 1 | 31.0 | 0.0 | 2.0 | 0.0 | 0.0 | 14.0 | 9500.0 | 1.0 | 1.0 | 2140.0 | ... | 0.0 | 5020.0 | 0.0 | 5020.0 | 0.0 | 11.0 | 0.0 | 0.0 | 1.429999 | 961.0 |
| 2 | 44.0 | 0.0 | 3.0 | 0.0 | 0.0 | 17.0 | 42500.0 | 1.0 | 1.0 | 4595.0 | ... | 1.0 | 2815.0 | 0.0 | 2815.0 | 0.0 | 21.0 | 0.0 | 0.0 | 20.530001 | 1936.0 |
| 3 | 30.0 | 0.0 | 4.0 | 0.0 | 0.0 | 12.0 | 42500.0 | 1.0 | 1.0 | 3211.0 | ... | 1.0 | 3786.0 | 0.0 | 3786.0 | 0.0 | 12.0 | 0.0 | 12.0 | 9.619998 | 900.0 |
| 4 | 64.0 | 0.0 | 5.0 | 0.0 | 0.0 | 14.0 | 2500.0 | 1.0 | 1.0 | 4052.0 | ... | 1.0 | 2580.0 | 0.0 | 2580.0 | 0.0 | 44.0 | 0.0 | 33.0 | 2.750000 | 4096.0 |
5 rows × 34 columns
Here is some additional information about the variables:
| Variable | Description | Variable | Description | Variable | Description | Variable | Description |
|---|---|---|---|---|---|---|---|
| age | in years | leis1 | sleep - totwrk | rlxall | slpnaps + personal activs | worknrm | mins work main job |
| black | =1 if black | leis2 | slpnaps - totwrk | selfe | =1 if self employed | workscnd | mins work second job |
| case | identifier | leis3 | rlxall - totwrk | sleep | mins sleep at night, per wk | exper | age - educ - 6 |
| clerical | =1 if clerical worker | smsa | =1 if live in smsa | slpnaps | minutes sleep, inc. naps | yngkid | =1 if children < 3 present |
| construc | =1 if construction worker | lhrwage | log hourly wage | south | =1 if live in south | yrsmarr | years married |
| educ | years of schooling | lothinc | log othinc, unless othinc < 0 | spsepay | spousal wage income | hrwage | hourly wage |
| earns74 | total earnings, 1974 | male | =1 if male | spwrk75 | =1 if spouse works | agesq | age^2 |
| gdhlth | =1 if in good or excel. health | marr | =1 if married | totwrk | mins worked per week | ||
| inlf | =1 if in labor force | prot | =1 if Protestant | union | =1 if belong to union |
Describing models with patsy¶
The patsy package provides us with a formulaic syntax for defining models that uses strings. The basic syntax is
y ~ x1 + x2
which describes the model
$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon$.
Notice that I did not specify the constant. Patsy takes care of that automatically. Let's start slow and build up the regression, then we will see how to do it all in one shot. You'll have to install patsy with pip:
pip install patsy
We start by passing our model, specified in a patsy string to patsy.dmatrices() to create the design matrices. Our model is
.
NB, This is Wooldridge problem 3, chapter 3.
import patsy # provides a syntax for specifying models
# Pass the model formula and the associated data to create design matrices
y, X = patsy.dmatrices('sleep ~ totwrk + educ + age', sleep)
# What do we have?
print('X and y are of type:' , type(X), type(y))
X and y are of type: <class 'patsy.design_info.DesignMatrix'> <class 'patsy.design_info.DesignMatrix'>
X
DesignMatrix with shape (706, 4)
Intercept totwrk educ age
1 3438 12 32
1 5020 14 31
1 2815 17 44
1 3786 12 30
1 2580 14 64
1 1205 12 41
1 2113 12 35
1 3608 13 47
1 2353 17 32
1 2851 15 30
1 6415 8 43
1 370 16 23
1 2438 16 24
1 2693 5 48
1 2526 12 33
1 2950 12 23
1 3003 17 46
1 4011 14 37
1 2300 12 53
1 1543 17 45
1 3473 17 46
1 3276 13 40
1 2506 12 53
1 2651 13 29
1 4580 12 29
1 3588 12 53
1 3418 13 28
1 2250 12 35
1 2638 12 36
1 3173 12 59
[676 rows omitted]
Terms:
'Intercept' (column 0)
'totwrk' (column 1)
'educ' (column 2)
'age' (column 3)
(to view full data, use np.asarray(this_obj))
y
DesignMatrix with shape (706, 1)
sleep
3113
2920
2670
3083
3448
4063
3180
2928
3368
3018
1575
3295
3798
3008
3248
3683
3201
2580
3420
3090
2760
2880
3470
2673
2820
2873
1905
2926
2603
3238
[676 rows omitted]
Terms:
'sleep' (column 0)
(to view full data, use np.asarray(this_obj))
So X holds the independent variables and y holds the dependent variable. Note the addition of the intercept (the column of ones) to the X matrix.
Building and estimating the model in statsmodels¶
With the design matrices in-hand, we can build ordinary least squares (OLS) model in the python package statsmodels which you can install via
pip install statsmodels
import statsmodels.api as sm # provides statistical models like ols, gmm, anova, etc...
# Pass design matrices to OLS to specifyan OLS model
sleep_model = sm.OLS(y, X)
type(sleep_model)
statsmodels.regression.linear_model.OLS
We can now estimate the model using the .fit( ) method of the statsmodel object.
res = sleep_model.fit() # Estimate the model and store the results in res
# To see the summary report
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: sleep R-squared: 0.113
Model: OLS Adj. R-squared: 0.110
Method: Least Squares F-statistic: 29.92
Date: Tue, 01 Nov 2022 Prob (F-statistic): 3.28e-18
Time: 15:13:19 Log-Likelihood: -5263.1
No. Observations: 706 AIC: 1.053e+04
Df Residuals: 702 BIC: 1.055e+04
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3638.2453 112.275 32.405 0.000 3417.810 3858.681
totwrk -0.1484 0.017 -8.888 0.000 -0.181 -0.116
educ -11.1338 5.885 -1.892 0.059 -22.687 0.420
age 2.1999 1.446 1.522 0.129 -0.639 5.038
==============================================================================
Omnibus: 68.731 Durbin-Watson: 1.943
Prob(Omnibus): 0.000 Jarque-Bera (JB): 185.551
Skew: -0.496 Prob(JB): 5.11e-41
Kurtosis: 5.308 Cond. No. 1.66e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.66e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
The more you work, the less you sleep. Or is it the other way around. I feel better already.
We can retrieve other results too.
print('The estimated coefficients are:', res.params, '\n')
print('The confidence intervals are:\n', res.conf_int(), '\n')
print(f'The r-squared is: {res.rsquared:.4f}')
The estimated coefficients are: [ 3.63824531e+03 -1.48373438e-01 -1.11338131e+01 2.19988481e+00] The confidence intervals are: [[ 3.41781008e+03 3.85868055e+03] [-1.81148684e-01 -1.15598192e-01] [-2.26872877e+01 4.19661463e-01] [-6.38561308e-01 5.03833093e+00]] The r-squared is: 0.1134
Directly specifying and estimating models with the formula.api¶
We have built our model from the ground up
- Create the design matrices with patsy
- Create the model with statsmodel
- Fit the model and obtain results
That was helpful to get a sense of what is going on, but we can do all those steps in one line of code. We just pass the patsy string and the data directly to statsmodels and call fit.
To do this, we use the statsmodels.formula.api method. The formula api interprets the patsy formula for us and creates the design matrices.
import statsmodels.formula.api as smf # provides a way to directly spec models from formulas
res = smf.ols('sleep ~ totwrk + educ + age', data=sleep).fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: sleep R-squared: 0.113
Model: OLS Adj. R-squared: 0.110
Method: Least Squares F-statistic: 29.92
Date: Tue, 01 Nov 2022 Prob (F-statistic): 3.28e-18
Time: 15:13:19 Log-Likelihood: -5263.1
No. Observations: 706 AIC: 1.053e+04
Df Residuals: 702 BIC: 1.055e+04
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3638.2453 112.275 32.405 0.000 3417.810 3858.681
totwrk -0.1484 0.017 -8.888 0.000 -0.181 -0.116
educ -11.1338 5.885 -1.892 0.059 -22.687 0.420
age 2.1999 1.446 1.522 0.129 -0.639 5.038
==============================================================================
Omnibus: 68.731 Durbin-Watson: 1.943
Prob(Omnibus): 0.000 Jarque-Bera (JB): 185.551
Skew: -0.496 Prob(JB): 5.11e-41
Kurtosis: 5.308 Cond. No. 1.66e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.66e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Data transformations¶
Patsy can handle common (and less common) regression tasks. Here are a few
Interacting variables¶
Use '*' to interact two variables. Patsy will also include the two variables individually, too. Below, we interact education an age
$$ sleep = \beta_0 + \beta_1 totwrk + \beta_2 educ + \beta_3 age + \beta_4 age\times educ + \epsilon $$.
res = smf.ols('sleep ~ totwrk + educ*age', data=sleep).fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: sleep R-squared: 0.119
Model: OLS Adj. R-squared: 0.114
Method: Least Squares F-statistic: 23.67
Date: Tue, 01 Nov 2022 Prob (F-statistic): 2.24e-18
Time: 15:13:19 Log-Likelihood: -5260.9
No. Observations: 706 AIC: 1.053e+04
Df Residuals: 701 BIC: 1.055e+04
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3081.4242 286.426 10.758 0.000 2519.069 3643.780
totwrk -0.1444 0.017 -8.618 0.000 -0.177 -0.112
educ 31.5246 21.032 1.499 0.134 -9.769 72.818
age 14.9293 6.197 2.409 0.016 2.763 27.096
educ:age -1.0065 0.477 -2.112 0.035 -1.942 -0.071
==============================================================================
Omnibus: 66.086 Durbin-Watson: 1.933
Prob(Omnibus): 0.000 Jarque-Bera (JB): 178.298
Skew: -0.475 Prob(JB): 1.92e-39
Kurtosis: 5.271 Cond. No. 4.32e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.32e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Using logs¶
We use the np.log( ) method directly in the patsy syntax. Note that we loaded the numpy package above as np. Below, we use the logarithm of age in the model
.
res = smf.ols('sleep ~ totwrk + educ + np.log(age)', data=sleep).fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: sleep R-squared: 0.113
Model: OLS Adj. R-squared: 0.109
Method: Least Squares F-statistic: 29.79
Date: Tue, 01 Nov 2022 Prob (F-statistic): 3.91e-18
Time: 15:13:19 Log-Likelihood: -5263.3
No. Observations: 706 AIC: 1.053e+04
Df Residuals: 702 BIC: 1.055e+04
Df Model: 3
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 3440.9308 239.448 14.370 0.000 2970.811 3911.050
totwrk -0.1489 0.017 -8.924 0.000 -0.182 -0.116
educ -11.3493 5.881 -1.930 0.054 -22.896 0.197
np.log(age) 79.2414 56.612 1.400 0.162 -31.908 190.391
==============================================================================
Omnibus: 68.734 Durbin-Watson: 1.944
Prob(Omnibus): 0.000 Jarque-Bera (JB): 184.854
Skew: -0.497 Prob(JB): 7.24e-41
Kurtosis: 5.301 Cond. No. 3.61e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.61e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Fixed effects¶
When I was a kid, we called these dummy variables. Gender is coded {0,1} in the variable male. Let's suppose it was coded as a string.
sleep['gender'] = sleep['male'].replace({1.0:'male', 0.0:'female'})
res = smf.ols('sleep ~ totwrk + educ + age + C(gender)', data=sleep).fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: sleep R-squared: 0.122
Model: OLS Adj. R-squared: 0.117
Method: Least Squares F-statistic: 24.26
Date: Tue, 01 Nov 2022 Prob (F-statistic): 8.02e-19
Time: 15:13:19 Log-Likelihood: -5259.8
No. Observations: 706 AIC: 1.053e+04
Df Residuals: 701 BIC: 1.055e+04
Df Model: 4
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept 3642.4666 111.844 32.567 0.000 3422.877 3862.056
C(gender)[T.male] 87.9933 34.323 2.564 0.011 20.604 155.382
totwrk -0.1658 0.018 -9.230 0.000 -0.201 -0.131
educ -11.7561 5.866 -2.004 0.045 -23.274 -0.238
age 1.9643 1.443 1.361 0.174 -0.869 4.797
==============================================================================
Omnibus: 65.308 Durbin-Watson: 1.942
Prob(Omnibus): 0.000 Jarque-Bera (JB): 174.107
Skew: -0.473 Prob(JB): 1.56e-38
Kurtosis: 5.241 Cond. No. 1.66e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.66e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Including a fixed effect for gender means that we interpret the coefficient on the dummy as relative to the reference group (here, women); ie, since the coefficient is positive, men sleep more than women, ceteris paribus. Equivalently, the intercept is the value of the female fixed effect.
No intercept¶
Use -1 to kill the automatic intercept and force the regression to go through the origin (as in our simple example).
res = smf.ols('sleep ~ totwrk + educ + age + C(gender) -1', data=sleep).fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: sleep R-squared: 0.122
Model: OLS Adj. R-squared: 0.117
Method: Least Squares F-statistic: 24.26
Date: Tue, 01 Nov 2022 Prob (F-statistic): 8.02e-19
Time: 15:13:19 Log-Likelihood: -5259.8
No. Observations: 706 AIC: 1.053e+04
Df Residuals: 701 BIC: 1.055e+04
Df Model: 4
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
C(gender)[female] 3642.4666 111.844 32.567 0.000 3422.877 3862.056
C(gender)[male] 3730.4598 117.475 31.755 0.000 3499.816 3961.104
totwrk -0.1658 0.018 -9.230 0.000 -0.201 -0.131
educ -11.7561 5.866 -2.004 0.045 -23.274 -0.238
age 1.9643 1.443 1.361 0.174 -0.869 4.797
==============================================================================
Omnibus: 65.308 Durbin-Watson: 1.942
Prob(Omnibus): 0.000 Jarque-Bera (JB): 174.107
Skew: -0.473 Prob(JB): 1.56e-38
Kurtosis: 5.241 Cond. No. 2.37e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.37e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Practice ¶
Wooldridge problem C2 in chapter 6.
- Load
wage1.dta
- Generate a scatter plot of log(wage) on education.
wage1 = pd.read_stata('./Data/wage1.dta')
# There is already a log(wage) variable in the data set (lwage) but we can always use some practice with transforms
wage1['lnwage'] = wage1['wage'].apply(np.log)
wage1.head()
| wage | educ | exper | tenure | nonwhite | female | married | numdep | smsa | northcen | ... | trade | services | profserv | profocc | clerocc | servocc | lwage | expersq | tenursq | lnwage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.10 | 11.0 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.131402 | 4.0 | 0.0 | 1.131402 |
| 1 | 3.24 | 12.0 | 22.0 | 2.0 | 0.0 | 1.0 | 1.0 | 3.0 | 1.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.175573 | 484.0 | 4.0 | 1.175573 |
| 2 | 3.00 | 11.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.098612 | 4.0 | 0.0 | 1.098612 |
| 3 | 6.00 | 8.0 | 44.0 | 28.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.791759 | 1936.0 | 784.0 | 1.791759 |
| 4 | 5.30 | 12.0 | 7.0 | 2.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.667707 | 49.0 | 4.0 | 1.667707 |
5 rows × 25 columns
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(wage1['educ'], wage1['lnwage'], marker='o', alpha = 0.5 )
ax.set_xlabel('education')
ax.set_ylabel('log wage')
sea.despine(ax=ax, trim=True)
- What's the relationship between education and log-wage? Estimate
$$ \log(wage) = \beta_0 + \beta_1 educ + \epsilon$$
Print your estimated coefficient on education. Is it statistically significant from zero? What's the interpretation?
res = smf.ols('np.log(wage) ~ educ', data=wage1).fit()
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: np.log(wage) R-squared: 0.186
Model: OLS Adj. R-squared: 0.184
Method: Least Squares F-statistic: 119.6
Date: Tue, 01 Nov 2022 Prob (F-statistic): 3.27e-25
Time: 15:13:19 Log-Likelihood: -359.38
No. Observations: 526 AIC: 722.8
Df Residuals: 524 BIC: 731.3
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.5838 0.097 5.998 0.000 0.393 0.775
educ 0.0827 0.008 10.935 0.000 0.068 0.098
==============================================================================
Omnibus: 11.804 Durbin-Watson: 1.801
Prob(Omnibus): 0.003 Jarque-Bera (JB): 13.811
Skew: 0.268 Prob(JB): 0.00100
Kurtosis: 3.586 Cond. No. 60.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- Scatter plot the residuals against education. The residuals are in the results object from the
fitmethod.
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(wage1['educ'], res.resid, marker='o', alpha = 0.5 )
ax.set_xlabel('education')
ax.set_ylabel('residual')
sea.despine(ax=ax, trim=True)
Looks heteroskedastic. That's s word to impress your friends. It means the errors don't look like they're drawn from a single distribution. They're still normal and noise so our math and $\beta$ point-estimates are valid -- the errors may just be coming from a different distribution depending on whether someone is of high or low education. The consequence of heteroskedasticity is that our standard errors and corresponding hypothesis tests are wrong.
This is actually an easy fix (mathematically and programatically -- not a word I think) and we can change the covariance matrix type (which will correct the standard error calculation) using the cov_type parameter (docs). The types of covariance matrices are described in the docs.
- Redo the estimation allowing for a more flexible variance-covariance matrix. Try 'HC3' for your covariance matrix type.
res = smf.ols('np.log(wage) ~ educ', data=wage1).fit(cov_type='HC3')
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: np.log(wage) R-squared: 0.186
Model: OLS Adj. R-squared: 0.184
Method: Least Squares F-statistic: 111.7
Date: Tue, 01 Nov 2022 Prob (F-statistic): 8.48e-24
Time: 15:13:19 Log-Likelihood: -359.38
No. Observations: 526 AIC: 722.8
Df Residuals: 524 BIC: 731.3
Df Model: 1
Covariance Type: HC3
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.5838 0.099 5.872 0.000 0.389 0.779
educ 0.0827 0.008 10.569 0.000 0.067 0.098
==============================================================================
Omnibus: 11.804 Durbin-Watson: 1.801
Prob(Omnibus): 0.003 Jarque-Bera (JB): 13.811
Skew: 0.268 Prob(JB): 0.00100
Kurtosis: 3.586 Cond. No. 60.2
==============================================================================
Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)
- Scatter plot the data and add the regression line.
To plot the regression line you will need to create some x data and then apply the parameters. I used something like this
y = [p.Intercept + p.educ*i for i in x]
where p holds the parameters from my results and x holds the relavent independent variables from the specification. If you've had linear algebra, this is matric multiplcation between a coefficient vector and the independent X matrix.
fig, ax = plt.subplots(figsize=(15,10))
# Plot the data
ax.scatter(wage1['educ'], wage1['lnwage'], marker='o', alpha = 0.5 )
# Create the line of best fit to plot
p = res.params # params from the model fit
x = range(0,20) # some x data
y = [p.Intercept + p.educ*i for i in x] # apply the coefficients
ax.plot(x,y, color='black')
# build the string
text = 'regression line: \nlog(w) = {0:.2} + {1:.2} educ'.format(p.Intercept, p.educ)
# place the annotation
ax.text(0.5, 2.5, text, fontsize=14)
ax.set_xlabel('education', fontsize=14)
ax.set_ylabel('log wage', fontsize=14)
sea.despine(ax=ax, trim=True)
plt.show()
- Go back and add the text 'log(w) = a + b*educ' to the northwest corner of your plot. Replace a and b with the estimated parameter values.
# See above.
- Let's modify the regression to include education in logs.
Estimate the model and interpret your results. You may have to drop a couple observations with where education is zero (or add a one to all education numbers in the data) in order to generate nice output.
Hint: Totall differentiate our specification to get $\frac{dw}{w}=\beta_1\times\frac{de}{e}$. Rearrange such that $\beta_1 = \frac{dw}{de}\times \frac{e}{w}.$ What is the economic term for the RHS of that expression?
wage1= wage1[wage1['educ']>0]
res = smf.ols('np.log(wage) ~ np.log(educ)', data=wage1,missing='drop').fit(cov_type='HC3')
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: np.log(wage) R-squared: 0.149
Model: OLS Adj. R-squared: 0.147
Method: Least Squares F-statistic: 53.97
Date: Tue, 01 Nov 2022 Prob (F-statistic): 7.89e-13
Time: 15:13:20 Log-Likelihood: -370.08
No. Observations: 524 AIC: 744.2
Df Residuals: 522 BIC: 752.7
Df Model: 1
Covariance Type: HC3
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept -0.4447 0.283 -1.572 0.116 -0.999 0.110
np.log(educ) 0.8252 0.112 7.346 0.000 0.605 1.045
==============================================================================
Omnibus: 11.725 Durbin-Watson: 1.796
Prob(Omnibus): 0.003 Jarque-Bera (JB): 13.046
Skew: 0.287 Prob(JB): 0.00147
Kurtosis: 3.517 Cond. No. 29.6
==============================================================================
Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)
The education **elasticity** of wage is 0.8252.
- Let's add some additional regressors (aka RHS variables, independent variables, covariates). Estimate $$ \log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 exper^2 + \beta_4I_m + \epsilon$$
where $I_m$ is a variable equal to 1 if the worker is a married. Interpret your results.
res = smf.ols('np.log(wage) ~ educ + exper + np.power(exper,2) + C(married)', data=wage1).fit(cov_type='HC3')
print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: np.log(wage) R-squared: 0.314
Model: OLS Adj. R-squared: 0.309
Method: Least Squares F-statistic: 57.67
Date: Tue, 01 Nov 2022 Prob (F-statistic): 2.92e-40
Time: 15:13:20 Log-Likelihood: -313.47
No. Observations: 524 AIC: 636.9
Df Residuals: 519 BIC: 658.2
Df Model: 4
Covariance Type: HC3
======================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept 0.0869 0.108 0.807 0.420 -0.124 0.298
C(married)[T.1.0] 0.1217 0.044 2.789 0.005 0.036 0.207
educ 0.0921 0.008 11.693 0.000 0.077 0.108
exper 0.0343 0.005 6.354 0.000 0.024 0.045
np.power(exper, 2) -0.0006 0.000 -5.166 0.000 -0.001 -0.000
==============================================================================
Omnibus: 5.733 Durbin-Watson: 1.793
Prob(Omnibus): 0.057 Jarque-Bera (JB): 7.698
Skew: 0.046 Prob(JB): 0.0213
Kurtosis: 3.587 Cond. No. 4.36e+03
==============================================================================
Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)
[2] The condition number is large, 4.36e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Wage is increasing in marriage, education, and experience. The marginal benefit of experience is decreasing, however.