Lecture 21: OLS Regression with Discrete Dependent Variables
[MY SOLUTIONS]

We continue to learn about the statsmodels package (docs), which provides functions for formulating and estimating statistical models. In this notebook we take on models in which the dependent variable is discrete. In the examples below, the dependent variable is binary (which makes it easier to visualize). At the end of the lecture, we extend the analysis to dependent variables with many discrete values.

Here is a nice overview of the discrete choice models in statsmodels.

The agenda for today's lecture is as follows:

Math Primer

Probit Regression

Logit Regression

Class Announcements

None.

1. Math Primer (top)¶

So far we've been dealing with continuous dependant (ie, LHS) variables such as hours worked. A lot of outcomes we observe and are interested in are not continuous, however. For example, labor force participation in the United States is roughly 65% so the choice of whether to work or not appears to be a significant one.

Suppose our dependant variable Y is binary (ie, zero or one). For example, Y may represent presence/absence of a certain condition, success/failure of some device, answer yes/no on a survey, etc. We also have a vector of regressors X which we think influence Y. As before, suppose we also have an error term $\epsilon$ which is distributed from some distribution. Define $Y^*$ as some latent (ie, we can't actually observe) variable where

$$ Y^*= X\beta + \epsilon $$

and we think $Y=1$ whenever $Y^*>0$, or equivalently whenever $X\beta + \epsilon>0$. Define $P(Y=1|X)$ as the 'probability Y is equal to one conditional on the variables X.' It follows then that

$$ P(Y=1|X)= P(Y^*>0) $$$$ \Rightarrow P(\epsilon < X\beta) $$

2. Probit Regression (top)¶

Note that $P(\epsilon < X\beta)$ is the definition of a CDF. Suppose we specify that $\epsilon$ is drawn iid from a standard Normal distribution. With this added assumption, we can do a lot more:

$$ P(Y=1|X)= \Phi(X\beta) $$

where $\Phi()$ is the CDF for the standard Normal distribution. The likelihood we observe a single observation ($Y_j=1$ or $Y_j=0$) is therefore

$$ \mathcal{L}(\beta;y_j,x_j)= \Phi(x_j\beta)^{y_j}\times(1-\Phi(x_j\beta))^{1-y_j}. $$

The first part ($\Phi(X\beta)^{y_j}$) gets turned on when $y_j=1$ while the second part gets turned on when $y_j=0$. We can therefore solve for the $\beta$ vector to best match the data by maximizing the 'likelihood' function; ie,

$$ \mathcal{L}(\beta;Y,X)= \Pi_{j=1}^J\Phi(x_j\beta)^{y_j}\times(1-\Phi(x_j\beta))^{1-y_j} $$

This looks complicated but it's really just a simple maxmization problem like OLS.

An Example: Gambling¶

When we're talking probability, there is no better example than gambling. Actually, gambling is the source (inspiration?) for a lot of the probability theory we have today. Since relativity and quantum mechanics use probability heavilly, let's attribute that to gambling too.

The file 'pntsprd.dta' contains data about vegas betting. The complete variable list is here. We will use favwin which is equal to 1 if the favored team won and zero otherwise and spread which holds the betting spread. In this context, a spread is the number of points that the favored team must beat the unfavored team by in order to be counted as a win by the favored team.

In [1]:

import pandas as pd                    # for data handling
import numpy as np                     # for numerical methods and data structures
import matplotlib.pyplot as plt        # for plotting
import seaborn as sea                  # advanced plotting

import statsmodels.formula.api as smf  # provides a way to directly spec models from formulas

In [2]:

# Use pandas read_stata method to get the stata formatted data file into a DataFrame.
vegas = pd.read_stata('./Data/pntsprd.dta')

# Take a look...so clean!
vegas.head()

Out[2]:

	favscr	undscr	spread	favhome	fav25	fregion	uregion	scrdiff	sprdcvr	favwin
0	72.0	61.0	7.0	0.0	1.0	3.0	4.0	11.0	1.0	1.0
1	82.0	74.0	7.0	1.0	0.0	3.0	1.0	8.0	1.0	1.0
2	87.0	57.0	17.0	1.0	0.0	3.0	3.0	30.0	1.0	1.0
3	69.0	70.0	9.0	1.0	0.0	3.0	3.0	-1.0	0.0	0.0
4	77.0	79.0	2.5	0.0	0.0	2.0	3.0	-2.0	0.0	0.0

In [3]:

vegas.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 553 entries, 0 to 552
Data columns (total 12 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   favscr   553 non-null    float32
 1   undscr   553 non-null    float32
 2   spread   553 non-null    float32
 3   favhome  553 non-null    float32
 4   neutral  553 non-null    float32
 5   fav25    553 non-null    float32
 6   und25    553 non-null    float32
 7   fregion  553 non-null    float32
 8   uregion  553 non-null    float32
 9   scrdiff  553 non-null    float32
 10  sprdcvr  553 non-null    float32
 11  favwin   553 non-null    float32
dtypes: float32(12)
memory usage: 30.2 KB

In [4]:

fig, ax = plt.subplots(figsize=(15,6))

ax.scatter( vegas['spread'], vegas['favwin'], facecolors='none', edgecolors='red')

ax.set_ylabel('favored team outcome (win = 1, loss = 0)')
ax.set_xlabel('point spread')
ax.set_title('The data from the point spread dataset')

sea.despine(ax=ax)

Estimation¶

We begin with the linear probability model. The model is

$$\text{Pr}(favwin=1 \mid spread) = \beta_0 + \beta_1 spread + \epsilon .$$

There is nothing new here technique-wise. Let's start with OLS which is like pretending the Y variable is continuous.

In [5]:

# statsmodels adds a constant for us...
res_ols = smf.ols('favwin ~ spread', data=vegas).fit()

print(res_ols.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 favwin   R-squared:                       0.111
Model:                            OLS   Adj. R-squared:                  0.109
Method:                 Least Squares   F-statistic:                     68.57
Date:                Tue, 08 Nov 2022   Prob (F-statistic):           9.32e-16
Time:                        09:07:16   Log-Likelihood:                -279.29
No. Observations:                 553   AIC:                             562.6
Df Residuals:                     551   BIC:                             571.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5769      0.028     20.434      0.000       0.521       0.632
spread         0.0194      0.002      8.281      0.000       0.015       0.024
==============================================================================
Omnibus:                       86.055   Durbin-Watson:                   2.112
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               94.402
Skew:                          -0.956   Prob(JB):                     3.17e-21
Kurtosis:                       2.336   Cond. No.                         20.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Hypothesis testing with t-test¶

If bookies were all-knowing, the spread would exactly account for the predictable winning probability and all we would be left with is the noise --- the intercept should be one-half. Is it true in the data? We can use the t_test( ) method of the results object to perform t-tests.

The null hypothesis is $H_0: \beta_0 = 0.5$ and the alternative hypothesis is $H_1: \beta_0 \neq 0.5$.

In [6]:

t_test = res_ols.t_test('Intercept = 0.5')
print(t_test)

                             Test for Constraints                             
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0             0.5769      0.028      2.725      0.007       0.521       0.632
==============================================================================

Linear probability models have some problems. Perhaps the biggest one is that there is no guarantee that the predicted probability lies between zero and one!

We can use the predictedvalues attribute of the results object to recover the fitted values of the y variables. Let's plot them and take a look.

In [7]:

fig, ax = plt.subplots(figsize=(15,6))

ax.scatter(vegas['spread'], res_ols.fittedvalues,  facecolors='none', edgecolors='red')
ax.axhline(y=1.0, color='grey', linestyle='--')

ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from an OLS model')

sea.despine(ax=ax, trim=True)

Now, let's account for the discreteness and estimate with probit.

In [8]:

res_probit = smf.probit('favwin ~ spread', data=vegas).fit()
print(res_probit.summary())

Optimization terminated successfully.
         Current function value: 0.476604
         Iterations 6
                          Probit Regression Results                           
==============================================================================
Dep. Variable:                 favwin   No. Observations:                  553
Model:                         Probit   Df Residuals:                      551
Method:                           MLE   Df Model:                            1
Date:                Tue, 08 Nov 2022   Pseudo R-squ.:                  0.1294
Time:                        09:07:16   Log-Likelihood:                -263.56
converged:                       True   LL-Null:                       -302.75
Covariance Type:            nonrobust   LLR p-value:                 8.521e-19
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0106      0.104     -0.102      0.919      -0.214       0.193
spread         0.0925      0.012      7.591      0.000       0.069       0.116
==============================================================================

Notice the top: "Optimization terminated successfully..." That's because with probit there is no analytical solution like there is with OLS. Instead, the computer has to maximize the likelihood function by taking a guess for an initial $\beta$ and then iterating using calculus to make smart choices.

The coefficients are very different. Just look at the intercept! That's in large part b/c the coefficients have a different meaning in a probabilistic model. In order to determine the effect on Y, we have to run the coefficient through the distributional assumption, here Normal. When we do this, we call the results 'marginal effects.' The math is pretty straight-forward -- but then again recovering marginal effects is standard stuff so there's a method for that:

In [9]:

margeff = res_probit.get_margeff('mean')
print(margeff.summary())

       Probit Marginal Effects       
=====================================
Dep. Variable:                 favwin
Method:                          dydx
At:                              mean
==============================================================================
                dy/dx    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
spread         0.0251      0.003      8.661      0.000       0.019       0.031
==============================================================================

Okay, so a unit increase in the spread is correlated with a (statistically significant) 2.5% increase in the probability the team wins. Makes sense -- otherwise those bright, shiny Vegas lights wouldn't be so shiny.

Note that the marginal effect calculation required us to take a stand on from where we calculated the derivative. In a linear model like OLS, the derivative is just the coefficients and those are constant. Here, the model is non-linear (b/c of the Normal distribution) so the derivative changes depending on where we choose. The average is the standard though skewed data might make the median more resonable.

Let's take a look at the marginal effects at different points in the data. Note that the reported marginal effect above is located at the intersection of the marginal effects plot and the vertical dashed line indicating the average spread.

In [10]:

from scipy.stats import norm # import functions related to the normal distribution

y = norm.pdf(res_probit.fittedvalues,0,1)*res_probit.params.spread

fig, ax = plt.subplots(figsize=(15,6))

avg_spread = np.mean(vegas['spread'])

# Create the marginal effects
ax.scatter(vegas['spread'],y, color='black', label = 'marg. effects')

ax.set_ylabel('estimated marginal effect')
ax.set_xlabel('point spread')
ax.set_title('plotting marginal effects')

ax.legend(frameon=False,loc='upper right', bbox_to_anchor=(0.9, 0.7), fontsize=14)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.axvline(x=avg_spread, color='red', linestyle='--')
ax.text(avg_spread+.5,0.035,'Average Spread',fontsize=14)
ax.set_ylim([-1e-3,0.04])

plt.show()

Let's look at the predicted values. In OLS this was easy. Here, things are (for some bizarre reason) more complicated -- we have to run the $X\hat\beta$ interactions through the standard Normal distribution ourselves.

In [11]:

pred_probit = norm.cdf(res_probit.fittedvalues,0,1)  # Standard Normal (ie, mean = 0, stdev = 1)

Plot the estimated probabilty of the favored team winning and the actual data.

In [12]:

fig, ax = plt.subplots(figsize=(15,6))

ax.scatter(vegas['spread'], pred_probit,  facecolors='none', edgecolors='red', label='predicted')
ax.scatter(vegas['spread'], vegas['favwin'],  facecolors='none', edgecolors='blue', label = 'data')
ax.axhline(y=1.0, color='grey', linestyle='--')

# Create the line of best fit to plot
p = res_ols.params                            # params from the OLS model linear probability model
x = range(0,35)                               # some x data
y = [p.Intercept + p.spread*i for i in x]     # apply the coefficients 
ax.plot(x,y, color='black', label = 'linear prob.')

ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from a probit model')

ax.legend(frameon=False,loc='upper right', bbox_to_anchor=(0.9, 0.7), fontsize=14)
sea.despine(ax=ax, trim=True)

3. Logistic Regression (aka Logit) (top)¶

Our framework is actually pretty flexible so we can use different distributions. The other popular distributional assumption is to assume the $\epsilon$ errors come from a Logistic distribution. Why Logistic? Because the result is a nice simple function for the probability:

$$\text{prob} = \frac{\exp \left({\beta_0+\beta_1 spread}\right)}{1+\exp \left({\beta_0+\beta_1 spread}\right)},$$

and we predict a team wins when ever $\text{prob} \ge 0.5$. We estimate the logit model with logit( ) method from smf in a way similar to probit.

In [13]:

res_logit = smf.logit('favwin ~ spread', data=vegas).fit()
print(res_logit.summary())

Optimization terminated successfully.
         Current function value: 0.477218
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 favwin   No. Observations:                  553
Model:                          Logit   Df Residuals:                      551
Method:                           MLE   Df Model:                            1
Date:                Tue, 08 Nov 2022   Pseudo R-squ.:                  0.1283
Time:                        09:07:17   Log-Likelihood:                -263.90
converged:                       True   LL-Null:                       -302.75
Covariance Type:            nonrobust   LLR p-value:                 1.201e-18
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0712      0.173     -0.411      0.681      -0.411       0.268
spread         0.1632      0.023      7.236      0.000       0.119       0.207
==============================================================================

Again, interpreting logit coefficients is bit more complicated. The probability that a team wins is given by the expression

$$\text{prob} = \frac{\exp \left({\beta_0+\beta_1 spread}\right)}{1+\exp \left({\beta_0+\beta_1 spread}\right)}$$

Our marginal effects will hammer $X\hat\beta$ through the above non-linear function to derive the marginal effects. Let's take a look:

In [14]:

margeff = res_logit.get_margeff('mean')
print(margeff.summary())

        Logit Marginal Effects       
=====================================
Dep. Variable:                 favwin
Method:                          dydx
At:                              mean
==============================================================================
                dy/dx    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
spread         0.0244      0.003      9.059      0.000       0.019       0.030
==============================================================================

Let's again plot the estimated probabilty of the favored team winning and the actual data but now let's compare the implications of our distributional assumptions. First, generate predicted values using numpy and the above expression for the probability.

In [15]:

pred_logit = np.exp(res_logit.fittedvalues) /( 1+np.exp(res_logit.fittedvalues) )

Now, plot probit vs logit:

In [16]:

fig, ax = plt.subplots(figsize=(15,6))

ax.scatter(vegas['spread'], pred_logit,  facecolors='none', edgecolors='red', label='predicted-logit')
ax.scatter(vegas['spread'], pred_probit,  facecolors='none', edgecolors='black', label='predicted-probit')
ax.scatter(vegas['spread'], vegas['favwin'],  facecolors='none', edgecolors='blue', label = 'data')
ax.axhline(y=1.0, color='grey', linestyle='--')

# Create the line of best fit to plot
p = res_ols.params                            # params from the OLS model linear probability model
x = range(0,35)                               # some x data
y = [p.Intercept + p.spread*i for i in x]     # apply the coefficients 
ax.plot(x,y, color='black', label = 'linear prob.')

ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from logit and probit models')

ax.legend(frameon=False,loc='upper right', bbox_to_anchor=(0.9, 0.7), fontsize=14)
sea.despine(ax=ax, trim=True)

We observe that the probit and logit models are nearly on top of eachother. That's a common occurrence. In practice, the models are often interchangeable and the practitioner will choose one over the other because in their setting one may have some slightly better properties (e.g., more intuitive intrepretation of the marginal effects).

Practice¶

Load the data 'apple.dta'. The data dictionary can be found here. The variable ecolbs is purchases of eco-friendly apples (whatever that means).

In [17]:

apples = pd.read_stata('./Data/apple.dta')
apples.head()

Out[17]:

	id	educ	date	state	regprc	ecoprc	inseason	hhsize	male	faminc	age	reglbs	ecolbs	num5_17	num18_64
0	10002	16	111597	SD	1.19	1.19	1	4	0	45	43	2.0	2.000000	1	3
1	10004	16	121897	KS	0.59	0.79	0	1	0	65	37	0.0	2.000000	0	1
2	10034	18	111097	MI	0.59	0.99	1	3	0	65	44	0.0	2.666667	2	1
3	10035	12	111597	TN	0.89	1.09	1	2	1	55	55	3.0	0.000000	0	2
4	10039	15	122997	NY	0.89	1.09	0	1	1	25	22	0.0	3.000000	0	1

Create a variable named ecobuy that is equal to 1 if the observation has a positive purchase of eco-apples (i.e., ecolbs>0).

In [18]:

# this is only one way to do this...

apples['ecobuy'] = 0                             # create the variable and default it to zero
apples.loc[apples['ecolbs']>0, 'ecobuy'] = 1     # set the variable = 1 when positive ecolbs

apples['ecobuy'].describe()

Out[18]:

count    660.000000
mean       0.624242
std        0.484685
min        0.000000
25%        0.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: ecobuy, dtype: float64

Estimate a linear probability model relating the probability of purchasing eco-apples to household characteristics.

$$\text{ecobuy} = \beta_0 + \beta_1 \text{ecoprc} + \beta_2 \text{regprc} + \beta_3 \text{faminc} + \beta_4 \text{hhsize} + \beta_5 \text{educ} + \beta_6 \text{age} + \epsilon$$

In [19]:

apple_res = smf.ols('ecobuy ~ ecoprc + regprc + faminc + hhsize + educ + age', data=apples).fit()
print(apple_res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 ecobuy   R-squared:                       0.110
Model:                            OLS   Adj. R-squared:                  0.102
Method:                 Least Squares   F-statistic:                     13.43
Date:                Tue, 08 Nov 2022   Prob (F-statistic):           2.18e-14
Time:                        09:07:17   Log-Likelihood:                -419.60
No. Observations:                 660   AIC:                             853.2
Df Residuals:                     653   BIC:                             884.6
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.4237      0.165      2.568      0.010       0.100       0.748
ecoprc        -0.8026      0.109     -7.336      0.000      -1.017      -0.588
regprc         0.7193      0.132      5.464      0.000       0.461       0.978
faminc         0.0006      0.001      1.042      0.298      -0.000       0.002
hhsize         0.0238      0.013      1.902      0.058      -0.001       0.048
educ           0.0248      0.008      2.960      0.003       0.008       0.041
age           -0.0005      0.001     -0.401      0.689      -0.003       0.002
==============================================================================
Omnibus:                     4015.360   Durbin-Watson:                   2.084
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               69.344
Skew:                          -0.411   Prob(JB):                     8.75e-16
Kurtosis:                       1.641   Cond. No.                         724.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

How many estimated probabilities are negative? Are greater than one?

In [20]:

fitted = apple_res.fittedvalues   # store the fitted values
fitted[(fitted>1) | (fitted<0)]   # greater than 1 or less than zero

Out[20]:

167    1.070860
493    1.054372
dtype: float64

In [25]:

val = ((fitted>1) | (fitted<0)).astype(float).mean()*100
print(f'Answer: {val:4.2f} percent of predicted probabilities are less than 0 or greater than 1.')

Answer: 0.30 percent of predicted probabilities are less than 0 or greater than 1.

Now estimate the model as a probit; i.e.,

$$\text{Pr}(\text{ecobuy}=1 \mid X) = \Phi \left(\beta_0 + \beta_1 \text{ecoprc} + \beta_2 \text{regprc} + \beta_3 \text{faminc} + \beta_4 \text{hhsize} + \beta_5 \text{educ} + \beta_6 \text{age} \right),$$

where $\Phi( )$ is the CDF of the normal distribution.

In [26]:

apple_pres = smf.probit('ecobuy ~ ecoprc + regprc + faminc + hhsize + educ + age', data=apples).fit()
print(apple_pres.summary())

Optimization terminated successfully.
         Current function value: 0.604599
         Iterations 5
                          Probit Regression Results                           
==============================================================================
Dep. Variable:                 ecobuy   No. Observations:                  660
Model:                         Probit   Df Residuals:                      653
Method:                           MLE   Df Model:                            6
Date:                Tue, 08 Nov 2022   Pseudo R-squ.:                 0.08664
Time:                        09:11:43   Log-Likelihood:                -399.04
converged:                       True   LL-Null:                       -436.89
Covariance Type:            nonrobust   LLR p-value:                 2.751e-14
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.2438      0.474     -0.514      0.607      -1.173       0.685
ecoprc        -2.2669      0.321     -7.052      0.000      -2.897      -1.637
regprc         2.0302      0.382      5.318      0.000       1.282       2.778
faminc         0.0014      0.002      0.932      0.351      -0.002       0.004
hhsize         0.0691      0.037      1.893      0.058      -0.002       0.141
educ           0.0714      0.024      2.939      0.003       0.024       0.119
age           -0.0012      0.004     -0.340      0.734      -0.008       0.006
==============================================================================

In [27]:

apple_pres.get_margeff?

Compute the marginal effects of the coefficients at the means and print them out using summary(). Interpret the results.