Lecture 21: OLS Regression with Discrete Dependent Variables
[MY SOLUTIONS]
We continue to learn about the statsmodels package (docs), which provides functions for formulating and estimating statistical models. In this notebook we take on models in which the dependent variable is discrete. In the examples below, the dependent variable is binary (which makes it easier to visualize). At the end of the lecture, we extend the analysis to dependent variables with many discrete values.
Here is a nice overview of the discrete choice models in statsmodels.
The agenda for today's lecture is as follows:
Class Announcements
None.
1. Math Primer (top)¶
So far we've been dealing with continuous dependant (ie, LHS) variables such as hours worked. A lot of outcomes we observe and are interested in are not continuous, however. For example, labor force participation in the United States is roughly 65% so the choice of whether to work or not appears to be a significant one.
Suppose our dependant variable Y is binary (ie, zero or one). For example, Y may represent presence/absence of a certain condition, success/failure of some device, answer yes/no on a survey, etc. We also have a vector of regressors X which we think influence Y. As before, suppose we also have an error term $\epsilon$ which is distributed from some distribution. Define $Y^*$ as some latent (ie, we can't actually observe) variable where
$$ Y^*= X\beta + \epsilon $$and we think $Y=1$ whenever $Y^*>0$, or equivalently whenever $X\beta + \epsilon>0$. Define $P(Y=1|X)$ as the 'probability Y is equal to one conditional on the variables X.' It follows then that
$$ P(Y=1|X)= P(Y^*>0) $$$$ \Rightarrow P(\epsilon < X\beta) $$2. Probit Regression (top)¶
Note that $P(\epsilon < X\beta)$ is the definition of a CDF. Suppose we specify that $\epsilon$ is drawn iid from a standard Normal distribution. With this added assumption, we can do a lot more:
$$ P(Y=1|X)= \Phi(X\beta) $$where $\Phi()$ is the CDF for the standard Normal distribution. The likelihood we observe a single observation ($Y_j=1$ or $Y_j=0$) is therefore
$$ \mathcal{L}(\beta;y_j,x_j)= \Phi(x_j\beta)^{y_j}\times(1-\Phi(x_j\beta))^{1-y_j}. $$The first part ($\Phi(X\beta)^{y_j}$) gets turned on when $y_j=1$ while the second part gets turned on when $y_j=0$. We can therefore solve for the $\beta$ vector to best match the data by maximizing the 'likelihood' function; ie,
$$ \mathcal{L}(\beta;Y,X)= \Pi_{j=1}^J\Phi(x_j\beta)^{y_j}\times(1-\Phi(x_j\beta))^{1-y_j} $$This looks complicated but it's really just a simple maxmization problem like OLS.
An Example: Gambling¶
When we're talking probability, there is no better example than gambling. Actually, gambling is the source (inspiration?) for a lot of the probability theory we have today. Since relativity and quantum mechanics use probability heavilly, let's attribute that to gambling too.
The file 'pntsprd.dta' contains data about vegas betting. The complete variable list is here. We will use favwin which is equal to 1 if the favored team won and zero otherwise and spread which holds the betting spread. In this context, a spread is the number of points that the favored team must beat the unfavored team by in order to be counted as a win by the favored team.
import pandas as pd # for data handling
import numpy as np # for numerical methods and data structures
import matplotlib.pyplot as plt # for plotting
import seaborn as sea # advanced plotting
import statsmodels.formula.api as smf # provides a way to directly spec models from formulas
# Use pandas read_stata method to get the stata formatted data file into a DataFrame.
vegas = pd.read_stata('./Data/pntsprd.dta')
# Take a look...so clean!
vegas.head()
| favscr | undscr | spread | favhome | neutral | fav25 | und25 | fregion | uregion | scrdiff | sprdcvr | favwin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 72.0 | 61.0 | 7.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 4.0 | 11.0 | 1.0 | 1.0 |
| 1 | 82.0 | 74.0 | 7.0 | 1.0 | 0.0 | 0.0 | 0.0 | 3.0 | 1.0 | 8.0 | 1.0 | 1.0 |
| 2 | 87.0 | 57.0 | 17.0 | 1.0 | 0.0 | 0.0 | 0.0 | 3.0 | 3.0 | 30.0 | 1.0 | 1.0 |
| 3 | 69.0 | 70.0 | 9.0 | 1.0 | 0.0 | 0.0 | 0.0 | 3.0 | 3.0 | -1.0 | 0.0 | 0.0 |
| 4 | 77.0 | 79.0 | 2.5 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 3.0 | -2.0 | 0.0 | 0.0 |
vegas.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 553 entries, 0 to 552 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 favscr 553 non-null float32 1 undscr 553 non-null float32 2 spread 553 non-null float32 3 favhome 553 non-null float32 4 neutral 553 non-null float32 5 fav25 553 non-null float32 6 und25 553 non-null float32 7 fregion 553 non-null float32 8 uregion 553 non-null float32 9 scrdiff 553 non-null float32 10 sprdcvr 553 non-null float32 11 favwin 553 non-null float32 dtypes: float32(12) memory usage: 30.2 KB
fig, ax = plt.subplots(figsize=(15,6))
ax.scatter( vegas['spread'], vegas['favwin'], facecolors='none', edgecolors='red')
ax.set_ylabel('favored team outcome (win = 1, loss = 0)')
ax.set_xlabel('point spread')
ax.set_title('The data from the point spread dataset')
sea.despine(ax=ax)
Estimation¶
We begin with the linear probability model. The model is
$$\text{Pr}(favwin=1 \mid spread) = \beta_0 + \beta_1 spread + \epsilon .$$There is nothing new here technique-wise. Let's start with OLS which is like pretending the Y variable is continuous.
# statsmodels adds a constant for us...
res_ols = smf.ols('favwin ~ spread', data=vegas).fit()
print(res_ols.summary())
OLS Regression Results
==============================================================================
Dep. Variable: favwin R-squared: 0.111
Model: OLS Adj. R-squared: 0.109
Method: Least Squares F-statistic: 68.57
Date: Tue, 08 Nov 2022 Prob (F-statistic): 9.32e-16
Time: 09:07:16 Log-Likelihood: -279.29
No. Observations: 553 AIC: 562.6
Df Residuals: 551 BIC: 571.2
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.5769 0.028 20.434 0.000 0.521 0.632
spread 0.0194 0.002 8.281 0.000 0.015 0.024
==============================================================================
Omnibus: 86.055 Durbin-Watson: 2.112
Prob(Omnibus): 0.000 Jarque-Bera (JB): 94.402
Skew: -0.956 Prob(JB): 3.17e-21
Kurtosis: 2.336 Cond. No. 20.0
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Hypothesis testing with t-test¶
If bookies were all-knowing, the spread would exactly account for the predictable winning probability and all we would be left with is the noise --- the intercept should be one-half. Is it true in the data? We can use the t_test( ) method of the results object to perform t-tests.
The null hypothesis is $H_0: \beta_0 = 0.5$ and the alternative hypothesis is $H_1: \beta_0 \neq 0.5$.
t_test = res_ols.t_test('Intercept = 0.5')
print(t_test)
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 0.5769 0.028 2.725 0.007 0.521 0.632
==============================================================================
Linear probability models have some problems. Perhaps the biggest one is that there is no guarantee that the predicted probability lies between zero and one!
We can use the predictedvalues attribute of the results object to recover the fitted values of the y variables. Let's plot them and take a look.
fig, ax = plt.subplots(figsize=(15,6))
ax.scatter(vegas['spread'], res_ols.fittedvalues, facecolors='none', edgecolors='red')
ax.axhline(y=1.0, color='grey', linestyle='--')
ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from an OLS model')
sea.despine(ax=ax, trim=True)
Now, let's account for the discreteness and estimate with probit.
res_probit = smf.probit('favwin ~ spread', data=vegas).fit()
print(res_probit.summary())
Optimization terminated successfully.
Current function value: 0.476604
Iterations 6
Probit Regression Results
==============================================================================
Dep. Variable: favwin No. Observations: 553
Model: Probit Df Residuals: 551
Method: MLE Df Model: 1
Date: Tue, 08 Nov 2022 Pseudo R-squ.: 0.1294
Time: 09:07:16 Log-Likelihood: -263.56
converged: True LL-Null: -302.75
Covariance Type: nonrobust LLR p-value: 8.521e-19
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -0.0106 0.104 -0.102 0.919 -0.214 0.193
spread 0.0925 0.012 7.591 0.000 0.069 0.116
==============================================================================
Notice the top: "Optimization terminated successfully..." That's because with probit there is no analytical solution like there is with OLS. Instead, the computer has to maximize the likelihood function by taking a guess for an initial $\beta$ and then iterating using calculus to make smart choices.
The coefficients are very different. Just look at the intercept! That's in large part b/c the coefficients have a different meaning in a probabilistic model. In order to determine the effect on Y, we have to run the coefficient through the distributional assumption, here Normal. When we do this, we call the results 'marginal effects.' The math is pretty straight-forward -- but then again recovering marginal effects is standard stuff so there's a method for that:
margeff = res_probit.get_margeff('mean')
print(margeff.summary())
Probit Marginal Effects
=====================================
Dep. Variable: favwin
Method: dydx
At: mean
==============================================================================
dy/dx std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
spread 0.0251 0.003 8.661 0.000 0.019 0.031
==============================================================================
Okay, so a unit increase in the spread is correlated with a (statistically significant) 2.5% increase in the probability the team wins. Makes sense -- otherwise those bright, shiny Vegas lights wouldn't be so shiny.
Note that the marginal effect calculation required us to take a stand on from where we calculated the derivative. In a linear model like OLS, the derivative is just the coefficients and those are constant. Here, the model is non-linear (b/c of the Normal distribution) so the derivative changes depending on where we choose. The average is the standard though skewed data might make the median more resonable.
Let's take a look at the marginal effects at different points in the data. Note that the reported marginal effect above is located at the intersection of the marginal effects plot and the vertical dashed line indicating the average spread.
from scipy.stats import norm # import functions related to the normal distribution
y = norm.pdf(res_probit.fittedvalues,0,1)*res_probit.params.spread
fig, ax = plt.subplots(figsize=(15,6))
avg_spread = np.mean(vegas['spread'])
# Create the marginal effects
ax.scatter(vegas['spread'],y, color='black', label = 'marg. effects')
ax.set_ylabel('estimated marginal effect')
ax.set_xlabel('point spread')
ax.set_title('plotting marginal effects')
ax.legend(frameon=False,loc='upper right', bbox_to_anchor=(0.9, 0.7), fontsize=14)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.axvline(x=avg_spread, color='red', linestyle='--')
ax.text(avg_spread+.5,0.035,'Average Spread',fontsize=14)
ax.set_ylim([-1e-3,0.04])
plt.show()
Let's look at the predicted values. In OLS this was easy. Here, things are (for some bizarre reason) more complicated -- we have to run the $X\hat\beta$ interactions through the standard Normal distribution ourselves.
pred_probit = norm.cdf(res_probit.fittedvalues,0,1) # Standard Normal (ie, mean = 0, stdev = 1)
Plot the estimated probabilty of the favored team winning and the actual data.
fig, ax = plt.subplots(figsize=(15,6))
ax.scatter(vegas['spread'], pred_probit, facecolors='none', edgecolors='red', label='predicted')
ax.scatter(vegas['spread'], vegas['favwin'], facecolors='none', edgecolors='blue', label = 'data')
ax.axhline(y=1.0, color='grey', linestyle='--')
# Create the line of best fit to plot
p = res_ols.params # params from the OLS model linear probability model
x = range(0,35) # some x data
y = [p.Intercept + p.spread*i for i in x] # apply the coefficients
ax.plot(x,y, color='black', label = 'linear prob.')
ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from a probit model')
ax.legend(frameon=False,loc='upper right', bbox_to_anchor=(0.9, 0.7), fontsize=14)
sea.despine(ax=ax, trim=True)
3. Logistic Regression (aka Logit) (top)¶
Our framework is actually pretty flexible so we can use different distributions. The other popular distributional assumption is to assume the $\epsilon$ errors come from a Logistic distribution. Why Logistic? Because the result is a nice simple function for the probability:
$$\text{prob} = \frac{\exp \left({\beta_0+\beta_1 spread}\right)}{1+\exp \left({\beta_0+\beta_1 spread}\right)},$$and we predict a team wins when ever $\text{prob} \ge 0.5$. We estimate the logit model with logit( ) method from smf in a way similar to probit.
res_logit = smf.logit('favwin ~ spread', data=vegas).fit()
print(res_logit.summary())
Optimization terminated successfully.
Current function value: 0.477218
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: favwin No. Observations: 553
Model: Logit Df Residuals: 551
Method: MLE Df Model: 1
Date: Tue, 08 Nov 2022 Pseudo R-squ.: 0.1283
Time: 09:07:17 Log-Likelihood: -263.90
converged: True LL-Null: -302.75
Covariance Type: nonrobust LLR p-value: 1.201e-18
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -0.0712 0.173 -0.411 0.681 -0.411 0.268
spread 0.1632 0.023 7.236 0.000 0.119 0.207
==============================================================================
Again, interpreting logit coefficients is bit more complicated. The probability that a team wins is given by the expression
$$\text{prob} = \frac{\exp \left({\beta_0+\beta_1 spread}\right)}{1+\exp \left({\beta_0+\beta_1 spread}\right)}$$Our marginal effects will hammer $X\hat\beta$ through the above non-linear function to derive the marginal effects. Let's take a look:
margeff = res_logit.get_margeff('mean')
print(margeff.summary())
Logit Marginal Effects
=====================================
Dep. Variable: favwin
Method: dydx
At: mean
==============================================================================
dy/dx std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
spread 0.0244 0.003 9.059 0.000 0.019 0.030
==============================================================================
Let's again plot the estimated probabilty of the favored team winning and the actual data but now let's compare the implications of our distributional assumptions. First, generate predicted values using numpy and the above expression for the probability.
pred_logit = np.exp(res_logit.fittedvalues) /( 1+np.exp(res_logit.fittedvalues) )
Now, plot probit vs logit:
fig, ax = plt.subplots(figsize=(15,6))
ax.scatter(vegas['spread'], pred_logit, facecolors='none', edgecolors='red', label='predicted-logit')
ax.scatter(vegas['spread'], pred_probit, facecolors='none', edgecolors='black', label='predicted-probit')
ax.scatter(vegas['spread'], vegas['favwin'], facecolors='none', edgecolors='blue', label = 'data')
ax.axhline(y=1.0, color='grey', linestyle='--')
# Create the line of best fit to plot
p = res_ols.params # params from the OLS model linear probability model
x = range(0,35) # some x data
y = [p.Intercept + p.spread*i for i in x] # apply the coefficients
ax.plot(x,y, color='black', label = 'linear prob.')
ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from logit and probit models')
ax.legend(frameon=False,loc='upper right', bbox_to_anchor=(0.9, 0.7), fontsize=14)
sea.despine(ax=ax, trim=True)
We observe that the probit and logit models are nearly on top of eachother. That's a common occurrence. In practice, the models are often interchangeable and the practitioner will choose one over the other because in their setting one may have some slightly better properties (e.g., more intuitive intrepretation of the marginal effects).
apples = pd.read_stata('./Data/apple.dta')
apples.head()
| id | educ | date | state | regprc | ecoprc | inseason | hhsize | male | faminc | age | reglbs | ecolbs | numlt5 | num5_17 | num18_64 | numgt64 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10002 | 16 | 111597 | SD | 1.19 | 1.19 | 1 | 4 | 0 | 45 | 43 | 2.0 | 2.000000 | 0 | 1 | 3 | 0 |
| 1 | 10004 | 16 | 121897 | KS | 0.59 | 0.79 | 0 | 1 | 0 | 65 | 37 | 0.0 | 2.000000 | 0 | 0 | 1 | 0 |
| 2 | 10034 | 18 | 111097 | MI | 0.59 | 0.99 | 1 | 3 | 0 | 65 | 44 | 0.0 | 2.666667 | 0 | 2 | 1 | 0 |
| 3 | 10035 | 12 | 111597 | TN | 0.89 | 1.09 | 1 | 2 | 1 | 55 | 55 | 3.0 | 0.000000 | 0 | 0 | 2 | 0 |
| 4 | 10039 | 15 | 122997 | NY | 0.89 | 1.09 | 0 | 1 | 1 | 25 | 22 | 0.0 | 3.000000 | 0 | 0 | 1 | 0 |
- Create a variable named
ecobuythat is equal to 1 if the observation has a positive purchase of eco-apples (i.e., ecolbs>0).
# this is only one way to do this...
apples['ecobuy'] = 0 # create the variable and default it to zero
apples.loc[apples['ecolbs']>0, 'ecobuy'] = 1 # set the variable = 1 when positive ecolbs
apples['ecobuy'].describe()
count 660.000000 mean 0.624242 std 0.484685 min 0.000000 25% 0.000000 50% 1.000000 75% 1.000000 max 1.000000 Name: ecobuy, dtype: float64
- Estimate a linear probability model relating the probability of purchasing eco-apples to household characteristics.
apple_res = smf.ols('ecobuy ~ ecoprc + regprc + faminc + hhsize + educ + age', data=apples).fit()
print(apple_res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: ecobuy R-squared: 0.110
Model: OLS Adj. R-squared: 0.102
Method: Least Squares F-statistic: 13.43
Date: Tue, 08 Nov 2022 Prob (F-statistic): 2.18e-14
Time: 09:07:17 Log-Likelihood: -419.60
No. Observations: 660 AIC: 853.2
Df Residuals: 653 BIC: 884.6
Df Model: 6
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.4237 0.165 2.568 0.010 0.100 0.748
ecoprc -0.8026 0.109 -7.336 0.000 -1.017 -0.588
regprc 0.7193 0.132 5.464 0.000 0.461 0.978
faminc 0.0006 0.001 1.042 0.298 -0.000 0.002
hhsize 0.0238 0.013 1.902 0.058 -0.001 0.048
educ 0.0248 0.008 2.960 0.003 0.008 0.041
age -0.0005 0.001 -0.401 0.689 -0.003 0.002
==============================================================================
Omnibus: 4015.360 Durbin-Watson: 2.084
Prob(Omnibus): 0.000 Jarque-Bera (JB): 69.344
Skew: -0.411 Prob(JB): 8.75e-16
Kurtosis: 1.641 Cond. No. 724.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- How many estimated probabilities are negative? Are greater than one?
fitted = apple_res.fittedvalues # store the fitted values
fitted[(fitted>1) | (fitted<0)] # greater than 1 or less than zero
167 1.070860 493 1.054372 dtype: float64
val = ((fitted>1) | (fitted<0)).astype(float).mean()*100
print(f'Answer: {val:4.2f} percent of predicted probabilities are less than 0 or greater than 1.')
Answer: 0.30 percent of predicted probabilities are less than 0 or greater than 1.
- Now estimate the model as a probit; i.e.,
where $\Phi( )$ is the CDF of the normal distribution.
apple_pres = smf.probit('ecobuy ~ ecoprc + regprc + faminc + hhsize + educ + age', data=apples).fit()
print(apple_pres.summary())
Optimization terminated successfully.
Current function value: 0.604599
Iterations 5
Probit Regression Results
==============================================================================
Dep. Variable: ecobuy No. Observations: 660
Model: Probit Df Residuals: 653
Method: MLE Df Model: 6
Date: Tue, 08 Nov 2022 Pseudo R-squ.: 0.08664
Time: 09:11:43 Log-Likelihood: -399.04
converged: True LL-Null: -436.89
Covariance Type: nonrobust LLR p-value: 2.751e-14
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -0.2438 0.474 -0.514 0.607 -1.173 0.685
ecoprc -2.2669 0.321 -7.052 0.000 -2.897 -1.637
regprc 2.0302 0.382 5.318 0.000 1.282 2.778
faminc 0.0014 0.002 0.932 0.351 -0.002 0.004
hhsize 0.0691 0.037 1.893 0.058 -0.002 0.141
educ 0.0714 0.024 2.939 0.003 0.024 0.119
age -0.0012 0.004 -0.340 0.734 -0.008 0.006
==============================================================================
apple_pres.get_margeff?
- Compute the marginal effects of the coefficients at the means and print them out using
summary(). Interpret the results.
probit_marg = apple_pres.get_margeff(at='mean')
print(probit_marg.summary())
Probit Marginal Effects
=====================================
Dep. Variable: ecobuy
Method: dydx
At: mean
==============================================================================
dy/dx std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ecoprc -0.8508 0.120 -7.087 0.000 -1.086 -0.615
regprc 0.7619 0.143 5.334 0.000 0.482 1.042
faminc 0.0005 0.001 0.932 0.351 -0.001 0.002
hhsize 0.0259 0.014 1.894 0.058 -0.001 0.053
educ 0.0268 0.009 2.941 0.003 0.009 0.045
age -0.0005 0.001 -0.340 0.734 -0.003 0.002
==============================================================================
- Re-estimate the model as a logit model.
apple_lres = smf.logit('ecobuy ~ ecoprc + regprc + faminc + hhsize + educ + age', data=apples).fit()
print(apple_lres.summary())
Optimization terminated successfully.
Current function value: 0.604746
Iterations 5
Logit Regression Results
==============================================================================
Dep. Variable: ecobuy No. Observations: 660
Model: Logit Df Residuals: 653
Method: MLE Df Model: 6
Date: Sun, 06 Nov 2022 Pseudo R-squ.: 0.08642
Time: 09:34:06 Log-Likelihood: -399.13
converged: True LL-Null: -436.89
Covariance Type: nonrobust LLR p-value: 3.017e-14
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -0.4278 0.786 -0.544 0.586 -1.968 1.112
ecoprc -3.6773 0.533 -6.898 0.000 -4.722 -2.632
regprc 3.2742 0.630 5.196 0.000 2.039 4.509
faminc 0.0026 0.003 1.012 0.311 -0.002 0.008
hhsize 0.1145 0.061 1.878 0.060 -0.005 0.234
educ 0.1186 0.041 2.925 0.003 0.039 0.198
age -0.0022 0.006 -0.372 0.710 -0.014 0.009
==============================================================================
- Compute the marginal effects of the logit coefficients at the averages in the data.
logit_marg = apple_lres.get_margeff(at='mean')
print(logit_marg.summary())
Logit Marginal Effects
=====================================
Dep. Variable: ecobuy
Method: dydx
At: mean
==============================================================================
dy/dx std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ecoprc -0.8480 0.122 -6.972 0.000 -1.086 -0.610
regprc 0.7551 0.144 5.227 0.000 0.472 1.038
faminc 0.0006 0.001 1.012 0.311 -0.001 0.002
hhsize 0.0264 0.014 1.880 0.060 -0.001 0.054
educ 0.0273 0.009 2.931 0.003 0.009 0.046
age -0.0005 0.001 -0.372 0.710 -0.003 0.002
==============================================================================
We haven't done much data wrangling lately. I'm feeling a bit sad; I miss shaping data.
- Create a pandas DataFrame with the row index 'ecoprc', 'regprc', 'faminc', 'hhsize', 'educ', and 'age'. The columns should be labeled 'logit', 'probit', and 'ols'. The columns should contain the marginal effects for the logit and probit models and the coefficients from the ols model.
params = pd.DataFrame({'logit':logit_marg.margeff, 'probit':probit_marg.margeff, 'ols':apple_res.params[1:]},
index = ['ecoprc', 'regprc', 'faminc', 'hhsize', 'educ', 'age'],
)
params
| logit | probit | ols | |
|---|---|---|---|
| ecoprc | -0.848039 | -0.850754 | -0.802622 |
| regprc | 0.755077 | 0.761893 | 0.719268 |
| faminc | 0.000610 | 0.000543 | 0.000552 |
| hhsize | 0.026416 | 0.025947 | 0.023823 |
| educ | 0.027349 | 0.026784 | 0.024785 |
| age | -0.000503 | -0.000455 | -0.000501 |