{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using `statsmodels` for Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove_cell"
    ]
   },
   "outputs": [],
   "source": [
    "from datascience import *\n",
    "import numpy as np\n",
    "import statsmodels.api as sm\n",
    "\n",
    "cps = Table.read_table('cps.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous section, we used functions in NumPy and concepts taught in Data 8 to perform single variable regressions. It turns out that there are (several) Python packages that can perform these regressions for us and which extend nicely into the types of regressions we will cover in the next few sections. In this section, we introduce `statsmodels` for performing single variable regressions, a foundation upon which we will build our discussion of multivariable regression."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`statsmodels` is a popular Python package used to create and analyze various statistical models. To create a linear regression model in `statsmodels`, which is generally import as `sm`, we use the following skeleton code:\n",
    "\n",
    "```python\n",
    "x = data.select(features).values            # Separate features (independent variables) \n",
    "y = data.select(target).values              # Separate target (outcome variable)\n",
    "model = sm.OLS(y, sm.add_constant(x))       # Initialize the OLS regression model\n",
    "result = model.fit()                        # Fit the regression model and save it to a variable\n",
    "result.summary()                            # Display a summary of results\n",
    "```\n",
    "\n",
    "*You must manually add a constant column of all 1's to your independent features. `statsmodels` will not do this for you and if you fail to do this you will perform a regression without an intercept $\\alpha$ term. This is performed in the third line by calling `sm.add_constant` on `x`.* Also note that we call `.values` after we select the columns in `x` and `y`; this gives us `NumPy` arrays containing the corresponding values, since `statsmodels` can't process `Table`s.\n",
    "\n",
    "Recall the `cps` dataset we used in the previous section:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "    <thead>\n",
       "        <tr>\n",
       "            <th>state</th> <th>age</th> <th>wagesal</th> <th>imm</th> <th>hispanic</th> <th>black</th> <th>asian</th> <th>educ</th> <th>wage</th> <th>logwage</th> <th>female</th> <th>fedwkr</th> <th>statewkr</th> <th>localwkr</th>\n",
       "        </tr>\n",
       "    </thead>\n",
       "    <tbody>\n",
       "        <tr>\n",
       "            <td>11   </td> <td>44  </td> <td>18000  </td> <td>0   </td> <td>0       </td> <td>0    </td> <td>0    </td> <td>14  </td> <td>9.10931</td> <td>2.2093 </td> <td>1     </td> <td>1     </td> <td>0       </td> <td>0       </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>11   </td> <td>39  </td> <td>18000  </td> <td>0   </td> <td>0       </td> <td>0    </td> <td>0    </td> <td>14  </td> <td>18     </td> <td>2.89037</td> <td>0     </td> <td>0     </td> <td>0       </td> <td>0       </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>11   </td> <td>39  </td> <td>35600  </td> <td>0   </td> <td>0       </td> <td>0    </td> <td>0    </td> <td>12  </td> <td>17.1154</td> <td>2.83998</td> <td>0     </td> <td>0     </td> <td>0       </td> <td>1       </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>11   </td> <td>39  </td> <td>8000   </td> <td>0   </td> <td>0       </td> <td>0    </td> <td>0    </td> <td>14  </td> <td>5.12821</td> <td>1.63476</td> <td>1     </td> <td>0     </td> <td>0       </td> <td>0       </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>11   </td> <td>39  </td> <td>100000 </td> <td>0   </td> <td>0       </td> <td>0    </td> <td>0    </td> <td>16  </td> <td>38.4615</td> <td>3.64966</td> <td>0     </td> <td>1     </td> <td>0       </td> <td>0       </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>11   </td> <td>43  </td> <td>25000  </td> <td>0   </td> <td>0       </td> <td>0    </td> <td>0    </td> <td>12  </td> <td>10     </td> <td>2.30259</td> <td>0     </td> <td>0     </td> <td>0       </td> <td>0       </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>11   </td> <td>38  </td> <td>25000  </td> <td>0   </td> <td>0       </td> <td>0    </td> <td>0    </td> <td>16  </td> <td>27.1739</td> <td>3.30226</td> <td>1     </td> <td>0     </td> <td>0       </td> <td>0       </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>11   </td> <td>39  </td> <td>26000  </td> <td>0   </td> <td>0       </td> <td>0    </td> <td>0    </td> <td>13  </td> <td>16.6667</td> <td>2.81341</td> <td>1     </td> <td>0     </td> <td>0       </td> <td>0       </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>11   </td> <td>39  </td> <td>52000  </td> <td>0   </td> <td>0       </td> <td>0    </td> <td>0    </td> <td>16  </td> <td>16.6667</td> <td>2.81341</td> <td>0     </td> <td>0     </td> <td>0       </td> <td>0       </td>\n",
       "        </tr>\n",
       "        <tr>\n",
       "            <td>11   </td> <td>37  </td> <td>4500   </td> <td>0   </td> <td>0       </td> <td>0    </td> <td>0    </td> <td>13  </td> <td>4      </td> <td>1.38629</td> <td>1     </td> <td>0     </td> <td>0       </td> <td>0       </td>\n",
       "        </tr>\n",
       "    </tbody>\n",
       "</table>\n",
       "<p>... (21897 rows omitted)</p>"
      ],
      "text/plain": [
       "state | age  | wagesal | imm  | hispanic | black | asian | educ | wage    | logwage | female | fedwkr | statewkr | localwkr\n",
       "11    | 44   | 18000   | 0    | 0        | 0     | 0     | 14   | 9.10931 | 2.2093  | 1      | 1      | 0        | 0\n",
       "11    | 39   | 18000   | 0    | 0        | 0     | 0     | 14   | 18      | 2.89037 | 0      | 0      | 0        | 0\n",
       "11    | 39   | 35600   | 0    | 0        | 0     | 0     | 12   | 17.1154 | 2.83998 | 0      | 0      | 0        | 1\n",
       "11    | 39   | 8000    | 0    | 0        | 0     | 0     | 14   | 5.12821 | 1.63476 | 1      | 0      | 0        | 0\n",
       "11    | 39   | 100000  | 0    | 0        | 0     | 0     | 16   | 38.4615 | 3.64966 | 0      | 1      | 0        | 0\n",
       "11    | 43   | 25000   | 0    | 0        | 0     | 0     | 12   | 10      | 2.30259 | 0      | 0      | 0        | 0\n",
       "11    | 38   | 25000   | 0    | 0        | 0     | 0     | 16   | 27.1739 | 3.30226 | 1      | 0      | 0        | 0\n",
       "11    | 39   | 26000   | 0    | 0        | 0     | 0     | 13   | 16.6667 | 2.81341 | 1      | 0      | 0        | 0\n",
       "11    | 39   | 52000   | 0    | 0        | 0     | 0     | 16   | 16.6667 | 2.81341 | 0      | 0      | 0        | 0\n",
       "11    | 37   | 4500    | 0    | 0        | 0     | 0     | 13   | 4       | 1.38629 | 1      | 0      | 0        | 0\n",
       "... (21897 rows omitted)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use `statsmodels` to perform our regression of `logwage` on `educ` again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>            <td>y</td>        <th>  R-squared:         </th> <td>   0.204</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.204</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   5600.</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Wed, 10 Jun 2020</td> <th>  Prob (F-statistic):</th>  <td>  0.00</td>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>10:43:19</td>     <th>  Log-Likelihood:    </th> <td> -20513.</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td> 21907</td>      <th>  AIC:               </th> <td>4.103e+04</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td> 21905</td>      <th>  BIC:               </th> <td>4.105e+04</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     1</td>      <th>                     </th>     <td> </td>    \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>    \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "    <td></td>       <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>const</th> <td>    1.4723</td> <td>    0.021</td> <td>   71.483</td> <td> 0.000</td> <td>    1.432</td> <td>    1.513</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>x1</th>    <td>    0.1078</td> <td>    0.001</td> <td>   74.831</td> <td> 0.000</td> <td>    0.105</td> <td>    0.111</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td>989.972</td> <th>  Durbin-Watson:     </th> <td>   1.873</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.000</td>  <th>  Jarque-Bera (JB):  </th> <td>2802.765</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td> 0.201</td>  <th>  Prob(JB):          </th> <td>    0.00</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td> 4.706</td>  <th>  Cond. No.          </th> <td>    70.9</td>\n",
       "</tr>\n",
       "</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:                      y   R-squared:                       0.204\n",
       "Model:                            OLS   Adj. R-squared:                  0.204\n",
       "Method:                 Least Squares   F-statistic:                     5600.\n",
       "Date:                Wed, 10 Jun 2020   Prob (F-statistic):               0.00\n",
       "Time:                        10:43:19   Log-Likelihood:                -20513.\n",
       "No. Observations:               21907   AIC:                         4.103e+04\n",
       "Df Residuals:                   21905   BIC:                         4.105e+04\n",
       "Df Model:                           1                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "==============================================================================\n",
       "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
       "------------------------------------------------------------------------------\n",
       "const          1.4723      0.021     71.483      0.000       1.432       1.513\n",
       "x1             0.1078      0.001     74.831      0.000       0.105       0.111\n",
       "==============================================================================\n",
       "Omnibus:                      989.972   Durbin-Watson:                   1.873\n",
       "Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2802.765\n",
       "Skew:                           0.201   Prob(JB):                         0.00\n",
       "Kurtosis:                       4.706   Cond. No.                         70.9\n",
       "==============================================================================\n",
       "\n",
       "Warnings:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "\"\"\""
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = cps.select(\"educ\").values\n",
    "y = cps.select(\"logwage\").values\n",
    "\n",
    "model = sm.OLS(y, sm.add_constant(x))\n",
    "results = model.fit()\n",
    "results.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The summary above provides us with a lot of information. Let's start with the most important pieces: the values of $\\hat{\\alpha}$ and $\\hat{\\beta}$. The middle table contains these values for us as `const` and `x1`'s `coef` values: we have $\\hat{\\alpha}$ is 1.4723 and $\\hat{\\beta}$ is 0.1078.\n",
    "\n",
    "Recall also our discussion of uncertainty in $\\hat{\\beta}$. `statsmodels` provides us with our calculated standard error in the `std err` column, and we see that the standard error of $\\hat{\\beta}$ is 0.001, matching our empirical estimate via bootstrapping from the last section. We can also see the 95% confidence interval that we calculated in the two rightmost columns.\n",
    "\n",
    "![](statsmodels-coeffs.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Earlier we said that $\\hat{\\beta}$ has some normal distribution with mean $\\beta$ if certain assumptions are satisfied. We now can see that the standard deviation of that normal distribution is the standard error of $\\hat{\\beta}$. We can also use this to test a null hypothesis that $\\beta = 0$. To do so, construct a [t-statistic](https://en.wikipedia.org/wiki/T-statistic) (which `statsmodels` does for you) that indicates how many standard deviations away $\\hat{\\beta}$ is from 0, assuming that the distribution of $\\hat{\\beta}$ is in fact centered at 0.\n",
    "\n",
    "We can see that $\\hat{\\beta}$ is 74 standard deviations away from the null hypothesis mean of 0, which is an enormous number. How likely do you think it is to draw a random number roughly 74 standard deviations away from the mean, assuming a standard normal distribution? Essentially 0. This is strong evidence that the mean of the distribution (the mean of $\\hat{\\beta}$ is the true value $\\beta$) is not 0. Accompanying the t-statistic is a p-value that indicates the statistical significance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " "
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}