Single Variable Regression¶

In this section, we will review single variable regression as taught in Data 8 and consider ways of quantifying the error of our line of best fit. We will also use data from the March 2012 Current Population Survey and the NLSY79 dataset to practice fitting some single variable regressions.

Regression in Data 8¶

Suppose we have some data that look like this:

np.random.seed(42)
x = np.random.uniform(0, 10, 100)
noise = np.random.randn(100) * 4
y = 1.5 * x + noise

plt.figure(figsize=(8,6))
plt.scatter(x, y)
plt.xlabel('$x$_standard')
plt.ylabel('$y$_standard');

      
    

In Data 8 we learned that $x$ and $y$ above have some correlation coefficient $r$ , which is a measure of the strength of the linear relationship between the two variables. The correlation coefficient is calculated as

$r = \frac{1}{n}\sum^n_{i=1} \left ( \frac{x_i - \hat{\mu}_x}{\hat{\sigma}_x} \right ) \left ( \frac{y_i - \hat{\mu}_y}{\hat{\sigma}_y} \right )$

where $n$ is the number of observations and $\hat{\mu}_x$ and $\hat{\sigma}_x$ are the observed mean and standard deviation of a variable $x$ .

It looks like there is some positive linear association between $x$ and $y$ such that larger values of $x$ correspond to larger values of $y$ . We therefore expect $r$ to be some positive number between 0 and 1, but not exactly 0 or exactly 1.

First, let’s convert the data (stored in the arrays x and y) to standard units. To convert a set of data points to standard units, we subtract out the mean of the data and scale by the standard deviation. This has the effect of changing the data so that the data in standard units have mean 0 and standard deviation 1. Below we construct a function that does this.

$x_{su} = \dfrac{x - \mu_x}{\sigma_x}$

def standard_units(array):
    return (array - np.mean(array)) / np.std(array)

x_standard = standard_units(x)
y_standard = standard_units(y)

The plot looks the same as before, except now the axes are scaled such that we measure $x$ and $y$ in standard units. Now recall that $r$ is calculated as the average of the product of two variables, when the variables are measured in standard units. Below we define a function that calculates $r$ , assuming that the inputs have already been converted to standard units.

def correlation(x, y):
    return np.mean(x * y)

What is the correlation between these two variables?

correlation(x_standard, y_standard)

0.7351364076052979

Recall from Data 8 that we use $r$ to form a line called the regression line, which makes predictions for $y$ given some $x$ . Our prediction for $y$ in standard units is $r \cdot x$ . If we want to fit this regression line in the original units, recall that the slope of this line is given by

$\text{slope} = r \cdot \dfrac{\hat{\sigma}_y}{\hat{\sigma}_x}$

and the intercept is given by

$\text{intercept} = \hat{\mu}_y - \text{slope} \cdot \hat{\mu}_x$

Our regression line will have the form

$y = \hat{\alpha} + \hat{\beta} x$

where $\hat{\alpha}$ is the intercept from above and $\hat{\beta}$ the slope.

Below we plot this line.

r = correlation(x_standard, y_standard)
slope = r * np.std(y) / np.std(x)
intercept = np.mean(y) - slope * np.mean(x)

Let’s take a closer look at the slope we found.

Slope:  1.3160907091507876

To generate the data above, we started with some range of $x$ values, and generated $y$ as a linear function of $x$ with some random noise added in. Take a look:

np.random.seed(42)
x = np.random.uniform(0, 10, 100)
noise = np.random.randn(100) * 4
y = 1.5 * x + noise

Notice how I defined $y$ :

$y = 1.5 \cdot x + u$

where $u$ is some normally-distributed random noise whose average is 0. So, while there is some randomness to the data, on average the “true” slope of the relationship is 1.5. Yet we predicted it to be roughly 1.3!

This highlights the following fact: Suppose we have some random data that we believe has a linear relationship. The least-squares slope we generate from the data is an estimate of the “true” slope of that data. Because of this, the estimated slope is a random variable that depends on the data we happen to have.

To highlight this fact, let’s repeat the procedure above but with a different random seed, in order to get data with the same underlying relationship but different values.

np.random.seed(189)
x = np.random.uniform(0, 10, 100)
noise = np.random.randn(100) * 4
y = 1.5 * x + noise

r = correlation(x_standard, y_standard)
slope = r * np.std(y) / np.std(x)
intercept = np.mean(y) - slope * np.mean(x)

      
    

Slope:  1.5590151265266388

Now the estimated slope is roughly 1.6, even though the underlying data was still generated using a slope of 1.5. This is a very important concept that we will revisit soon.

Keep in mind, however, that correlation in data does not imply causation. In this example we know the true causal relationship between $x$ and $y$ because we defined it ourselves. However, when using real data you do not see the “true” relation and thus cannot conclude causality from correlation. It could simply be that both your variables depend on an unseen third variable and have no causal effect on one another. Or even worse, while unlikely it could be the case that slight linear trends in two variables is a complete coincidence.

Root-Mean-Squared Error¶

While we can arbitrarily pick $\hat{\alpha}$ and $\hat{\beta}$ values, we do want to pick the values that help predict $\hat{y}$ that are closest to actual $y$ values. To achieve this, we want to minimize a loss function that quantifies how far off our prediction $\hat{y}$ is from $y$ for some known data points. One of the most common loss functions is called the root-mean-squared error.

Definition

The root-mean-squared error is a common loss function used to quantify the error of a prediction. It is defined as

$\text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^n \left ( y_i - \hat{y}_i \right ) ^2 }$

where $n$ is the number of observations. The effect of this is to take the mean of the distance of each value of $\hat{y}$ from its corresponding value in $y$ ; squaring these values keeps them positive, and then we take the square root to correct the units of the error.

Plugging the formula $\hat{y}$ into the RMSE formula, we get

$\text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^n \left ( y_i - (\hat{\alpha} + \hat{\beta}x_i) \right ) ^2 }$

By doing a bit of calculus, we get the following formulas for $\hat{\alpha}$ and $\hat{\beta}$ :

$\hat{\beta} = r\frac {\hat{\sigma}_y} {\hat{\sigma}_x} \qquad \qquad \hat{\alpha} = \hat{\mu}_y - \hat{\beta}\hat{\mu}_x$

where $r$ is the correlation between $x$ and $y$ , $\hat{\sigma}_y$ is the standard deviation of $y$ , $\hat{\sigma}_x$ is the standard deviation of $x$ , $\hat{\mu}_y$ is the average of all our $y$ values, and $\hat{\mu}_x$ is the average of all our $x$ values. (As an aside, note the hats on our $\sigma$ ’s and $\mu$ ’s; this is because these are empirical estimates of the parameters of these distributions, rather than the true values.) These are the same values we had above!

To calculate the RMSE, we will write an rmse function that makes use of sklearn’s mean_squared_error function.

def rmse(target, pred):
    return np.sqrt(mean_squared_error(target, pred))

To get a better idea of what the RMSE represents, the figures below show a small dataset, a proposed regression line, and the squared error that we are summing in the RMSE. The data points are

$x$	$y$
0	1
1	.5
2	-1
3	2
4	-3

Here are the proposed regression lines and their errors:

Econometric Single Variable Regression¶

The regression line can have two purposes:

Of particular interest to data scientists is the line’s ability to predict values of $y$ for new values of $x$ that we didn’t see before.
Of particular interest to economists is the line’s ability to estimate the “true” underlying slope of the data via its slope.

This is why regression is such a powerful tool and forms the backbone of econometrics. If we believe that our data satisfy certain assumptions (which we won’t explore too much this lecture), then we can use the slope of the regression line to estimate the “true” relation between the variables in question and learn more about the world we live in.

In econometrics, we usually write the “true” underlying linear relationship as follows:

$y = \alpha + \beta \cdot x + \varepsilon$

where $y$ and $x$ are values for any arbitrary point, $\alpha$ is the intercept, $\beta$ is the slope, and $\varepsilon$ is some noise. This is entirely analogous to the code from earlier that determined the true linear relationship between the data:

y = 1.5 * x + noise

Here, $\beta = 1.5$ , $\alpha = 0$ , and $\varepsilon = \text{noise}$ .

When we fit a regression line onto the data, we express the line as:

$\hat{y} = \hat{\alpha} + \hat{\beta} \cdot x$

Here, we put hats over the slope and intercept terms because they are estimates of the true slope and intercept terms. Similarly, we put a hat over $y$ because this is the $y$ value that the regression line predicts.

Notice how the noise term $\varepsilon$ does not appear in the expression for the regression line. This is because the noise term is a random variable that has no relation with $x$ , and is thus impossible to predict from the data. Furthermore, the noise term has a mean value of 0, so on average we actually don’t expect the noise term to have any impact on the underlying trends of the data.

For the Data 8 demonstration above, we forced these conditions to be true. However, with real data these are assumptions that we have to make, and is something that econometricians spend a lot of time thinking about.

Years of Schooling and Earnings¶

Consider a case where we want to study how years of schooling relate to a person’s earnings. This should be of particular interest to college students. Below we import a dataset that has the hourly wage, years of schooling, and other information on thousands of people sampled in the March 2012 Current Population Survey.

cps = Table.read_table('cps.csv')
cps

state	age	wagesal	educ	wage	logwage	female	fedwkr	localwkr
11	44	18000	14	9.10931	2.2093	1	1	0
11	39	18000	14	18	2.89037	0	0	0
11	39	35600	12	17.1154	2.83998	0	0	1
11	39	8000	14	5.12821	1.63476	1	0	0
11	39	100000	16	38.4615	3.64966	0	1	0
11	43	25000	12	10	2.30259	0	0	0
11	38	25000	16	27.1739	3.30226	1	0	0
11	39	26000	13	16.6667	2.81341	1	0	0
11	39	52000	16	16.6667	2.81341	0	0	0
11	37	4500	13	4	1.38629	1	0	0

... (21897 rows omitted)

We want to consider a person’s wage and years of schooling. But first…

Wage and Similar Variables¶

Wage is an interesting variable because it is something that tends to vary by a proportion rather than an absolute amount. To highlight this fact, let’s actually consider how the variable GDP behaves. GDP tends to grow by a certain percent each year; no one is particularly interested in how much GDP changes from one year to the next, or from one country to another, but rather by what percent it changes.

If you recall from the finance chapter, this behaves similarly to compound interest. If you were to plot GDP over time for a country, it might look something like this:

GDPs = make_array(100)

for _ in np.arange(99):
    GDPs = np.append(GDPs, GDPs.item(-1) * 1.05)
    
plt.figure(figsize=(8,6))
plt.plot(np.arange(100), GDPs)
plt.xlabel('Years')
plt.ylabel('GDP')
plt.title('GDP Over Time');

      
    

The starting value for GDP is 100, and GDP grows by 5 percent each year. Look at how how much GDP grows in the later years!

While this phenomenon is impressive, it is misleading. At surface level, it seems to imply that something different happened at around year 50 or so that caused GDP to increase considerably in subsequent years. We know that this isn’t true, and that this is just a consequence of exponential growth.

To counter this effect, for variables that tend to vary from one observation to the next by proportions rather than absolute amounts, we take the natural log of these variables. Let’s do that for GDP:

ln_GDPs = np.log(GDPs)

plt.figure(figsize=(8,6))
plt.plot(np.arange(100), ln_GDPs)
plt.xlabel('Years')
plt.ylabel('GDP')
plt.title('GDP Over Time');

      
    

We’ve now uncovered a linear relationship between years and GDP! You can interpret the slope of this line as the approximate percent change in GDP for an increase in one year. To verify:

print('Slope between years 0 and 1: ', ln_GDPs[1] - ln_GDPs[0])

Slope between years 0 and 1:  0.04879016416943127

It turns out that wage is another one of these variables, and so when we do studies on wage we usually take the natural log of wage instead. Below we plot log wage and years of schooling for the CPS data.

Now let’s fit a least-squares regression line onto this data.

educ_standard = standard_units(educ)
logwage_standard = standard_units(logwage)

r = correlation(logwage_standard, educ_standard)
slope = r * np.std(logwage) / np.std(educ)
intercept = np.mean(logwage) - slope * np.mean(educ)

Slope:  0.10781133924799272
Intercept:  1.472287673006932

So from the very simple and straight-forward model above, it seems that we estimate a slope of roughly 0.1, meaning we might expect that a one-year increase in schooling is associated with a 10% increase in wage, on average.

We can also see that we have a non-zero intercept term. We should be careful how we interpret this term; from a strictly mathematical point of view, the intercept represents the expected value of $y$ (in this case log wage) when $x = 0$ . However, in economics sometimes it makes no sense for $x$ to be 0, and so we cannot use the above interpretation. We won’t go into detail this lecture, but regardless of whether the intercept is interpretable, we almost always want to include it.

Uncertainty in $\hat{\beta}$ ¶

We mentioned earlier that the slope we estimate from regression is exactly that: an estimate of the “true” underlying slope. Because of this, the estimate $\hat{\beta}$ is a random variable that depends on the underlying data.

Let’s assume there is the following true linear relation between log wage and years of schooling,

$\text{log wage} = \alpha + \beta \cdot \text{years of schooling} + \varepsilon$

and we try to estimate $\alpha$ and $\beta$ .

If our data are “well-behaved”, then even though there is uncertainty in our estimate $\hat{\beta}$ , on average $\hat{\beta}$ will be $\beta$ ; that is to say that the expectation of $\hat{\beta}$ is $\beta$ . Additionally, if our data are “well-behaved”, then $\hat{\beta}$ has some normal distribution with mean $\beta$ . We won’t worry too much about what assumptions need to be satisfied to make the data “well-behaved”.

You can think of each person as an observation of these variables, and using a sample of people we can estimate the relationship between the two variables. However, due to the noise term and the fact that we only have a finite sample of people, the true relationship is always hidden from us, and we can only hope to get better estimates by designing better experiments and sampling more people.

Let’s try to get an idea of how “certain” we can be of our estimate $\hat{\beta}$ . We’ll do this in classic Data 8 style: bootstrapping. Using our existing sample data, we’ll create new samples by bootstrapping from the existing data. Then, for each sample, we’ll fit a line, and keep the slope of that line in a list with all of the other slopes. Then, we’ll find the standard deviation of that list of slopes.

Standard dev. of bootstrapped slopes:  0.0015948419317821595

Our bootstrapped approximation standard error of 0.00159 is pretty close to the true standard error of 0.00144. statsmodels, the package we will be using to perform regressions, actually uses a precise mathematical formula for finding the standard error whereas we tried to find this value through simulation, but the idea behind the standard error is the same.

Armed with a standard error, we can now form a 95% confidence interval and perform a test of significance to see if $\hat{\beta}$ is significantly different from 0.

# Using our resampled slopes
lower_bound = percentile(2.5, slopes)
upper_bound = percentile(97.5, slopes)
print('95% confidence interval: [{}, {}]'.format(lower_bound, upper_bound))

95% confidence interval: [0.10456031293884147, 0.11093735589830406]

The 95% confidence interval does not contain 0, and so $\beta$ is unlikely to be 0.

Regression with a Binary Variable¶

A binary variable is a variable that takes on the value of 1 if some condition is true, and 0 otherwise. These are also called dummy variables or indicator variables. It might sound strange at first, but you can actually perform regression of a variable like log earnings onto a binary variable.

Let’s import a different dataset that has the following features. Some will be useful to us later.

nlsy = Table.read_table('nlsy_cleaned_small.csv')
nlsy

male	AFQT	real_earnings_1999	weeks_worked_1999	log_earn_1999	college	father_college
1	99.393	52354.4	52	10.8658	1	0
1	47.412	32721.5	52	10.3958	0	0
0	44.022	35862.8	52	10.4875	0	0
1	59.683	68060.7	52	11.1282	0	0
1	72.313	78531.6	52	11.2713	1	1
0	98.798	62825.3	52	11.0481	1	0
0	50.283	49736.7	52	10.8145	0	0
1	89.669	62825.3	52	11.0481	0	0
1	95.977	259311	52	12.4658	0	1
0	67.021	68060.7	52	11.1282	1	1

... (5403 rows omitted)

Now let’s visualize log earnings vs. the binary variable corresponding to whether or not an observation went to college.

Now let’s fit a regression model:

coll_standard = standard_units(coll)
logearn_standard = standard_units(logearn)

r = correlation(logearn_standard, coll_standard)
slope = r * np.std(logearn) / np.std(coll)
intercept = np.mean(logearn) - slope * np.mean(coll)

print("y = {:.5f} * x + {:.5f}".format(slope, intercept))

y = 0.72389 * x + 10.27883

Wow! This regression would imply that we expect, on average, observations who went to college to have 70% higher earnings than those who did not go to college. Let’s now plot this line on the data:

When we perform a simple regression onto just a dummy variable, it is an important fact that $\hat{\alpha}$ is the mean value of $y$ for all observations in the sample where $x = 0$ , and $\hat{\beta}$ is the difference between the mean value of $y$ for observations in the sample where $x = 1$ and observations where $x = 0$ . Proving this claim is beyond our scope this week, but let’s verify it with our data:

avg_logearn_coll = np.mean(logearn[coll == 1])
avg_logearn_nocoll = np.mean(logearn[coll == 0])

print('Avg logearn for coll = 1: ', avg_logearn_coll)
print('Avg logearn for coll = 0: ', avg_logearn_nocoll)
print('Difference between the two: ', avg_logearn_coll - avg_logearn_nocoll)

Avg logearn for coll = 1:  11.002722859240283
Avg logearn for coll = 0:  10.278831734712744
Difference between the two:  0.7238911245275386

print('Intercept: ', intercept)
print('Slope: ', slope)

Intercept:  10.278831734712744
Slope:  0.72389112452754

Data 88: Economic Models Textbook