# Instrumental Variables

Instrumental variables are an incredibly powerful for dealing with unobserved heterogenity within the context of regression but the language used to define them is mind bending. Typically, you hear something along the lines of “an instrumental variable is a variable that is correlated with x but uncorrelated with the outcome except through x.” At this point examples are listed — taxes on smoking likely effect health only through their actions on smoking — or the author drops right into the math stats.

I like math stats (when I am not getting a grade for it at least!) and will work through it. But at some point, I want to play with a simulation of the process. However, books like Mostly Harmless Econometerics just throw a bunch of nonsense in terms of emperical studies at you. Sure, the OLS and IV estimates are different… but I don't know which is right for sure in an empirical setting. Simulations are great for this (the lack of simulations is one of my biggest issues with that book. Right after the use of Stata.).

I turned to Google and did several searches and the only simple simulation that I could find was done using Stata. At least it was a simulation… but I wanted more. Since I couldn't find any examples using R, I play around until I got it and then write up the simulation to see if it helps anyone.

## Overview

Suppose that you have a continuous variable $$y$$ with the known mean response function

$E(y) = \beta_0 + \beta_1 x + \beta_2 c$

and further that $$x$$ and $$c$$ are correlated with each other. If you knew $$c$$, estimating $$\beta_1$$ would be easy by simply running lm(y ~ x + c). You would complete the regression, throw together a few diagnostics and call it a day.

But often we don't observe $$c$$ in our data. $$c$$ could be any number of things, such as treatment practices at a hospital or unmeasured differences between patients, but it is in the direct casual path of $$y$$ and you don't know it. If you instead do the regression lm(y ~ x), you will not get an accurate estimate of $$\beta_0$$. If you think about the model fit in that call it looks like

$y_i = \beta_0 + \beta_1 x_i + \epsilon_i$ but in this case $\epsilon_i = f(c_i, \eta_i)$ where $$\eta_i$$ is white noise centered on zero. As long as that function is linear, the expectation of $$\epsilon$$ will be some multiple of $$c$$. The result is that $$x$$ and $$\epsilon_i$$ are correlated. The estimates of $$\beta_1$$ will nessecarly be wrong. Interested parties can read more here and see just how wrong it will be.

Suppose there is a third variable $$z$$, $$x$$ can be expressed as some function of $$z$$, $$z$$ has no effect on $$y$$ except through $$x$$ (therefore $$z$$ is independent of any other variable that effects $$y$$). Then we have the equations

$E(x) = \alpha_0 + \alpha_1 x^* + \alpha_2 z$ and $E(y) = \beta_0 + beta_1 x + f(c)$

where $$x^*$$ is some latent part of $$x$$ and $$c$$ is still unobserved. Looking at the first equation and realizing we don't know what $$x^*$$ is the resulting model will have two regression coefficients. The intercept will be $$\alpha_0 + \alpha_1 E(x^*)$$ which is not correlated with $$f(c)$$ as and $$z$$ is independent of $$f(c)$$ by assumption.

If we take the fitted values of $$x$$ from the first equation and plug it into the second equation, the $$x$$ term is no independent of $$f(c)$$. This independence means that we can produce consistent estimates of $$\beta_1$$ when we replace $$x$$ with the fitted values of $$x$$ from equation one. When the estimator used is OLS, this method is called two-stage least squares (2SLS).

## Simulations

We need to start by generating values of $$x$$ and $$c$$ that are correlated. The easiest way to do this is to use the function mvrnorm() from MASS. Lets start by generating a 1000x2 random matrix with $$\rho = 0.5$$.

library(MASS)
# we are really generating x* and c and using a common variance
xStarAndC <- mvrnorm(1000, c(20, 15), matrix(c(1, 0.5, 0.5, 1), 2, 2))
xStar <- xStarAndC[, 1]
c <- xStarAndC[, 2]


and lets generate $$z$$ and make the observed $$x$$

z <- rnorm(1000)
x <- xStar + z


And now generate the response variable

# using 1 makes it easy to estimate how 'wrong' an estimator is and toss
# some noise on y
y <- 1 + x + c + rnorm(1000, 0, 0.5)


Lets check to make sure that everything worked so far:

cor(x, c)

## [1] 0.3027

cor(z, c)

## [1] -0.02993


We have a moderate correlation between $$x$$ and $$c$$ and it appears that $$z$$ and $$c$$ are uncorrelated. So far, so good.

Lets estimate the full model (the model that we would estimate if we knew $$c$$)

lm(y ~ x + c)

##
## Call:
## lm(formula = y ~ x + c)
##
## Coefficients:
## (Intercept)            x            c
##       1.716        0.987        0.970


Not bad 0.9868 is basically 1. Now what would happen if we ignored the fact that $$c$$ was unknown and estimated anyways?

lm(y ~ x)

##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept)            x
##        12.0          1.2


Yikes! The complete model definitely works but when we ignore the heterogeneity caused by $$c$$ we get terrible estimates. The estimate was 120% of the true effect of $$x$$. That isn't good.

What happens when we break the correlation between the “x part” and the unobserved $$c$$ by using the instrument and 2SLS?

xHat <- lm(x ~ z)\$fitted.values
lm(y ~ xHat)

##
## Call:
## lm(formula = y ~ xHat)
##
## Coefficients:
## (Intercept)         xHat
##      16.895        0.954


That is a pretty good estimate, only off by 4.6% of the true value of 1. Pretty nifty that we are able to accurately estimate the effect of $$x$$ using the instrument despite the unobserved value of $$c$$.

Its worth noting now that if you want to do this in R, you probably do not want to do 2SLS as I have done it here. The parameter estimates are correct but the errors are not. Using the method I have done above, you get standard errors from the second equation residuals whereas you want them from the first. The package AER includes a function ivreg() which will take care of this for you.

If you look at the equations, two things probably jump out at you. First, if $$x$$ does a poor job predicting $$x$$ (i.e., $$z$$ is a weak instrument), then it is unlikely that IV and 2SLS will provide you with meaningful results. In fact, it is possible that the IV estimates will be more inaccurate than the naive OLS estimator. A rule of thumb is the first stage (i.e., the regression of x on the instrument and any covariates) should have an F statistic of at least 10.

Second, if $$x$$ and $$c$$ are only weakly correlated, the OLS and IV estimates should converge. As $$x$$ and $$c$$ become more correlated, however, the estimates move apart quickly and with even mild correlation ($$\rho \sim 0.2$$) there can be significant differences in the estimates.

The dynamic graph below allows you to explore the effects of altering these parameters on the resulting OLS and IV estimates. The solid black line is the true value, the green line is the estimate using the full model (if you knew $$c$$), red is the IV estimate and blue is using the OLS estimator.