Parallel Simulation of Heckman Selection Model

Parallel Simulation of Heckman Selection Model

One of the, if not the, fundamental problems in observational data analysis is the estimation of the value of the unobserved choice. If the \(i^{\text{th}}\) unit chooses the value of \(t\) on the basis of some factors \(\mathbf{x_i}\), which may include the \(E(u_i(t))\) for that unit, comparing the outcome \(\mathbf{y}\) on a set where \(t = 1\) and a set where \(t = 0\) is unlikely to provide valid inference.

Randomized controlled trials get rid of this problem by setting the \(Pr(t_i = 1)\) to be equal and independent of the values of \(x_i\) for all \(i\). This provides a good estimation of the value of \(t\).

But randomized controlled trials just aren’t reasonable for any questions, if they are even possible. And so observational data analysis is critical but it must first find some way to control for everything in \(\mathbf{x_i}\) while to compute an unbiased and valid estimate of \(E(u_i(t == 1)) - E(u_i(t == 0))\).

This can be straight-froward. It is easy to deal with outcome modifiers or selection modifiers with propensity scores or regression. However, both of these methods breakdown when faced with correlation between the error in the selection equation and the outcomes equation.

Heckman Model

The Heckman Model (aka Heckman correction/Heckit) treats this correlation between the errors as a form of mis-specification and omitted variable bias. If we could observe the behavioral factors that feed into the error terms, this would not be a problem. Alas, that is generally not possible.

What the Heckman Model suggests is that for each person we have

\[t_i^* = \mathbf{z_i} \alpha + \epsilon_i\]


\[y_{u, i}^* = \mathbf{x_i} \beta_u + \eta_{u, i}\] \[y_{t, i}^* = \mathbf{x_i} \beta_t + \eta_{t, i}\]

The first equation gives the likelihood of being given treatment \(t\), this should similar similar to propensity score models. The second set of equations denote the outcomes that would be realized if a patient were given \(t\). However, we can’t observe both outcomes on the same unit (what Rubin and Holland call the fundamental problem of casual inference).

Instead, unless we have randomized \(t\), we observe

\[t_i = \begin{cases} 0 \text{ if } \mathbf{z_i} \alpha + \epsilon_i < 0 \\ 1 \text{ if } \mathbf{z_i} \alpha + \epsilon_i \geq 0 \end{cases}\]

and from \(t_i\),

\[y_i = \begin{cases} y_{u, i}^* \text{ if } t_i = 0 \\ y_{t, u}^* \text{ if } t_i = 1 \end{cases}\]

The value \(y\) is the realized value and its dependence on \(t\) is clear.

Still, this isn’t a real problem. It can all be solved with regression or propensity scores or similar analysis. The problem comes in the error terms. Basically, often times problems present with a correlation between the error terms in the selection and outcome equations. We could write

\[\left(\begin{array}{c} \epsilon \\ \eta_u \\ \eta_t \\ \end{array} \right) = N\left( \left(\begin{array}{c}0\\ 0\\ 0\\ \end{array}\right), \left(\begin{array}{ccc} 1 & \rho_1 \sigma_1 & \rho_2 \sigma_2 \\ \rho_1 \sigma_1 & \sigma_2^2 &\rho_{12}\sigma_1 \sigma_2 \\ \rho_2 \sigma_2 & \rho_{12}\sigma_1\sigma_2 & \sigma_2^2 \end{array} \right) \right)\]

and whenever both \(\rho_1\) and \(\rho_2\) are not 0, there will be poor estimation of \(E(y_t^*) - E(y_u^*)\).

What the Heckman model advances in the idea that \(E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0)\) can be broken down into two parts. Rather simply, the expectation breaks down into the true expectation of the unobserved value of \(y_u^*\) plus the expected value of \(\eta_u\) conditional on \(t_i = 0\):

\[E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0) = E(y_u^*) + E(\eta_u | t_i = 0)\]

Since \(t_i = 0\) if \(\mathbf{z_i} \alpha + \epsilon_i\), we can expand this to

\[E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0) + E(\eta_u | \mathbf{z_i} \alpha + \epsilon < 0)\]

Looking back towards the top, we see that \(E(y_u^*) = \mathbf{x_i} \beta_u\). Replacing that expectation in the expression yields

\[E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0) = \mathbf{x_i} \beta_u + E(\eta_u | \mathbf{z_i} \alpha + \epsilon < 0)\] \[E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0) = \mathbf{x_i} \beta_u + E(\eta_u | \epsilon < -\mathbf{z_i} \alpha)\]

It can be shown that because \(\eta_u\) and \(\epsilon\) are distributed according to a multivariate truncated normal distribution that

\[E(\eta_u | \epsilon < -\mathbf{z_i} \alpha) = \rho_1 \sigma_1 \lambda(-\mathbf{z_i} \alpha) + \epsilon_u\]

The expression \(\lambda(-\mathbf{x_i} \alpha)\) says to evaluate the inverse Mill’s ratio at the value of \(-\mathbf{x_i} \alpha\). The inverse Mill’s ratio for \(x\) is given by \(\frac{\phi(x)}{\Phi(x)}\) where \(\phi()\) and \(\Phi()\) are the standard normal pdf and cdf functions, respectively.

The proof of this is not overly complicated. The simplest way is to consider a recursive definition of the moments for a truncated normal and then the consideration that the truncated normal can be generalized to the bi variate truncated normal with a multiplication by \(\rho_1\). Fully detailing the proof would bog down in integration by parts and algebra nobody cares to see.

The exact same process can be repeated for the case of \(y_t\) leading to the equations:

\[\begin{eqnarray} E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0) &=& \mathbf{x_i} \beta_u - \rho_1 \sigma_1 \lambda(-\mathbf{z_i} \alpha) + \epsilon_u \\ E(y_t | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 1) &=& \mathbf{x_i} \beta_t + \rho_2 \sigma_2 \lambda(\mathbf{z_i} \alpha) + \epsilon_t \end{eqnarray}\]

where \(\epsilon_u\) and \(\epsilon_t\) are random error uncorrelated with the selection model and a mean of zero.

This equation can be estimated in two steps. The first bit, \(\lambda(-\mathbf{z_i}\alpha)\) and \(\lambda(\mathbf{z_i} \alpha)\) can both be estimated using a probit first stage:

\[E(Pr(t_i = 1)) = \Phi\left(\mathbf{z_i} \alpha \right)\]

and the resulting Inverse Mill’s Ratio can be computed using these values and a sign depending on the observed value of \(t\).

Once this variable is estimated, the complete model can be estimated to find the effects of \(\mathbf{x_i}\) as part of a second stage using least squares.

Additionally (and perhaps better), the whole model can be estimated in a single stage using maximum likelihood. Heckman’s recommendations have historically been to estimate the model in two stages to provide starting values for a single stage ML estimator. The MLE is more sensitive to starting values compared to the two stage model and so it isn’t always best, but should be the first estimator considered.

Data Generation Details


Let’s set up a (very) simple simulated example of this problem.

A disease of interest affects patients aged 65 to 85 uniformly and you want to estimating the effect of a new long-term treatment. The treatment’s benefit is not dependent on age but the probability of realizing the treatment is. The patient needs to be on the drug for an average of 10 years to realize a treatment effect (this is not totally unrealistic for some drugs such as statins). Formally, lets say that \(l\) is the lag of the treatment effect and \(l \sim gamma(20, 2)\). That gives \(l\) a mean of 10 and a variance of of 5. Some people respond early to the treatment and some respond late but 95% respond between 6 and 14 years after the treatment is started.

What is of interest is that the expected life remaining decreases over this interval. If we use the Social Security Tables we can compute the number of years of expected life remaining by age, \(d\).

Lets combine the remaining life and the distribution of \(l\) to estimate the probability of living long enough to benefit from the treatment effect for the average person:

actTable <- data.frame(age = seq(65, 85), 
                      yearsRemaining = c(18.94, 18.18, 17.44, 16.70, 15.98, 
                                        15.27, 14.57, 13.88, 13.21, 12.55, 
                                        11.91, 11.29, 10.67, 10.09,  9.51,
                                         8.94,  8.41,  7.88,  7.38,  6.91,
actTable$prTreatmentEffect <- pgamma(actTable$yearsRemaining, 20, 2)
ggplot(actTable, aes(x = age, y = prTreatmentEffect)) + 
  geom_line() + 
  scale_x_continuous("Age / Years") + 
  scale_y_continuous("Expected Probability of Outliving Treatment Lag")

Of course, the expected probability of outliving the lag isn’t the same as a given person’s probability. So lets add some noise. I don’t want to move it too far in either direction but I do want a fair amount of error, so lets say

\[P(r \geq l)_i = \Phi \left(\Phi^{-1} \left[P(d_i \geq l) \right] + \epsilon_i \right)\]

where \(\epsilon \sim N(0, 0.5^2)\), \(\Phi\) is the standard normal CDF and \(\Phi^{-1}\) is the inverse of the standard normal CDF (converting the probability to a Z-score). The same plot as above is made below but this time includes a shaded area denoting 95% sampling bounds about the mean.

# using a probit style model for simplicity but logit or other adjustments are
# equally valid
actTable$z <- qnorm(actTable$prTreatmentEffect)
actTable$lb <- pnorm(actTable$z - 2 * 0.5)
actTable$ub <- pnorm(actTable$z + 2 * 0.5)

ggplot(actTable, aes(x = age, y = prTreatmentEffect, ymin = lb, ymax = ub)) + 
  geom_line() + 
  geom_ribbon(alpha = 0.25) + 
  scale_x_continuous("Age / Years") + 
  scale_y_continuous("Expected Probability of Outliving Treatment Lag") 

Visually, this looks like a reasonable but not excessive amount of noise around the expected probability of success.

The meaning of \(\epsilon\) can be ascribed to random differences in willingness to accept the cost of the treatment. Even if you had a \(P(r \geq l)\) very near zero, if the treatment had no cost (in dollars or side effects) it would always make sense to take the treatment as the expected value would be strictly \(> 0\).

But that isn’t how treatments work. They cost you (and the insurer) money. They cost you effort to take (even if that effort is just taking a pill). They cost you in the risk of side effects.

So in this example, \(\epsilon\) can be thought of as measuring the sensitivity to the cost or the preference for the chance that the drug works.

It can also be thought of as patients sorting on their expected gain based on unobserved (and potentially unobservant to the researcher) factors. A patient who is really good about taking medications might take the medication because the cost of taking it is low. Same with the patient who is taking it because they are not worried about the side effects or believe they are worth the outcome.

This is where the problem arises. Suppose the outcome model is simple:

\[y_i = 1 + t_i + \eta_i\]

where \(y\) is some outcome of interest, \(t\) is treatment status and \(\eta\) is the error term in the outcomes model. Suppose \(\eta \sim N(0, 2^2)\).

As a general rule, people who take their medications regularly and who do not stop due to mild side effects are going to have the better outcomes. It seems likely then that \(\text{cor}(\epsilon, \eta) \neq 0\). Because the error term in the selection model was partly due to patients sorting on their expected ability to adhere and persist through side effects, it is not a stretch to imagine that the patients with larger values of \(\epsilon\) will also have larger values of \(\eta\) and likewise for smaller values.

Generating Test Sets

To recap, the selection model is

\[P(r \geq l)_i = \Phi \left(\Phi^{-1} \left[P(d_i \geq l) \right] + \epsilon_i \right)\]

where \(\epsilon \sim N(0, 0.05^2)\)

The outcomes are determined by

\[y = 1 + I_t + \eta\]

where \(I_t\) is an indicator for the patient having survived long enough to have treatment effects and \(\eta\) is \(N(0, 2^2)\). The \(\rho_{\epsilon, \eta}\) is set when the data is generated.

The function dataGen puts this all together.

dataGen <- function(cor, n = 1000, actuarialTable = actTable) {
  # generate uniform age values in the range of ages covered by the act tab
  data <- data.frame(
    age = round(runif(n, min(actuarialTable$age), max(actuarialTable$age))))
  # merge the data from the act tab and the generated ages
  data <- dplyr::select(actuarialTable, one_of("age", "yearsRemaining")) %>% 
    inner_join(data, by = ("age" = "age"))
  # compute the probability of living past treatment lag on average
  data$prTreatmentEffect <- pgamma(data$yearsRemaining, 20, 2)
  # generate error terms, mvrnorm using sigma^2 
  s1 <- 0.5
  s2 <- 1
  Sigma <- diag(c(s1^2, s2^2))
  # compute cov based on cor provided as input
  Sigma[lower.tri(Sigma)] <- cor * (s1) * (s2)
  Sigma[upper.tri(Sigma)] <- Sigma[lower.tri(Sigma)]
  err <- mvrnorm(n, c(0, 0), Sigma)
  # assign treatment based on prTreatmentEffect + noise
  data$treatment <- rbinom(n, 1, 
                           pnorm(qnorm(data$prTreatmentEffect) + err[, 1]))
  # assign outcome according to a very simple model
  data$outcome <- 1 + data$treatment + err[, 2]

Simulation in Parallel

I’m going to generate test sets with values of \(\rho\) ranging from -0.99 to 0.99 and I’m going to run 100 iterations on data generated at each of these levels. In each iteration, I’m going to compute the OLS estimator, the PS estimator and the Heckman MLE estimator.

However,as this takes a fair bit of time to run and in interest of computation time, I’m going to use the parallel package and parLapply to do this computation.

We need to nest dataGen inside the pfit function. Basically, a cluster is set of R jobs created by the makeCluster command. These jobs cannot “see” the same environment that you exist in but only what you pass them. In this case, it is easiest to pass them inside the function call.

The function pfit looks like:

pfit <- function(iter, r, n) {
  dataGen <- function(cor, n) {
    # from above
  fit <- function(r, n) {
    # similar to above but this time call dataGen inside fit
    data <- dataGen(r, n)
    # regression estimator
    linear <- coef(lm(outcome ~ treatment, data = data))[2]
    # propensity scores (using Matching package)
    model <- glm(treatment ~ cut(age, breaks = 20), data = data, 
                 family = binomial())$fitted.values
    ps <- Match(Y = data$outcome, Tr = data$treatment, X = model, ties = FALSE)$est
    # heckman mle
    hmle <- selection(data$t ~ cut(data$age, breaks = 20), 
                      list(data$outcome ~ 1, data$outcome ~ 1))
    hmle <- coef(hmle)[24] - coef(hmle)[21]
    # results
    return(c(linear, ps, hmle))
  results <- replicate(iter, fit(r, n))

Basically, pfit is passed arguments r and n where r is the correlation coefficient between the error in the selection model and outcome model. At this time, I’m forcing the errors for both outcome models and the correlation between those errors and the selection model to be equal, although in practice this need not be the case. As normal, n is the number of observations to generate.

fit then uses dataGen to generate n observations with r correlation and estimates a linear model of the outcome on treatment, a PS estimate of the treatment effect (using Match from the Matching package) and then estimates the Heckman selection model (aka Tobit-5) using the selection function of the sampleSelection package (as opposed to doing it as a two stage and doing it by hand).

The last line calls fit iter number of times using replicate and returns the data as a matrix.

To work with parLapply I wrote a second wrapper function parSim.

parSim <- function(cl, iter, r, n) {
  nw <- length(cl)
  results <- simplify2array(,
                                              rep(iter / nw, nw),
                                              fun = pfit,
                                              r = r, 
                                              n = n)))

This function takes a cluster cl and divides the iter iterations over the number of unique jobs in cl and runs fit the correct number of times on each job using parLapply. The simplify2array is simply to make the data easier to manipulate.

The results returned by parSim take the form of an array with each row being an estimator, each column being an iteration and each “slice” coming from one of the jobs.

I generated 100 replicates at 198 correlation coefficient values between \(-0.99\) and \(0.99\) (e.g., steps of \(0.01\)). While parSim does give about a 3-fold increase when the cluster is of size 4, it still takes about 60-90 minutes to run for the 198 values of \(\rho\) tested.

results <- NULL
s <- Sys.time()
for (r in seq(-0.99, 0.99, 0.01)) {
 run <- parSim(cl, iter = 100, r = r, n = 2000) 
 run <- data.frame(estimator = rep(c("LS", "PS", "HML"), 100),
                   estimate = as.numeric(run), 
                   r = r, n = 2000)
 results <- rbind(results, run)
e <- Sys.time()


First thing, first: see how the estimators perform as a function of the set value of \(\rho\):

ggplot(results, aes(x = r, y = estimate, color = estimator)) + 
  geom_point(alpha = 0.25) + 
  scale_x_continuous("Correlation") + 
  scale_y_continuous("Estimate (True Value = 1)")

Generally, no big surprises here — the effect of the “omitted variable bias” in the regression and propensity scores models, given that neither accounts for the non-zero correlation is unsurprising and has the expected signs.

When the correlation is zero or nearly zero, all of the models have similar performance with estimates about 1. However, as the correlation moves away from zero, this is no longer the case. As the correlation between the error in the selection and outcome equations increases, the effect estimate is over-estimated (by 100% among the propensity scores and nearly 50% by the regression). The same pattern is seen with correlations going to \(-1\) with the propensity score, in extreme cases, even getting the sign wrong.

However, the red smear reflecting the Heckman style selection model estimates the treatment effect at nearly 1 with no sensitivity to the changing correlation.

The problems with the linear regression model and the PS model are even more clear when plotting the mean and the mean squared error at each correlation level:

performance <- group_by(results, estimator, r) %>% 
  summarize(mean = mean(estimate), lb = quantile(estimate, 0.025),
            ub = quantile(estimate, 0.975), mse = mean((1 - estimate)^2))
ggplot(performance, aes(x = r, y = mean, ymin = lb, ymax = ub, 
                        color = estimator)) + 
  geom_point() + 
  geom_ribbon(alpha = 0.10, aes(fill = estimator)) + 
  scale_x_continuous("Correlation") + 
  scale_y_continuous("Mean Estimated Treatment Effect")

The mean estimated effect is denoted by the dots with a 95% bootstrapped confidence interval about the mean shown as the shaded region. The relative spread of the PS estimates relative to the two regression-style models is clear. The very high sensitivity to endogeneity in the treatment effect for the propensity scores models is also very visible. The linear regression model exhibits smaller CI but suffers from bias as correlation moves away from zero, if not to the same degree as the propensity scores.

The mean of the Heckman style selection estimates barely deviates from the true mean of \(1\).

The same information is visible in terms of the mean squared error.

ggplot(performance, aes(x = r, color = estimator, y = mse)) + 
  geom_point() + 
  geom_smooth(method = "loess") + 
  scale_x_continuous("Correlation") + 
  scale_y_continuous("Mean Squared Error")

The MSE for the propensity score and least squares models both exhibit the expected parabolic shape with a global minimum at correlation of zero. The robustness of the Heckman style estimator to endogeneity in the treatment estimate.

Code and Data

The result of the simulations can be downloaded here (2.4MB) as a csv. The script used to generate the data can be downloaded here, although the entire script is reproduced in the above post.


Endogenous relationships create real problems for proper and unbiased estimation of treatment effects and standard methods for dealing with selection on the observables, such as propensity scores and simple regression, fails to adequately deal with this problem. Heckman style selection models provide one means for dealing with this problem. Although they are not a panacea, these frameworks are a valuable tool for estimation in the presence of endogeneity due to correlation between the selection and outcome processes.

Tags: statistics r methods heckman

Related Posts