Parallel Simulation of Heckman Selection Model

Parallel Simulation of Heckman Selection Model

One of the, if not the, fundamental problems in observational data analysis is the estimation of the value of the unobserved choice. If the \(i^{\text{th}}\) unit chooses the value of \(t\) on the basis of some factors \(\mathbf{x_i}\), which may include the \(E(u_i(t))\) for that unit, comparing the outcome \(\mathbf{y}\) on a set where \(t = 1\) and a set where \(t = 0\) is unlikely to provide valid inference.

Randomized controlled trials get rid of this problem by setting the \(Pr(t_i = 1)\) to be equal and independent of the values of \(x_i\) for all \(i\). This provides a good estimation of the value of \(t\).

But randomized controlled trials just aren’t reasonable for any questions, if they are even possible. And so observational data analysis is critical but it must first find some way to control for everything in \(\mathbf{x_i}\) while to compute an unbiased and valid estimate of \(E(u_i(t == 1)) - E(u_i(t == 0))\).

This can be straight-froward. It is easy to deal with outcome modifiers or selection modifiers with propensity scores or regression. However, both of these methods breakdown when faced with correlation between the error in the selection equation and the outcomes equation.

Heckman Model

The Heckman Model (aka Heckman correction/Heckit) treats this correlation between the errors as a form of mis-specification and omitted variable bias. If we could observe the behavioral factors that feed into the error terms, this would not be a problem. Alas, that is generally not possible.

What the Heckman Model suggests is that for each person we have

\[t_i^* = \mathbf{z_i} \alpha + \epsilon_i\]


\[y_{u, i}^* = \mathbf{x_i} \beta_u + \eta_{u, i}\] \[y_{t, i}^* = \mathbf{x_i} \beta_t + \eta_{t, i}\]

The first equation gives the likelihood of being given treatment \(t\), this should similar similar to propensity score models. The second set of equations denote the outcomes that would be realized if a patient were given \(t\). However, we can’t observe both outcomes on the same unit (what Rubin and Holland call the fundamental problem of casual inference).

Instead, unless we have randomized \(t\), we observe

\[t_i = \begin{cases} 0 \text{ if } \mathbf{z_i} \alpha + \epsilon_i < 0 \\ 1 \text{ if } \mathbf{z_i} \alpha + \epsilon_i \geq 0 \end{cases}\]

and from \(t_i\),

\[y_i = \begin{cases} y_{u, i}^* \text{ if } t_i = 0 \\ y_{t, u}^* \text{ if } t_i = 1 \end{cases}\]

The value \(y\) is the realized value and its dependence on \(t\) is clear.

Still, this isn’t a real problem. It can all be solved with regression or propensity scores or similar analysis. The problem comes in the error terms. Basically, often times problems present with a correlation between the error terms in the selection and outcome equations. We could write

\[\left(\begin{array}{c} \epsilon \\ \eta_u \\ \eta_t \\ \end{array} \right) = N\left( \left(\begin{array}{c}0\\ 0\\ 0\\ \end{array}\right), \left(\begin{array}{ccc} 1 & \rho_1 \sigma_1 & \rho_2 \sigma_2 \\ \rho_1 \sigma_1 & \sigma_2^2 &\rho_{12}\sigma_1 \sigma_2 \\ \rho_2 \sigma_2 & \rho_{12}\sigma_1\sigma_2 & \sigma_2^2 \end{array} \right) \right)\]

and whenever both \(\rho_1\) and \(\rho_2\) are not 0, there will be poor estimation of \(E(y_t^*) - E(y_u^*)\).

What the Heckman model advances in the idea that \(E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0)\) can be broken down into two parts. Rather simply, the expectation breaks down into the true expectation of the unobserved value of \(y_u^*\) plus the expected value of \(\eta_u\) conditional on \(t_i = 0\):

\[E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0) = E(y_u^*) + E(\eta_u | t_i = 0)\]

Since \(t_i = 0\) if \(\mathbf{z_i} \alpha + \epsilon_i\), we can expand this to

\[E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0) + E(\eta_u | \mathbf{z_i} \alpha + \epsilon < 0)\]

Looking back towards the top, we see that \(E(y_u^*) = \mathbf{x_i} \beta_u\). Replacing that expectation in the expression yields

\[E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0) = \mathbf{x_i} \beta_u + E(\eta_u | \mathbf{z_i} \alpha + \epsilon < 0)\] \[E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0) = \mathbf{x_i} \beta_u + E(\eta_u | \epsilon < -\mathbf{z_i} \alpha)\]

It can be shown that because \(\eta_u\) and \(\epsilon\) are distributed according to a multivariate truncated normal distribution that

\[E(\eta_u | \epsilon < -\mathbf{z_i} \alpha) = \rho_1 \sigma_1 \lambda(-\mathbf{z_i} \alpha) + \epsilon_u\]

The expression \(\lambda(-\mathbf{x_i} \alpha)\) says to evaluate the inverse Mill’s ratio at the value of \(-\mathbf{x_i} \alpha\). The inverse Mill’s ratio for \(x\) is given by \(\frac{\phi(x)}{\Phi(x)}\) where \(\phi()\) and \(\Phi()\) are the standard normal pdf and cdf functions, respectively.

The proof of this is not overly complicated. The simplest way is to consider a recursive definition of the moments for a truncated normal and then the consideration that the truncated normal can be generalized to the bi variate truncated normal with a multiplication by \(\rho_1\). Fully detailing the proof would bog down in integration by parts and algebra nobody cares to see.

The exact same process can be repeated for the case of \(y_t\) leading to the equations:

\[\begin{eqnarray} E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0) &=& \mathbf{x_i} \beta_u - \rho_1 \sigma_1 \lambda(-\mathbf{z_i} \alpha) + \epsilon_u \\ E(y_t | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 1) &=& \mathbf{x_i} \beta_t + \rho_2 \sigma_2 \lambda(\mathbf{z_i} \alpha) + \epsilon_t \end{eqnarray}\]

where \(\epsilon_u\) and \(\epsilon_t\) are random error uncorrelated with the selection model and a mean of zero.

This equation can be estimated in two steps. The first bit, \(\lambda(-\mathbf{z_i}\alpha)\) and \(\lambda(\mathbf{z_i} \alpha)\) can both be estimated using a probit first stage:

\[E(Pr(t_i = 1)) = \Phi\left(\mathbf{z_i} \alpha \right)\]

and the resulting Inverse Mill’s Ratio can be computed using these values and a sign depending on the observed value of \(t\).

Once this variable is estimated, the complete model can be estimated to find the effects of \(\mathbf{x_i}\) as part of a second stage using least squares.

Additionally (and perhaps better), the whole model can be estimated in a single stage using maximum likelihood. Heckman’s recommendations have historically been to estimate the model in two stages to provide starting values for a single stage ML estimator. The MLE is more sensitive to starting values compared to the two stage model and so it isn’t always best, but should be the first estimator considered.

Data Generation Details


Let’s set up a (very) simple simulated example of this problem.

A disease of interest affects patients aged 65 to 85 uniformly and you want to estimating the effect of a new long-term treatment. The treatment’s benefit is not dependent on age but the probability of realizing the treatment is. The patient needs to be on the drug for an average of 10 years to realize a treatment effect (this is not totally unrealistic for some drugs such as statins). Formally, lets say that \(l\) is the lag of the treatment effect and \(l \sim gamma(20, 2)\). That gives \(l\) a mean of 10 and a variance of of 5. Some people respond early to the treatment and some respond late but 95% respond between 6 and 14 years after the treatment is started.

What is of interest is that the expected life remaining decreases over this interval. If we use the Social Security Tables we can compute the number of years of expected life remaining by age, \(d\).

Lets combine the remaining life and the distribution of \(l\) to estimate the probability of living long enough to benefit from the treatment effect for the average person:

actTable <- data.frame(age = seq(65, 85), 
                      yearsRemaining = c(18.94, 18.18, 17.44, 16.70, 15.98, 
                                        15.27, 14.57, 13.88, 13.21, 12.55, 
                                        11.91, 11.29, 10.67, 10.09,  9.51,
                                         8.94,  8.41,  7.88,  7.38,  6.91,
actTable$prTreatmentEffect <- pgamma(actTable$yearsRemaining, 20, 2)
ggplot(actTable, aes(x = age, y = prTreatmentEffect)) + 
  geom_line() + 
  scale_x_continuous("Age / Years") + 
  scale_y_continuous("Expected Probability of Outliving Treatment Lag")

Of course, the expected probability of outliving the lag isn’t the same as a given person’s probability. So lets add some noise. I don’t want to move it too far in either direction but I do want a fair amount of error, so lets say

\[P(r \geq l)_i = \Phi \left(\Phi^{-1} \left[P(d_i \geq l) \right] + \epsilon_i \right)\]

where \(\epsilon \sim N(0, 0.5^2)\), \(\Phi\) is the standard normal CDF and \(\Phi^{-1}\) is the inverse of the standard normal CDF (converting the probability to a Z-score). The same plot as above is made below but this time includes a shaded area denoting 95% sampling bounds about the mean.

# using a probit style model for simplicity but logit or other adjustments are
# equally valid
actTable$z <- qnorm(actTable$prTreatmentEffect)
actTable$lb <- pnorm(actTable$z - 2 * 0.5)
actTable$ub <- pnorm(actTable$z + 2 * 0.5)

ggplot(actTable, aes(x = age, y = prTreatmentEffect, ymin = lb, ymax = ub)) + 
  geom_line() + 
  geom_ribbon(alpha = 0.25) + 
  scale_x_continuous("Age / Years") + 
  scale_y_continuous("Expected Probability of Outliving Treatment Lag") 

Visually, this looks like a reasonable but not excessive amount of noise around the expected probability of success.

The meaning of \(\epsilon\) can be ascribed to random differences in willingness to accept the cost of the treatment. Even if you had a \(P(r \geq l)\) very near zero, if the treatment had no cost (in dollars or side effects) it would always make sense to take the treatment as the expected value would be strictly \(> 0\).

But that isn’t how treatments work. They cost you (and the insurer) money. They cost you effort to take (even if that effort is just taking a pill). They cost you in the risk of side effects.

So in this example, \(\epsilon\) can be thought of as measuring the sensitivity to the cost or the preference for the chance that the drug works.

It can also be thought of as patients sorting on their expected gain based on unobserved (and potentially unobservant to the researcher) factors. A patient who is really good about taking medications might take the medication because the cost of taking it is low. Same with the patient who is taking it because they are not worried about the side effects or believe they are worth the outcome.

This is where the problem arises. Suppose the outcome model is simple:

\[y_i = 1 + t_i + \eta_i\]

where \(y\) is some outcome of interest, \(t\) is treatment status and \(\eta\) is the error term in the outcomes model. Suppose \(\eta \sim N(0, 2^2)\).

As a general rule, people who take their medications regularly and who do not stop due to mild side effects are going to have the better outcomes. It seems likely then that \(\text{cor}(\epsilon, \eta) \neq 0\). Because the error term in the selection model was partly due to patients sorting on their expected ability to adhere and persist through side effects, it is not a stretch to imagine that the patients with larger values of \(\epsilon\) will also have larger values of \(\eta\) and likewise for smaller values.

Generating Test Sets

To recap, the selection model is

\[P(r \geq l)_i = \Phi \left(\Phi^{-1} \left[P(d_i \geq l) \right] + \epsilon_i \right)\]

where \(\epsilon \sim N(0, 0.05^2)\)

The outcomes are determined by

\[y = 1 + I_t + \eta\]

where \(I_t\) is an indicator for the patient having survived long enough to have treatment effects and \(\eta\) is \(N(0, 2^2)\). The \(\rho_{\epsilon, \eta}\) is set when the data is generated.

The function dataGen puts this all together.

dataGen <- function(cor, n = 1000, actuarialTable = actTable) {
  # generate uniform age values in the range of ages covered by the act tab
  data <- data.frame(
    age = round(runif(n, min(actuarialTable$age), max(actuarialTable$age))))
  # merge the data from the act tab and the generated ages
  data <- dplyr::select(actuarialTable, one_of("age", "yearsRemaining")) %>% 
    inner_join(data, by = ("age" = "age"))
  # compute the probability of living past treatment lag on average
  data$prTreatmentEffect <- pgamma(data$yearsRemaining, 20, 2)
  # generate error terms, mvrnorm using sigma^2 
  s1 <- 0.5
  s2 <- 1
  Sigma <- diag(c(s1^2, s2^2))
  # compute cov based on cor provided as input
  Sigma[lower.tri(Sigma)] <- cor * (s1) * (s2)
  Sigma[upper.tri(Sigma)] <- Sigma[lower.tri(Sigma)]
  err <- mvrnorm(n, c(0, 0), Sigma)
  # assign treatment based on prTreatmentEffect + noise
  data$treatment <- rbinom(n, 1, 
                           pnorm(qnorm(data$prTreatmentEffect) + err[, 1]))
  # assign outcome according to a very simple model
  data$outcome <- 1 + data$treatment + err[, 2]

Simulation in Parallel

I’m going to generate test sets with values of \(\rho\) ranging from -0.99 to 0.99 and I’m going to run 100 iterations on data generated at each of these levels. In each iteration, I’m going to compute the OLS estimator, the PS estimator and the Heckman MLE estimator.

However,as this takes a fair bit of time to run and in interest of computation time, I’m going to use the parallel package and parLapply to do this computation.

We need to nest dataGen inside the pfit function. Basically, a cluster is set of R jobs created by the makeCluster command. These jobs cannot “see” the same environment that you exist in but only what you pass them. In this case, it is easiest to pass them inside the function call.

The function pfit looks like:

pfit <- function(iter, r, n) {
  dataGen <- function(cor, n) {
    # from above
  fit <- function(r, n) {
    # similar to above but this time call dataGen inside fit
    data <- dataGen(r, n)
    # regression estimator
    linear <- coef(lm(outcome ~ treatment, data = data))[2]
    # propensity scores (using Matching package)
    model <- glm(treatment ~ cut(age, breaks = 20), data = data, 
                 family = binomial())$fitted.values
    ps <- Match(Y = data$outcome, Tr = data$treatment, X = model, ties = FALSE)$est
    # heckman mle
    hmle <- selection(data$t ~ cut(data$age, breaks = 20), 
                      list(data$outcome ~ 1, data$outcome ~ 1))
    hmle <- coef(hmle)[24] - coef(hmle)[21]
    # results
    return(c(linear, ps, hmle))
  results <- replicate(iter, fit(r, n))

Basically, pfit is passed arguments r and n where r is the correlation coefficient between the error in the selection model and outcome model. At this time, I’m forcing the errors for both outcome models and the correlation between those errors and the selection model to be equal, although in practice this need not be the case. As normal, n is the number of observations to generate.

fit then uses dataGen to generate n observations with r correlation and estimates a linear model of the outcome on treatment, a PS estimate of the treatment effect (using Match from the Matching package) and then estimates the Heckman selection model (aka Tobit-5) using the selection function of the sampleSelection package (as opposed to doing it as a two stage and doing it by hand).

The last line calls fit iter number of times using replicate and returns the data as a matrix.

To work with parLapply I wrote a second wrapper function parSim.

parSim <- function(cl, iter, r, n) {
  nw <- length(cl)
  results <- simplify2array(,
                                              rep(iter / nw, nw),
                                              fun = pfit,
                                              r = r, 
                                              n = n)))

This function takes a cluster cl and divides the iter iterations over the number of unique jobs in cl and runs fit the correct number of times on each job using parLapply. The simplify2array is simply to make the data easier to manipulate.

The results returned by parSim take the form of an array with each row being an estimator, each column being an iteration and each “slice” coming from one of the jobs.

I generated 100 replicates at 198 correlation coefficient values between \(-0.99\) and \(0.99\) (e.g., steps of \(0.01\)). While parSim does give about a 3-fold increase when the cluster is of size 4, it still takes about 60-90 minutes to run for the 198 values of \(\rho\) tested.

results <- NULL
s <- Sys.time()
for (r in seq(-0.99, 0.99, 0.01)) {
 run <- parSim(cl, iter = 100, r = r, n = 2000) 
 run <- data.frame(estimator = rep(c("LS", "PS", "HML"), 100),
                   estimate = as.numeric(run), 
                   r = r, n = 2000)
 results <- rbind(results, run)
e <- Sys.time()


First thing, first: see how the estimators perform as a function of the set value of \(\rho\):

ggplot(results, aes(x = r, y = estimate, color = estimator)) + 
  geom_point(alpha = 0.25) + 
  scale_x_continuous("Correlation") + 
  scale_y_continuous("Estimate (True Value = 1)")

Generally, no big surprises here — the effect of the “omitted variable bias” in the regression and propensity scores models, given that neither accounts for the non-zero correlation is unsurprising and has the expected signs.

When the correlation is zero or nearly zero, all of the models have similar performance with estimates about 1. However, as the correlation moves away from zero, this is no longer the case. As the correlation between the error in the selection and outcome equations increases, the effect estimate is over-estimated (by 100% among the propensity scores and nearly 50% by the regression). The same pattern is seen with correlations going to \(-1\) with the propensity score, in extreme cases, even getting the sign wrong.

However, the red smear reflecting the Heckman style selection model estimates the treatment effect at nearly 1 with no sensitivity to the changing correlation.

The problems with the linear regression model and the PS model are even more clear when plotting the mean and the mean squared error at each correlation level:

performance <- group_by(results, estimator, r) %>% 
  summarize(mean = mean(estimate), lb = quantile(estimate, 0.025),
            ub = quantile(estimate, 0.975), mse = mean((1 - estimate)^2))
ggplot(performance, aes(x = r, y = mean, ymin = lb, ymax = ub, 
                        color = estimator)) + 
  geom_point() + 
  geom_ribbon(alpha = 0.10, aes(fill = estimator)) + 
  scale_x_continuous("Correlation") + 
  scale_y_continuous("Mean Estimated Treatment Effect")

The mean estimated effect is denoted by the dots with a 95% bootstrapped confidence interval about the mean shown as the shaded region. The relative spread of the PS estimates relative to the two regression-style models is clear. The very high sensitivity to endogeneity in the treatment effect for the propensity scores models is also very visible. The linear regression model exhibits smaller CI but suffers from bias as correlation moves away from zero, if not to the same degree as the propensity scores.

The mean of the Heckman style selection estimates barely deviates from the true mean of \(1\).

The same information is visible in terms of the mean squared error.

ggplot(performance, aes(x = r, color = estimator, y = mse)) + 
  geom_point() + 
  geom_smooth(method = "loess") + 
  scale_x_continuous("Correlation") + 
  scale_y_continuous("Mean Squared Error")

The MSE for the propensity score and least squares models both exhibit the expected parabolic shape with a global minimum at correlation of zero. The robustness of the Heckman style estimator to endogeneity in the treatment estimate.

Code and Data

The result of the simulations can be downloaded here (2.4MB) as a csv. The script used to generate the data can be downloaded here, although the entire script is reproduced in the above post.


Endogenous relationships create real problems for proper and unbiased estimation of treatment effects and standard methods for dealing with selection on the observables, such as propensity scores and simple regression, fails to adequately deal with this problem. Heckman style selection models provide one means for dealing with this problem. Although they are not a panacea, these frameworks are a valuable tool for estimation in the presence of endogeneity due to correlation between the selection and outcome processes.

The Problem with Propensity Scores

Are Propensity Scores Useful?

Effect estimation for treatments using observation data isn't always straight forward. For example, it is very common that patients who are treated with a certain medication or procedure are healthier than those who are not treated. Those who aren't treated may not be treated due to a higher risk of treatment in their condition or a reduced expected value of treatment (e.g., a statin does not make sense as a new treatment for a patient who is 100 years old due to other causes of mortality). If you simply compared the the difference between the means of some response among the treated and untreated populations, the estimate would be confounded by the differences between the populations.

One method of dealing with this problem was to match on the observed characteristics thought to be relevant. Studies would find a control patient that matched the case patient on age, gender and a number of other factors. The problem with this approach is that it is expensive and impractical to match on anything more than a few factors due to the curse of dimensionality. To avoid “throwing away” unmatched cases, the number of required candidate controls to be sampled increases exponentially with the number of matched factors.

Propensity scores are a possible alternative to this problem. Instead of matching points either directly or using a nearest-neighbor method in \(k\) dimensions, a model can be constructed to express the propensity (e.g., probability) to be treated as a function of the observed variables. Then, matching can be done in 1 dimension with this propensity. By reducing the \(k\) dimensional problem to 1 dimension, it is possible to increase the number of matched factors and to also use the collected data more effectively. The analysis of the matched groups can be done using a relatively simple method, such as a t test.

While much can be said for reducing the complexity of the analysis to a simple bivariate test and for the potential relatively greater ease for modeling treatment choice as opposed to outcome, the value of a PS-adjusted bivariate test isn't clear to me. To explore the question in more detail, I did a few simulations in R to model how well a OLS estimator and a PS adjusted estimator did at estimating a treatment effect under the “best case” conditions.


Propensity scores are typically estimated by fitting a binomial GLM with treatment status (0 or 1) as the outcome and all other covariates being predictors. Lets generate a set of 10 predictors, treatment and the response variable. The predictors will all be iid normal and will be the only non-white noise elements in the determination of treatment status. I selected an error term with a standard deviation of 5 empirically.

X <- replicate(10, rnorm(1000))
error <- rnorm(1000, 0, 5)
probTreatment <- exp(cbind(1, X) %*% rep(1, 11) + error) / 
  (1 + exp(cbind(1, X) %*% rep(1, 11) + error))

The vector probTreatment gives us \(\text{Pr}(T = 1 | x_1, x_2, \ldots, x_{10})\). If we use rbinom, we can randomly assign treatment based on the different probabilities:

t <- rbinom(1000, 1, probTreatment)

Now that we have treatment assigned, lets generate the response variable.

y <- 1 + X %*% rep(1, 10) + t + rnorm(1000, 0, 5)

As mentioned above, simply comparing the values of \(y\) for the treated and untreated groups won't produce the right estimates.

t.test(y[t == 1], y[t == 0])
##  Welch Two Sample t-test
## data:  y[t == 1] and y[t == 0]
## t = 8.8739, df = 945.607, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.421779 3.797091
## sample estimates:
##  mean of x  mean of y 
##  2.7115980 -0.3978372

The estimated treatment effect is 3.11, a 310.94% over estimate of the true value of 1.

However, if we estimate a propensity score model and match on the nearest unmatched control subject to the difference between the matches being less than 20% of the standard deviation of the pooled propensity scores, we get a better estimate. First, the model:

ps <- glm(t ~ X, family = binomial())
psValues <- predict(ps)
psTreated <- psValues[t == 1]
psUntreated <- psValues[t == 0]

The next step is to match subject to the 20% of the standard deviation limit:

# find matches, require be within 20% of a sd of the untransformed pooled ps
matches <- matrix(NA, ncol = 3, nrow = NROW(y[t == 1]))
limit <- 0.2 * sd(predict(ps))
for (i in 1:NROW(y[t == 1])) {
  x <- psTreated[i]
  distance <- (x - psUntreated)^2
  if (min(distance) <= limit) {
    nn <- which.min(distance)
    psUntreated[nn] <- .Machine$integer.max
    matches[i, ] <- c(i, nn, distance[nn]) 
  } else {
    matches[i, ] <- c(i, NA, NA)

The loop takes each treated patient and searches for an unmatched untreated patient near the treated patient's propensity score and within the constraint. Once matched, the propensity score for the untreated patient is set to .Machine$integer.max to prevent further matches to the same patient (setting the value to NA introduces problems and .Machine$integer.max is safe enough). If there is no possible match subject to the constraints, the treated patient is matched to NA to reflect this fact.

Using the matches, we can construct two new vectors, yTreatedMatched and yUntreatedMatched for the t test. Based on propensity score theory, and true in this simulation, conditional on the observed characteristics the assignment to treatment or not treatment is random. As a result, these two groups are similar to what you might see in an RCT (in theory) and so the differences should be an unbiased and accurate estimate of the treatment effect.

# throw out the unmatched patients
matches <- matches[![, 2]), ]
yTreatedMatched <- y[t == 1][matches[, 1]]
yUntreatedMatched <- y[t == 0][matches[, 2]]
t.test(yTreatedMatched, yUntreatedMatched)
##  Welch Two Sample t-test
## data:  yTreatedMatched and yUntreatedMatched
## t = 2.2593, df = 654.727, p-value = 0.0242
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1224774 1.7492684
## sample estimates:
## mean of x mean of y 
## 1.4946742 0.5588013

And, as expected, the propensity score matched estimator produces an accurate measure of the treatment effect, 0.94, compared to a true effect of 1. The result is also statistically significantly different from zero.

However, it required that we threw away data. We “collected” data on 565 treated patients but with 435 untreated patients, we were only able to match 330, or 58.41%.

Additionally, had we instead used a OLS regression model, we could have produced similar estimates.

model <- lm(y ~ X + t)
## Call:
## lm(formula = y ~ X + t)
## Coefficients:
## (Intercept)           X1           X2           X3           X4  
##      1.1242       1.0259       0.7914       0.5494       0.8666  
##          X5           X6           X7           X8           X9  
##      0.7482       0.8879       0.9435       1.2965       0.8275  
##         X10            t  
##      0.6487       0.7250

Moreover, the 95% CI produced using the OLS method is smaller ranging from 0.05 to 1.4 compared to the t test CI ranging from 0.12 to 1.75.

The next step is to see how this holds up for a variety of values of \(n\) and sampling ratios (e.g., 2 controls per case vs 1 control per case). To this end, the function psSim simulates a propensity score adjusted model. It works by constructing a large population of size n with k covariates and 1 treatment variable. From this large population, sampleN cases are selected at random and ratio * sampleN controls are selected. (I'm using case and control here to mean treated vs untreated, not in the strict sense of a [case-control study] (

From these sampled cases and controls, a propensity score model is estimated and the treatment effect is estimated using the matched populations. The function returns a 2 element vector, the first element is the estimated treatment effect and the second is the p value from the t test.

The function is below:

psSim <- function(n, k, sampleN, ratio) {
  # generate k covariates
  X <- replicate(k, rnorm(n))
  # some errors to the propensity score model
  pErr <- rnorm(n, 0, 5)  
  pt <- exp(1 + X %*% rep(1, k) + pErr)/(1 + exp(1 + X %*% rep(1, k) + pErr))
  # assign treatment status
  t <- rbinom(n, 1, pt)
  # generate response
  yErr <- rnorm(n, 0, 5)
  y <- 1 + X %*% rep(1, k) + t + yErr
  # put into big data frame
  data <- data.frame(y = y, t = t)
  data <- cbind(data, X)
  names(data) <- c("y", "t", paste(rep("x", k), 1:k, sep = ''))
  # generate sample from population
  treatedSample <- data[t == 1, ][sample(1:nrow(data[t == 1, ]), sampleN), ]
  untreatedSample <- data[t == 0, ][sample(1:nrow(data[t == 0, ]), 
                                           ratio * sampleN), ]
  # estimate propensity scores using all k X predictors
  sample <- rbind(treatedSample, untreatedSample)
  ps <- glm(t ~ ., data = sample[-1], 
            family = binomial())
  # find untransformed fitted values
  psTreated <- predict(ps)[sample$t == 1]
  psUntreated <- predict(ps)[sample$t == 0]
  # find matches, require be within 20% of a sd of the untransformed pooled ps
  matches <- matrix(NA, ncol = 3, nrow = nrow(treatedSample))
  limit <- 0.2 * sd(predict(ps))
  for (i in 1:nrow(treatedSample)) {
    x <- psTreated[i]
    distance <- (x - psUntreated)^2
    if (min(distance) <= limit) {
      nn <- which.min(distance)
      psUntreated[nn] <- .Machine$integer.max
      matches[i, ] <- c(i, nn, distance[nn]) 
    } else {
      matches[i, ] <- c(i, NA, NA)
  # do analysis
  treatedY <- treatedSample[, 1]
  untreatedY <- untreatedSample[, 1]
  matches <- matches[![, 2]), ]
  matchedTreatedY <- treatedY[matches[, 1]]
  matchedUntreatedY <- untreatedY[matches[, 2]]
  test <- t.test(matchedTreatedY, matchedUntreatedY)
  results <- c(mean(matchedTreatedY - matchedUntreatedY), test$p.value)

I also wrote a second function that generates the population and sample in the same way but instead of using propensity score matching, the treatment effect is estimated directly with OLS. This function is called lmSim:

lmSim <- function(n, k, sampleN, ratio) {
  # generate k covariates
  X <- replicate(k, rnorm(n))
  # some errors to the propensity score model
  pErr <- rnorm(n, 0, 5)  
  pt <- exp(1 + X %*% rep(1, k) + pErr)/(1 + exp(1 + X %*% rep(1, k) + pErr))
  # assign treatment status
  t <- rbinom(n, 1, pt)
  # generate response
  yErr <- rnorm(n, 0, 5)
  y <- 1 + X %*% rep(1, k) + t + yErr
  # put into big data frame
  data <- data.frame(y = y, t = t)
  data <- cbind(data, X)
  names(data) <- c("y", "t", paste(rep("x", k), 1:k, sep = ''))
  # generate sample from population
  treatedSample <- data[t == 1, ][sample(1:nrow(data[t == 1, ]), sampleN), ]
  untreatedSample <- data[t == 0, ][sample(1:nrow(data[t == 0, ]), 
                                           ratio * sampleN), ]
  sampleData <- rbind(treatedSample, untreatedSample)
  # fit OLS model
  model <- summary(lm(y ~ ., data = sampleData))
  results <- c(model$coefficients[2, 1], model$coefficients[2, 4])

Using these two functions, I ran each estimator through 100 iterations for sample sizes of 50 to 1000 cases and control sets of 1 to 2 times larger. If you are running this at home, it takes a while. There is a second post that discusses running this code in parallel (using the parallel package) that also does a little optimization to the code as well (both technically and statistically) that is worth checking out, otherwise, be ready to wait a while (about 45-60 minutes on 3.0 GHz CPU).

psMean <- matrix(NA, nrow = 100 * 3, ncol = 100)
psP <- matrix(NA, nrow = 100 * 3, ncol = 100)
lsMean <- matrix(NA, nrow = 100 * 3, ncol = 100)
lsP <- matrix(NA, nrow = 100 * 3, ncol = 100)
i <- 1
for (n in seq(50, 1000, length.out = 100)) {
  for (ratio in c(1, 1.5, 2)) {
    psResults <- replicate(100, psSim(10000, 10, n, ratio))
    lmResults <- replicate(100, lmSim(10000, 10, n, ratio))
    psMean[i, ] <- psResults[1, ]
    psP[i, ] <- psResults[2, ]
    lsMean[i, ] <- lmResults[1, ]
    lsP[i, ] <- lmResults[2, ]
    i <- i + 1

And some data munging to get it into a reasonable format for plotting:

results <- data.frame(estimatedEffect = c(as.numeric(psMean), 
results$ratio <- c(1, 1.5, 2)
results$n <- rep(seq(50, 1000, length.out = 100), each = 3)
results$method <- rep(c("PS", "LS"), each = 300 * 100)
results$p <- c(as.numeric(psP), as.numeric(lsP))
results$sig <- results$p <= 0.05


If we plot the estimate against the sample size (of cases):

ggplot(results, aes(x = n, y = estimatedEffect)) + 
  geom_point(aes(color = method), alpha = 0.25) + 
  facet_grid(ratio ~ method) + 
  scale_y_continuous("Estimated Treatment Effect") + 
  geom_hline(aes(yintercept = 1))

The code produces a 3x3 grid of plots, the left plots are the estimates from OLS and the right from from the propensity score matched t test. The rows are the different ratios of control to case sampling (1, 1.5 and 2).

It is clear that both estimators produce reasonably unbiased estimates of the true treatment effect of 1. Furthermore, both appear to converge to the true value of 1 as the sample size goes to infinity.

However, it looks like the range of the estimates of the OLS estimator may be smaller at a given value of \(n\) than the propensity scores estimator. To this end, I calculate the mean squared error (MSE) of the estimate from the true value of 1 for both methods at each sample size.

# from dplyr
mse <- summarise(group_by(results, method, n, ratio), mean((estimatedEffect - 1)^2))
names(mse) <- c("method", "n", "ratio", "mse")

When plotted

ggplot(mse, aes(x = n, y = mse, color = method)) + 
  geom_point() + 
  facet_grid(ratio ~ .) + 
  scale_x_continuous("N") + 

and especially when just a smoother is shown, not the actual points,

ggplot(mse, aes(x = n, y = mse, color = method)) + 
  geom_smooth(method = "loess") + 
  facet_grid(ratio ~ .) + 
  scale_x_continuous("N") + 

it becomes clear that the MSE of a regression method is smaller than the typical MSE of the propensity score method for a fixed \(k\) as \(n\) varies.

Unsurprisingly, the higher MSE also shows up in the power testing. If we look at the p values of the tests:

ggplot(results, aes(x = n, y = p)) + 
  geom_point(aes(color = method), alpha = 0.25) + facet_grid(ratio ~ method) + 
  scale_y_continuous("Observed p value") + geom_hline(aes(yintercept = 0.05))

As we known that the true treatment effect is 1, we should expect fairly small p values. Both modeling frameworks appear to give similar results, although, as with the estimate and MSE plots above, relative differences in performance emerge as we consider the power curve.

If we formalize this with an estimated power curve:

power <- group_by(results, method, n, ratio) %>%
  summarize(power = mean(sig))
ggplot(power, aes(x = n, y = power, color = method)) + 
  geom_point(aes(color = method)) + facet_grid(ratio ~ method) + 
  geom_smooth(method = "loess") + 
  scale_x_continuous("N") + 

And again but with just the LOESS fit and no points:

ggplot(power, aes(x = n, y = power * 100, color = method)) + 
  facet_grid(ratio ~ .) + 
  geom_smooth(method = "loess") + 
  scale_x_continuous("N") + 

We see a relative difference between the two methods with the difference most pronounced at larger values of \(k\).

ps <- power[power$method == "PS", ]
ols <- power[power$method == "LS", ]
diffPower <- data.frame(ratio = ps$ratio,
                        diff = ols$power - ps$power,
                        n = ps$n)
diffPower$`ls more powerful` <- diffPower$diff > 0
ggplot(diffPower, aes(x = n, y = diff * 100)) +
  geom_point(aes(color = `ls more powerful`)) + 
  facet_grid(ratio ~ .) + 
  geom_smooth(method = "loess", alpha = 0) + 
  scale_x_continuous("N") + 
  scale_y_continuous("Percentage Point Difference in Power")

In general, propensity score methods for a fixed \(k\) over a varied \(n\) are less powerful than regression methods at the same values of \(n\) and \(k\). These differences become small as \(n\) gets very large, say 500-750, but for smaller values of \(n\) they can make a meaningful difference. A 10-20 percentage point increase in power by simply using regression over matching is hard to turn down.

The benefits of using linear models becomes more clear when you look at the variance of \(k\). With the same \(n\) and increasing \(k\), the linear model becomes increasingly powerful, to as much as 10 percentage points greater power. When the ratio of cases and controls gets larger, the power of regression is clearer than ever.

There are other problems with propensity scores. The propensity score models in this simulation were “perfect.” Conditional on all the observed variables, treatment choice was random, as our model was an exact restatement of the data generating process. This is unlikely to be the case in reality. Few, if any, papers evaluate the performance of their propensity score model. Many don't even provide information about what went into the model and the model selection, much less a cross-validation MSE.

They offer no improvement over regression in terms of unobserved heterogeneity. The scores are functionally identical to doing a regression and so suffer from all the relevant drawbacks.

Finally, propensity scores are computationally more expensive than regression. The file k.csv contains timings from a run of the propensity score and regression methods on similar data sets with a fixed value of \(n = 1000\) and \(k\) ranging from 10 to 100 done with 100 replications at each combination on my Linode VPS.

timings <- read.csv("/home/jacob/documents/blog/ps/k.csv")
# have to remove an outlier
timings <- timings[timings$k != 10, ]
timings$method <- ifelse(timings$m == 0, "LS", "PS")
ggplot(timings, aes(x = k, y = t, color = method)) + geom_point() + 
  geom_smooth(method = "loess") + 
  scale_x_continuous("K") + 
  scale_y_continuous("Time / s")

Regression is nearly always much faster than propensity score matching. More importantly, as \(k\) increases with a fixed \(n\), the cost of regression increases approximately linearly. The non-linear increase with respect to \(k\) is scary — a model that estimates and then matches on 500 predictors could take hours to fix, or seconds with a OLS model.

This makes sense, lm() and similar functions hand off the computation to highly optimized numerical linear algebra routines. No matching function, even if written in a lower level language like C or C++ via Rcpp will be nowhere near as fast and optimized as LINPACK/LAPACK/BLAS.

It would be hard, if not impossible, to match a propensity score method as fast as regression for two reasons. The propensity score method already does the regression in order to estimate the propensity scores. The matching then adds on an additional and greater cost.


This added cost is just that, added cost. It doesn't provide a power benefit (if anything, it comes with a power penalty), computational benefit or reduction in complexity. While PS methods work, there is practically no circumstance in which they work and regression methods do not work at least as well.

Frequentist German Tank Problem

The German Tank Problem: The Frequentist Way

Many things are given a serial number and often that serial number, logically, starts at 1 and for each new unit is increased by 1. For example, German tanks in World War II had several parts with serial numbers. By collecting the value of these numbers, Allied statisticians could produce estimates of the total number of tanks produced by the Germans in a given time period.

The idea behind this was that the serial numbers encountered in the field were samples from a discrete uniform distribution starting at 1 and terminating at some unknown value \( N \), where \( N \) is the true number of tanks built. The problem then became to use the information from the tanks captured/destroyed in battle estimate the value of \( N \).

If you had an urn with numbered balls from 1 to \( N \) and draw \( k \) balls with replacement, the maximum value observed in the sample is a reasonable estimator of \( N \).

A simple simulation shows that this is the case:

sampleUrn <- function(k, N = 10000, r = 100) {
    urn <- 1:N
    sampleMax <- replicate(r, max(sample(urn, k, replace = TRUE)))
k <- seq(1, 10000, length.out = 100)
sampleMax <- lapply(k, sampleUrn)
sampleMax <- data.frame(maxValue =, sampleMax), k = rep(k, each = 100))
ggplot(sampleMax, aes(x = k, y = maxValue)) + geom_point() + scale_y_continuous("Sample Maximum")

Indeed, as suggested by the plot, the estimator is consistent for \( N \) (a formal treatment is in Example 5.1.1 in Hogg, McKean and Craig 6th edition).

However, try as they might, the Allies couldn't capture enough tanks to make an appeal to asymptotic theory. For fairly small sample sizes, the bias of this estimator is significant. This makes sense as the sample max cannot exceed but only equal the maximum value of the support and the probability of a draw containing a given sample depends on the size of the draw (and the sample space).

Furthermore, they weren't sampling the tanks with replacement. The information about the tank's serial numbers came when the tank was captured or disabled in combat. Therefore, once observed, the tank can never be observed again. This complicates the pmf for \( M \) somewhat but provides more accurate estimation of \( N \) for a fixed sample than without accepting this complexity.

Nevertheless, it is possible to produce an estimator of \( N \), \( \hat{N} \), that is unbiased for small sample sizes and captures the fact that the sampling is done without replacement. To start simply,

\[ E(M; k) = \sum_{m = 0}^{N} m \text{Pr}(m) \]

or the expectation of the sample maximum, \( M \), given a draw of size \( k \) is simply the sum of the products of \( m \) times \( \text{Pr}(m) \) where \( m \in {1, 2, \ldots} \).

If we recognize that we have observed \( k \) tanks, it is not possible for \( N < k \) because we are sampling without replacement and have seen \( k \) unique tanks, we can split the summation into two parts

\[ E(M; k) = \sum_{m = 0}^{k-1} m \text{Pr}(m) + \sum_{m = k}^{N} m \text{Pr}(m) \]

and since \( Pr(N = m) = 0 \) for \( m < k \), this reduces to

\[ E(M; k) = \sum_{m = k}^{N} m \text{Pr}(m) \]

Since the sampling is without replacement, the pmf of \( M \) is the number of ways to select \( k-1 \) tanks from a set of \( m-1 \) tanks divided by the total number of ways to select \( k \) tanks from the total set of \( N \) tanks. After questions involving cards, this is my least favorite part of math stats, but it appears that this expression is given by

\[ \text{Pr}(M = m) = \frac{\binom{m-1}{k-1}}{\binom{N}{k}} \]

If we plug this in the the summation above for \( \text{Pr}(m) \), we get

\[ E(M; k) = \sum_{m = k}^{N} m \frac{\binom{m-1}{k-1}}{\binom{N}{k}} \]

Which, even without expansion of the binomial coefficients into a factorial form, is ugly. And so expansion is exactly what we do

\[ E(M; k) = \sum_{m = k}^{N} m \frac{\frac{(m-1)!}{(k-1)!((m-1)-(k-1))!}}{\binom{N}{k}} \]

And if we bring the \( m \) into the expression

\[ E(M; k) = \sum_{m = k}^{N} \frac{\frac{m!}{(k-1)!((m - k))!}}{\binom{N}{k}} \]

The expansion would be much nicer now if we could somehow put it back into a binomial coefficient. It is close to \( \binom{m}{k} \) but we would need a \( \frac{1}{k} \) multiplier. If we multiply the top and bottom of the expansion by \( k \), we do not change the value but we arrive at

\[ E(M; k) = \sum_{m = k}^{N} \frac{\frac{km!}{k(k-1)!((m - k))!}}{\binom{N}{k}} \]

which can be rewritten as

\[ E(M; k) = \sum_{m = k}^{N} \frac{k\binom{m}{k}}{\binom{N}{k}} \]

As far as the summation is concerned, \( k \) and \( \binom{N}{k} \) are constants and we can rewrite the expression as

\[ E(M; k) = \frac{k}{\binom{N}{k}} \sum_{m = k}^{N} \binom{m}{k} \]

At first glance, this doesn't look like much of an improvement. However, now we just have the summation of a single binomial coefficient without any multiplication inside the summation.

Recall from earlier, the pmf of \( M \) is

\[ \text{Pr}(M = m) = \frac{\binom{m-1}{k-1}}{\binom{N}{k}} \]

By the definition of a pmf, the sum of \( \text{Pr}(M = m) \) over the support of \( m \) is 1. As the values of \( \text{Pr}(m) \) are 0 for \( m < k \), this becomes

\[ 1 = \sum_{m = k}^{n} \frac{\binom{m-1}{k-1}}{\binom{N}{k}} \]

The \( \frac{1}{\binom{N}{k}} \) can be pulled out giving

\[ 1 = \frac{1}{\binom{N}{k}} \sum_{m = k}^{n} \binom{m-1}{k-1} \]

In order for this expression to be true,

\[ \binom{N}{k} = \sum_{m = k}^{N} \binom{m-1}{k-1} \]

Now returning to the expectation, we see something similar with \( \sum_{m = k}^{N} \binom{m}{k} \), however, we lack the “\( -1 \)” bit. If we realize that playing with the indices by adding or subtracting a constant doesn't change the value of the expression as long as we change all the indices, lets rewrite the expectation as

\[ E(M; k) = \frac{k}{\binom{N}{k}} \sum_{m = k+1}^{N+1} \binom{m - 1}{k - 1} \]

we now have an expression in the right form for this trick based on the pmf to apply. Taking relationship \( \binom{N}{k} = \sum_{m = k}^{N} \binom{m-1}{k-1} \) and substituting, we arrive at

\[ E(M; k) = \frac{k}{\binom{N}{k}} \binom{N + 1}{k + 1} \]

No more ugly summations! If we expand the binomial coefficients we get to

\[ E(M; k) = \frac{k \frac{(N + 1)!}{(k+1)!(N-k)!}} {\frac{N!}{k! (N-k)!}} \] \[ E(M; k) = \frac{k(N+1)!k!}{(k+1)!N!} \]

If we realize that \( \frac{k!}{(k+1)!} \) cancels to \( \frac{1}{k+1} \) and that \( \frac{(n+1)!}{n!} \) reduces to \( n+1 \), we can write this as

\[ E(M; k) = \frac{k (N+1)}{k+1} \]

By solving this expression for \( N \),

\[ N = \frac{M k + M}{k} - 1 \]


\[ E(N) = E\bigg(\frac{M k + M}{k} - 1\bigg) = \frac{M k + M}{k} - 1 \]

Note that this can be reduced to

\[ E(N) = M \bigg(1 + \frac{1}{k}\bigg) - 1 \]

for simplicity.

Since this is an R-centric blog after all, lets check the performance of this estimator against using just the sample maximum for a collection of fairly small samples. According to Wikipedia, a reasonable number for \( N \) under the historical context is about 300.

germanTankSim <- function(k, N) {
    tanks <- 1:N
    m <- max(sample(tanks, k))
    nhat <- m * (1 + 1/k) - 1
sampleMaxEstimator <- function(k, N) {
    tanks <- 1:N
    m <- max(sample(tanks, k))

N <- 300
k <- round(seq(0.01, 1, 0.01) * N)
k[k == 0] <- 1  # doesn't make sense with k = 0
tanks <- sapply(rep(k, each = 1000), germanTankSim, N)
tanks <- data.frame(adjustedMax = tanks, k = rep(k, each = 1000))
tanks$sampleMax <- sapply(rep(k, each = 1000), sampleMaxEstimator, N)

If you visualize the corrected estimator's estimate as a function of the number of observed tanks, you see a pretty smooth decline in the variance of the estimate and a relatively unbiased estimator of \( N \). Note that I've used both jitter and alpha blending to make the changes in distribution a little more clear.

ggplot(tanks, aes(x = k)) + geom_jitter(aes(y = adjustedMax), alpha = 0.1, position = position_jitter(width = 1, 
    height = 1)) + geom_hline(aes(yintercept = 300), lty = 2) + scale_x_continuous("Number of Observed Tanks") + 
    scale_y_continuous("Unbiased Estimate of Number of Total Tanks")

Turning towards the sample maximum, it is clearly biased towards lower values for the total number of tanks. Given that the probability for a draw of a very small size, say 10% of the population, to contain the population maximum value, this makes sense. The bias decays with increased numbers of observed tanks and converges to the true value when \( k \sim N \).

ggplot(tanks, aes(x = k)) + geom_jitter(aes(y = sampleMax), alpha = 0.1, position = position_jitter(width = 1, 
    height = 1)) + geom_hline(aes(yintercept = 300), lty = 2) + scale_x_continuous("Number of Observed Tanks") + 
    scale_y_continuous("Sample Max")

If we calculate the mean square error at each value of \( k \) for the two estimators, we see that the adjusted estimator has a slightly more accurate estimate of \( N \) than the raw sample mean.

mse <- function(value, N) {
    mean((value - N)^2)
esd <- aggregate(cbind(tanks$adjustedMax, tanks$sampleMax), by = list(tanks$k), 
    mse, N = 300)
names(esd) <- c("k", "adjustMax", "sampleMax")
esd <- data.frame(k = rep(esd$k, 2), mse = c(esd$adjustMax, esd$sampleMax), 
    estimator = rep(c("Adjusted", "Sample Max"), each = 100))
ggplot(esd, aes(x = k, y = mse, color = estimator)) + geom_line()

While it started out with tanks, this method can be useful for estimating a large number of different maximum values based on sequential serial numbers. In the absence of sales data, estimation based on serial numbers observed in the wild can often provide reasonable estimates of the total number of units. It may also spell out the end of humanity.

It also has somewhat different estimates using Bayesian approaches, specifically a non-finite mean for \( k = 1 \) or \( k = 2 \) without setting a prior limit on the number of units.

And most importantly, it is fun math, statistics and computing.

Stop using bivariate correlations for variable selection

Stop using bivariate correlations for variable selection

Something I've never understood is the widespread calculation and reporting of univariate and bivariate statistics in applied work, especially when it comes to model selection. Bivariate statistics are, at best, useless for multi-variate model selection and, at worst, harmful. Since nearly all questions of interest are highly dimensional, the use of bivariate statistics is worrisome.

What typically happens is a researcher sits down with their statistical software of choice and they compute a correlation between their response variable and their collection of possible predictors. From here, they toss out potential predictors that either have low correlation or for which the correlation is not significant.

The concern here is that it is possible for the correlation between the marginal distributions of the response and a predictor to be almost zero or non-significant and for that predictor to be an important element in the data generating pathway.

Consider the following simulation that illustrates the difference between the marginal relationship between the response and a predictor and the true multi- variable relationship. The function genY generates the response using user provided values for \( \beta \) and adds in a little white noise.

genY <- function(betas, X, stdPercent = 0.25) {
    modelMatrix <- cbind(1, X)
    yhat <- modelMatrix %*% betas
    y <- yhat + rnorm(nrow(yhat), 0, stdPercent * mean(yhat))

The function cor returns the correlation between two values and works with a vector and a matrix argument but cor.test requires two vectors. For this I wrote a simple function sigCor that returns TRUE if a correlation between the vector x and y is significant at a user provided threshold and FALSE otherwise. This function is designed to work with apply to filter our predictor set.

sigCor <- function(x, y, threshold = 0.2) {
    p <- cor.test(x, y)$p.value
    p <= threshold

The main function then generates a random covariance matrix and draws from a multivariate normal centered at 0 with that covariance matrix. The values of the matrix X are then checked for correlation with the response y. If the absolute value of the correlation exceeds some threshold or if the p value for the significance of the correlation is less than some threshold, the variable is kept. Otherwise it is dropped. The function returns the number of retained variables for each selection method.

simulateSelection <- function(nx, n, corLim = 0.2, pLim = 0.2) {
    betas <- rep(1, nx + 1)
    mu <- rep(0, nx)
    sigma <- genPositiveDefMat(nx, covMethod = "unifcorrmat")$Sigma
    X <- mvrnorm(n, mu, sigma)
    y <- genY(betas, X)
    xcor <- cor(y, X)
    limitedX <- X[, abs(xcor) >= corLim]
    pX <- X[, apply(X, 2, sigCor, y, pLim)]
    nCoef <- c(ncol(X) + 1, ncol(limitedX) + 1, ncol(pX) + 1)

If we keep the defaults of corLim = 0.2 and pLim = 0.2 (keep any correlation where the absolute value exceeds 0.2 and any correlation with a p value less than 0.2) and use replicate to run the function 1000 times with 10 predictors and \( n = 1000 \), we can calculate the probability that the different method selects the correct model. (Note that we set all of values of betas to 1, so all 11 coefficient estimates should be non-zero and all of the predictors are correlated with the outcome).

results <- replicate(1000, (simulateSelection(10, 1000)))
correctModelFraction <- apply(results == 11, 1, mean)
## [1] 1.000 0.026 0.573

Using a filter based on the size of the correlation only yields the correct model 2.6% of the time. Using a relatively high p value as a threshold does better returning the correct model 57.3% of the time.

We can explore the effect of varying the value of pLim using a few simple modifications to the above code.

simulateSelectionP <- function(nx, n, pLim) {
    betas <- rep(1, nx + 1)
    mu <- rep(0, nx)
    sigma <- genPositiveDefMat(nx, covMethod = "unifcorrmat")$Sigma
    X <- mvrnorm(n, mu, sigma)
    y <- genY(betas, X)
    pX <- X[, apply(X, 2, sigCor, y, pLim)]
    correct <- (ncol(pX)) == nx
varyP <- function(nx, n, p) {
    fractionCorrect <- numeric(100 * length(p))
    j <- 1
    for (i in 1:length(p)) {
        fractionCorrect[(j):(j + 99)] <- replicate(100, simulateSelectionP(nx, 
            n, p[i]))
        j <- j + 100

Using these new functions, we can calculate the fraction of the time that a threshold of p yields the correct model. If we vary p from \( 0.01--0.50 \) and plot the resulting curve,

prCorrect <- varyP(10, 1000, seq(0.01, 0.5, length.out = 20))
results <- data.frame(prCorrect = prCorrect, limit = rep(seq(0.01, 0.5, length.out = 20), 
    each = 100))
meanByLimit <- aggregate(results$prCorrect, list(results$limit), mean)
sdByLimit <- aggregate(results$prCorrect, list(results$limit), sd)
meanByLimit$sd <- sdByLimit$x
meanByLimit$upper <- meanByLimit$x + qnorm(0.975) * meanByLimit$sd/sqrt(100)
meanByLimit$lower <- meanByLimit$x - qnorm(0.975) * meanByLimit$sd/sqrt(100)
ggplot(meanByLimit, aes(x = Group.1, y = x, ymin = lower, ymax = upper)) + geom_line() + 
    geom_ribbon(alpha = 0.25) + theme_bw() + scale_x_continuous("Threshold for p value") + 
    scale_y_continuous("Fraction of models in simulation that are correct")

we see that, as expected, the probability of getting the correct model goes up as we increase the threshold value. We have about a 75% shot at getting the right model using this method when we set the limit on the p value to be 0.50 compared to under 30% at a p value of 0.01.

However, this raises an interesting question: if we increase the threshold for the p value, what is the value of using the bivariate correlation at all?

randomRate <- function(n, pLim) {
    x <- rnorm(n)
    y <- rnorm(n)
    sigCor(x, y, pLim)
p <- seq(0.01, 0.5, length.out = 20)
rate <- numeric(length(p))
for (i in 1:length(p)) {
    rate[i] <- mean(replicate(1000, randomRate(1000, p[i])))
rbind(p, rate)
##       [,1]    [,2]    [,3]    [,4]   [,5]   [,6]   [,7]   [,8]   [,9]
## p    0.010 0.03579 0.06158 0.08737 0.1132 0.1389 0.1647 0.1905 0.2163
## rate 0.009 0.03900 0.06400 0.10100 0.1110 0.1290 0.1660 0.1870 0.2210
##       [,10]  [,11]  [,12]  [,13]  [,14]  [,15]  [,16]  [,17]  [,18]  [,19]
## p    0.2421 0.2679 0.2937 0.3195 0.3453 0.3711 0.3968 0.4226 0.4484 0.4742
## rate 0.2250 0.2920 0.3180 0.3180 0.3210 0.4010 0.3720 0.4560 0.4130 0.4660
##      [,20]
## p      0.5
## rate   0.5

As the threshold increases, the value of the bivariate correlation as a filter goes down and goes down fast. At the limit required for there to be a 50% chance of the model containing all relevant predictors, the rate of improperly included variables would go up to nearly 20%.

The real problem comes into play here — the bivariate comparsions selects for the wrong variables by over-emphasizing the relationship between the marginal distributions. Because the selected model is incomplete and important variables are omitted, the resulting parameter estimates are biased and inaccurate.

The bivariate comparsion is a terrible way to select relevant variables for a highly dimensional model as the function of interest is relating the all of the predictors to the outcome. It can neither rule in nor rule out a predictor from a model. A better approach is to generate the model using domain knowledge and a model of the expected data generating process. If the process is to remain strictly data driven, methods like cross-validation or AIC/BIC provide better measures of model quality and predictor importance than the bivariate correlation coefficient.

Bayesian Search Models

Bayesian Search Theory

The US had a pretty big problem on their hands in 1966. Two planes had hit each other during a in-flight refueling and crashed. Normally, this would be an unfortunate thing and terrible for the families of those involved in the crash but otherwise fairly limited in importance. However, in this case, the plane being refueled was carrying four hydrogen bombs.

Of the four bombs, only three were found on land. Two had their non-nuclear payload explode and the third was intact and embedded in a river bank. As a result, the fourth bomb was thought to be in the sea. A local man reported that he saw the bomb enter the water and that was basically all of the information that was available to find the bomb.

The Navy started a search using Bayesian search theory. Bayesian search theory recognizes that there are two factors that contribute to the probability of finding a lost item in a given location: the probability that the object is in a given location and the probability of locating the object given that it is in the searched location. When looking for objects in difficult places, such as under water or in rough terrain, it may be very difficult to find an object even if it exists.

Bayesian search theory starts by by expressing the probability that an object \( O \) will be found in a location \( (x, y) \) is as by the product \( P(O) \) and \( P(D | O) \) where \( D \) is the event of detection of the object \( O \).

This makes sense intuitively. For instance, the probability of me finding a scarp of paper is smaller in my office which is overflowing with scraps of paper than in my kitchen despite the fact that my office is much more likely to contain the scrap of paper.

To use Bayesian search theory to find an object, you would calculate the probability that the object will be found in each location on the basis of the detection probability and the prior probability. You would then search first in the location most likely to result in you successfully finding the object. Assuming you don't find the object in that location, you can update the probability of finding that object by iteratively updating the probability with Bayes theorem.

To make this concrete, suppose we have a particle in a 2D box. Let the box be 61 units by 61 units wide and divided into 3721 “cells” of 1 square unit. Suppose that during each time-step, the particle draws a move for each x and y from a multivariate normal centered at zero with variance of 1 and covariate of 0. (Although the grid is divided into boxes, the particle is not limited to discrete values). After \( t \) time steps, the pdf for the location of the particle is simply \( N(\mathbf{\mu}, t\mathbf{\Sigma}) \) as it is the result of the summation of a set of independent normal RV. We will transform this to the the grid instead of using contours. If we let the particle “walk” for 100 time steps, the resulting pmf/pdf looks like

d <- data.frame(x = rep(seq(-30, 30), each = 61), y = rep(seq(-30, 30), times = 61))
d$PrP <- dnorm(d$x, 0, sqrt(100)) * dnorm(d$y, 0, sqrt(100))
ggplot(d, aes(x = x, y = y, z = PrP)) + geom_point(aes(alpha = PrP)) + stat_contour()

As you might expect as \( N(0, 100) \) distribution is fairly flat.

Now lets come up with a model for \( P(D | O) \). In the case of the Navy's search, this was a function of depth in that grid location. For simplicity, I'm going to code detection probability as a function of distance from a given point. I placed a detector station at \( (10, 10) \) and in that grid square the detector has a 97.5% chance of detecting the particle if it is there. The detection probability decays with distance. For this, I simply modeled as it \( 0.975 * 0.925^d \) where \( d \) is the shortest linear distance to the detector.

The detection power then looks something like

detectionPower <- function(x, y, dx = 10, dy = 5, p0 = 0.975, d = 0.925) {
    x2 <- x - dx
    y2 <- y - dx
    r <- sqrt(x2^2 + y2^2)
    power <- p0 * d^r
d$PrD <- detectionPower(d$x, d$y)
ggplot(d, aes(x = x, y = y, z = PrD)) + geom_point(aes(alpha = PrD)) + stat_contour(binwidth = 0.1)

The product of these two functions is proportional to the probability that a single search of an grid will find the particle. This is shown below

d$valueOfSearch <- d$PrP * d$PrD
ggplot(d, aes(x = x, y = y, z = d$valueOfSearch)) + geom_point(aes(alpha = d$valueOfSearch)) + 

Putting all three graphics together

nd <- data.frame(x = rep(d$x, 3), y = rep(d$y, 3), value = c(d$PrP, d$PrD, d$valueOfSearch), 
    metric = gl(3, nrow(d), labels = c("PDF of Object", "Detection Prob", "Value of Search")))
ggplot(nd, aes(x = x, y = y, z = value)) + stat_contour() + facet_grid(. ~ metric) + 
    scale_x_continuous(limits = c(-30, 30)) + scale_y_continuous(limits = c(-30, 

it is clear that the optimal areas to search are actually pulled slightly away from the most probable areas because those areas are less likely to be fruitful given the reduced values of \( P(D | O) \).

Suppose in a given time unit, 100 areas can be searched using the allotted budget and the particle is no longer moving. On the first day, it would be optimal to search the 100 most likely spots

If the object is not found in the searched spots, the probability of finding the object in that spot, by application of Bayes Theorem, becomes

\[ P(O) = \frac{P(O)_0 (1 - P(D | O))}{(1 - P(O)_0 + P(O)_0 (1 - P(D|O)))} \]

And for the locations not searched, their probability is also revised upwards (just like when the Monty Hall opens a door) according to

\[ P(O) = \frac{P(O)_0}{(1 - P(O)_0 + P(O)_0 (1 - P(D|O)))} \]

Suppose that we search the first 100 locations, what does the updated search value look like?

bayesUpdate <- function(searched, p0, pD) {
    (p0 * (1 - searched * pD))/(1 - p0 + p0 * (1 - pD))
d$searched <- rank(-1 * d$valueOfSearch) <= 100
d$newSearchValue <- bayesUpdate(d$searched, d$PrP, d$PrD)
nd <- data.frame(x = rep(d$x, 2), y = rep(d$y, 2), valueOfSearch = c(d$valueOfSearch, 
    d$newSearchValue), searched = rep(d$searched, 2), search = rep(c("Before Any Searching", 
    "First Wave"), each = nrow(d)))
nd$searched[nd$searched == FALSE] <- NA
ggplot(nd, aes(x = x, y = y, z = valueOfSearch)) + stat_contour() + facet_grid(. ~ 
    search) + geom_point(aes(color = searched, alpha = valueOfSearch))

The searched area becomes less probable than before, however, it does not go to zero after a single search. Additionally, points with higher prior probabilities of having the object remain relatively likely depsite having being searched. Furthermore, the whole distribution moved as a result of not finding the object in the first 100 locations.

Using these new probabilities, you could update your search path and search the next 100 locations and so on until either the object is found or the probability of ever finding it in the search area is nearly zero.

searchCount <- rep(0, nrow(d))
probInSearchArea <- numeric(1000)
probFindingInGrid <- numeric(1000)
p0 <- d$PrP * d$PrD
pD <- d$PrD
for (i in 1:1000) {
    searchLocations <- rank(-1 * p0) <= 100
    searchCount <- searchCount + searchLocations
    probInSearchArea[i] <- sum(p0[searchLocations])
    probFindingInGrid[i] <- sum(p0)
    p0 <- bayesUpdate(searchLocations, p0, pD)
nSearches <- data.frame(x = d$x, y = d$y, count = searchCount)
ggplot(nSearches, aes(x = x, y = y, z = count)) + stat_contour() + geom_point(aes(alpha = count))

This result makes sense. The areas most distant from the detection center required the most searching as the areas closer can be more quickly ruled out due to the greater values of \( P(D|O) \).

Since each grid location is 1 square unit, the sum of the probabilities for each unit is approximately equal to the integral of the pdf over the 31 x 31 search area. Exploiting this, we can easily calculate the probability of finding the object with the sum of the probabilities at each point. (This approximates the integral fairly well as the grid is 1 unit square which is fairly small compared to the variance and total grid size).

searchValue <- data.frame(searchNumber = 1:1000, marginalvalue = probInSearchArea, 
    cumulativeValue = probFindingInGrid)
ggplot(searchValue, aes(x = searchNumber)) + geom_line(aes(y = marginalvalue)) + 
    geom_line(aes(y = probFindingInGrid), lty = 2) + scale_y_continuous("Probability of Finding the Object") + 
    scale_x_continuous("Number of Unsuccessful Searches")

The probability of finding the object in the next search or within the considered grid decreases as the number of failed searches increases. This is one of the greatest features of Bayesian search theory: by estimating the probability of ever finding the object given the search area and previous efforts, you can determine when searching is no longer economically (or otherwise) viable.

Additionally, by adjusting resources according to new information, such as the failure to find the object in a searched sector, Bayesian methods reduce the time required to find an object, especially when the \( P(D | O) \) is highly variable. With some assumptions in the modeling, a Bayesian search model found the end of a random walk in fewer steps than a search model based only on the pdf of the random walk or a model based on the product of the random walk pdf and the detection probability. You can see the complete code for these simulations here.

By combining the relevant information from prior searching and the likelihood of a successful search given the object is located in the area, Bayesian search theory provides an method for the optimal deployment of search resources. Now, off to figure out where I put my iPad.

Older Newer