The Problem with Propensity Scores

Are Propensity Scores Useful?

Effect estimation for treatments using observation data isn't always straight forward. For example, it is very common that patients who are treated with a certain medication or procedure are healthier than those who are not treated. Those who aren't treated may not be treated due to a higher risk of treatment in their condition or a reduced expected value of treatment (e.g., a statin does not make sense as a new treatment for a patient who is 100 years old due to other causes of mortality). If you simply compared the the difference between the means of some response among the treated and untreated populations, the estimate would be confounded by the differences between the populations.

One method of dealing with this problem was to match on the observed characteristics thought to be relevant. Studies would find a control patient that matched the case patient on age, gender and a number of other factors. The problem with this approach is that it is expensive and impractical to match on anything more than a few factors due to the curse of dimensionality. To avoid “throwing away” unmatched cases, the number of required candidate controls to be sampled increases exponentially with the number of matched factors.

Propensity scores are a possible alternative to this problem. Instead of matching points either directly or using a nearest-neighbor method in \(k\) dimensions, a model can be constructed to express the propensity (e.g., probability) to be treated as a function of the observed variables. Then, matching can be done in 1 dimension with this propensity. By reducing the \(k\) dimensional problem to 1 dimension, it is possible to increase the number of matched factors and to also use the collected data more effectively. The analysis of the matched groups can be done using a relatively simple method, such as a t test.

While much can be said for reducing the complexity of the analysis to a simple bivariate test and for the potential relatively greater ease for modeling treatment choice as opposed to outcome, the value of a PS-adjusted bivariate test isn't clear to me. To explore the question in more detail, I did a few simulations in R to model how well a OLS estimator and a PS adjusted estimator did at estimating a treatment effect under the “best case” conditions.

Simulation

Propensity scores are typically estimated by fitting a binomial GLM with treatment status (0 or 1) as the outcome and all other covariates being predictors. Lets generate a set of 10 predictors, treatment and the response variable. The predictors will all be iid normal and will be the only non-white noise elements in the determination of treatment status. I selected an error term with a standard deviation of 5 empirically.

set.seed(222)
X <- replicate(10, rnorm(1000))
error <- rnorm(1000, 0, 5)
probTreatment <- exp(cbind(1, X) %*% rep(1, 11) + error) / 
  (1 + exp(cbind(1, X) %*% rep(1, 11) + error))

The vector probTreatment gives us \(\text{Pr}(T = 1 | x_1, x_2, \ldots, x_{10})\). If we use rbinom, we can randomly assign treatment based on the different probabilities:

t <- rbinom(1000, 1, probTreatment)

Now that we have treatment assigned, lets generate the response variable.

y <- 1 + X %*% rep(1, 10) + t + rnorm(1000, 0, 5)

As mentioned above, simply comparing the values of \(y\) for the treated and untreated groups won't produce the right estimates.

t.test(y[t == 1], y[t == 0])
## 
##  Welch Two Sample t-test
## 
## data:  y[t == 1] and y[t == 0]
## t = 8.8739, df = 945.607, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.421779 3.797091
## sample estimates:
##  mean of x  mean of y 
##  2.7115980 -0.3978372

The estimated treatment effect is 3.11, a 310.94% over estimate of the true value of 1.

However, if we estimate a propensity score model and match on the nearest unmatched control subject to the difference between the matches being less than 20% of the standard deviation of the pooled propensity scores, we get a better estimate. First, the model:

ps <- glm(t ~ X, family = binomial())
psValues <- predict(ps)
psTreated <- psValues[t == 1]
psUntreated <- psValues[t == 0]

The next step is to match subject to the 20% of the standard deviation limit:

# find matches, require be within 20% of a sd of the untransformed pooled ps
matches <- matrix(NA, ncol = 3, nrow = NROW(y[t == 1]))
limit <- 0.2 * sd(predict(ps))
for (i in 1:NROW(y[t == 1])) {
  x <- psTreated[i]
  distance <- (x - psUntreated)^2
  if (min(distance) <= limit) {
    nn <- which.min(distance)
    psUntreated[nn] <- .Machine$integer.max
    matches[i, ] <- c(i, nn, distance[nn]) 
  } else {
    matches[i, ] <- c(i, NA, NA)
  }
}

The loop takes each treated patient and searches for an unmatched untreated patient near the treated patient's propensity score and within the constraint. Once matched, the propensity score for the untreated patient is set to .Machine$integer.max to prevent further matches to the same patient (setting the value to NA introduces problems and .Machine$integer.max is safe enough). If there is no possible match subject to the constraints, the treated patient is matched to NA to reflect this fact.

Using the matches, we can construct two new vectors, yTreatedMatched and yUntreatedMatched for the t test. Based on propensity score theory, and true in this simulation, conditional on the observed characteristics the assignment to treatment or not treatment is random. As a result, these two groups are similar to what you might see in an RCT (in theory) and so the differences should be an unbiased and accurate estimate of the treatment effect.

# throw out the unmatched patients
matches <- matches[!is.na(matches[, 2]), ]
yTreatedMatched <- y[t == 1][matches[, 1]]
yUntreatedMatched <- y[t == 0][matches[, 2]]
t.test(yTreatedMatched, yUntreatedMatched)
## 
##  Welch Two Sample t-test
## 
## data:  yTreatedMatched and yUntreatedMatched
## t = 2.2593, df = 654.727, p-value = 0.0242
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1224774 1.7492684
## sample estimates:
## mean of x mean of y 
## 1.4946742 0.5588013

And, as expected, the propensity score matched estimator produces an accurate measure of the treatment effect, 0.94, compared to a true effect of 1. The result is also statistically significantly different from zero.

However, it required that we threw away data. We “collected” data on 565 treated patients but with 435 untreated patients, we were only able to match 330, or 58.41%.

Additionally, had we instead used a OLS regression model, we could have produced similar estimates.

model <- lm(y ~ X + t)
model
## 
## Call:
## lm(formula = y ~ X + t)
## 
## Coefficients:
## (Intercept)           X1           X2           X3           X4  
##      1.1242       1.0259       0.7914       0.5494       0.8666  
##          X5           X6           X7           X8           X9  
##      0.7482       0.8879       0.9435       1.2965       0.8275  
##         X10            t  
##      0.6487       0.7250

Moreover, the 95% CI produced using the OLS method is smaller ranging from 0.05 to 1.4 compared to the t test CI ranging from 0.12 to 1.75.

The next step is to see how this holds up for a variety of values of \(n\) and sampling ratios (e.g., 2 controls per case vs 1 control per case). To this end, the function psSim simulates a propensity score adjusted model. It works by constructing a large population of size n with k covariates and 1 treatment variable. From this large population, sampleN cases are selected at random and ratio * sampleN controls are selected. (I'm using case and control here to mean treated vs untreated, not in the strict sense of a [case-control study] (http://en.wikipedia.org/wiki/Case-control_study)).

From these sampled cases and controls, a propensity score model is estimated and the treatment effect is estimated using the matched populations. The function returns a 2 element vector, the first element is the estimated treatment effect and the second is the p value from the t test.

The function is below:

psSim <- function(n, k, sampleN, ratio) {
  # generate k covariates
  X <- replicate(k, rnorm(n))
  # some errors to the propensity score model
  pErr <- rnorm(n, 0, 5)  
  pt <- exp(1 + X %*% rep(1, k) + pErr)/(1 + exp(1 + X %*% rep(1, k) + pErr))
  # assign treatment status
  t <- rbinom(n, 1, pt)
  
  # generate response
  yErr <- rnorm(n, 0, 5)
  y <- 1 + X %*% rep(1, k) + t + yErr
  
  # put into big data frame
  data <- data.frame(y = y, t = t)
  data <- cbind(data, X)
  names(data) <- c("y", "t", paste(rep("x", k), 1:k, sep = ''))
  
  # generate sample from population
  treatedSample <- data[t == 1, ][sample(1:nrow(data[t == 1, ]), sampleN), ]
  untreatedSample <- data[t == 0, ][sample(1:nrow(data[t == 0, ]), 
                                           ratio * sampleN), ]
  
  # estimate propensity scores using all k X predictors
  sample <- rbind(treatedSample, untreatedSample)
  ps <- glm(t ~ ., data = sample[-1], 
            family = binomial())
  # find untransformed fitted values
  psTreated <- predict(ps)[sample$t == 1]
  psUntreated <- predict(ps)[sample$t == 0]
  
  # find matches, require be within 20% of a sd of the untransformed pooled ps
  matches <- matrix(NA, ncol = 3, nrow = nrow(treatedSample))
  limit <- 0.2 * sd(predict(ps))
  for (i in 1:nrow(treatedSample)) {
    x <- psTreated[i]
    distance <- (x - psUntreated)^2
    if (min(distance) <= limit) {
      nn <- which.min(distance)
      psUntreated[nn] <- .Machine$integer.max
      matches[i, ] <- c(i, nn, distance[nn]) 
    } else {
      matches[i, ] <- c(i, NA, NA)
    }
  }
  
  # do analysis
  treatedY <- treatedSample[, 1]
  untreatedY <- untreatedSample[, 1]
  matches <- matches[!is.na(matches[, 2]), ]
  matchedTreatedY <- treatedY[matches[, 1]]
  matchedUntreatedY <- untreatedY[matches[, 2]]
  test <- t.test(matchedTreatedY, matchedUntreatedY)
  results <- c(mean(matchedTreatedY - matchedUntreatedY), test$p.value)
  results
}

I also wrote a second function that generates the population and sample in the same way but instead of using propensity score matching, the treatment effect is estimated directly with OLS. This function is called lmSim:

lmSim <- function(n, k, sampleN, ratio) {
  # generate k covariates
  X <- replicate(k, rnorm(n))
  # some errors to the propensity score model
  pErr <- rnorm(n, 0, 5)  
  pt <- exp(1 + X %*% rep(1, k) + pErr)/(1 + exp(1 + X %*% rep(1, k) + pErr))
  # assign treatment status
  t <- rbinom(n, 1, pt)
  
  # generate response
  yErr <- rnorm(n, 0, 5)
  y <- 1 + X %*% rep(1, k) + t + yErr
  
  # put into big data frame
  data <- data.frame(y = y, t = t)
  data <- cbind(data, X)
  names(data) <- c("y", "t", paste(rep("x", k), 1:k, sep = ''))
  
  # generate sample from population
  treatedSample <- data[t == 1, ][sample(1:nrow(data[t == 1, ]), sampleN), ]
  untreatedSample <- data[t == 0, ][sample(1:nrow(data[t == 0, ]), 
                                           ratio * sampleN), ]
  sampleData <- rbind(treatedSample, untreatedSample)
  # fit OLS model
  model <- summary(lm(y ~ ., data = sampleData))
  results <- c(model$coefficients[2, 1], model$coefficients[2, 4])
  results
}

Using these two functions, I ran each estimator through 100 iterations for sample sizes of 50 to 1000 cases and control sets of 1 to 2 times larger. If you are running this at home, it takes a while. There is a second post that discusses running this code in parallel (using the parallel package) that also does a little optimization to the code as well (both technically and statistically) that is worth checking out, otherwise, be ready to wait a while (about 45-60 minutes on 3.0 GHz CPU).

psMean <- matrix(NA, nrow = 100 * 3, ncol = 100)
psP <- matrix(NA, nrow = 100 * 3, ncol = 100)
lsMean <- matrix(NA, nrow = 100 * 3, ncol = 100)
lsP <- matrix(NA, nrow = 100 * 3, ncol = 100)
i <- 1
for (n in seq(50, 1000, length.out = 100)) {
  for (ratio in c(1, 1.5, 2)) {
    psResults <- replicate(100, psSim(10000, 10, n, ratio))
    lmResults <- replicate(100, lmSim(10000, 10, n, ratio))
    psMean[i, ] <- psResults[1, ]
    psP[i, ] <- psResults[2, ]
    lsMean[i, ] <- lmResults[1, ]
    lsP[i, ] <- lmResults[2, ]
    i <- i + 1
  }
}

And some data munging to get it into a reasonable format for plotting:

results <- data.frame(estimatedEffect = c(as.numeric(psMean), 
                                          as.numeric(lsMean)))
results$ratio <- c(1, 1.5, 2)
results$n <- rep(seq(50, 1000, length.out = 100), each = 3)
results$method <- rep(c("PS", "LS"), each = 300 * 100)
results$p <- c(as.numeric(psP), as.numeric(lsP))
results$sig <- results$p <= 0.05

Results

If we plot the estimate against the sample size (of cases):

ggplot(results, aes(x = n, y = estimatedEffect)) + 
  geom_point(aes(color = method), alpha = 0.25) + 
  facet_grid(ratio ~ method) + 
  scale_y_continuous("Estimated Treatment Effect") + 
  geom_hline(aes(yintercept = 1))

The code produces a 3x3 grid of plots, the left plots are the estimates from OLS and the right from from the propensity score matched t test. The rows are the different ratios of control to case sampling (1, 1.5 and 2).

It is clear that both estimators produce reasonably unbiased estimates of the true treatment effect of 1. Furthermore, both appear to converge to the true value of 1 as the sample size goes to infinity.

However, it looks like the range of the estimates of the OLS estimator may be smaller at a given value of \(n\) than the propensity scores estimator. To this end, I calculate the mean squared error (MSE) of the estimate from the true value of 1 for both methods at each sample size.

# from dplyr
mse <- summarise(group_by(results, method, n, ratio), mean((estimatedEffect - 1)^2))
names(mse) <- c("method", "n", "ratio", "mse")

When plotted

ggplot(mse, aes(x = n, y = mse, color = method)) + 
  geom_point() + 
  facet_grid(ratio ~ .) + 
  scale_x_continuous("N") + 
  scale_y_continuous("MSE")

and especially when just a smoother is shown, not the actual points,

ggplot(mse, aes(x = n, y = mse, color = method)) + 
  geom_smooth(method = "loess") + 
  facet_grid(ratio ~ .) + 
  scale_x_continuous("N") + 
  scale_y_continuous("MSE")

it becomes clear that the MSE of a regression method is smaller than the typical MSE of the propensity score method for a fixed \(k\) as \(n\) varies.

Unsurprisingly, the higher MSE also shows up in the power testing. If we look at the p values of the tests:

ggplot(results, aes(x = n, y = p)) + 
  geom_point(aes(color = method), alpha = 0.25) + facet_grid(ratio ~ method) + 
  scale_y_continuous("Observed p value") + geom_hline(aes(yintercept = 0.05))

As we known that the true treatment effect is 1, we should expect fairly small p values. Both modeling frameworks appear to give similar results, although, as with the estimate and MSE plots above, relative differences in performance emerge as we consider the power curve.

If we formalize this with an estimated power curve:

power <- group_by(results, method, n, ratio) %>%
  summarize(power = mean(sig))
ggplot(power, aes(x = n, y = power, color = method)) + 
  geom_point(aes(color = method)) + facet_grid(ratio ~ method) + 
  geom_smooth(method = "loess") + 
  scale_x_continuous("N") + 
  scale_y_continuous("Power")

And again but with just the LOESS fit and no points:

ggplot(power, aes(x = n, y = power * 100, color = method)) + 
  facet_grid(ratio ~ .) + 
  geom_smooth(method = "loess") + 
  scale_x_continuous("N") + 
  scale_y_continuous("Power")

We see a relative difference between the two methods with the difference most pronounced at larger values of \(k\).

ps <- power[power$method == "PS", ]
ols <- power[power$method == "LS", ]
diffPower <- data.frame(ratio = ps$ratio,
                        diff = ols$power - ps$power,
                        n = ps$n)
diffPower$`ls more powerful` <- diffPower$diff > 0
ggplot(diffPower, aes(x = n, y = diff * 100)) +
  geom_point(aes(color = `ls more powerful`)) + 
  facet_grid(ratio ~ .) + 
  geom_smooth(method = "loess", alpha = 0) + 
  scale_x_continuous("N") + 
  scale_y_continuous("Percentage Point Difference in Power")

In general, propensity score methods for a fixed \(k\) over a varied \(n\) are less powerful than regression methods at the same values of \(n\) and \(k\). These differences become small as \(n\) gets very large, say 500-750, but for smaller values of \(n\) they can make a meaningful difference. A 10-20 percentage point increase in power by simply using regression over matching is hard to turn down.

The benefits of using linear models becomes more clear when you look at the variance of \(k\). With the same \(n\) and increasing \(k\), the linear model becomes increasingly powerful, to as much as 10 percentage points greater power. When the ratio of cases and controls gets larger, the power of regression is clearer than ever.

There are other problems with propensity scores. The propensity score models in this simulation were “perfect.” Conditional on all the observed variables, treatment choice was random, as our model was an exact restatement of the data generating process. This is unlikely to be the case in reality. Few, if any, papers evaluate the performance of their propensity score model. Many don't even provide information about what went into the model and the model selection, much less a cross-validation MSE.

They offer no improvement over regression in terms of unobserved heterogeneity. The scores are functionally identical to doing a regression and so suffer from all the relevant drawbacks.

Finally, propensity scores are computationally more expensive than regression. The file k.csv contains timings from a run of the propensity score and regression methods on similar data sets with a fixed value of \(n = 1000\) and \(k\) ranging from 10 to 100 done with 100 replications at each combination on my Linode VPS.

timings <- read.csv("/home/jacob/documents/blog/ps/k.csv")
# have to remove an outlier
timings <- timings[timings$k != 10, ]
timings$method <- ifelse(timings$m == 0, "LS", "PS")
ggplot(timings, aes(x = k, y = t, color = method)) + geom_point() + 
  geom_smooth(method = "loess") + 
  scale_x_continuous("K") + 
  scale_y_continuous("Time / s")

Regression is nearly always much faster than propensity score matching. More importantly, as \(k\) increases with a fixed \(n\), the cost of regression increases approximately linearly. The non-linear increase with respect to \(k\) is scary — a model that estimates and then matches on 500 predictors could take hours to fix, or seconds with a OLS model.

This makes sense, lm() and similar functions hand off the computation to highly optimized numerical linear algebra routines. No matching function, even if written in a lower level language like C or C++ via Rcpp will be nowhere near as fast and optimized as LINPACK/LAPACK/BLAS.

It would be hard, if not impossible, to match a propensity score method as fast as regression for two reasons. The propensity score method already does the regression in order to estimate the propensity scores. The matching then adds on an additional and greater cost.

Take-away

This added cost is just that, added cost. It doesn't provide a power benefit (if anything, it comes with a power penalty), computational benefit or reduction in complexity. While PS methods work, there is practically no circumstance in which they work and regression methods do not work at least as well.

Tags: statistics r regression propensity-scores

Related Posts