# Inter-ocular trauma test

I’ve recently been thinking about the role statistics can play in answering questions. I think the it came up on the NSSD podcast a few weeks ago. Basically, problems can be divided into three classes:

• those that don’t need statistics because the answer is obvious (problems without much confounding and a strong signal to noise ratio)
• those that need statistics because there is noise/confounding
• those that can’t be helped by statistics because the noise/confounding is so large or the signal is very small or the question is fundamentally unanswerable with statistics

I think this is a useful idea and have been applying a similar system for years that I’ve learned as the “inter-ocular trauma test.” (I learned the term in my intro biostats class, I tracked it down to a talk by Joe Berkson in the 50s or 60s). The idea is that sometimes the result is so obvious that it hits you between the eyes, hence causing inter-ocular trauma.

I like this idea and often use it as a crude filter when looking at visualizations or causally reading some analysis. High end statistical analysis is often required to determine the signal and the formalization of the inter-ocular trauma test with a formal analysis is always needed, but that doesn’t stop it from being useful all the same.

One of the reasons I really like it is how simple and easy it is to understand. For example, where I live (Iowa City, Iowa) there is a small “town” of about 1,000 people located within the larger city. This small town is famous for being a speed trap - a major road enters and passes through the town at the same time as the speed limit drops to 25 mph from 35 mph and the road goes down hill. They employ 5 people officers (about 2x what would be expected for a comparable city) and issue a lot of speeding tickets.

A recent article came out in the local paper saying they issued more tickets than other areas. The reporter summarized the number of tickets issued in the past year and the population of the towns. They found the following data:

The town of University Heights is located entirely within Iowa City. When asked about the difference, the University Heights police chief said that the comparison isn’t fair because they have a major thoroughfare and lots of people tailgate in the area during home football games. (Never mind that the stadium is in Iowa City and the only way to enter University Heights is to travel through Iowa City for several miles.)

They do also say that normalizing by population isn’t fair because they have so many communters. However, they also did issue the greatest number of tickets (907 vs 617, 802 or 366). So that doesn’t really hold either.

ps <- Match(Y = data$outcome, Tr = data$treatment, X = model, ties = FALSE)$est # heckman mle hmle <- selection(data$t ~ cut(data$age, breaks = 20), list(data$outcome ~ 1, data$outcome ~ 1)) hmle <- coef(hmle)[24] - coef(hmle)[21] # results return(c(linear, ps, hmle)) } results <- replicate(iter, fit(r, n)) } Basically, pfit is passed arguments r and n where r is the correlation coefficient between the error in the selection model and outcome model. At this time, I’m forcing the errors for both outcome models and the correlation between those errors and the selection model to be equal, although in practice this need not be the case. As normal, n is the number of observations to generate. fit then uses dataGen to generate n observations with r correlation and estimates a linear model of the outcome on treatment, a PS estimate of the treatment effect (using Match from the Matching package) and then estimates the Heckman selection model (aka Tobit-5) using the selection function of the sampleSelection package (as opposed to doing it as a two stage and doing it by hand). The last line calls fit iter number of times using replicate and returns the data as a matrix. To work with parLapply I wrote a second wrapper function parSim. parSim <- function(cl, iter, r, n) { require(parallel) nw <- length(cl) results <- simplify2array(do.call(list, parLapply(cl, rep(iter / nw, nw), fun = pfit, r = r, n = n))) results } This function takes a cluster cl and divides the iter iterations over the number of unique jobs in cl and runs fit the correct number of times on each job using parLapply. The simplify2array is simply to make the data easier to manipulate. The results returned by parSim take the form of an array with each row being an estimator, each column being an iteration and each “slice” coming from one of the jobs. I generated 100 replicates at 198 correlation coefficient values between $$-0.99$$ and $$0.99$$ (e.g., steps of $$0.01$$). While parSim does give about a 3-fold increase when the cluster is of size 4, it still takes about 60-90 minutes to run for the 198 values of $$\rho$$ tested. results <- NULL s <- Sys.time() for (r in seq(-0.99, 0.99, 0.01)) { print(r) run <- parSim(cl, iter = 100, r = r, n = 2000) run <- data.frame(estimator = rep(c("LS", "PS", "HML"), 100), estimate = as.numeric(run), r = r, n = 2000) results <- rbind(results, run) } e <- Sys.time() ## Analysis First thing, first: see how the estimators perform as a function of the set value of $$\rho$$: ggplot(results, aes(x = r, y = estimate, color = estimator)) + geom_point(alpha = 0.25) + scale_x_continuous("Correlation") + scale_y_continuous("Estimate (True Value = 1)") Generally, no big surprises here — the effect of the “omitted variable bias” in the regression and propensity scores models, given that neither accounts for the non-zero correlation is unsurprising and has the expected signs. When the correlation is zero or nearly zero, all of the models have similar performance with estimates about 1. However, as the correlation moves away from zero, this is no longer the case. As the correlation between the error in the selection and outcome equations increases, the effect estimate is over-estimated (by 100% among the propensity scores and nearly 50% by the regression). The same pattern is seen with correlations going to $$-1$$ with the propensity score, in extreme cases, even getting the sign wrong. However, the red smear reflecting the Heckman style selection model estimates the treatment effect at nearly 1 with no sensitivity to the changing correlation. The problems with the linear regression model and the PS model are even more clear when plotting the mean and the mean squared error at each correlation level: performance <- group_by(results, estimator, r) %>% summarize(mean = mean(estimate), lb = quantile(estimate, 0.025), ub = quantile(estimate, 0.975), mse = mean((1 - estimate)^2)) ggplot(performance, aes(x = r, y = mean, ymin = lb, ymax = ub, color = estimator)) + geom_point() + geom_ribbon(alpha = 0.10, aes(fill = estimator)) + scale_x_continuous("Correlation") + scale_y_continuous("Mean Estimated Treatment Effect") The mean estimated effect is denoted by the dots with a 95% bootstrapped confidence interval about the mean shown as the shaded region. The relative spread of the PS estimates relative to the two regression-style models is clear. The very high sensitivity to endogeneity in the treatment effect for the propensity scores models is also very visible. The linear regression model exhibits smaller CI but suffers from bias as correlation moves away from zero, if not to the same degree as the propensity scores. The mean of the Heckman style selection estimates barely deviates from the true mean of $$1$$. The same information is visible in terms of the mean squared error. ggplot(performance, aes(x = r, color = estimator, y = mse)) + geom_point() + geom_smooth(method = "loess") + scale_x_continuous("Correlation") + scale_y_continuous("Mean Squared Error") The MSE for the propensity score and least squares models both exhibit the expected parabolic shape with a global minimum at correlation of zero. The robustness of the Heckman style estimator to endogeneity in the treatment estimate. ## Code and Data The result of the simulations can be downloaded here (2.4MB) as a csv. The script used to generate the data can be downloaded here, although the entire script is reproduced in the above post. ## Take-Away Endogenous relationships create real problems for proper and unbiased estimation of treatment effects and standard methods for dealing with selection on the observables, such as propensity scores and simple regression, fails to adequately deal with this problem. Heckman style selection models provide one means for dealing with this problem. Although they are not a panacea, these frameworks are a valuable tool for estimation in the presence of endogeneity due to correlation between the selection and outcome processes. # The Problem with Propensity Scores 15 Apr 2015 # Are Propensity Scores Useful? Effect estimation for treatments using observation data isn't always straight forward. For example, it is very common that patients who are treated with a certain medication or procedure are healthier than those who are not treated. Those who aren't treated may not be treated due to a higher risk of treatment in their condition or a reduced expected value of treatment (e.g., a statin does not make sense as a new treatment for a patient who is 100 years old due to other causes of mortality). If you simply compared the the difference between the means of some response among the treated and untreated populations, the estimate would be confounded by the differences between the populations. One method of dealing with this problem was to match on the observed characteristics thought to be relevant. Studies would find a control patient that matched the case patient on age, gender and a number of other factors. The problem with this approach is that it is expensive and impractical to match on anything more than a few factors due to the curse of dimensionality. To avoid “throwing away” unmatched cases, the number of required candidate controls to be sampled increases exponentially with the number of matched factors. Propensity scores are a possible alternative to this problem. Instead of matching points either directly or using a nearest-neighbor method in $$k$$ dimensions, a model can be constructed to express the propensity (e.g., probability) to be treated as a function of the observed variables. Then, matching can be done in 1 dimension with this propensity. By reducing the $$k$$ dimensional problem to 1 dimension, it is possible to increase the number of matched factors and to also use the collected data more effectively. The analysis of the matched groups can be done using a relatively simple method, such as a t test. While much can be said for reducing the complexity of the analysis to a simple bivariate test and for the potential relatively greater ease for modeling treatment choice as opposed to outcome, the value of a PS-adjusted bivariate test isn't clear to me. To explore the question in more detail, I did a few simulations in R to model how well a OLS estimator and a PS adjusted estimator did at estimating a treatment effect under the “best case” conditions. ## Simulation Propensity scores are typically estimated by fitting a binomial GLM with treatment status (0 or 1) as the outcome and all other covariates being predictors. Lets generate a set of 10 predictors, treatment and the response variable. The predictors will all be iid normal and will be the only non-white noise elements in the determination of treatment status. I selected an error term with a standard deviation of 5 empirically. set.seed(222) X <- replicate(10, rnorm(1000)) error <- rnorm(1000, 0, 5) probTreatment <- exp(cbind(1, X) %*% rep(1, 11) + error) / (1 + exp(cbind(1, X) %*% rep(1, 11) + error)) The vector probTreatment gives us $$\text{Pr}(T = 1 | x_1, x_2, \ldots, x_{10})$$. If we use rbinom, we can randomly assign treatment based on the different probabilities: t <- rbinom(1000, 1, probTreatment) Now that we have treatment assigned, lets generate the response variable. y <- 1 + X %*% rep(1, 10) + t + rnorm(1000, 0, 5) As mentioned above, simply comparing the values of $$y$$ for the treated and untreated groups won't produce the right estimates. t.test(y[t == 1], y[t == 0]) ## ## Welch Two Sample t-test ## ## data: y[t == 1] and y[t == 0] ## t = 8.8739, df = 945.607, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 2.421779 3.797091 ## sample estimates: ## mean of x mean of y ## 2.7115980 -0.3978372 The estimated treatment effect is 3.11, a 310.94% over estimate of the true value of 1. However, if we estimate a propensity score model and match on the nearest unmatched control subject to the difference between the matches being less than 20% of the standard deviation of the pooled propensity scores, we get a better estimate. First, the model: ps <- glm(t ~ X, family = binomial()) psValues <- predict(ps) psTreated <- psValues[t == 1] psUntreated <- psValues[t == 0] The next step is to match subject to the 20% of the standard deviation limit: # find matches, require be within 20% of a sd of the untransformed pooled ps matches <- matrix(NA, ncol = 3, nrow = NROW(y[t == 1])) limit <- 0.2 * sd(predict(ps)) for (i in 1:NROW(y[t == 1])) { x <- psTreated[i] distance <- (x - psUntreated)^2 if (min(distance) <= limit) { nn <- which.min(distance) psUntreated[nn] <- .Machine$integer.max
matches[i, ] <- c(i, nn, distance[nn])
} else {
matches[i, ] <- c(i, NA, NA)
}
}

The loop takes each treated patient and searches for an unmatched untreated patient near the treated patient's propensity score and within the constraint. Once matched, the propensity score for the untreated patient is set to .Machine$integer.max to prevent further matches to the same patient (setting the value to NA introduces problems and .Machine$integer.max is safe enough). If there is no possible match subject to the constraints, the treated patient is matched to NA to reflect this fact.

Using the matches, we can construct two new vectors, yTreatedMatched and yUntreatedMatched for the t test. Based on propensity score theory, and true in this simulation, conditional on the observed characteristics the assignment to treatment or not treatment is random. As a result, these two groups are similar to what you might see in an RCT (in theory) and so the differences should be an unbiased and accurate estimate of the treatment effect.

# throw out the unmatched patients
matches <- matches[!is.na(matches[, 2]), ]
yTreatedMatched <- y[t == 1][matches[, 1]]
yUntreatedMatched <- y[t == 0][matches[, 2]]
t.test(yTreatedMatched, yUntreatedMatched)
##
##  Welch Two Sample t-test
##
## data:  yTreatedMatched and yUntreatedMatched
## t = 2.2593, df = 654.727, p-value = 0.0242
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1224774 1.7492684
## sample estimates:
## mean of x mean of y
## 1.4946742 0.5588013

And, as expected, the propensity score matched estimator produces an accurate measure of the treatment effect, 0.94, compared to a true effect of 1. The result is also statistically significantly different from zero.

However, it required that we threw away data. We “collected” data on 565 treated patients but with 435 untreated patients, we were only able to match 330, or 58.41%.

Additionally, had we instead used a OLS regression model, we could have produced similar estimates.

model <- lm(y ~ X + t)
model
##
## Call:
## lm(formula = y ~ X + t)
##
## Coefficients:
## (Intercept)           X1           X2           X3           X4
##      1.1242       1.0259       0.7914       0.5494       0.8666
##          X5           X6           X7           X8           X9
##      0.7482       0.8879       0.9435       1.2965       0.8275
##         X10            t
##      0.6487       0.7250

Moreover, the 95% CI produced using the OLS method is smaller ranging from 0.05 to 1.4 compared to the t test CI ranging from 0.12 to 1.75.

The next step is to see how this holds up for a variety of values of $$n$$ and sampling ratios (e.g., 2 controls per case vs 1 control per case). To this end, the function psSim simulates a propensity score adjusted model. It works by constructing a large population of size n with k covariates and 1 treatment variable. From this large population, sampleN cases are selected at random and ratio * sampleN controls are selected. (I'm using case and control here to mean treated vs untreated, not in the strict sense of a [case-control study] (http://en.wikipedia.org/wiki/Case-control_study)).

From these sampled cases and controls, a propensity score model is estimated and the treatment effect is estimated using the matched populations. The function returns a 2 element vector, the first element is the estimated treatment effect and the second is the p value from the t test.

The function is below:

psSim <- function(n, k, sampleN, ratio) {
# generate k covariates
X <- replicate(k, rnorm(n))
# some errors to the propensity score model
pErr <- rnorm(n, 0, 5)
pt <- exp(1 + X %*% rep(1, k) + pErr)/(1 + exp(1 + X %*% rep(1, k) + pErr))
# assign treatment status
t <- rbinom(n, 1, pt)

# generate response
yErr <- rnorm(n, 0, 5)
y <- 1 + X %*% rep(1, k) + t + yErr

# put into big data frame
data <- data.frame(y = y, t = t)
data <- cbind(data, X)
names(data) <- c("y", "t", paste(rep("x", k), 1:k, sep = ''))

# generate sample from population
treatedSample <- data[t == 1, ][sample(1:nrow(data[t == 1, ]), sampleN), ]
untreatedSample <- data[t == 0, ][sample(1:nrow(data[t == 0, ]),
ratio * sampleN), ]

# estimate propensity scores using all k X predictors
sample <- rbind(treatedSample, untreatedSample)
ps <- glm(t ~ ., data = sample[-1],
family = binomial())
# find untransformed fitted values
psTreated <- predict(ps)[sample$t == 1] psUntreated <- predict(ps)[sample$t == 0]

# find matches, require be within 20% of a sd of the untransformed pooled ps
matches <- matrix(NA, ncol = 3, nrow = nrow(treatedSample))
limit <- 0.2 * sd(predict(ps))
for (i in 1:nrow(treatedSample)) {
x <- psTreated[i]
distance <- (x - psUntreated)^2
if (min(distance) <= limit) {
nn <- which.min(distance)
psUntreated[nn] <- .Machine$integer.max matches[i, ] <- c(i, nn, distance[nn]) } else { matches[i, ] <- c(i, NA, NA) } } # do analysis treatedY <- treatedSample[, 1] untreatedY <- untreatedSample[, 1] matches <- matches[!is.na(matches[, 2]), ] matchedTreatedY <- treatedY[matches[, 1]] matchedUntreatedY <- untreatedY[matches[, 2]] test <- t.test(matchedTreatedY, matchedUntreatedY) results <- c(mean(matchedTreatedY - matchedUntreatedY), test$p.value)
results
}

I also wrote a second function that generates the population and sample in the same way but instead of using propensity score matching, the treatment effect is estimated directly with OLS. This function is called lmSim:

lmSim <- function(n, k, sampleN, ratio) {
# generate k covariates
X <- replicate(k, rnorm(n))
# some errors to the propensity score model
pErr <- rnorm(n, 0, 5)
pt <- exp(1 + X %*% rep(1, k) + pErr)/(1 + exp(1 + X %*% rep(1, k) + pErr))
# assign treatment status
t <- rbinom(n, 1, pt)

# generate response
yErr <- rnorm(n, 0, 5)
y <- 1 + X %*% rep(1, k) + t + yErr

# put into big data frame
data <- data.frame(y = y, t = t)
data <- cbind(data, X)
names(data) <- c("y", "t", paste(rep("x", k), 1:k, sep = ''))

# generate sample from population
treatedSample <- data[t == 1, ][sample(1:nrow(data[t == 1, ]), sampleN), ]
untreatedSample <- data[t == 0, ][sample(1:nrow(data[t == 0, ]),
ratio * sampleN), ]
sampleData <- rbind(treatedSample, untreatedSample)
# fit OLS model
model <- summary(lm(y ~ ., data = sampleData))
results <- c(model$coefficients[2, 1], model$coefficients[2, 4])
results
}

Using these two functions, I ran each estimator through 100 iterations for sample sizes of 50 to 1000 cases and control sets of 1 to 2 times larger. If you are running this at home, it takes a while. There is a second post that discusses running this code in parallel (using the parallel package) that also does a little optimization to the code as well (both technically and statistically) that is worth checking out, otherwise, be ready to wait a while (about 45-60 minutes on 3.0 GHz CPU).

psMean <- matrix(NA, nrow = 100 * 3, ncol = 100)
psP <- matrix(NA, nrow = 100 * 3, ncol = 100)
lsMean <- matrix(NA, nrow = 100 * 3, ncol = 100)
lsP <- matrix(NA, nrow = 100 * 3, ncol = 100)
i <- 1
for (n in seq(50, 1000, length.out = 100)) {
for (ratio in c(1, 1.5, 2)) {
psResults <- replicate(100, psSim(10000, 10, n, ratio))
lmResults <- replicate(100, lmSim(10000, 10, n, ratio))
psMean[i, ] <- psResults[1, ]
psP[i, ] <- psResults[2, ]
lsMean[i, ] <- lmResults[1, ]
lsP[i, ] <- lmResults[2, ]
i <- i + 1
}
}

And some data munging to get it into a reasonable format for plotting:

results <- data.frame(estimatedEffect = c(as.numeric(psMean),
as.numeric(lsMean)))
results$ratio <- c(1, 1.5, 2) results$n <- rep(seq(50, 1000, length.out = 100), each = 3)
results$method <- rep(c("PS", "LS"), each = 300 * 100) results$p <- c(as.numeric(psP), as.numeric(lsP))
results$sig <- results$p <= 0.05

## Results

If we plot the estimate against the sample size (of cases):

ggplot(results, aes(x = n, y = estimatedEffect)) +
geom_point(aes(color = method), alpha = 0.25) +
facet_grid(ratio ~ method) +
scale_y_continuous("Estimated Treatment Effect") +
geom_hline(aes(yintercept = 1))

The code produces a 3x3 grid of plots, the left plots are the estimates from OLS and the right from from the propensity score matched t test. The rows are the different ratios of control to case sampling (1, 1.5 and 2).

It is clear that both estimators produce reasonably unbiased estimates of the true treatment effect of 1. Furthermore, both appear to converge to the true value of 1 as the sample size goes to infinity.

However, it looks like the range of the estimates of the OLS estimator may be smaller at a given value of $$n$$ than the propensity scores estimator. To this end, I calculate the mean squared error (MSE) of the estimate from the true value of 1 for both methods at each sample size.

# from dplyr
mse <- summarise(group_by(results, method, n, ratio), mean((estimatedEffect - 1)^2))
names(mse) <- c("method", "n", "ratio", "mse")

When plotted

ggplot(mse, aes(x = n, y = mse, color = method)) +
geom_point() +
facet_grid(ratio ~ .) +
scale_x_continuous("N") +
scale_y_continuous("MSE")

and especially when just a smoother is shown, not the actual points,

ggplot(mse, aes(x = n, y = mse, color = method)) +
geom_smooth(method = "loess") +
facet_grid(ratio ~ .) +
scale_x_continuous("N") +
scale_y_continuous("MSE")

it becomes clear that the MSE of a regression method is smaller than the typical MSE of the propensity score method for a fixed $$k$$ as $$n$$ varies.

Unsurprisingly, the higher MSE also shows up in the power testing. If we look at the p values of the tests:

ggplot(results, aes(x = n, y = p)) +
geom_point(aes(color = method), alpha = 0.25) + facet_grid(ratio ~ method) +
scale_y_continuous("Observed p value") + geom_hline(aes(yintercept = 0.05))

As we known that the true treatment effect is 1, we should expect fairly small p values. Both modeling frameworks appear to give similar results, although, as with the estimate and MSE plots above, relative differences in performance emerge as we consider the power curve.

If we formalize this with an estimated power curve:

power <- group_by(results, method, n, ratio) %>%
summarize(power = mean(sig))
ggplot(power, aes(x = n, y = power, color = method)) +
geom_point(aes(color = method)) + facet_grid(ratio ~ method) +
geom_smooth(method = "loess") +
scale_x_continuous("N") +
scale_y_continuous("Power")

And again but with just the LOESS fit and no points:

ggplot(power, aes(x = n, y = power * 100, color = method)) +
facet_grid(ratio ~ .) +
geom_smooth(method = "loess") +
scale_x_continuous("N") +
scale_y_continuous("Power")

We see a relative difference between the two methods with the difference most pronounced at larger values of $$k$$.

ps <- power[power$method == "PS", ] ols <- power[power$method == "LS", ]
diffPower <- data.frame(ratio = ps$ratio, diff = ols$power - ps$power, n = ps$n)
diffPower$ls more powerful <- diffPower$diff > 0
ggplot(diffPower, aes(x = n, y = diff * 100)) +
geom_point(aes(color = ls more powerful)) +
facet_grid(ratio ~ .) +
geom_smooth(method = "loess", alpha = 0) +
scale_x_continuous("N") +
scale_y_continuous("Percentage Point Difference in Power")

In general, propensity score methods for a fixed $$k$$ over a varied $$n$$ are less powerful than regression methods at the same values of $$n$$ and $$k$$. These differences become small as $$n$$ gets very large, say 500-750, but for smaller values of $$n$$ they can make a meaningful difference. A 10-20 percentage point increase in power by simply using regression over matching is hard to turn down.

The benefits of using linear models becomes more clear when you look at the variance of $$k$$. With the same $$n$$ and increasing $$k$$, the linear model becomes increasingly powerful, to as much as 10 percentage points greater power. When the ratio of cases and controls gets larger, the power of regression is clearer than ever.

There are other problems with propensity scores. The propensity score models in this simulation were “perfect.” Conditional on all the observed variables, treatment choice was random, as our model was an exact restatement of the data generating process. This is unlikely to be the case in reality. Few, if any, papers evaluate the performance of their propensity score model. Many don't even provide information about what went into the model and the model selection, much less a cross-validation MSE.

They offer no improvement over regression in terms of unobserved heterogeneity. The scores are functionally identical to doing a regression and so suffer from all the relevant drawbacks.

Finally, propensity scores are computationally more expensive than regression. The file k.csv contains timings from a run of the propensity score and regression methods on similar data sets with a fixed value of $$n = 1000$$ and $$k$$ ranging from 10 to 100 done with 100 replications at each combination on my Linode VPS.

timings <- read.csv("/home/jacob/documents/blog/ps/k.csv")
# have to remove an outlier
timings <- timings[timings$k != 10, ] timings$method <- ifelse(timings\$m == 0, "LS", "PS")
ggplot(timings, aes(x = k, y = t, color = method)) + geom_point() +
geom_smooth(method = "loess") +
scale_x_continuous("K") +
scale_y_continuous("Time / s")

Regression is nearly always much faster than propensity score matching. More importantly, as $$k$$ increases with a fixed $$n$$, the cost of regression increases approximately linearly. The non-linear increase with respect to $$k$$ is scary — a model that estimates and then matches on 500 predictors could take hours to fix, or seconds with a OLS model.

This makes sense, lm() and similar functions hand off the computation to highly optimized numerical linear algebra routines. No matching function, even if written in a lower level language like C or C++ via Rcpp will be nowhere near as fast and optimized as LINPACK/LAPACK/BLAS.

It would be hard, if not impossible, to match a propensity score method as fast as regression for two reasons. The propensity score method already does the regression in order to estimate the propensity scores. The matching then adds on an additional and greater cost.

## Take-away

This added cost is just that, added cost. It doesn't provide a power benefit (if anything, it comes with a power penalty), computational benefit or reduction in complexity. While PS methods work, there is practically no circumstance in which they work and regression methods do not work at least as well.