Inter-ocular trauma test

I’ve recently been thinking about the role statistics can play in answering questions. I think the it came up on the NSSD podcast a few weeks ago. Basically, problems can be divided into three classes:

  • those that don’t need statistics because the answer is obvious (problems without much confounding and a strong signal to noise ratio)
  • those that need statistics because there is noise/confounding
  • those that can’t be helped by statistics because the noise/confounding is so large or the signal is very small or the question is fundamentally unanswerable with statistics

I think this is a useful idea and have been applying a similar system for years that I’ve learned as the “inter-ocular trauma test.” (I learned the term in my intro biostats class, I tracked it down to a talk by Joe Berkson in the 50s or 60s). The idea is that sometimes the result is so obvious that it hits you between the eyes, hence causing inter-ocular trauma.

I like this idea and often use it as a crude filter when looking at visualizations or causally reading some analysis. High end statistical analysis is often required to determine the signal and the formalization of the inter-ocular trauma test with a formal analysis is always needed, but that doesn’t stop it from being useful all the same.

One of the reasons I really like it is how simple and easy it is to understand. For example, where I live (Iowa City, Iowa) there is a small “town” of about 1,000 people located within the larger city. This small town is famous for being a speed trap - a major road enters and passes through the town at the same time as the speed limit drops to 25 mph from 35 mph and the road goes down hill. They employ 5 people officers (about 2x what would be expected for a comparable city) and issue a lot of speeding tickets.

A recent article came out in the local paper saying they issued more tickets than other areas. The reporter summarized the number of tickets issued in the past year and the population of the towns. They found the following data:

The town of University Heights is located entirely within Iowa City. When asked about the difference, the University Heights police chief said that the comparison isn’t fair because they have a major thoroughfare and lots of people tailgate in the area during home football games. (Never mind that the stadium is in Iowa City and the only way to enter University Heights is to travel through Iowa City for several miles.)

They do also say that normalizing by population isn’t fair because they have so many communters. However, they also did issue the greatest number of tickets (907 vs 617, 802 or 366). So that doesn’t really hold either.

It all makes sense in the context of incentives: On average, a ticket return $81 to the city’s pocket and saves the average taxpayer about $66 dollars a year or a household of four $263 a year in property taxes. They pay for 6% of all city services.

The inter-ocular trauma test seems to confirm the reputation for this town being a speed trap. Sure, a nice analysis looking at when the stops occur (is the football game excuse a real excuse? morning rush hour?) would be stronger, but also a bit of overkill for the question at hand.

My advise? The inter-ocular trauma test, much like the sniff test for spoiled milk, will get you far. And if you travel through University Heights, go 20 just to be safe.

Using tidytext to make sentiment analysis easy

Last week I discovered the R package tidytext and its very nice e-book detailing usage. Julia Silge and David Robinson have significantly reduced the effort it takes for me to “grok” text mining by making it “tidy.”

It certainly helped that a lot of the examples are from Pride and Prejudice and other books by Jane Austen, my most beloved author. Julia Silge’s examples on her blog doing NLP and sentiment analysis alone would have made me a life-long fan. The gifs from P&P (mostly the 1995 mini-series, to be honest) on her posts and the references in the titles made me very excited. My brain automatically started playing the theme and made me smile.

Okay, enough of that. Moving on.

Seeing her work, I started wondering what I can model to get some insight into my own life. I have a database of 92,372 text messages (basically every message sent to or from me from sometime in 2011/2012 to 2015) but text messages were weird (lots of “lol” and “haha”’s). I think there is some interesting insights there, but probably not what I wanted to cover today.

So I started thinking what other plain text data did I have that might be interesting. And then I realized I have a 149 page dissertation (excluding boilerplate and references) and it was in LaTeX (so easy to parse) and it was written in 5 different files that relate directly to the chapters (intro, lit review, methods, results and a discussion). I could do something with that!

My thesis is currently under embargo while I chop it into its respective papers (one under review, one soon to be under review and one undergoing a final revision. So close.), so I can’t link to it. However, it relates the seasonality of two infectious diseases and local weather patterns.

I wonder how my sentiment changes across the thesis. To do this, I’ll use the tidytext package. Let’s import the relevant packages now.


The tidyverse ecosystem and tidytext play well together (no surprises there) and so I also import tidyverse. The stringr package is useful for filtering out the LaTeX specific code and also for dropping words that have numbers in them (like jefferson1776 as a reference or 0.05).

Now let’s read in the data (the tex files)

thesis_words <- data_frame(file = paste0("~/thesis/thesis/", 
                         c("introduction.tex", "lit-review.tex", "methods.tex", 
                           "results.tex", "discussion.tex"))) %>%
  mutate(text = map(file, read_lines))
## # A tibble: 5 × 2
##                               file          text
##                              <chr>        <list>
## 1 ~/thesis/thesis/introduction.tex   <chr [125]>
## 2   ~/thesis/thesis/lit-review.tex <chr [1,386]>
## 3      ~/thesis/thesis/methods.tex   <chr [625]>
## 4      ~/thesis/thesis/results.tex <chr [1,351]>
## 5   ~/thesis/thesis/discussion.tex   <chr [649]>

The resulting tibble has a variable file that is the name of the file that created that row and a list-column of the text of that file.

We want to unnest() that tibble, remove the lines that are LaTeX crude (either start with \[A-Z] or \[a-z], like \section or \figure) and compute a line number.

thesis_words <- thesis_words %>%
  unnest() %>%
  filter(text != "%!TEX root = thesis.tex") %>%
  filter(!str_detect(text, "^(\\\\[A-Z,a-z])"),
         text != "") %>%
  mutate(line_number = 1:n(),
         file = str_sub(basename(file), 1, -5))
thesis_words$file <- forcats::fct_relevel(thesis_words$file, c("introduction",

Now we have a tibble with file giving us the chapter, text giving us the line of text from the tex files (when I wrote it, I strived to keep my line lengths under 80 characters, hence the relatively short value in text) and line_number giving a counter of the number of lines since the start of the thesis.

Now we want to tokenize (strip each word of any formatting and reduce down to the root word, if possible). This is easy with unnest_tokens(). I’ve also played around with the results and came up with some other words that needed to be deleted (stats terms like ci or p, LaTeX terms like _i or tabular and references/numbers).

thesis_words <- thesis_words %>%
  unnest_tokens(word, text) %>%
  filter(!str_detect(word, "[0-9]"),
         word != "fismanreview",
         word != "multicolumn",
         word != "p",
         word != "_i",
         word != "c", 
         word != "ci",
         word != "al",
         word != "dowellsars",
         word != "h",
         word != "tabular",
         word != "t",
         word != "ref",
         word != "cite",
         !str_detect(word, "[a-z]_"),
         !str_detect(word, ":"),
         word != "bar",
         word != "emph",
         !str_detect(word, "textless"))
## # A tibble: 27,787 × 3
##            file line_number        word
##          <fctr>       <int>       <chr>
## 1  introduction           1 seasonality
## 2  introduction           1          or
## 3  introduction           1         the
## 4  introduction           1    periodic
## 5  introduction           1      surges
## 6  introduction           1         and
## 7  introduction           1       lulls
## 8  introduction           1          in
## 9  introduction           1   incidence
## 10 introduction           1          is
## # ... with 27,777 more rows

Now to compute the sentiment using the words written per line in the thesis. tidytext comes with three sentiment lexicons, affin, bing and nrc. affin provides a score ranging from -5 (very negative) to +5 (very positive) fr 2,476 words. bing provides a label of “negative” or “positive” for 6,788 words. nrc provides a label (anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise or trust) for 13,901 words. None of these account for negation (“I’m not sad” is a negative sentiment, not a positive one).

Using the nrc lexicon, let’s see how the emotions of my words change over the thesis.

thesis_words %>%
  inner_join(get_sentiments("nrc")) %>%
  group_by(index = line_number %/% 25, file, sentiment) %>%
  summarize(n = n()) %>%
  ggplot(aes(x = index, y = n, fill = file)) + 
  geom_bar(stat = "identity", alpha = 0.8) + 
  facet_wrap(~ sentiment, ncol = 5) 

I wasn’t surprised, but at least I wasn’t sad? It looks like I used more “fear” and “negative” words in the lit-review than the other sections. However, it looks like “infectious” as in “infectious diseases” is a fear/negative word. I used that word a lot more in the lit review than other sections.

I can use the bing and afinn lexicons to look at how the sentiment of the words changed over the course of the thesis.

thesis_words %>% 
  left_join(get_sentiments("bing")) %>%
  left_join(get_sentiments("afinn")) %>%
  group_by(index = line_number %/% 25, file) %>%
  summarize(afinn = mean(score, na.rm = TRUE), 
            bing = sum(sentiment == "positive", na.rm = TRUE) - sum(sentiment == "negative", na.rm = TRUE)) %>%
  gather(lexicon, lexicon_score, afinn, bing) %>% 
  ggplot(aes(x = index, y = lexicon_score, fill = file)) +
    geom_bar(stat = "identity") + 
    facet_wrap(~ lexicon, scale = "free_y") +
    scale_x_continuous("Location in thesis", breaks = NULL) +
    scale_y_continuous("Lexicon Score")

Looking at the two lexicon’s scoring of my thesis, the bing lexicon seems a little more stable if we assume local correlation of sentiments is likely. It seems like I started out all doom and gloom (hey, I needed to convince my committee that it was a real problem!), moved onto more doom and gloom (did I mention this is a problem and my question hasn’t been resolved?), the methods were more neutral, results were more doom and gloom but with a slight uplift at the end followed by more doom and gloom (this really is a problem guys!) and a little bit of hope at the end (now that we know, we can fix this?).

This got me thinking about what a typical academic paper looks like. My mental model for a paper is:

  1. show that the problem is really a problem (“ is a significant cause of morbidity and mortality”)
  2. show that the problem isn’t resolved by the prior work
  3. answer the question
  4. incorporate the answer into the existing literature
  5. discussion limitations and breezily dismiss them
  6. show hope for the future

So I pulled the text of my 4 currently published papers. I’m going to call them well-children, medication time series, transfer networks and COPD readmissions.

I took the text out of each paper and copied them into plain text files and read them into R as above. I also computed line numbers within each of the different papers.

paper_words <- data_frame(file = paste0("~/projects/paper_analysis/", 
                                         c("well_child.txt", "pharm_ts.txt",
                                           "transfers.txt", "copd.txt"))) %>%
  mutate(text = map(file, read_lines)) %>%
  unnest() %>%
  group_by(file = str_sub(basename(file), 1, -5)) %>%
  mutate(line_number = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, text) 

paper_sentiment <- inner_join(paper_words, get_sentiments("bing")) %>%
  count(file, index = round(line_number / max(line_number) * 100 / 5) * 5, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(net_sentiment = positive - negative)

paper_sentiment %>% ggplot(aes(x = index, y = net_sentiment, fill = file)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  facet_wrap(~ file) + 
  scale_x_continuous("Location in paper (percent)") + 
  scale_y_continuous("Bing Net Sentiment")

It looks like I wasn’t totally off. Most of the papers start out relatively negative, have super negative results sections (judging by paper location) but I was wrong about them ending on a happy note.

And the sentiment for this post:

Talking about negative sentiments is a negative sentiment. But look at the start when I was talking about Austen… that was a good time.

Easy Cross Validation in R with `modelr`

11 Nov 2016

When estimating a model, the quality of the model fit will always be higher in-sample than out-of-sample. A model will always fit the data that it is trained on, warts and all, and may use those warts and statistical noise to make predictions. As a result, a model that performs very well on a data set may perform poorly when used more generally.

Since we often can’t fit a model, put it in use and determine how well it performs (and probably wouldn’t want to even if we could), we need some methods for predicting out-of-sample error when only using the data we have.

The simplest method is to randomly divide the data into a training set and a test set. However, this provides only one test set. If you train a model, test it and then train a new model and re-test it using the same test set and see an improvement, it may have come from the model actually being better or it may come from you over-fitting the test set.

One possible way to address this problem is to use something like k-fold cross validation. The general idea is is to use some portion of your data (typically 90%) to fit a model and then test that model on the withheld portion (typically 10%) to assess the quality of the model fit. You would repeat this process over and over again until each point has been used to estimate the model \(k-1\) times and been included in the test set 1 time.

This can be done with a loop or a member of the apply family, but I’m going to use purrr to write functional code to fit the models and modelr to evaluate the models.

My “preamble” looks like this:

Data and the Model

For this example, I’m going to use the cars dataset from the caret package. It looks like this:

data("cars", package = "caret")
## Observations: 804
## Variables: 18
## $ Price       <dbl> 22661.05, 21725.01, 29142.71, 30731.94, 33358.77, ...
## $ Mileage     <int> 20105, 13457, 31655, 22479, 17590, 23635, 17381, 2...
## $ Cylinder    <int> 6, 6, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,...
## $ Doors       <int> 4, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4,...
## $ Cruise      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Sound       <int> 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1,...
## $ Leather     <int> 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1,...
## $ Buick       <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Cadillac    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Chevy       <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Pontiac     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Saab        <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Saturn      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ convertible <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ coupe       <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hatchback   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ sedan       <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ wagon       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

We want to build a model to predict the price of the car. We have the mileage, number of cylinders, size, some features (cruise control, sound system, leather seats), brand and then the style of car. The data already has dummy variables constructed but we need to drop one in order to estimate the model (e.g., set it as the reference level). We’ll drop the brand Chevy indicator (setting Chevy to the reference brand) and set the reference style to sedan. We’ll also drop the doors variable as that is perfectly co-linear with the coupe indicator and styles likely contain more information than the number of doors.

Our final model will look something like this:

select(cars, -Doors, -Chevy, -sedan) %>% 
  lm(Price ~ Milage + Cylinder + Cruise + Sound + Leather + Buick + Cadillac + 
  Pontiac + Saab + Saturn + convertible + coupe + hatchback + wagon, data = .)

The modelr package has a useful tool for making the cross-validation folds. crossv_kfold will divide the data into \(k\) (by default 10) folds and returns a tibble with the list-columns of train (the training data for the fold), test (the test set for the fold) and .id (the fold number). For more details on the idea of list-columns, see the Many Models Chapter in R for Data Science by Garrett Grolemund and Hadley Wickham.

cars_cv <- cars %>%
  select(-Doors, -Chevy, -sedan) %>%
## # A tibble: 10 × 3
##             train           test   .id
##            <list>         <list> <chr>
## 1  <S3: resample> <S3: resample>    01
## 2  <S3: resample> <S3: resample>    02
## 3  <S3: resample> <S3: resample>    03
## 4  <S3: resample> <S3: resample>    04
## 5  <S3: resample> <S3: resample>    05
## 6  <S3: resample> <S3: resample>    06
## 7  <S3: resample> <S3: resample>    07
## 8  <S3: resample> <S3: resample>    08
## 9  <S3: resample> <S3: resample>    09
## 10 <S3: resample> <S3: resample>    10

You might notice that instead of the elements being data.frame or tibble, they are resample objects. resample objects are cool because they are lazy. Instead of making 10 copies of cars, crossv_kfold() returned what are essentially pointers to the original data. So instead of storing the data 10 times, we store it once plus a little bit for the pointers. Indeed, we can see that in the object size:

## 63.6 kB
## 93.7 kB

Awesome, right?

Now it is time to fit the model. It all can be down within the very tidy tidyverse (see what I did there?). The code below uses purrr:map() to fit the model for each one of the 10 folds and returns the RMSE on the test set.

cars_cv %>% 
  mutate(train = map(train, as_tibble)) %>%
  mutate(model = map(train, ~ lm(Price ~ Mileage + Cylinder + Cruise + Sound + 
                                   Leather + Buick + Cadillac + Pontiac + 
                                   Saab + Saturn + convertible + coupe + 
                                   hatchback + wagon, 
                                 data = .))) %>%
  mutate(rmse = map2_dbl(model, test, rmse)) %>%
  select(.id, rmse)
## # A tibble: 10 × 2
##      .id     rmse
##    <chr>    <dbl>
## 1     01 2372.046
## 2     02 2844.509
## 3     03 3398.572
## 4     04 3165.065
## 5     05 2838.691
## 6     06 2675.153
## 7     07 3720.843
## 8     08 2692.567
## 9     09 2362.007
## 10    10 3016.648

It is reasonably fast too.

ten_fold <- function(x) {
  x %>% 
    mutate(model = map(train, ~ lm(Price ~ Mileage + Cylinder + Cruise + Sound + 
                                   Leather + Buick + Cadillac + Pontiac + 
                                   Saab + Saturn + convertible + coupe + 
                                   hatchback + wagon, 
                                 data = .))) %>%
  mutate(rmse = map2_dbl(model, test, rmse)) %>%
  select(.id, rmse)

## Unit: milliseconds
##               expr      min       lq     mean   median       uq      max
##  ten_fold(cars_cv) 45.47155 50.18167 61.01467 53.27679 60.67367 139.0201
##  neval
##    100

microbenchmark puts it a median time of about 50 ms. We can probably make it even faster though. We don’t actually need to store the model and we can do it all in one step with a function.

fitter <- function(training_data, testing_data) {
  lm(Price ~ Mileage + Cylinder + Cruise + Sound + Leather + Buick + Cadillac + 
       Pontiac + Saab + Saturn + convertible + coupe + hatchback + wagon, 
     data = training_data) %>% 

ten_fold_2 <- function(x) {
  x %>% 
    mutate(rmse = map2_dbl(train, test, fitter)) %>%
    select(.id, rmse)

## Unit: milliseconds
##                 expr     min       lq     mean   median       uq      max
##  ten_fold_2(cars_cv) 46.1202 49.16919 52.21734 51.25691 53.72752 100.7599
##  neval
##    100

Hey, that saved 2 or 3 milliseconds on the median. Totally worth it. (If your model fit takes a long time, it is probably worth it).

It is possible to parallelize the model fitting using multidplyr as the backend for dplyr; however, multidplyr doesn’t seem to work well with the resample objects. The easiest solution is to store nine copies of the data to allow for the cross-validation to work most simply - but I’m more likely to hurt for shortage of memory than shortage of CPU cycles/time when dealing with health care administrative data. Alternatively, you could use do() and multidplyr to filter by fold ID, fit a model, compute RMSE and proceed from there, but that solution is much less simple.

Packages like caret do provide powerful tools for cross-validation; however, I like the ability to keep my entire workflow smooth and within the tidyverse. I found that, in general, I prefer to use the modelr tools and do the model fitting the “old fashioned way.”

Parallel Simulation of Heckman Selection Model

23 Apr 2015

Parallel Simulation of Heckman Selection Model

One of the, if not the, fundamental problems in observational data analysis is the estimation of the value of the unobserved choice. If the \(i^{\text{th}}\) unit chooses the value of \(t\) on the basis of some factors \(\mathbf{x_i}\), which may include the \(E(u_i(t))\) for that unit, comparing the outcome \(\mathbf{y}\) on a set where \(t = 1\) and a set where \(t = 0\) is unlikely to provide valid inference.

Randomized controlled trials get rid of this problem by setting the \(Pr(t_i = 1)\) to be equal and independent of the values of \(x_i\) for all \(i\). This provides a good estimation of the value of \(t\).

But randomized controlled trials just aren’t reasonable for any questions, if they are even possible. And so observational data analysis is critical but it must first find some way to control for everything in \(\mathbf{x_i}\) while to compute an unbiased and valid estimate of \(E(u_i(t == 1)) - E(u_i(t == 0))\).

This can be straight-froward. It is easy to deal with outcome modifiers or selection modifiers with propensity scores or regression. However, both of these methods breakdown when faced with correlation between the error in the selection equation and the outcomes equation.

Heckman Model

The Heckman Model (aka Heckman correction/Heckit) treats this correlation between the errors as a form of mis-specification and omitted variable bias. If we could observe the behavioral factors that feed into the error terms, this would not be a problem. Alas, that is generally not possible.

What the Heckman Model suggests is that for each person we have

\[t_i^* = \mathbf{z_i} \alpha + \epsilon_i\]


\[y_{u, i}^* = \mathbf{x_i} \beta_u + \eta_{u, i}\] \[y_{t, i}^* = \mathbf{x_i} \beta_t + \eta_{t, i}\]

The first equation gives the likelihood of being given treatment \(t\), this should similar similar to propensity score models. The second set of equations denote the outcomes that would be realized if a patient were given \(t\). However, we can’t observe both outcomes on the same unit (what Rubin and Holland call the fundamental problem of casual inference).

Instead, unless we have randomized \(t\), we observe

\[t_i = \begin{cases} 0 \text{ if } \mathbf{z_i} \alpha + \epsilon_i < 0 \\ 1 \text{ if } \mathbf{z_i} \alpha + \epsilon_i \geq 0 \end{cases}\]

and from \(t_i\),

\[y_i = \begin{cases} y_{u, i}^* \text{ if } t_i = 0 \\ y_{t, u}^* \text{ if } t_i = 1 \end{cases}\]

The value \(y\) is the realized value and its dependence on \(t\) is clear.

Still, this isn’t a real problem. It can all be solved with regression or propensity scores or similar analysis. The problem comes in the error terms. Basically, often times problems present with a correlation between the error terms in the selection and outcome equations. We could write

\[\left(\begin{array}{c} \epsilon \\ \eta_u \\ \eta_t \\ \end{array} \right) = N\left( \left(\begin{array}{c}0\\ 0\\ 0\\ \end{array}\right), \left(\begin{array}{ccc} 1 & \rho_1 \sigma_1 & \rho_2 \sigma_2 \\ \rho_1 \sigma_1 & \sigma_2^2 &\rho_{12}\sigma_1 \sigma_2 \\ \rho_2 \sigma_2 & \rho_{12}\sigma_1\sigma_2 & \sigma_2^2 \end{array} \right) \right)\]

and whenever both \(\rho_1\) and \(\rho_2\) are not 0, there will be poor estimation of \(E(y_t^*) - E(y_u^*)\).

What the Heckman model advances in the idea that \(E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0)\) can be broken down into two parts. Rather simply, the expectation breaks down into the true expectation of the unobserved value of \(y_u^*\) plus the expected value of \(\eta_u\) conditional on \(t_i = 0\):

\[E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0) = E(y_u^*) + E(\eta_u | t_i = 0)\]

Since \(t_i = 0\) if \(\mathbf{z_i} \alpha + \epsilon_i\), we can expand this to

\[E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0) + E(\eta_u | \mathbf{z_i} \alpha + \epsilon < 0)\]

Looking back towards the top, we see that \(E(y_u^*) = \mathbf{x_i} \beta_u\). Replacing that expectation in the expression yields

\[E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0) = \mathbf{x_i} \beta_u + E(\eta_u | \mathbf{z_i} \alpha + \epsilon < 0)\] \[E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0) = \mathbf{x_i} \beta_u + E(\eta_u | \epsilon < -\mathbf{z_i} \alpha)\]

It can be shown that because \(\eta_u\) and \(\epsilon\) are distributed according to a multivariate truncated normal distribution that

\[E(\eta_u | \epsilon < -\mathbf{z_i} \alpha) = \rho_1 \sigma_1 \lambda(-\mathbf{z_i} \alpha) + \epsilon_u\]

The expression \(\lambda(-\mathbf{x_i} \alpha)\) says to evaluate the inverse Mill’s ratio at the value of \(-\mathbf{x_i} \alpha\). The inverse Mill’s ratio for \(x\) is given by \(\frac{\phi(x)}{\Phi(x)}\) where \(\phi()\) and \(\Phi()\) are the standard normal pdf and cdf functions, respectively.

The proof of this is not overly complicated. The simplest way is to consider a recursive definition of the moments for a truncated normal and then the consideration that the truncated normal can be generalized to the bi variate truncated normal with a multiplication by \(\rho_1\). Fully detailing the proof would bog down in integration by parts and algebra nobody cares to see.

The exact same process can be repeated for the case of \(y_t\) leading to the equations:

\[\begin{eqnarray} E(y_u | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 0) &=& \mathbf{x_i} \beta_u - \rho_1 \sigma_1 \lambda(-\mathbf{z_i} \alpha) + \epsilon_u \\ E(y_t | \mathbf{x} = \mathbf{x_i}, \mathbf{z} = \mathbf{z_i}, t = 1) &=& \mathbf{x_i} \beta_t + \rho_2 \sigma_2 \lambda(\mathbf{z_i} \alpha) + \epsilon_t \end{eqnarray}\]

where \(\epsilon_u\) and \(\epsilon_t\) are random error uncorrelated with the selection model and a mean of zero.

This equation can be estimated in two steps. The first bit, \(\lambda(-\mathbf{z_i}\alpha)\) and \(\lambda(\mathbf{z_i} \alpha)\) can both be estimated using a probit first stage:

\[E(Pr(t_i = 1)) = \Phi\left(\mathbf{z_i} \alpha \right)\]

and the resulting Inverse Mill’s Ratio can be computed using these values and a sign depending on the observed value of \(t\).

Once this variable is estimated, the complete model can be estimated to find the effects of \(\mathbf{x_i}\) as part of a second stage using least squares.

Additionally (and perhaps better), the whole model can be estimated in a single stage using maximum likelihood. Heckman’s recommendations have historically been to estimate the model in two stages to provide starting values for a single stage ML estimator. The MLE is more sensitive to starting values compared to the two stage model and so it isn’t always best, but should be the first estimator considered.

Data Generation Details


Let’s set up a (very) simple simulated example of this problem.

A disease of interest affects patients aged 65 to 85 uniformly and you want to estimating the effect of a new long-term treatment. The treatment’s benefit is not dependent on age but the probability of realizing the treatment is. The patient needs to be on the drug for an average of 10 years to realize a treatment effect (this is not totally unrealistic for some drugs such as statins). Formally, lets say that \(l\) is the lag of the treatment effect and \(l \sim gamma(20, 2)\). That gives \(l\) a mean of 10 and a variance of of 5. Some people respond early to the treatment and some respond late but 95% respond between 6 and 14 years after the treatment is started.

What is of interest is that the expected life remaining decreases over this interval. If we use the Social Security Tables we can compute the number of years of expected life remaining by age, \(d\).

Lets combine the remaining life and the distribution of \(l\) to estimate the probability of living long enough to benefit from the treatment effect for the average person:

actTable <- data.frame(age = seq(65, 85), 
                      yearsRemaining = c(18.94, 18.18, 17.44, 16.70, 15.98, 
                                        15.27, 14.57, 13.88, 13.21, 12.55, 
                                        11.91, 11.29, 10.67, 10.09,  9.51,
                                         8.94,  8.41,  7.88,  7.38,  6.91,
actTable$prTreatmentEffect <- pgamma(actTable$yearsRemaining, 20, 2)
ggplot(actTable, aes(x = age, y = prTreatmentEffect)) + 
  geom_line() + 
  scale_x_continuous("Age / Years") + 
  scale_y_continuous("Expected Probability of Outliving Treatment Lag")

Of course, the expected probability of outliving the lag isn’t the same as a given person’s probability. So lets add some noise. I don’t want to move it too far in either direction but I do want a fair amount of error, so lets say

\[P(r \geq l)_i = \Phi \left(\Phi^{-1} \left[P(d_i \geq l) \right] + \epsilon_i \right)\]

where \(\epsilon \sim N(0, 0.5^2)\), \(\Phi\) is the standard normal CDF and \(\Phi^{-1}\) is the inverse of the standard normal CDF (converting the probability to a Z-score). The same plot as above is made below but this time includes a shaded area denoting 95% sampling bounds about the mean.

# using a probit style model for simplicity but logit or other adjustments are
# equally valid
actTable$z <- qnorm(actTable$prTreatmentEffect)
actTable$lb <- pnorm(actTable$z - 2 * 0.5)
actTable$ub <- pnorm(actTable$z + 2 * 0.5)

ggplot(actTable, aes(x = age, y = prTreatmentEffect, ymin = lb, ymax = ub)) + 
  geom_line() + 
  geom_ribbon(alpha = 0.25) + 
  scale_x_continuous("Age / Years") + 
  scale_y_continuous("Expected Probability of Outliving Treatment Lag") 

Visually, this looks like a reasonable but not excessive amount of noise around the expected probability of success.

The meaning of \(\epsilon\) can be ascribed to random differences in willingness to accept the cost of the treatment. Even if you had a \(P(r \geq l)\) very near zero, if the treatment had no cost (in dollars or side effects) it would always make sense to take the treatment as the expected value would be strictly \(> 0\).

But that isn’t how treatments work. They cost you (and the insurer) money. They cost you effort to take (even if that effort is just taking a pill). They cost you in the risk of side effects.

So in this example, \(\epsilon\) can be thought of as measuring the sensitivity to the cost or the preference for the chance that the drug works.

It can also be thought of as patients sorting on their expected gain based on unobserved (and potentially unobservant to the researcher) factors. A patient who is really good about taking medications might take the medication because the cost of taking it is low. Same with the patient who is taking it because they are not worried about the side effects or believe they are worth the outcome.

This is where the problem arises. Suppose the outcome model is simple:

\[y_i = 1 + t_i + \eta_i\]

where \(y\) is some outcome of interest, \(t\) is treatment status and \(\eta\) is the error term in the outcomes model. Suppose \(\eta \sim N(0, 2^2)\).

As a general rule, people who take their medications regularly and who do not stop due to mild side effects are going to have the better outcomes. It seems likely then that \(\text{cor}(\epsilon, \eta) \neq 0\). Because the error term in the selection model was partly due to patients sorting on their expected ability to adhere and persist through side effects, it is not a stretch to imagine that the patients with larger values of \(\epsilon\) will also have larger values of \(\eta\) and likewise for smaller values.

Generating Test Sets

To recap, the selection model is

\[P(r \geq l)_i = \Phi \left(\Phi^{-1} \left[P(d_i \geq l) \right] + \epsilon_i \right)\]

where \(\epsilon \sim N(0, 0.05^2)\)

The outcomes are determined by

\[y = 1 + I_t + \eta\]

where \(I_t\) is an indicator for the patient having survived long enough to have treatment effects and \(\eta\) is \(N(0, 2^2)\). The \(\rho_{\epsilon, \eta}\) is set when the data is generated.

The function dataGen puts this all together.

dataGen <- function(cor, n = 1000, actuarialTable = actTable) {
  # generate uniform age values in the range of ages covered by the act tab
  data <- data.frame(
    age = round(runif(n, min(actuarialTable$age), max(actuarialTable$age))))
  # merge the data from the act tab and the generated ages
  data <- dplyr::select(actuarialTable, one_of("age", "yearsRemaining")) %>% 
    inner_join(data, by = ("age" = "age"))
  # compute the probability of living past treatment lag on average
  data$prTreatmentEffect <- pgamma(data$yearsRemaining, 20, 2)
  # generate error terms, mvrnorm using sigma^2 
  s1 <- 0.5
  s2 <- 1
  Sigma <- diag(c(s1^2, s2^2))
  # compute cov based on cor provided as input
  Sigma[lower.tri(Sigma)] <- cor * (s1) * (s2)
  Sigma[upper.tri(Sigma)] <- Sigma[lower.tri(Sigma)]
  err <- mvrnorm(n, c(0, 0), Sigma)
  # assign treatment based on prTreatmentEffect + noise
  data$treatment <- rbinom(n, 1, 
                           pnorm(qnorm(data$prTreatmentEffect) + err[, 1]))
  # assign outcome according to a very simple model
  data$outcome <- 1 + data$treatment + err[, 2]

Simulation in Parallel

I’m going to generate test sets with values of \(\rho\) ranging from -0.99 to 0.99 and I’m going to run 100 iterations on data generated at each of these levels. In each iteration, I’m going to compute the OLS estimator, the PS estimator and the Heckman MLE estimator.

However,as this takes a fair bit of time to run and in interest of computation time, I’m going to use the parallel package and parLapply to do this computation.

We need to nest dataGen inside the pfit function. Basically, a cluster is set of R jobs created by the makeCluster command. These jobs cannot “see” the same environment that you exist in but only what you pass them. In this case, it is easiest to pass them inside the function call.

The function pfit looks like:

pfit <- function(iter, r, n) {
  dataGen <- function(cor, n) {
    # from above
  fit <- function(r, n) {
    # similar to above but this time call dataGen inside fit
    data <- dataGen(r, n)
    # regression estimator
    linear <- coef(lm(outcome ~ treatment, data = data))[2]
    # propensity scores (using Matching package)
    model <- glm(treatment ~ cut(age, breaks = 20), data = data, 
                 family = binomial())$fitted.values
    ps <- Match(Y = data$outcome, Tr = data$treatment, X = model, ties = FALSE)$est
    # heckman mle
    hmle <- selection(data$t ~ cut(data$age, breaks = 20), 
                      list(data$outcome ~ 1, data$outcome ~ 1))
    hmle <- coef(hmle)[24] - coef(hmle)[21]
    # results
    return(c(linear, ps, hmle))
  results <- replicate(iter, fit(r, n))

Basically, pfit is passed arguments r and n where r is the correlation coefficient between the error in the selection model and outcome model. At this time, I’m forcing the errors for both outcome models and the correlation between those errors and the selection model to be equal, although in practice this need not be the case. As normal, n is the number of observations to generate.

fit then uses dataGen to generate n observations with r correlation and estimates a linear model of the outcome on treatment, a PS estimate of the treatment effect (using Match from the Matching package) and then estimates the Heckman selection model (aka Tobit-5) using the selection function of the sampleSelection package (as opposed to doing it as a two stage and doing it by hand).

The last line calls fit iter number of times using replicate and returns the data as a matrix.

To work with parLapply I wrote a second wrapper function parSim.

parSim <- function(cl, iter, r, n) {
  nw <- length(cl)
  results <- simplify2array(,
                                              rep(iter / nw, nw),
                                              fun = pfit,
                                              r = r, 
                                              n = n)))

This function takes a cluster cl and divides the iter iterations over the number of unique jobs in cl and runs fit the correct number of times on each job using parLapply. The simplify2array is simply to make the data easier to manipulate.

The results returned by parSim take the form of an array with each row being an estimator, each column being an iteration and each “slice” coming from one of the jobs.

I generated 100 replicates at 198 correlation coefficient values between \(-0.99\) and \(0.99\) (e.g., steps of \(0.01\)). While parSim does give about a 3-fold increase when the cluster is of size 4, it still takes about 60-90 minutes to run for the 198 values of \(\rho\) tested.

results <- NULL
s <- Sys.time()
for (r in seq(-0.99, 0.99, 0.01)) {
 run <- parSim(cl, iter = 100, r = r, n = 2000) 
 run <- data.frame(estimator = rep(c("LS", "PS", "HML"), 100),
                   estimate = as.numeric(run), 
                   r = r, n = 2000)
 results <- rbind(results, run)
e <- Sys.time()


First thing, first: see how the estimators perform as a function of the set value of \(\rho\):

ggplot(results, aes(x = r, y = estimate, color = estimator)) + 
  geom_point(alpha = 0.25) + 
  scale_x_continuous("Correlation") + 
  scale_y_continuous("Estimate (True Value = 1)")

Generally, no big surprises here — the effect of the “omitted variable bias” in the regression and propensity scores models, given that neither accounts for the non-zero correlation is unsurprising and has the expected signs.

When the correlation is zero or nearly zero, all of the models have similar performance with estimates about 1. However, as the correlation moves away from zero, this is no longer the case. As the correlation between the error in the selection and outcome equations increases, the effect estimate is over-estimated (by 100% among the propensity scores and nearly 50% by the regression). The same pattern is seen with correlations going to \(-1\) with the propensity score, in extreme cases, even getting the sign wrong.

However, the red smear reflecting the Heckman style selection model estimates the treatment effect at nearly 1 with no sensitivity to the changing correlation.

The problems with the linear regression model and the PS model are even more clear when plotting the mean and the mean squared error at each correlation level:

performance <- group_by(results, estimator, r) %>% 
  summarize(mean = mean(estimate), lb = quantile(estimate, 0.025),
            ub = quantile(estimate, 0.975), mse = mean((1 - estimate)^2))
ggplot(performance, aes(x = r, y = mean, ymin = lb, ymax = ub, 
                        color = estimator)) + 
  geom_point() + 
  geom_ribbon(alpha = 0.10, aes(fill = estimator)) + 
  scale_x_continuous("Correlation") + 
  scale_y_continuous("Mean Estimated Treatment Effect")

The mean estimated effect is denoted by the dots with a 95% bootstrapped confidence interval about the mean shown as the shaded region. The relative spread of the PS estimates relative to the two regression-style models is clear. The very high sensitivity to endogeneity in the treatment effect for the propensity scores models is also very visible. The linear regression model exhibits smaller CI but suffers from bias as correlation moves away from zero, if not to the same degree as the propensity scores.

The mean of the Heckman style selection estimates barely deviates from the true mean of \(1\).

The same information is visible in terms of the mean squared error.

ggplot(performance, aes(x = r, color = estimator, y = mse)) + 
  geom_point() + 
  geom_smooth(method = "loess") + 
  scale_x_continuous("Correlation") + 
  scale_y_continuous("Mean Squared Error")

The MSE for the propensity score and least squares models both exhibit the expected parabolic shape with a global minimum at correlation of zero. The robustness of the Heckman style estimator to endogeneity in the treatment estimate.

Code and Data

The result of the simulations can be downloaded here (2.4MB) as a csv. The script used to generate the data can be downloaded here, although the entire script is reproduced in the above post.


Endogenous relationships create real problems for proper and unbiased estimation of treatment effects and standard methods for dealing with selection on the observables, such as propensity scores and simple regression, fails to adequately deal with this problem. Heckman style selection models provide one means for dealing with this problem. Although they are not a panacea, these frameworks are a valuable tool for estimation in the presence of endogeneity due to correlation between the selection and outcome processes.

The Problem with Propensity Scores

15 Apr 2015

Are Propensity Scores Useful?

Effect estimation for treatments using observation data isn't always straight forward. For example, it is very common that patients who are treated with a certain medication or procedure are healthier than those who are not treated. Those who aren't treated may not be treated due to a higher risk of treatment in their condition or a reduced expected value of treatment (e.g., a statin does not make sense as a new treatment for a patient who is 100 years old due to other causes of mortality). If you simply compared the the difference between the means of some response among the treated and untreated populations, the estimate would be confounded by the differences between the populations.

One method of dealing with this problem was to match on the observed characteristics thought to be relevant. Studies would find a control patient that matched the case patient on age, gender and a number of other factors. The problem with this approach is that it is expensive and impractical to match on anything more than a few factors due to the curse of dimensionality. To avoid “throwing away” unmatched cases, the number of required candidate controls to be sampled increases exponentially with the number of matched factors.

Propensity scores are a possible alternative to this problem. Instead of matching points either directly or using a nearest-neighbor method in \(k\) dimensions, a model can be constructed to express the propensity (e.g., probability) to be treated as a function of the observed variables. Then, matching can be done in 1 dimension with this propensity. By reducing the \(k\) dimensional problem to 1 dimension, it is possible to increase the number of matched factors and to also use the collected data more effectively. The analysis of the matched groups can be done using a relatively simple method, such as a t test.

While much can be said for reducing the complexity of the analysis to a simple bivariate test and for the potential relatively greater ease for modeling treatment choice as opposed to outcome, the value of a PS-adjusted bivariate test isn't clear to me. To explore the question in more detail, I did a few simulations in R to model how well a OLS estimator and a PS adjusted estimator did at estimating a treatment effect under the “best case” conditions.


Propensity scores are typically estimated by fitting a binomial GLM with treatment status (0 or 1) as the outcome and all other covariates being predictors. Lets generate a set of 10 predictors, treatment and the response variable. The predictors will all be iid normal and will be the only non-white noise elements in the determination of treatment status. I selected an error term with a standard deviation of 5 empirically.

X <- replicate(10, rnorm(1000))
error <- rnorm(1000, 0, 5)
probTreatment <- exp(cbind(1, X) %*% rep(1, 11) + error) / 
  (1 + exp(cbind(1, X) %*% rep(1, 11) + error))

The vector probTreatment gives us \(\text{Pr}(T = 1 | x_1, x_2, \ldots, x_{10})\). If we use rbinom, we can randomly assign treatment based on the different probabilities:

t <- rbinom(1000, 1, probTreatment)

Now that we have treatment assigned, lets generate the response variable.

y <- 1 + X %*% rep(1, 10) + t + rnorm(1000, 0, 5)

As mentioned above, simply comparing the values of \(y\) for the treated and untreated groups won't produce the right estimates.

t.test(y[t == 1], y[t == 0])
##  Welch Two Sample t-test
## data:  y[t == 1] and y[t == 0]
## t = 8.8739, df = 945.607, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.421779 3.797091
## sample estimates:
##  mean of x  mean of y 
##  2.7115980 -0.3978372

The estimated treatment effect is 3.11, a 310.94% over estimate of the true value of 1.

However, if we estimate a propensity score model and match on the nearest unmatched control subject to the difference between the matches being less than 20% of the standard deviation of the pooled propensity scores, we get a better estimate. First, the model:

ps <- glm(t ~ X, family = binomial())
psValues <- predict(ps)
psTreated <- psValues[t == 1]
psUntreated <- psValues[t == 0]

The next step is to match subject to the 20% of the standard deviation limit:

# find matches, require be within 20% of a sd of the untransformed pooled ps
matches <- matrix(NA, ncol = 3, nrow = NROW(y[t == 1]))
limit <- 0.2 * sd(predict(ps))
for (i in 1:NROW(y[t == 1])) {
  x <- psTreated[i]
  distance <- (x - psUntreated)^2
  if (min(distance) <= limit) {
    nn <- which.min(distance)
    psUntreated[nn] <- .Machine$integer.max
    matches[i, ] <- c(i, nn, distance[nn]) 
  } else {
    matches[i, ] <- c(i, NA, NA)

The loop takes each treated patient and searches for an unmatched untreated patient near the treated patient's propensity score and within the constraint. Once matched, the propensity score for the untreated patient is set to .Machine$integer.max to prevent further matches to the same patient (setting the value to NA introduces problems and .Machine$integer.max is safe enough). If there is no possible match subject to the constraints, the treated patient is matched to NA to reflect this fact.

Using the matches, we can construct two new vectors, yTreatedMatched and yUntreatedMatched for the t test. Based on propensity score theory, and true in this simulation, conditional on the observed characteristics the assignment to treatment or not treatment is random. As a result, these two groups are similar to what you might see in an RCT (in theory) and so the differences should be an unbiased and accurate estimate of the treatment effect.

# throw out the unmatched patients
matches <- matches[![, 2]), ]
yTreatedMatched <- y[t == 1][matches[, 1]]
yUntreatedMatched <- y[t == 0][matches[, 2]]
t.test(yTreatedMatched, yUntreatedMatched)
##  Welch Two Sample t-test
## data:  yTreatedMatched and yUntreatedMatched
## t = 2.2593, df = 654.727, p-value = 0.0242
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1224774 1.7492684
## sample estimates:
## mean of x mean of y 
## 1.4946742 0.5588013

And, as expected, the propensity score matched estimator produces an accurate measure of the treatment effect, 0.94, compared to a true effect of 1. The result is also statistically significantly different from zero.

However, it required that we threw away data. We “collected” data on 565 treated patients but with 435 untreated patients, we were only able to match 330, or 58.41%.

Additionally, had we instead used a OLS regression model, we could have produced similar estimates.

model <- lm(y ~ X + t)
## Call:
## lm(formula = y ~ X + t)
## Coefficients:
## (Intercept)           X1           X2           X3           X4  
##      1.1242       1.0259       0.7914       0.5494       0.8666  
##          X5           X6           X7           X8           X9  
##      0.7482       0.8879       0.9435       1.2965       0.8275  
##         X10            t  
##      0.6487       0.7250

Moreover, the 95% CI produced using the OLS method is smaller ranging from 0.05 to 1.4 compared to the t test CI ranging from 0.12 to 1.75.

The next step is to see how this holds up for a variety of values of \(n\) and sampling ratios (e.g., 2 controls per case vs 1 control per case). To this end, the function psSim simulates a propensity score adjusted model. It works by constructing a large population of size n with k covariates and 1 treatment variable. From this large population, sampleN cases are selected at random and ratio * sampleN controls are selected. (I'm using case and control here to mean treated vs untreated, not in the strict sense of a [case-control study] (

From these sampled cases and controls, a propensity score model is estimated and the treatment effect is estimated using the matched populations. The function returns a 2 element vector, the first element is the estimated treatment effect and the second is the p value from the t test.

The function is below:

psSim <- function(n, k, sampleN, ratio) {
  # generate k covariates
  X <- replicate(k, rnorm(n))
  # some errors to the propensity score model
  pErr <- rnorm(n, 0, 5)  
  pt <- exp(1 + X %*% rep(1, k) + pErr)/(1 + exp(1 + X %*% rep(1, k) + pErr))
  # assign treatment status
  t <- rbinom(n, 1, pt)
  # generate response
  yErr <- rnorm(n, 0, 5)
  y <- 1 + X %*% rep(1, k) + t + yErr
  # put into big data frame
  data <- data.frame(y = y, t = t)
  data <- cbind(data, X)
  names(data) <- c("y", "t", paste(rep("x", k), 1:k, sep = ''))
  # generate sample from population
  treatedSample <- data[t == 1, ][sample(1:nrow(data[t == 1, ]), sampleN), ]
  untreatedSample <- data[t == 0, ][sample(1:nrow(data[t == 0, ]), 
                                           ratio * sampleN), ]
  # estimate propensity scores using all k X predictors
  sample <- rbind(treatedSample, untreatedSample)
  ps <- glm(t ~ ., data = sample[-1], 
            family = binomial())
  # find untransformed fitted values
  psTreated <- predict(ps)[sample$t == 1]
  psUntreated <- predict(ps)[sample$t == 0]
  # find matches, require be within 20% of a sd of the untransformed pooled ps
  matches <- matrix(NA, ncol = 3, nrow = nrow(treatedSample))
  limit <- 0.2 * sd(predict(ps))
  for (i in 1:nrow(treatedSample)) {
    x <- psTreated[i]
    distance <- (x - psUntreated)^2
    if (min(distance) <= limit) {
      nn <- which.min(distance)
      psUntreated[nn] <- .Machine$integer.max
      matches[i, ] <- c(i, nn, distance[nn]) 
    } else {
      matches[i, ] <- c(i, NA, NA)
  # do analysis
  treatedY <- treatedSample[, 1]
  untreatedY <- untreatedSample[, 1]
  matches <- matches[![, 2]), ]
  matchedTreatedY <- treatedY[matches[, 1]]
  matchedUntreatedY <- untreatedY[matches[, 2]]
  test <- t.test(matchedTreatedY, matchedUntreatedY)
  results <- c(mean(matchedTreatedY - matchedUntreatedY), test$p.value)

I also wrote a second function that generates the population and sample in the same way but instead of using propensity score matching, the treatment effect is estimated directly with OLS. This function is called lmSim:

lmSim <- function(n, k, sampleN, ratio) {
  # generate k covariates
  X <- replicate(k, rnorm(n))
  # some errors to the propensity score model
  pErr <- rnorm(n, 0, 5)  
  pt <- exp(1 + X %*% rep(1, k) + pErr)/(1 + exp(1 + X %*% rep(1, k) + pErr))
  # assign treatment status
  t <- rbinom(n, 1, pt)
  # generate response
  yErr <- rnorm(n, 0, 5)
  y <- 1 + X %*% rep(1, k) + t + yErr
  # put into big data frame
  data <- data.frame(y = y, t = t)
  data <- cbind(data, X)
  names(data) <- c("y", "t", paste(rep("x", k), 1:k, sep = ''))
  # generate sample from population
  treatedSample <- data[t == 1, ][sample(1:nrow(data[t == 1, ]), sampleN), ]
  untreatedSample <- data[t == 0, ][sample(1:nrow(data[t == 0, ]), 
                                           ratio * sampleN), ]
  sampleData <- rbind(treatedSample, untreatedSample)
  # fit OLS model
  model <- summary(lm(y ~ ., data = sampleData))
  results <- c(model$coefficients[2, 1], model$coefficients[2, 4])

Using these two functions, I ran each estimator through 100 iterations for sample sizes of 50 to 1000 cases and control sets of 1 to 2 times larger. If you are running this at home, it takes a while. There is a second post that discusses running this code in parallel (using the parallel package) that also does a little optimization to the code as well (both technically and statistically) that is worth checking out, otherwise, be ready to wait a while (about 45-60 minutes on 3.0 GHz CPU).

psMean <- matrix(NA, nrow = 100 * 3, ncol = 100)
psP <- matrix(NA, nrow = 100 * 3, ncol = 100)
lsMean <- matrix(NA, nrow = 100 * 3, ncol = 100)
lsP <- matrix(NA, nrow = 100 * 3, ncol = 100)
i <- 1
for (n in seq(50, 1000, length.out = 100)) {
  for (ratio in c(1, 1.5, 2)) {
    psResults <- replicate(100, psSim(10000, 10, n, ratio))
    lmResults <- replicate(100, lmSim(10000, 10, n, ratio))
    psMean[i, ] <- psResults[1, ]
    psP[i, ] <- psResults[2, ]
    lsMean[i, ] <- lmResults[1, ]
    lsP[i, ] <- lmResults[2, ]
    i <- i + 1

And some data munging to get it into a reasonable format for plotting:

results <- data.frame(estimatedEffect = c(as.numeric(psMean), 
results$ratio <- c(1, 1.5, 2)
results$n <- rep(seq(50, 1000, length.out = 100), each = 3)
results$method <- rep(c("PS", "LS"), each = 300 * 100)
results$p <- c(as.numeric(psP), as.numeric(lsP))
results$sig <- results$p <= 0.05


If we plot the estimate against the sample size (of cases):

ggplot(results, aes(x = n, y = estimatedEffect)) + 
  geom_point(aes(color = method), alpha = 0.25) + 
  facet_grid(ratio ~ method) + 
  scale_y_continuous("Estimated Treatment Effect") + 
  geom_hline(aes(yintercept = 1))

The code produces a 3x3 grid of plots, the left plots are the estimates from OLS and the right from from the propensity score matched t test. The rows are the different ratios of control to case sampling (1, 1.5 and 2).

It is clear that both estimators produce reasonably unbiased estimates of the true treatment effect of 1. Furthermore, both appear to converge to the true value of 1 as the sample size goes to infinity.

However, it looks like the range of the estimates of the OLS estimator may be smaller at a given value of \(n\) than the propensity scores estimator. To this end, I calculate the mean squared error (MSE) of the estimate from the true value of 1 for both methods at each sample size.

# from dplyr
mse <- summarise(group_by(results, method, n, ratio), mean((estimatedEffect - 1)^2))
names(mse) <- c("method", "n", "ratio", "mse")

When plotted

ggplot(mse, aes(x = n, y = mse, color = method)) + 
  geom_point() + 
  facet_grid(ratio ~ .) + 
  scale_x_continuous("N") + 

and especially when just a smoother is shown, not the actual points,

ggplot(mse, aes(x = n, y = mse, color = method)) + 
  geom_smooth(method = "loess") + 
  facet_grid(ratio ~ .) + 
  scale_x_continuous("N") + 

it becomes clear that the MSE of a regression method is smaller than the typical MSE of the propensity score method for a fixed \(k\) as \(n\) varies.

Unsurprisingly, the higher MSE also shows up in the power testing. If we look at the p values of the tests:

ggplot(results, aes(x = n, y = p)) + 
  geom_point(aes(color = method), alpha = 0.25) + facet_grid(ratio ~ method) + 
  scale_y_continuous("Observed p value") + geom_hline(aes(yintercept = 0.05))

As we known that the true treatment effect is 1, we should expect fairly small p values. Both modeling frameworks appear to give similar results, although, as with the estimate and MSE plots above, relative differences in performance emerge as we consider the power curve.

If we formalize this with an estimated power curve:

power <- group_by(results, method, n, ratio) %>%
  summarize(power = mean(sig))
ggplot(power, aes(x = n, y = power, color = method)) + 
  geom_point(aes(color = method)) + facet_grid(ratio ~ method) + 
  geom_smooth(method = "loess") + 
  scale_x_continuous("N") + 

And again but with just the LOESS fit and no points:

ggplot(power, aes(x = n, y = power * 100, color = method)) + 
  facet_grid(ratio ~ .) + 
  geom_smooth(method = "loess") + 
  scale_x_continuous("N") + 

We see a relative difference between the two methods with the difference most pronounced at larger values of \(k\).

ps <- power[power$method == "PS", ]
ols <- power[power$method == "LS", ]
diffPower <- data.frame(ratio = ps$ratio,
                        diff = ols$power - ps$power,
                        n = ps$n)
diffPower$`ls more powerful` <- diffPower$diff > 0
ggplot(diffPower, aes(x = n, y = diff * 100)) +
  geom_point(aes(color = `ls more powerful`)) + 
  facet_grid(ratio ~ .) + 
  geom_smooth(method = "loess", alpha = 0) + 
  scale_x_continuous("N") + 
  scale_y_continuous("Percentage Point Difference in Power")

In general, propensity score methods for a fixed \(k\) over a varied \(n\) are less powerful than regression methods at the same values of \(n\) and \(k\). These differences become small as \(n\) gets very large, say 500-750, but for smaller values of \(n\) they can make a meaningful difference. A 10-20 percentage point increase in power by simply using regression over matching is hard to turn down.

The benefits of using linear models becomes more clear when you look at the variance of \(k\). With the same \(n\) and increasing \(k\), the linear model becomes increasingly powerful, to as much as 10 percentage points greater power. When the ratio of cases and controls gets larger, the power of regression is clearer than ever.

There are other problems with propensity scores. The propensity score models in this simulation were “perfect.” Conditional on all the observed variables, treatment choice was random, as our model was an exact restatement of the data generating process. This is unlikely to be the case in reality. Few, if any, papers evaluate the performance of their propensity score model. Many don't even provide information about what went into the model and the model selection, much less a cross-validation MSE.

They offer no improvement over regression in terms of unobserved heterogeneity. The scores are functionally identical to doing a regression and so suffer from all the relevant drawbacks.

Finally, propensity scores are computationally more expensive than regression. The file k.csv contains timings from a run of the propensity score and regression methods on similar data sets with a fixed value of \(n = 1000\) and \(k\) ranging from 10 to 100 done with 100 replications at each combination on my Linode VPS.

timings <- read.csv("/home/jacob/documents/blog/ps/k.csv")
# have to remove an outlier
timings <- timings[timings$k != 10, ]
timings$method <- ifelse(timings$m == 0, "LS", "PS")
ggplot(timings, aes(x = k, y = t, color = method)) + geom_point() + 
  geom_smooth(method = "loess") + 
  scale_x_continuous("K") + 
  scale_y_continuous("Time / s")

Regression is nearly always much faster than propensity score matching. More importantly, as \(k\) increases with a fixed \(n\), the cost of regression increases approximately linearly. The non-linear increase with respect to \(k\) is scary — a model that estimates and then matches on 500 predictors could take hours to fix, or seconds with a OLS model.

This makes sense, lm() and similar functions hand off the computation to highly optimized numerical linear algebra routines. No matching function, even if written in a lower level language like C or C++ via Rcpp will be nowhere near as fast and optimized as LINPACK/LAPACK/BLAS.

It would be hard, if not impossible, to match a propensity score method as fast as regression for two reasons. The propensity score method already does the regression in order to estimate the propensity scores. The matching then adds on an additional and greater cost.


This added cost is just that, added cost. It doesn't provide a power benefit (if anything, it comes with a power penalty), computational benefit or reduction in complexity. While PS methods work, there is practically no circumstance in which they work and regression methods do not work at least as well.