Easy Cross Validation in R with `modelr`

When estimating a model, the quality of the model fit will always be higher in-sample than out-of-sample. A model will always fit the data that it is trained on, warts and all, and may use those warts and statistical noise to make predictions. As a result, a model that performs very well on a data set may perform poorly when used more generally.

Since we often can’t fit a model, put it in use and determine how well it performs (and probably wouldn’t want to even if we could), we need some methods for predicting out-of-sample error when only using the data we have.

The simplest method is to randomly divide the data into a training set and a test set. However, this provides only one test set. If you train a model, test it and then train a new model and re-test it using the same test set and see an improvement, it may have come from the model actually being better or it may come from you over-fitting the test set.

One possible way to address this problem is to use something like k-fold cross validation. The general idea is is to use some portion of your data (typically 90%) to fit a model and then test that model on the withheld portion (typically 10%) to assess the quality of the model fit. You would repeat this process over and over again until each point has been used to estimate the model \(k-1\) times and been included in the test set 1 time.

This can be done with a loop or a member of the apply family, but I’m going to use purrr to write functional code to fit the models and modelr to evaluate the models.

My “preamble” looks like this:

Data and the Model

For this example, I’m going to use the cars dataset from the caret package. It looks like this:

data("cars", package = "caret")
glimpse(cars)
## Observations: 804
## Variables: 18
## $ Price       <dbl> 22661.05, 21725.01, 29142.71, 30731.94, 33358.77, ...
## $ Mileage     <int> 20105, 13457, 31655, 22479, 17590, 23635, 17381, 2...
## $ Cylinder    <int> 6, 6, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,...
## $ Doors       <int> 4, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4,...
## $ Cruise      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Sound       <int> 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1,...
## $ Leather     <int> 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1,...
## $ Buick       <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Cadillac    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Chevy       <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Pontiac     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Saab        <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Saturn      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ convertible <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ coupe       <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hatchback   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ sedan       <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ wagon       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

We want to build a model to predict the price of the car. We have the mileage, number of cylinders, size, some features (cruise control, sound system, leather seats), brand and then the style of car. The data already has dummy variables constructed but we need to drop one in order to estimate the model (e.g., set it as the reference level). We’ll drop the brand Chevy indicator (setting Chevy to the reference brand) and set the reference style to sedan. We’ll also drop the doors variable as that is perfectly co-linear with the coupe indicator and styles likely contain more information than the number of doors.

Our final model will look something like this:

select(cars, -Doors, -Chevy, -sedan) %>% 
  lm(Price ~ Milage + Cylinder + Cruise + Sound + Leather + Buick + Cadillac + 
  Pontiac + Saab + Saturn + convertible + coupe + hatchback + wagon, data = .)

The modelr package has a useful tool for making the cross-validation folds. crossv_kfold will divide the data into \(k\) (by default 10) folds and returns a tibble with the list-columns of train (the training data for the fold), test (the test set for the fold) and .id (the fold number). For more details on the idea of list-columns, see the Many Models Chapter in R for Data Science by Garrett Grolemund and Hadley Wickham.

cars_cv <- cars %>%
  select(-Doors, -Chevy, -sedan) %>%
  crossv_kfold(10)
cars_cv
## # A tibble: 10 × 3
##             train           test   .id
##            <list>         <list> <chr>
## 1  <S3: resample> <S3: resample>    01
## 2  <S3: resample> <S3: resample>    02
## 3  <S3: resample> <S3: resample>    03
## 4  <S3: resample> <S3: resample>    04
## 5  <S3: resample> <S3: resample>    05
## 6  <S3: resample> <S3: resample>    06
## 7  <S3: resample> <S3: resample>    07
## 8  <S3: resample> <S3: resample>    08
## 9  <S3: resample> <S3: resample>    09
## 10 <S3: resample> <S3: resample>    10

You might notice that instead of the elements being data.frame or tibble, they are resample objects. resample objects are cool because they are lazy. Instead of making 10 copies of cars, crossv_kfold() returned what are essentially pointers to the original data. So instead of storing the data 10 times, we store it once plus a little bit for the pointers. Indeed, we can see that in the object size:

pryr::object_size(cars)
## 63.6 kB
pryr::object_size(cars_cv)
## 93.7 kB

Awesome, right?

Now it is time to fit the model. It all can be down within the very tidy tidyverse (see what I did there?). The code below uses purrr:map() to fit the model for each one of the 10 folds and returns the RMSE on the test set.

cars_cv %>% 
  mutate(train = map(train, as_tibble)) %>%
  mutate(model = map(train, ~ lm(Price ~ Mileage + Cylinder + Cruise + Sound + 
                                   Leather + Buick + Cadillac + Pontiac + 
                                   Saab + Saturn + convertible + coupe + 
                                   hatchback + wagon, 
                                 data = .))) %>%
  mutate(rmse = map2_dbl(model, test, rmse)) %>%
  select(.id, rmse)
## # A tibble: 10 × 2
##      .id     rmse
##    <chr>    <dbl>
## 1     01 2372.046
## 2     02 2844.509
## 3     03 3398.572
## 4     04 3165.065
## 5     05 2838.691
## 6     06 2675.153
## 7     07 3720.843
## 8     08 2692.567
## 9     09 2362.007
## 10    10 3016.648

It is reasonably fast too.

ten_fold <- function(x) {
  x %>% 
    mutate(model = map(train, ~ lm(Price ~ Mileage + Cylinder + Cruise + Sound + 
                                   Leather + Buick + Cadillac + Pontiac + 
                                   Saab + Saturn + convertible + coupe + 
                                   hatchback + wagon, 
                                 data = .))) %>%
  mutate(rmse = map2_dbl(model, test, rmse)) %>%
  select(.id, rmse)
}

microbenchmark::microbenchmark(ten_fold(cars_cv))
## Unit: milliseconds
##               expr      min       lq     mean   median       uq      max
##  ten_fold(cars_cv) 45.47155 50.18167 61.01467 53.27679 60.67367 139.0201
##  neval
##    100

microbenchmark puts it a median time of about 50 ms. We can probably make it even faster though. We don’t actually need to store the model and we can do it all in one step with a function.

fitter <- function(training_data, testing_data) {
  lm(Price ~ Mileage + Cylinder + Cruise + Sound + Leather + Buick + Cadillac + 
       Pontiac + Saab + Saturn + convertible + coupe + hatchback + wagon, 
     data = training_data) %>% 
    rmse(testing_data)
}

ten_fold_2 <- function(x) {
  x %>% 
    mutate(rmse = map2_dbl(train, test, fitter)) %>%
    select(.id, rmse)
}

microbenchmark::microbenchmark(ten_fold_2(cars_cv))
## Unit: milliseconds
##                 expr     min       lq     mean   median       uq      max
##  ten_fold_2(cars_cv) 46.1202 49.16919 52.21734 51.25691 53.72752 100.7599
##  neval
##    100

Hey, that saved 2 or 3 milliseconds on the median. Totally worth it. (If your model fit takes a long time, it is probably worth it).

It is possible to parallelize the model fitting using multidplyr as the backend for dplyr; however, multidplyr doesn’t seem to work well with the resample objects. The easiest solution is to store nine copies of the data to allow for the cross-validation to work most simply - but I’m more likely to hurt for shortage of memory than shortage of CPU cycles/time when dealing with health care administrative data. Alternatively, you could use do() and multidplyr to filter by fold ID, fit a model, compute RMSE and proceed from there, but that solution is much less simple.

Packages like caret do provide powerful tools for cross-validation; however, I like the ability to keep my entire workflow smooth and within the tidyverse. I found that, in general, I prefer to use the modelr tools and do the model fitting the “old fashioned way.”

Tags: r modelr tidyverse cross_validation ml

Related Posts