Poor Donald - his tweets keep getting more negative

Last summer, David Robinson did this interesting text analysis of Donald Trump’s tweets and found that they more angry ones came from Android (which Trump is known to use). But he didn’t consider how Trump’s emotional state varies over time and he certainly couldn’t have considered what the impact of the election and recent events would have been on Trump.

Using the twitteR package and the tidyverse ecosystem (plus tidytext) this is actually a very simple analysis.

For starters, pulling Trump’s tweets (the last 3,200) is very simple:

library(twitteR)
library(tidyverse)
library(tidytext)

source("~/twitter_key.R")

setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)
## [1] "Using direct authentication"
trump <- userTimeline("realDonaldTrump",
                      n = 3100,
                      includeRts = TRUE,
                      excludeReplies = FALSE) %>%
  twListToDF() %>%
  as_tibble()

And then we have a tidy tibble with Trump’s tweets:

glimpse(trump)
## Observations: 3,099
## Variables: 16
## $ text          <chr> "Heading to Joint Base Andrews on #MarineOne wit...
## $ favorited     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ favoriteCount <dbl> 77699, 85576, 71312, 220083, 64348, 84125, 62284...
## $ replyToSN     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ created       <time> 2017-02-10 23:24:51, 2017-02-10 13:35:50, 2017-...
## $ truncated     <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, ...
## $ replyToSID    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ id            <chr> "830195857530183684", "830047626414477312", "830...
## $ replyToUID    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ statusSource  <chr> "<a href=\"http://twitter.com/download/iphone\" ...
## $ screenName    <chr> "realDonaldTrump", "realDonaldTrump", "realDonal...
## $ retweetCount  <dbl> 21473, 19779, 15069, 64363, 10082, 14185, 11294,...
## $ isRetweet     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ retweeted     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ longitude     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ latitude      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

Using tidytext, it is straightforward to unnest and tokenize the words in the body of the tweets:

words <- trump %>%
  select(id, statusSource, retweetCount, favoriteCount, created, isRetweet,
         text) %>%
  unnest_tokens(word, text)
words
## # A tibble: 57,239 x 7
##                    id
##                 <chr>
## 1  830195857530183684
## 2  830195857530183684
## 3  830195857530183684
## 4  830195857530183684
## 5  830195857530183684
## 6  830195857530183684
## 7  830195857530183684
## 8  830195857530183684
## 9  830195857530183684
## 10 830195857530183684
## # ... with 57,229 more rows, and 6 more variables: statusSource <chr>,
## #   retweetCount <dbl>, favoriteCount <dbl>, created <time>,
## #   isRetweet <lgl>, word <chr>

Given what David Robinson found, we might want to convert the statusSource variable into a flag for whether it was posted via an Android device:

words <- words %>%
  mutate(android = stringr::str_detect(statusSource, "Android")) %>%
  select(- statusSource)
words
## # A tibble: 57,239 x 7
##                    id retweetCount favoriteCount             created
##                 <chr>        <dbl>         <dbl>              <time>
## 1  830195857530183684        21473         77699 2017-02-10 23:24:51
## 2  830195857530183684        21473         77699 2017-02-10 23:24:51
## 3  830195857530183684        21473         77699 2017-02-10 23:24:51
## 4  830195857530183684        21473         77699 2017-02-10 23:24:51
## 5  830195857530183684        21473         77699 2017-02-10 23:24:51
## 6  830195857530183684        21473         77699 2017-02-10 23:24:51
## 7  830195857530183684        21473         77699 2017-02-10 23:24:51
## 8  830195857530183684        21473         77699 2017-02-10 23:24:51
## 9  830195857530183684        21473         77699 2017-02-10 23:24:51
## 10 830195857530183684        21473         77699 2017-02-10 23:24:51
## # ... with 57,229 more rows, and 3 more variables: isRetweet <lgl>,
## #   word <chr>, android <lgl>

Let’s now code the tweets using the afinn sentiment set:

words <- words %>%
  inner_join(get_sentiments("afinn"))
## Joining, by = "word"
words
## # A tibble: 5,093 x 8
##                    id retweetCount favoriteCount             created
##                 <chr>        <dbl>         <dbl>              <time>
## 1  830047626414477312        19779         85576 2017-02-10 13:35:50
## 2  830047626414477312        19779         85576 2017-02-10 13:35:50
## 3  830042498806460417        15069         71312 2017-02-10 13:15:27
## 4  829721019720015872        10082         64348 2017-02-09 15:58:01
## 5  829721019720015872        10082         64348 2017-02-09 15:58:01
## 6  829689436279603206        14185         84125 2017-02-09 13:52:31
## 7  829689436279603206        14185         84125 2017-02-09 13:52:31
## 8  829689436279603206        14185         84125 2017-02-09 13:52:31
## 9  829689436279603206        14185         84125 2017-02-09 13:52:31
## 10 829689436279603206        14185         84125 2017-02-09 13:52:31
## # ... with 5,083 more rows, and 4 more variables: isRetweet <lgl>,
## #   word <chr>, android <lgl>, score <int>

And now let’s see how the typical sentiment of those tweets has varied since April 2016 (midsts of the Republican primary) to present:

words %>%
  filter(isRetweet == FALSE) %>%
  group_by(id, created) %>%
  summarize(sentiment = mean(score)) %>%
  ggplot(aes(x = created, y = sentiment)) + 
  geom_smooth() + 
  geom_vline(xintercept = as.numeric(as.POSIXct(("2017-01-20")))) + 
  geom_vline(xintercept = as.numeric(as.POSIXct(("2016-11-08")))) + 
  geom_vline(xintercept = as.numeric(as.POSIXct(("2016-05-03")))) + 
  labs(x = "Date", y = "Mean Afinn Sentiment Score")

The vertical lines denote the date he was named as the Republican candidate (May 3rd 2016), the date of the election (Nov 8th 2016) and inauguration day. Thing aren’t looking up for Trump. He seems to be more angry/sad/negative now than any prior point during the past year.

What if we consider the grouping by using Android vs not:

words %>%
  filter(isRetweet == FALSE) %>%
  group_by(id, created, android) %>%
  summarize(sentiment = mean(score)) %>%
  ggplot(aes(x = created, y = sentiment, color = android)) + 
  geom_smooth() + 
  geom_vline(xintercept = as.numeric(as.POSIXct(("2017-01-20")))) + 
  geom_vline(xintercept = as.numeric(as.POSIXct(("2016-11-08")))) + 
  geom_vline(xintercept = as.numeric(as.POSIXct(("2016-05-03")))) + 
  labs(x = "Date", y = "Mean Afinn Sentiment Score")

We see the general trend that David Robinson identified - the Android tweets tended to be more negitive than the other platforms. It is interesting that they were more positive than the tweets presumed to be by staff right before the election. Also, we can see the non-Android tweets were more positive during the transition than the Android tweets that clearly became more negitive. Perhaps the limits of Presidential powers are stricter than he expected. It is interesting that the Android tweets are now more negitive than positive, the first time this has occurred.

Interestingly, there seems to be no effect of being positive/negitive on the number of retweets

words %>%
  filter(isRetweet == FALSE) %>%
  group_by(id, created, android) %>%
  summarize(sentiment = mean(score)) %>%
  inner_join(select(words, id, retweetCount, favoriteCount) %>%
               distinct()) %>%
  ggplot(aes(x = sentiment, y = retweetCount, color = android)) + 
  geom_smooth() + 
  geom_point() + 
  scale_y_log10() + 
  labs(x = "Mean Afinn Sentiment Score", y = "Number of Retweets")
## Joining, by = "id"

or the number of favorites

words %>%
  filter(isRetweet == FALSE) %>%
  group_by(id, created, android) %>%
  summarize(sentiment = mean(score)) %>%
  inner_join(select(words, id, retweetCount, favoriteCount) %>%
               distinct()) %>%
  ggplot(aes(x = sentiment, y = favoriteCount, color = android)) + 
  geom_smooth() + 
  geom_point() + 
  scale_y_log10() + 
  labs(x = "Mean Afinn Sentiment Score", y = "Number of Favorites")
## Joining, by = "id"

that a tweet gets.

Regression analysis suggests that the number of retweets is increased significantly by a more negitive tweet but that also the effect wears off with time (very very slightly):

words %>%
  filter(isRetweet == FALSE, android) %>%
  group_by(id, created) %>%
  summarize(sentiment = mean(score)) %>%
  inner_join(select(words, id, retweetCount, favoriteCount) %>%
               distinct()) %>%
  lm(log(retweetCount) ~ created * (sentiment < 0), data = .) %>%
  summary()
## Joining, by = "id"
## 
## Call:
## lm(formula = log(retweetCount) ~ created * (sentiment < 0), data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7744 -0.3806  0.0005  0.3576  3.2661 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -1.012e+02  3.942e+00 -25.679  < 2e-16 ***
## created                    7.488e-08  2.680e-09  27.939  < 2e-16 ***
## sentiment < 0TRUE          1.959e+01  6.086e+00   3.219  0.00132 ** 
## created:sentiment < 0TRUE -1.313e-08  4.135e-09  -3.175  0.00154 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5923 on 1198 degrees of freedom
## Multiple R-squared:  0.5195, Adjusted R-squared:  0.5183 
## F-statistic: 431.7 on 3 and 1198 DF,  p-value: < 2.2e-16

A similar pattern exists for the number of favorites

words %>%
  filter(isRetweet == FALSE, android) %>%
  group_by(id, created) %>%
  summarize(sentiment = mean(score)) %>%
  inner_join(select(words, id, retweetCount, favoriteCount) %>%
               distinct()) %>%
  lm(log(favoriteCount) ~ created * (sentiment < 0), data = .) %>%
  summary()
## Joining, by = "id"
## 
## Call:
## lm(formula = log(favoriteCount) ~ created * (sentiment < 0), 
##     data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.75782 -0.35691 -0.00795  0.33800  2.48914 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -1.176e+02  3.452e+00 -34.068  < 2e-16 ***
## created                    8.689e-08  2.347e-09  37.020  < 2e-16 ***
## sentiment < 0TRUE          1.435e+01  5.329e+00   2.692  0.00721 ** 
## created:sentiment < 0TRUE -9.648e-09  3.621e-09  -2.664  0.00781 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5187 on 1198 degrees of freedom
## Multiple R-squared:  0.6525, Adjusted R-squared:  0.6517 
## F-statistic: 749.9 on 3 and 1198 DF,  p-value: < 2.2e-16

The words used by the Android postings that were positive and negitive varied from before the election, during the transition and after Trump was sworn in:

words %>%
  filter(android) %>%
  mutate(phase = ifelse(as.POSIXct("2016-11-08") > created, "Before the election",
                        ifelse(as.POSIXct("2017-01-20") > created, "Transition",
                                          "In the White House"))) %>%
  group_by(phase, pos_sentiment = score >= 0, word) %>%
  count() %>%
  group_by(phase, pos_sentiment) %>%
  filter(word != "no") %>%
  top_n(3, wt = n) %>%
  arrange(pos_sentiment, phase, desc(n))
## Source: local data frame [18 x 4]
## Groups: phase, pos_sentiment [6]
## 
##                  phase pos_sentiment      word     n
##                  <chr>         <lgl>     <chr> <int>
## 1  Before the election         FALSE       bad    62
## 2  Before the election         FALSE dishonest    27
## 3  Before the election         FALSE    rigged    25
## 4   In the White House         FALSE       bad    10
## 5   In the White House         FALSE      fake    10
## 6   In the White House         FALSE       ban     6
## 7           Transition         FALSE       bad    13
## 8           Transition         FALSE     wrong    11
## 9           Transition         FALSE dishonest    10
## 10 Before the election          TRUE     great   175
## 11 Before the election          TRUE     thank    69
## 12 Before the election          TRUE       big    54
## 13  In the White House          TRUE     great     8
## 14  In the White House          TRUE       big     5
## 15  In the White House          TRUE       win     5
## 16          Transition          TRUE     great    56
## 17          Transition          TRUE       big    16
## 18          Transition          TRUE       win    14

We have the fake news to thank for the fake debut post-being sworn in. At least the election was no longer rigged after he worn it.

readr::problems() returns tidy data!

A handy little trick I picked up today when using readr.

Some background: I needed a mapping between ZIP Code Tabulation Areas and counties (to link to some urban/rural data). The Census Bureau provides a CSV style table that includes information about each of the ZCTA (e.g., size, population, area by land/water type) and the FIPS codes for the state and county.

However, when I load that data using the readr package:

library(tidyverse)
zcta_to_county_mapping <- read_csv("http://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt") %>%
  select(ZCTA5, STATE, COUNTY) %>%
  mutate(STATE = as.numeric(STATE),
         COUNTY = as.numeric(COUNTY))
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   ZCTA5 = col_character(),
##   COUNTY = col_character(),
##   COAREA = col_double(),
##   COAREALAND = col_double(),
##   ZPOPPCT = col_double(),
##   ZHUPCT = col_double(),
##   ZAREAPCT = col_double(),
##   ZAREALANDPCT = col_double(),
##   COPOPPCT = col_double(),
##   COHUPCT = col_double(),
##   COAREAPCT = col_double(),
##   COAREALANDPCT = col_double()
## )
## See spec(...) for full column specifications.
## Warning: 1592 parsing failures.
##  row        col   expected     actual
## 1303 ZAREA      an integer 3298386447
## 1303 ZAREALAND  an integer 3032137295
## 1304 AREAPT     an integer 2429735568
## 1304 AREALANDPT an integer 2262437812
## 1304 ZAREA      an integer 3298386447
## .... .......... .......... ..........
## See problems(...) for more details.

It produces a warning. Looking at the few rows it returned, it seems likely that the errors are coming from overflow - read_csv() guessed that the variable was of type int (8 bytes, max value of \(2^31 - 1\) or 2,147,483,647) byt some of these values are huge. I looked up a few of them and saw that they were all occuring in large, unpopulated areas. One of them (ZIP code 04462) is described by UnitedStatesZipCodes.org as covering “an extremely large land area compared to other ZIP codes in the United States.”

So that seems like the source of the issue - but there were 1,592 failures! I want to make sure those failures never affect the variables that I’m interested in. I noticed the error message says to use problems() to see more details. I did as it was told, expecting something about as useful as the results of warnings() but was pleased to get get back a tbl_df!

Checking to make sure the errors didn’t affect my variables of interest (ZCTA5, STATE and COUNTY) was as easy as

problems(zcta_to_county_mapping) %>%
  filter(col %in% c("ZCTA5", "STATE", "COUNTY"))
## # A tibble: 0 × 4
## # ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr>

I love when tools make life easier! Even the error handling returns tidy data!

Inter-ocular trauma test

I’ve recently been thinking about the role statistics can play in answering questions. I think the it came up on the NSSD podcast a few weeks ago. Basically, problems can be divided into three classes:

  • those that don’t need statistics because the answer is obvious (problems without much confounding and a strong signal to noise ratio)
  • those that need statistics because there is noise/confounding
  • those that can’t be helped by statistics because the noise/confounding is so large or the signal is very small or the question is fundamentally unanswerable with statistics

I think this is a useful idea and have been applying a similar system for years that I’ve learned as the “inter-ocular trauma test.” (I learned the term in my intro biostats class, I tracked it down to a talk by Joe Berkson in the 50s or 60s). The idea is that sometimes the result is so obvious that it hits you between the eyes, hence causing inter-ocular trauma.

I like this idea and often use it as a crude filter when looking at visualizations or causally reading some analysis. High end statistical analysis is often required to determine the signal and the formalization of the inter-ocular trauma test with a formal analysis is always needed, but that doesn’t stop it from being useful all the same.

One of the reasons I really like it is how simple and easy it is to understand. For example, where I live (Iowa City, Iowa) there is a small “town” of about 1,000 people located within the larger city. This small town is famous for being a speed trap - a major road enters and passes through the town at the same time as the speed limit drops to 25 mph from 35 mph and the road goes down hill. They employ 5 people officers (about 2x what would be expected for a comparable city) and issue a lot of speeding tickets.

A recent article came out in the local paper saying they issued more tickets than other areas. The reporter summarized the number of tickets issued in the past year and the population of the towns. They found the following data:

The town of University Heights is located entirely within Iowa City. When asked about the difference, the University Heights police chief said that the comparison isn’t fair because they have a major thoroughfare and lots of people tailgate in the area during home football games. (Never mind that the stadium is in Iowa City and the only way to enter University Heights is to travel through Iowa City for several miles.)

They do also say that normalizing by population isn’t fair because they have so many communters. However, they also did issue the greatest number of tickets (907 vs 617, 802 or 366). So that doesn’t really hold either.

It all makes sense in the context of incentives: On average, a ticket return $81 to the city’s pocket and saves the average taxpayer about $66 dollars a year or a household of four $263 a year in property taxes. They pay for 6% of all city services.

The inter-ocular trauma test seems to confirm the reputation for this town being a speed trap. Sure, a nice analysis looking at when the stops occur (is the football game excuse a real excuse? morning rush hour?) would be stronger, but also a bit of overkill for the question at hand.

My advise? The inter-ocular trauma test, much like the sniff test for spoiled milk, will get you far. And if you travel through University Heights, go 20 just to be safe.

Using tidytext to make sentiment analysis easy

Last week I discovered the R package tidytext and its very nice e-book detailing usage. Julia Silge and David Robinson have significantly reduced the effort it takes for me to “grok” text mining by making it “tidy.”

It certainly helped that a lot of the examples are from Pride and Prejudice and other books by Jane Austen, my most beloved author. Julia Silge’s examples on her blog doing NLP and sentiment analysis alone would have made me a life-long fan. The gifs from P&P (mostly the 1995 mini-series, to be honest) on her posts and the references in the titles made me very excited. My brain automatically started playing the theme and made me smile.

Okay, enough of that. Moving on.

Seeing her work, I started wondering what I can model to get some insight into my own life. I have a database of 92,372 text messages (basically every message sent to or from me from sometime in 2011/2012 to 2015) but text messages were weird (lots of “lol” and “haha”’s). I think there is some interesting insights there, but probably not what I wanted to cover today.

So I started thinking what other plain text data did I have that might be interesting. And then I realized I have a 149 page dissertation (excluding boilerplate and references) and it was in LaTeX (so easy to parse) and it was written in 5 different files that relate directly to the chapters (intro, lit review, methods, results and a discussion). I could do something with that!

My thesis is currently under embargo while I chop it into its respective papers (one under review, one soon to be under review and one undergoing a final revision. So close.), so I can’t link to it. However, it relates the seasonality of two infectious diseases and local weather patterns.

I wonder how my sentiment changes across the thesis. To do this, I’ll use the tidytext package. Let’s import the relevant packages now.

library(tidyverse)
library(tidytext)
library(stringr)

The tidyverse ecosystem and tidytext play well together (no surprises there) and so I also import tidyverse. The stringr package is useful for filtering out the LaTeX specific code and also for dropping words that have numbers in them (like jefferson1776 as a reference or 0.05).

Now let’s read in the data (the tex files)

thesis_words <- data_frame(file = paste0("~/thesis/thesis/", 
                         c("introduction.tex", "lit-review.tex", "methods.tex", 
                           "results.tex", "discussion.tex"))) %>%
  mutate(text = map(file, read_lines))
thesis_words
## # A tibble: 5 × 2
##                               file          text
##                              <chr>        <list>
## 1 ~/thesis/thesis/introduction.tex   <chr [125]>
## 2   ~/thesis/thesis/lit-review.tex <chr [1,386]>
## 3      ~/thesis/thesis/methods.tex   <chr [625]>
## 4      ~/thesis/thesis/results.tex <chr [1,351]>
## 5   ~/thesis/thesis/discussion.tex   <chr [649]>

The resulting tibble has a variable file that is the name of the file that created that row and a list-column of the text of that file.

We want to unnest() that tibble, remove the lines that are LaTeX crude (either start with \[A-Z] or \[a-z], like \section or \figure) and compute a line number.

thesis_words <- thesis_words %>%
  unnest() %>%
  filter(text != "%!TEX root = thesis.tex") %>%
  filter(!str_detect(text, "^(\\\\[A-Z,a-z])"),
         text != "") %>%
  mutate(line_number = 1:n(),
         file = str_sub(basename(file), 1, -5))
thesis_words$file <- forcats::fct_relevel(thesis_words$file, c("introduction",
                                                  "lit-review",
                                                  "methods",
                                                  "results",
                                                  "discussion"))

Now we have a tibble with file giving us the chapter, text giving us the line of text from the tex files (when I wrote it, I strived to keep my line lengths under 80 characters, hence the relatively short value in text) and line_number giving a counter of the number of lines since the start of the thesis.

Now we want to tokenize (strip each word of any formatting and reduce down to the root word, if possible). This is easy with unnest_tokens(). I’ve also played around with the results and came up with some other words that needed to be deleted (stats terms like ci or p, LaTeX terms like _i or tabular and references/numbers).

thesis_words <- thesis_words %>%
  unnest_tokens(word, text) %>%
  filter(!str_detect(word, "[0-9]"),
         word != "fismanreview",
         word != "multicolumn",
         word != "p",
         word != "_i",
         word != "c", 
         word != "ci",
         word != "al",
         word != "dowellsars",
         word != "h",
         word != "tabular",
         word != "t",
         word != "ref",
         word != "cite",
         !str_detect(word, "[a-z]_"),
         !str_detect(word, ":"),
         word != "bar",
         word != "emph",
         !str_detect(word, "textless"))
thesis_words
## # A tibble: 27,787 × 3
##            file line_number        word
##          <fctr>       <int>       <chr>
## 1  introduction           1 seasonality
## 2  introduction           1          or
## 3  introduction           1         the
## 4  introduction           1    periodic
## 5  introduction           1      surges
## 6  introduction           1         and
## 7  introduction           1       lulls
## 8  introduction           1          in
## 9  introduction           1   incidence
## 10 introduction           1          is
## # ... with 27,777 more rows

Now to compute the sentiment using the words written per line in the thesis. tidytext comes with three sentiment lexicons, affin, bing and nrc. affin provides a score ranging from -5 (very negative) to +5 (very positive) fr 2,476 words. bing provides a label of “negative” or “positive” for 6,788 words. nrc provides a label (anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise or trust) for 13,901 words. None of these account for negation (“I’m not sad” is a negative sentiment, not a positive one).

Using the nrc lexicon, let’s see how the emotions of my words change over the thesis.

thesis_words %>%
  inner_join(get_sentiments("nrc")) %>%
  group_by(index = line_number %/% 25, file, sentiment) %>%
  summarize(n = n()) %>%
  ggplot(aes(x = index, y = n, fill = file)) + 
  geom_bar(stat = "identity", alpha = 0.8) + 
  facet_wrap(~ sentiment, ncol = 5) 

I wasn’t surprised, but at least I wasn’t sad? It looks like I used more “fear” and “negative” words in the lit-review than the other sections. However, it looks like “infectious” as in “infectious diseases” is a fear/negative word. I used that word a lot more in the lit review than other sections.

I can use the bing and afinn lexicons to look at how the sentiment of the words changed over the course of the thesis.

thesis_words %>% 
  left_join(get_sentiments("bing")) %>%
  left_join(get_sentiments("afinn")) %>%
  group_by(index = line_number %/% 25, file) %>%
  summarize(afinn = mean(score, na.rm = TRUE), 
            bing = sum(sentiment == "positive", na.rm = TRUE) - sum(sentiment == "negative", na.rm = TRUE)) %>%
  gather(lexicon, lexicon_score, afinn, bing) %>% 
  ggplot(aes(x = index, y = lexicon_score, fill = file)) +
    geom_bar(stat = "identity") + 
    facet_wrap(~ lexicon, scale = "free_y") +
    scale_x_continuous("Location in thesis", breaks = NULL) +
    scale_y_continuous("Lexicon Score")

Looking at the two lexicon’s scoring of my thesis, the bing lexicon seems a little more stable if we assume local correlation of sentiments is likely. It seems like I started out all doom and gloom (hey, I needed to convince my committee that it was a real problem!), moved onto more doom and gloom (did I mention this is a problem and my question hasn’t been resolved?), the methods were more neutral, results were more doom and gloom but with a slight uplift at the end followed by more doom and gloom (this really is a problem guys!) and a little bit of hope at the end (now that we know, we can fix this?).

This got me thinking about what a typical academic paper looks like. My mental model for a paper is:

  1. show that the problem is really a problem (“ is a significant cause of morbidity and mortality”)
  2. show that the problem isn’t resolved by the prior work
  3. answer the question
  4. incorporate the answer into the existing literature
  5. discussion limitations and breezily dismiss them
  6. show hope for the future

So I pulled the text of my 4 currently published papers. I’m going to call them well-children, medication time series, transfer networks and COPD readmissions.

I took the text out of each paper and copied them into plain text files and read them into R as above. I also computed line numbers within each of the different papers.

paper_words <- data_frame(file = paste0("~/projects/paper_analysis/", 
                                         c("well_child.txt", "pharm_ts.txt",
                                           "transfers.txt", "copd.txt"))) %>%
  mutate(text = map(file, read_lines)) %>%
  unnest() %>%
  group_by(file = str_sub(basename(file), 1, -5)) %>%
  mutate(line_number = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, text) 

paper_sentiment <- inner_join(paper_words, get_sentiments("bing")) %>%
  count(file, index = round(line_number / max(line_number) * 100 / 5) * 5, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(net_sentiment = positive - negative)

paper_sentiment %>% ggplot(aes(x = index, y = net_sentiment, fill = file)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  facet_wrap(~ file) + 
  scale_x_continuous("Location in paper (percent)") + 
  scale_y_continuous("Bing Net Sentiment")

It looks like I wasn’t totally off. Most of the papers start out relatively negative, have super negative results sections (judging by paper location) but I was wrong about them ending on a happy note.

And the sentiment for this post:

Talking about negative sentiments is a negative sentiment. But look at the start when I was talking about Austen… that was a good time.

Easy Cross Validation in R with `modelr`

When estimating a model, the quality of the model fit will always be higher in-sample than out-of-sample. A model will always fit the data that it is trained on, warts and all, and may use those warts and statistical noise to make predictions. As a result, a model that performs very well on a data set may perform poorly when used more generally.

Since we often can’t fit a model, put it in use and determine how well it performs (and probably wouldn’t want to even if we could), we need some methods for predicting out-of-sample error when only using the data we have.

The simplest method is to randomly divide the data into a training set and a test set. However, this provides only one test set. If you train a model, test it and then train a new model and re-test it using the same test set and see an improvement, it may have come from the model actually being better or it may come from you over-fitting the test set.

One possible way to address this problem is to use something like k-fold cross validation. The general idea is is to use some portion of your data (typically 90%) to fit a model and then test that model on the withheld portion (typically 10%) to assess the quality of the model fit. You would repeat this process over and over again until each point has been used to estimate the model \(k-1\) times and been included in the test set 1 time.

This can be done with a loop or a member of the apply family, but I’m going to use purrr to write functional code to fit the models and modelr to evaluate the models.

My “preamble” looks like this:

Data and the Model

For this example, I’m going to use the cars dataset from the caret package. It looks like this:

data("cars", package = "caret")
glimpse(cars)
## Observations: 804
## Variables: 18
## $ Price       <dbl> 22661.05, 21725.01, 29142.71, 30731.94, 33358.77, ...
## $ Mileage     <int> 20105, 13457, 31655, 22479, 17590, 23635, 17381, 2...
## $ Cylinder    <int> 6, 6, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,...
## $ Doors       <int> 4, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4,...
## $ Cruise      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Sound       <int> 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1,...
## $ Leather     <int> 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1,...
## $ Buick       <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Cadillac    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Chevy       <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Pontiac     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Saab        <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Saturn      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ convertible <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ coupe       <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ hatchback   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ sedan       <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ wagon       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

We want to build a model to predict the price of the car. We have the mileage, number of cylinders, size, some features (cruise control, sound system, leather seats), brand and then the style of car. The data already has dummy variables constructed but we need to drop one in order to estimate the model (e.g., set it as the reference level). We’ll drop the brand Chevy indicator (setting Chevy to the reference brand) and set the reference style to sedan. We’ll also drop the doors variable as that is perfectly co-linear with the coupe indicator and styles likely contain more information than the number of doors.

Our final model will look something like this:

select(cars, -Doors, -Chevy, -sedan) %>% 
  lm(Price ~ Milage + Cylinder + Cruise + Sound + Leather + Buick + Cadillac + 
  Pontiac + Saab + Saturn + convertible + coupe + hatchback + wagon, data = .)

The modelr package has a useful tool for making the cross-validation folds. crossv_kfold will divide the data into \(k\) (by default 10) folds and returns a tibble with the list-columns of train (the training data for the fold), test (the test set for the fold) and .id (the fold number). For more details on the idea of list-columns, see the Many Models Chapter in R for Data Science by Garrett Grolemund and Hadley Wickham.

cars_cv <- cars %>%
  select(-Doors, -Chevy, -sedan) %>%
  crossv_kfold(10)
cars_cv
## # A tibble: 10 × 3
##             train           test   .id
##            <list>         <list> <chr>
## 1  <S3: resample> <S3: resample>    01
## 2  <S3: resample> <S3: resample>    02
## 3  <S3: resample> <S3: resample>    03
## 4  <S3: resample> <S3: resample>    04
## 5  <S3: resample> <S3: resample>    05
## 6  <S3: resample> <S3: resample>    06
## 7  <S3: resample> <S3: resample>    07
## 8  <S3: resample> <S3: resample>    08
## 9  <S3: resample> <S3: resample>    09
## 10 <S3: resample> <S3: resample>    10

You might notice that instead of the elements being data.frame or tibble, they are resample objects. resample objects are cool because they are lazy. Instead of making 10 copies of cars, crossv_kfold() returned what are essentially pointers to the original data. So instead of storing the data 10 times, we store it once plus a little bit for the pointers. Indeed, we can see that in the object size:

pryr::object_size(cars)
## 63.6 kB
pryr::object_size(cars_cv)
## 93.7 kB

Awesome, right?

Now it is time to fit the model. It all can be down within the very tidy tidyverse (see what I did there?). The code below uses purrr:map() to fit the model for each one of the 10 folds and returns the RMSE on the test set.

cars_cv %>% 
  mutate(train = map(train, as_tibble)) %>%
  mutate(model = map(train, ~ lm(Price ~ Mileage + Cylinder + Cruise + Sound + 
                                   Leather + Buick + Cadillac + Pontiac + 
                                   Saab + Saturn + convertible + coupe + 
                                   hatchback + wagon, 
                                 data = .))) %>%
  mutate(rmse = map2_dbl(model, test, rmse)) %>%
  select(.id, rmse)
## # A tibble: 10 × 2
##      .id     rmse
##    <chr>    <dbl>
## 1     01 2372.046
## 2     02 2844.509
## 3     03 3398.572
## 4     04 3165.065
## 5     05 2838.691
## 6     06 2675.153
## 7     07 3720.843
## 8     08 2692.567
## 9     09 2362.007
## 10    10 3016.648

It is reasonably fast too.

ten_fold <- function(x) {
  x %>% 
    mutate(model = map(train, ~ lm(Price ~ Mileage + Cylinder + Cruise + Sound + 
                                   Leather + Buick + Cadillac + Pontiac + 
                                   Saab + Saturn + convertible + coupe + 
                                   hatchback + wagon, 
                                 data = .))) %>%
  mutate(rmse = map2_dbl(model, test, rmse)) %>%
  select(.id, rmse)
}

microbenchmark::microbenchmark(ten_fold(cars_cv))
## Unit: milliseconds
##               expr      min       lq     mean   median       uq      max
##  ten_fold(cars_cv) 45.47155 50.18167 61.01467 53.27679 60.67367 139.0201
##  neval
##    100

microbenchmark puts it a median time of about 50 ms. We can probably make it even faster though. We don’t actually need to store the model and we can do it all in one step with a function.

fitter <- function(training_data, testing_data) {
  lm(Price ~ Mileage + Cylinder + Cruise + Sound + Leather + Buick + Cadillac + 
       Pontiac + Saab + Saturn + convertible + coupe + hatchback + wagon, 
     data = training_data) %>% 
    rmse(testing_data)
}

ten_fold_2 <- function(x) {
  x %>% 
    mutate(rmse = map2_dbl(train, test, fitter)) %>%
    select(.id, rmse)
}

microbenchmark::microbenchmark(ten_fold_2(cars_cv))
## Unit: milliseconds
##                 expr     min       lq     mean   median       uq      max
##  ten_fold_2(cars_cv) 46.1202 49.16919 52.21734 51.25691 53.72752 100.7599
##  neval
##    100

Hey, that saved 2 or 3 milliseconds on the median. Totally worth it. (If your model fit takes a long time, it is probably worth it).

It is possible to parallelize the model fitting using multidplyr as the backend for dplyr; however, multidplyr doesn’t seem to work well with the resample objects. The easiest solution is to store nine copies of the data to allow for the cross-validation to work most simply - but I’m more likely to hurt for shortage of memory than shortage of CPU cycles/time when dealing with health care administrative data. Alternatively, you could use do() and multidplyr to filter by fold ID, fit a model, compute RMSE and proceed from there, but that solution is much less simple.

Packages like caret do provide powerful tools for cross-validation; however, I like the ability to keep my entire workflow smooth and within the tidyverse. I found that, in general, I prefer to use the modelr tools and do the model fitting the “old fashioned way.”

Older Newer