Poor Donald - his tweets keep getting more negative

Last summer, David Robinson did this interesting text analysis of Donald Trump’s tweets and found that they more angry ones came from Android (which Trump is known to use). But he didn’t consider how Trump’s emotional state varies over time and he certainly couldn’t have considered what the impact of the election and recent events would have been on Trump.

Using the twitteR package and the tidyverse ecosystem (plus tidytext) this is actually a very simple analysis.

For starters, pulling Trump’s tweets (the last 3,200) is very simple:

library(twitteR)
library(tidyverse)
library(tidytext)

source("~/twitter_key.R")

setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)
## [1] "Using direct authentication"
trump <- userTimeline("realDonaldTrump",
                      n = 3100,
                      includeRts = TRUE,
                      excludeReplies = FALSE) %>%
  twListToDF() %>%
  as_tibble()

And then we have a tidy tibble with Trump’s tweets:

glimpse(trump)
## Observations: 3,099
## Variables: 16
## $ text          <chr> "Heading to Joint Base Andrews on #MarineOne wit...
## $ favorited     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ favoriteCount <dbl> 77699, 85576, 71312, 220083, 64348, 84125, 62284...
## $ replyToSN     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ created       <time> 2017-02-10 23:24:51, 2017-02-10 13:35:50, 2017-...
## $ truncated     <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, ...
## $ replyToSID    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ id            <chr> "830195857530183684", "830047626414477312", "830...
## $ replyToUID    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ statusSource  <chr> "<a href=\"http://twitter.com/download/iphone\" ...
## $ screenName    <chr> "realDonaldTrump", "realDonaldTrump", "realDonal...
## $ retweetCount  <dbl> 21473, 19779, 15069, 64363, 10082, 14185, 11294,...
## $ isRetweet     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ retweeted     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ longitude     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ latitude      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

Using tidytext, it is straightforward to unnest and tokenize the words in the body of the tweets:

words <- trump %>%
  select(id, statusSource, retweetCount, favoriteCount, created, isRetweet,
         text) %>%
  unnest_tokens(word, text)
words
## # A tibble: 57,239 x 7
##                    id
##                 <chr>
## 1  830195857530183684
## 2  830195857530183684
## 3  830195857530183684
## 4  830195857530183684
## 5  830195857530183684
## 6  830195857530183684
## 7  830195857530183684
## 8  830195857530183684
## 9  830195857530183684
## 10 830195857530183684
## # ... with 57,229 more rows, and 6 more variables: statusSource <chr>,
## #   retweetCount <dbl>, favoriteCount <dbl>, created <time>,
## #   isRetweet <lgl>, word <chr>

Given what David Robinson found, we might want to convert the statusSource variable into a flag for whether it was posted via an Android device:

words <- words %>%
  mutate(android = stringr::str_detect(statusSource, "Android")) %>%
  select(- statusSource)
words
## # A tibble: 57,239 x 7
##                    id retweetCount favoriteCount             created
##                 <chr>        <dbl>         <dbl>              <time>
## 1  830195857530183684        21473         77699 2017-02-10 23:24:51
## 2  830195857530183684        21473         77699 2017-02-10 23:24:51
## 3  830195857530183684        21473         77699 2017-02-10 23:24:51
## 4  830195857530183684        21473         77699 2017-02-10 23:24:51
## 5  830195857530183684        21473         77699 2017-02-10 23:24:51
## 6  830195857530183684        21473         77699 2017-02-10 23:24:51
## 7  830195857530183684        21473         77699 2017-02-10 23:24:51
## 8  830195857530183684        21473         77699 2017-02-10 23:24:51
## 9  830195857530183684        21473         77699 2017-02-10 23:24:51
## 10 830195857530183684        21473         77699 2017-02-10 23:24:51
## # ... with 57,229 more rows, and 3 more variables: isRetweet <lgl>,
## #   word <chr>, android <lgl>

Let’s now code the tweets using the afinn sentiment set:

words <- words %>%
  inner_join(get_sentiments("afinn"))
## Joining, by = "word"
words
## # A tibble: 5,093 x 8
##                    id retweetCount favoriteCount             created
##                 <chr>        <dbl>         <dbl>              <time>
## 1  830047626414477312        19779         85576 2017-02-10 13:35:50
## 2  830047626414477312        19779         85576 2017-02-10 13:35:50
## 3  830042498806460417        15069         71312 2017-02-10 13:15:27
## 4  829721019720015872        10082         64348 2017-02-09 15:58:01
## 5  829721019720015872        10082         64348 2017-02-09 15:58:01
## 6  829689436279603206        14185         84125 2017-02-09 13:52:31
## 7  829689436279603206        14185         84125 2017-02-09 13:52:31
## 8  829689436279603206        14185         84125 2017-02-09 13:52:31
## 9  829689436279603206        14185         84125 2017-02-09 13:52:31
## 10 829689436279603206        14185         84125 2017-02-09 13:52:31
## # ... with 5,083 more rows, and 4 more variables: isRetweet <lgl>,
## #   word <chr>, android <lgl>, score <int>

And now let’s see how the typical sentiment of those tweets has varied since April 2016 (midsts of the Republican primary) to present:

words %>%
  filter(isRetweet == FALSE) %>%
  group_by(id, created) %>%
  summarize(sentiment = mean(score)) %>%
  ggplot(aes(x = created, y = sentiment)) + 
  geom_smooth() + 
  geom_vline(xintercept = as.numeric(as.POSIXct(("2017-01-20")))) + 
  geom_vline(xintercept = as.numeric(as.POSIXct(("2016-11-08")))) + 
  geom_vline(xintercept = as.numeric(as.POSIXct(("2016-05-03")))) + 
  labs(x = "Date", y = "Mean Afinn Sentiment Score")

The vertical lines denote the date he was named as the Republican candidate (May 3rd 2016), the date of the election (Nov 8th 2016) and inauguration day. Thing aren’t looking up for Trump. He seems to be more angry/sad/negative now than any prior point during the past year.

What if we consider the grouping by using Android vs not:

words %>%
  filter(isRetweet == FALSE) %>%
  group_by(id, created, android) %>%
  summarize(sentiment = mean(score)) %>%
  ggplot(aes(x = created, y = sentiment, color = android)) + 
  geom_smooth() + 
  geom_vline(xintercept = as.numeric(as.POSIXct(("2017-01-20")))) + 
  geom_vline(xintercept = as.numeric(as.POSIXct(("2016-11-08")))) + 
  geom_vline(xintercept = as.numeric(as.POSIXct(("2016-05-03")))) + 
  labs(x = "Date", y = "Mean Afinn Sentiment Score")

We see the general trend that David Robinson identified - the Android tweets tended to be more negitive than the other platforms. It is interesting that they were more positive than the tweets presumed to be by staff right before the election. Also, we can see the non-Android tweets were more positive during the transition than the Android tweets that clearly became more negitive. Perhaps the limits of Presidential powers are stricter than he expected. It is interesting that the Android tweets are now more negitive than positive, the first time this has occurred.

Interestingly, there seems to be no effect of being positive/negitive on the number of retweets

words %>%
  filter(isRetweet == FALSE) %>%
  group_by(id, created, android) %>%
  summarize(sentiment = mean(score)) %>%
  inner_join(select(words, id, retweetCount, favoriteCount) %>%
               distinct()) %>%
  ggplot(aes(x = sentiment, y = retweetCount, color = android)) + 
  geom_smooth() + 
  geom_point() + 
  scale_y_log10() + 
  labs(x = "Mean Afinn Sentiment Score", y = "Number of Retweets")
## Joining, by = "id"

or the number of favorites

words %>%
  filter(isRetweet == FALSE) %>%
  group_by(id, created, android) %>%
  summarize(sentiment = mean(score)) %>%
  inner_join(select(words, id, retweetCount, favoriteCount) %>%
               distinct()) %>%
  ggplot(aes(x = sentiment, y = favoriteCount, color = android)) + 
  geom_smooth() + 
  geom_point() + 
  scale_y_log10() + 
  labs(x = "Mean Afinn Sentiment Score", y = "Number of Favorites")
## Joining, by = "id"

that a tweet gets.

Regression analysis suggests that the number of retweets is increased significantly by a more negitive tweet but that also the effect wears off with time (very very slightly):

words %>%
  filter(isRetweet == FALSE, android) %>%
  group_by(id, created) %>%
  summarize(sentiment = mean(score)) %>%
  inner_join(select(words, id, retweetCount, favoriteCount) %>%
               distinct()) %>%
  lm(log(retweetCount) ~ created * (sentiment < 0), data = .) %>%
  summary()
## Joining, by = "id"
## 
## Call:
## lm(formula = log(retweetCount) ~ created * (sentiment < 0), data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7744 -0.3806  0.0005  0.3576  3.2661 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -1.012e+02  3.942e+00 -25.679  < 2e-16 ***
## created                    7.488e-08  2.680e-09  27.939  < 2e-16 ***
## sentiment < 0TRUE          1.959e+01  6.086e+00   3.219  0.00132 ** 
## created:sentiment < 0TRUE -1.313e-08  4.135e-09  -3.175  0.00154 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5923 on 1198 degrees of freedom
## Multiple R-squared:  0.5195, Adjusted R-squared:  0.5183 
## F-statistic: 431.7 on 3 and 1198 DF,  p-value: < 2.2e-16

A similar pattern exists for the number of favorites

words %>%
  filter(isRetweet == FALSE, android) %>%
  group_by(id, created) %>%
  summarize(sentiment = mean(score)) %>%
  inner_join(select(words, id, retweetCount, favoriteCount) %>%
               distinct()) %>%
  lm(log(favoriteCount) ~ created * (sentiment < 0), data = .) %>%
  summary()
## Joining, by = "id"
## 
## Call:
## lm(formula = log(favoriteCount) ~ created * (sentiment < 0), 
##     data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.75782 -0.35691 -0.00795  0.33800  2.48914 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -1.176e+02  3.452e+00 -34.068  < 2e-16 ***
## created                    8.689e-08  2.347e-09  37.020  < 2e-16 ***
## sentiment < 0TRUE          1.435e+01  5.329e+00   2.692  0.00721 ** 
## created:sentiment < 0TRUE -9.648e-09  3.621e-09  -2.664  0.00781 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5187 on 1198 degrees of freedom
## Multiple R-squared:  0.6525, Adjusted R-squared:  0.6517 
## F-statistic: 749.9 on 3 and 1198 DF,  p-value: < 2.2e-16

The words used by the Android postings that were positive and negitive varied from before the election, during the transition and after Trump was sworn in:

words %>%
  filter(android) %>%
  mutate(phase = ifelse(as.POSIXct("2016-11-08") > created, "Before the election",
                        ifelse(as.POSIXct("2017-01-20") > created, "Transition",
                                          "In the White House"))) %>%
  group_by(phase, pos_sentiment = score >= 0, word) %>%
  count() %>%
  group_by(phase, pos_sentiment) %>%
  filter(word != "no") %>%
  top_n(3, wt = n) %>%
  arrange(pos_sentiment, phase, desc(n))
## Source: local data frame [18 x 4]
## Groups: phase, pos_sentiment [6]
## 
##                  phase pos_sentiment      word     n
##                  <chr>         <lgl>     <chr> <int>
## 1  Before the election         FALSE       bad    62
## 2  Before the election         FALSE dishonest    27
## 3  Before the election         FALSE    rigged    25
## 4   In the White House         FALSE       bad    10
## 5   In the White House         FALSE      fake    10
## 6   In the White House         FALSE       ban     6
## 7           Transition         FALSE       bad    13
## 8           Transition         FALSE     wrong    11
## 9           Transition         FALSE dishonest    10
## 10 Before the election          TRUE     great   175
## 11 Before the election          TRUE     thank    69
## 12 Before the election          TRUE       big    54
## 13  In the White House          TRUE     great     8
## 14  In the White House          TRUE       big     5
## 15  In the White House          TRUE       win     5
## 16          Transition          TRUE     great    56
## 17          Transition          TRUE       big    16
## 18          Transition          TRUE       win    14

We have the fake news to thank for the fake debut post-being sworn in. At least the election was no longer rigged after he worn it.

Tags: r statistics text tidytext

Related Posts