Using tidytext to make sentiment analysis easy

Last week I discovered the R package tidytext and its very nice e-book detailing usage. Julia Silge and David Robinson have significantly reduced the effort it takes for me to “grok” text mining by making it “tidy.”

It certainly helped that a lot of the examples are from Pride and Prejudice and other books by Jane Austen, my most beloved author. Julia Silge’s examples on her blog doing NLP and sentiment analysis alone would have made me a life-long fan. The gifs from P&P (mostly the 1995 mini-series, to be honest) on her posts and the references in the titles made me very excited. My brain automatically started playing the theme and made me smile.

Okay, enough of that. Moving on.

Seeing her work, I started wondering what I can model to get some insight into my own life. I have a database of 92,372 text messages (basically every message sent to or from me from sometime in 2011/2012 to 2015) but text messages were weird (lots of “lol” and “haha”’s). I think there is some interesting insights there, but probably not what I wanted to cover today.

So I started thinking what other plain text data did I have that might be interesting. And then I realized I have a 149 page dissertation (excluding boilerplate and references) and it was in LaTeX (so easy to parse) and it was written in 5 different files that relate directly to the chapters (intro, lit review, methods, results and a discussion). I could do something with that!

My thesis is currently under embargo while I chop it into its respective papers (one under review, one soon to be under review and one undergoing a final revision. So close.), so I can’t link to it. However, it relates the seasonality of two infectious diseases and local weather patterns.

I wonder how my sentiment changes across the thesis. To do this, I’ll use the tidytext package. Let’s import the relevant packages now.

library(tidyverse)
library(tidytext)
library(stringr)

The tidyverse ecosystem and tidytext play well together (no surprises there) and so I also import tidyverse. The stringr package is useful for filtering out the LaTeX specific code and also for dropping words that have numbers in them (like jefferson1776 as a reference or 0.05).

Now let’s read in the data (the tex files)

thesis_words <- data_frame(file = paste0("~/thesis/thesis/",
c("introduction.tex", "lit-review.tex", "methods.tex",
"results.tex", "discussion.tex"))) %>%
thesis_words
## # A tibble: 5 × 2
##                               file          text
##                              <chr>        <list>
## 1 ~/thesis/thesis/introduction.tex   <chr [125]>
## 2   ~/thesis/thesis/lit-review.tex <chr [1,386]>
## 3      ~/thesis/thesis/methods.tex   <chr [625]>
## 4      ~/thesis/thesis/results.tex <chr [1,351]>
## 5   ~/thesis/thesis/discussion.tex   <chr [649]>

The resulting tibble has a variable file that is the name of the file that created that row and a list-column of the text of that file.

We want to unnest() that tibble, remove the lines that are LaTeX crude (either start with \[A-Z] or \[a-z], like \section or \figure) and compute a line number.

thesis_words <- thesis_words %>%
unnest() %>%
filter(text != "%!TEX root = thesis.tex") %>%
filter(!str_detect(text, "^(\\\\[A-Z,a-z])"),
text != "") %>%
mutate(line_number = 1:n(),
file = str_sub(basename(file), 1, -5))
thesis_words$file <- forcats::fct_relevel(thesis_words$file, c("introduction",
"lit-review",
"methods",
"results",
"discussion"))

Now we have a tibble with file giving us the chapter, text giving us the line of text from the tex files (when I wrote it, I strived to keep my line lengths under 80 characters, hence the relatively short value in text) and line_number giving a counter of the number of lines since the start of the thesis.

Now we want to tokenize (strip each word of any formatting and reduce down to the root word, if possible). This is easy with unnest_tokens(). I’ve also played around with the results and came up with some other words that needed to be deleted (stats terms like ci or p, LaTeX terms like _i or tabular and references/numbers).

thesis_words <- thesis_words %>%
unnest_tokens(word, text) %>%
filter(!str_detect(word, "[0-9]"),
word != "fismanreview",
word != "multicolumn",
word != "p",
word != "_i",
word != "c",
word != "ci",
word != "al",
word != "dowellsars",
word != "h",
word != "tabular",
word != "t",
word != "ref",
word != "cite",
!str_detect(word, "[a-z]_"),
!str_detect(word, ":"),
word != "bar",
word != "emph",
!str_detect(word, "textless"))
thesis_words
## # A tibble: 27,787 × 3
##            file line_number        word
##          <fctr>       <int>       <chr>
## 1  introduction           1 seasonality
## 2  introduction           1          or
## 3  introduction           1         the
## 4  introduction           1    periodic
## 5  introduction           1      surges
## 6  introduction           1         and
## 7  introduction           1       lulls
## 8  introduction           1          in
## 9  introduction           1   incidence
## 10 introduction           1          is
## # ... with 27,777 more rows

Now to compute the sentiment using the words written per line in the thesis. tidytext comes with three sentiment lexicons, affin, bing and nrc. affin provides a score ranging from -5 (very negative) to +5 (very positive) fr 2,476 words. bing provides a label of “negative” or “positive” for 6,788 words. nrc provides a label (anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise or trust) for 13,901 words. None of these account for negation (“I’m not sad” is a negative sentiment, not a positive one).

Using the nrc lexicon, let’s see how the emotions of my words change over the thesis.

thesis_words %>%
inner_join(get_sentiments("nrc")) %>%
group_by(index = line_number %/% 25, file, sentiment) %>%
summarize(n = n()) %>%
ggplot(aes(x = index, y = n, fill = file)) +
geom_bar(stat = "identity", alpha = 0.8) +
facet_wrap(~ sentiment, ncol = 5) 

I wasn’t surprised, but at least I wasn’t sad? It looks like I used more “fear” and “negative” words in the lit-review than the other sections. However, it looks like “infectious” as in “infectious diseases” is a fear/negative word. I used that word a lot more in the lit review than other sections.

I can use the bing and afinn lexicons to look at how the sentiment of the words changed over the course of the thesis.

thesis_words %>%
left_join(get_sentiments("bing")) %>%
left_join(get_sentiments("afinn")) %>%
group_by(index = line_number %/% 25, file) %>%
summarize(afinn = mean(score, na.rm = TRUE),
bing = sum(sentiment == "positive", na.rm = TRUE) - sum(sentiment == "negative", na.rm = TRUE)) %>%
gather(lexicon, lexicon_score, afinn, bing) %>%
ggplot(aes(x = index, y = lexicon_score, fill = file)) +
geom_bar(stat = "identity") +
facet_wrap(~ lexicon, scale = "free_y") +
scale_x_continuous("Location in thesis", breaks = NULL) +
scale_y_continuous("Lexicon Score")

Looking at the two lexicon’s scoring of my thesis, the bing lexicon seems a little more stable if we assume local correlation of sentiments is likely. It seems like I started out all doom and gloom (hey, I needed to convince my committee that it was a real problem!), moved onto more doom and gloom (did I mention this is a problem and my question hasn’t been resolved?), the methods were more neutral, results were more doom and gloom but with a slight uplift at the end followed by more doom and gloom (this really is a problem guys!) and a little bit of hope at the end (now that we know, we can fix this?).

This got me thinking about what a typical academic paper looks like. My mental model for a paper is:

1. show that the problem is really a problem (“ is a significant cause of morbidity and mortality”)
2. show that the problem isn’t resolved by the prior work
4. incorporate the answer into the existing literature
5. discussion limitations and breezily dismiss them
6. show hope for the future

So I pulled the text of my 4 currently published papers. I’m going to call them well-children, medication time series, transfer networks and COPD readmissions.

I took the text out of each paper and copied them into plain text files and read them into R as above. I also computed line numbers within each of the different papers.

paper_words <- data_frame(file = paste0("~/projects/paper_analysis/",
c("well_child.txt", "pharm_ts.txt",
"transfers.txt", "copd.txt"))) %>%
unnest() %>%
group_by(file = str_sub(basename(file), 1, -5)) %>%
mutate(line_number = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text)

paper_sentiment <- inner_join(paper_words, get_sentiments("bing")) %>%
count(file, index = round(line_number / max(line_number) * 100 / 5) * 5, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(net_sentiment = positive - negative)

paper_sentiment %>% ggplot(aes(x = index, y = net_sentiment, fill = file)) +
geom_bar(stat = "identity", show.legend = FALSE) +
facet_wrap(~ file) +
scale_x_continuous("Location in paper (percent)") +
scale_y_continuous("Bing Net Sentiment")

It looks like I wasn’t totally off. Most of the papers start out relatively negative, have super negative results sections (judging by paper location) but I was wrong about them ending on a happy note.

And the sentiment for this post:

Talking about negative sentiments is a negative sentiment. But look at the start when I was talking about Austen… that was a good time.