Chapter 4 Text Mining

4.1 Recap

In my last lecture I talked about using the rvest package to pull data from the Internet. This is useful because there is lots of information out there on the Internet that you could download and analyze. It can also be a pain to clean data and get it into a format you would like but that’s life and there isn’t a lot you can do about that.

Every 60 seconds on Facebook: 510,000 comments are posted and 293,000 statused are updated
Facebook Photo uploads total 300 million per day
Worldwide, there are over 2.32 billion active Facbook users as of December 31, 2018
Every second, on average, around 6,000 tweets are tweeted on Twitter
1,332,433 Publications in Pubmed for the year 2018

Here are the top Twitter accounts being followed. A tweet from any of these people can represent a lot of money ! There are people who try to figure out just how much.

Sources:

4.2 Unstructured Data

Much of the text information found in these sources is unstructured meaning that the content is a narrative, a collection of phrases, or maybe social media posts that might involve domain specific references or a form of slang. Not surprisingly, it can be hard to get meaningful information from text. Unstructured data is useful in that can capture many contexts or points of views in response to a question or situation. It’s also more “natural” for humans to speak and write in unsrtuctured terms.

http://www.triella.com/unstructured-data/

Some unstructured text does conform to an ontology, vocabulary, or some form of domain approved terminology that can simplify knowledge extraction but there are always ambiguities. For example consider a body (aka “corpus”) of medical information related to Nephrology. The following terms might be used to refer to the same thing:

Chronic renal impairment
Chronic kidney disease
CKD
Kidney failure
Chronid renal failure
CRF
Chronic renal failure syndrome
Renal insufficiency
eGFR 44

Source: http://healtex.org/wp-content/uploads/2016/11/I4H-Healtex-tutorial-part1-optimised.pdf

Unless someone knows that these all map to the same idea then downstream analysis might lead to confusing or diluted results.

4.3 Structured Data

Stuctured data is easier to work with in the sense that it is frequently numeric in nature so it lends itself well to statistical analysis and use in building predictive models. Some structured data is standalone meaning you can download it or obtain it as a .CSV file or from a database (e.g. CDC) and begin to work with the data immediately upon receipt.

However, structured data does NOT have to be numeric. For example, consider demographic information. As long as a patient tracking system conforms to a standard then one’s race (and the label used to describe that) would ususally not change. Lists of personal allegies and medications are ususally structured although they might change over time. Smoking status (“yes” or “no”) is structured. Asking someone to describe their life time struggles with trying to stop smoking would not be structured information.

4.4 Hybrids

Consider a Pubmed publication. There is a lot of textual information that is supplemented by graphs, tables, and (hopefully) link outs to raw data. Of course the tables themselves represent a form of structured data in aggregate form so you maybe you would want to parse tor extraxt them from the text. Also consider that Clinician notes and letters are ususally in narrative form but might well include structured info also.

The take home info here is that there is by far more unstructured data than structured but there is lots of cool stuff to be learned from unstructured information. So don’t let that stop you from trying to apply text mining methods for understanding collections of documents.

4.5 Information Retrieval vs Information Extraction

Searching for data is different from extracting meaning from the results returned from a search. If a search engine is good it will return relevant results in response to our query. A bad search engine will return lots of “false positives”. The idea of relevance can be expressed using a number of mathematically computed ratios such as precision (aka positive predictive value) and recall (aka sensitivity). We’ll dig more into these later.

https://en.wikipedia.org/wiki/Precision_and_recall

So for this lecture we’ll assume that we have some meaningful results already and work with the text that we have (or will) obtain from using some web scraping techniques we used in a previous lecture.

4.6 Web Scraping Revisited

So I just said that we would be focusing on how to compute on the data instead of how to retrieve it but it’s important to review some basic techniques to refresh your memory and up your R game a little bit. Let’s get some speeches given by former President Obama. This is easy to do.

Look at this URL http://www.obamaspeeches.com/

We can get all the speeches from this page if we want to using the rvest package. It’s tedious and takes some time to get things to work correctly but it’s not intellectually challenging. Data scraping and cleaning are two of the more common tasks when working with unstructured text. There is no way around it really. Most projects start out with lots of enthusiam until the data starts rolling in and everyone realizes that there are problems with formatting, etc. Anyway, we’ll get a list of all the links from the page itself. Well all the links that refer to a speech.

library(rvest)
url <- "http://www.obamaspeeches.com/"

links <- read_html(url) %>% html_nodes("a") %>% html_attr("href") 
links[c(1:2,110:114)]

## [1] "/P-Obama-Inaugural-Speech-Inauguration.htm"                                                                      
## [2] "/E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm"                         
## [3] "003-John-Lewis-65th-Birthday-Gala-Obama-Speech.htm"                                                              
## [4] "002-Keynote-Address-at-the-2004-Democratic-National-Convention-Obama-Speech.htm"                                 
## [5] "001-2002-Speech-Against-the-Iraq-War-Obama-Speech.htm"                                                           
## [6] "http://www.cafepress.com/obamaquotes"                                                                            
## [7] "http://www.amazon.com/INSPIRE-NATION-Electrifying-Speeches-Inauguration/dp/0982100531/?tag=obama-speeches-com-20"

So there are about 114 speeches but the last couple of elements don’t appear to relate to the speeches so we’ll pull them from the list:

links[c(length(links)-3):length(links)]

## [1] "002-Keynote-Address-at-the-2004-Democratic-National-Convention-Obama-Speech.htm"                                 
## [2] "001-2002-Speech-Against-the-Iraq-War-Obama-Speech.htm"                                                           
## [3] "http://www.cafepress.com/obamaquotes"                                                                            
## [4] "http://www.amazon.com/INSPIRE-NATION-Electrifying-Speeches-Inauguration/dp/0982100531/?tag=obama-speeches-com-20"

I’m going to formalize this a bit more and make it better because it turns out that some of the speech links on that page are duplicated. Besides, I also want to get back the full link so I can access it and download the text.

linkScraper <- function(baseurl="http://obamaspeeches.com") {
#
# Function to grab links from the Obama Speech site
# http://obamaspeeches.com/
#
  suppressMessages(library(rvest))

  url   <- "http://obamaspeeches.com/"
  links <- read_html(url) %>% html_nodes("a") %>% html_attr("href")
  
  links <- gsub("/","",links)
  cleaned_links <- links[-grep("http",links)]
  
  cleaned_links <- cleaned_links[!duplicated(cleaned_links)]
  
  retlinks <- paste("http://obamaspeeches.com",cleaned_links,sep="/")
  return(retlinks)
}

obamalinks <- linkScraper() 

obamalinks[3:10]

## [1] "http://obamaspeeches.com/E-Barack-Obama-Speech-Manassas-Virgina-Last-Rally-2008-Election.htm"                                                                       
## [2] "http://obamaspeeches.com/E10-Barack-Obama-The-American-Promise-Acceptance-Speech-at-the-Democratic-Convention-Mile-High-Stadium--Denver-Colorado-August-28-2008.htm"
## [3] "http://obamaspeeches.com/E09-Barack-Obama-Final-Primary-Night-Presumptive-Democratic-Nominee-Speech-St-Paul-Minnesota-June-3-2008.htm"                              
## [4] "http://obamaspeeches.com/E08-Barack-Obama-North-Carolina-Primary-Night-Raleigh-NC-May-6-2008.htm"                                                                   
## [5] "http://obamaspeeches.com/E07-Barack-Obama-Pennsylvania-Primary-Night-Evansville-Indiana-April-22-2008.htm"                                                          
## [6] "http://obamaspeeches.com/E06-Barack-Obama-AP-Annual-Luncheon-Washington-DC-April-14-2008-religion-guns-pennsylvania.htm"                                            
## [7] "http://obamaspeeches.com/E05-Barack-Obama-A-More-Perfect-Union-the-Race-Speech-Philadelphia-PA-March-18-2008.htm"                                                   
## [8] "http://obamaspeeches.com/E04-Barack-Obama-March-4-Primary-Night-Texas-and-Ohio-San-Antonio-TX-March-4-2008.htm"

Now let’s get the actual speech content. Not only will we get the speech content but we will keep track of the specific speech number from whence the text came. So we will create a factor that let’s us identify each speech by a number. We don’t have to do this if we want to consider all the speeches text as one big “blob”.

obamaParse <- function(url,write=FALSE) {
  library(rvest)
  retlist <- list()
  df <- data.frame()
  
  for (ii in 1:length(url)) {
    page <- read_html(url[ii]) %>% html_nodes("p") %>% html_text()
    
    # The following gets rid of some links that don't work. See, this is
    # the crap you have to deal with when working with real data
    
    val <- grep("Against the Iraq",page)
    page <- page[-1:-val]
  
    # Now we also want to get rid of numbers since they generally 
    # don't mean very much in text analysis.
    
    page_nonum <- as.vector(lapply(page,function(x) gsub("[0-9]+","",x)))
    
    # We populate each list element with a one line data frame which 
    # contains the text of one speech and its associated "book" number
    # e.g. 1,2,3... for the length of the url argument
    
    retlist[[ii]] <- data.frame(text=page,book=as.character(ii),
                                stringsAsFactors = FALSE)
    # Return the page
  }
  df <- do.call(rbind,retlist)
  return(df)
}

Let’s get just 9 speeches to make things manageable.

out <- obamaParse(obamalinks[2:12])

speeches <- out %>% group_by(book) %>% 
  mutate(linenumber = row_number()) %>% ungroup()

speeches

## # A tibble: 472 x 3
##    text                                                    book  linenumber
##    <chr>                                                   <chr>      <int>
##  1 "Election Night Victory Speech \n              Grant P… 1              1
##  2 ""                                                      1              2
##  3 "If \n              there is anyone out there who stil… 1              3
##  4 "It’s \n              the answer told by lines that st… 1              4
##  5 "It’s \n              the answer spoken by young and o… 1              5
##  6 "It’s \n              the answer that led those who ha… 1              6
##  7 "It’s \n              been a long time coming, but ton… 1              7
##  8 "I just \n              received a very gracious call … 1              8
##  9 "I want \n              to thank my partner in this jo… 1              9
## 10 "I would \n              not be standing here tonight … 1             10
## # … with 462 more rows

4.7 Exploring The Text

So there are some general activities common to text explorations. Let’s summarize:

Collections of text (one or more documents) can be called a corpus
We tend to break up documents into words and treat them as a bag of words
Getting the text into a tidy format is importnat. The tidytext package helps.
Basically we get each word of a phrase or document onto its own line a data frame
We then remove stop words which are filler words that don’t mean much
We might also get rid of numbers
Next, we stem the words (e.g. america and americas are really the same word)

The tidytext package is what we will use here although there a number of R packages that take a more traditional approach to analyzing text such as the tm package. There is also RWeka and the qdap packages that help work with text data.

You might also encounter references to things like Term Document Matrices or Document Term Matrices which are sparse matrix structures that help count the number of times a word occurs in a given set of documents. The transpose of a Document Term Matrix can be thought of as a Document Term Matrix.

Here is a very brief walkthrough on how you could do some text explorations using the tm package:

# https://stackoverflow.com/questions/24703920/r-tm-package-vcorpus-error-in-converting-corpus-to-data-frame#24704075

suppressMessages(library(tm))
library("wordcloud")
library("RColorBrewer")

# Create a vector with text
x <- c("Hello. Sir!","Tacos? On Tuesday?!?", "Hello")

# We create a Corpus
mycorpus <- Corpus(VectorSource(x))

# Get rid of the punctuation
mycorpus <- tm_map(mycorpus, removePunctuation)

## Warning in tm_map.SimpleCorpus(mycorpus, removePunctuation): transformation
## drops documents

# Create a Term Document Matrix
dtm <- TermDocumentMatrix(mycorpus)

# We can turn it onto a matrix to be read by humans
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

##            word freq
## hello     hello    2
## sir         sir    1
## tacos     tacos    1
## tuesday tuesday    1

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

https://gerardnico.com/natural_language/term_document

The strengths of tm are that it more closely aligns with the traditional text mining terminology and approaches. There are also a number of tools and packages that work well with tm that do not work well directly with the tidytext approach. But there are now approaches and functions in tidytext that address this concern which can take, for example, a Term Document Matrixx and turn it into a tidy data frame.

4.7.1 Tidy Format

In this class you have encountered previous references the tidyverse and tools such as those found in dplyr. It shouldn’t be surprising then that the tidytext package has text analysis tools that fit in nicely with the tidyverse.

4.7.2 Tidy Up the text

Let’s look at the speeches data frame

speeches

## # A tibble: 472 x 3
##    text                                                    book  linenumber
##    <chr>                                                   <chr>      <int>
##  1 "Election Night Victory Speech \n              Grant P… 1              1
##  2 ""                                                      1              2
##  3 "If \n              there is anyone out there who stil… 1              3
##  4 "It’s \n              the answer told by lines that st… 1              4
##  5 "It’s \n              the answer spoken by young and o… 1              5
##  6 "It’s \n              the answer that led those who ha… 1              6
##  7 "It’s \n              been a long time coming, but ton… 1              7
##  8 "I just \n              received a very gracious call … 1              8
##  9 "I want \n              to thank my partner in this jo… 1              9
## 10 "I would \n              not be standing here tonight … 1             10
## # … with 462 more rows

This looks bad because the text still has junk characters in it. Each row also has a long stringy line of text (in the text column) that is not only hard to read but if we wanted to compute on the text it would be very inconvenient. It might better to first break up the long obnoxious looking line into individual words.

library(tidytext)
tidy_speeches <- speeches %>%
  unnest_tokens(word, text)

tidy_speeches

## # A tibble: 27,849 x 3
##    book  linenumber word    
##    <chr>      <int> <chr>   
##  1 1              1 election
##  2 1              1 night   
##  3 1              1 victory 
##  4 1              1 speech  
##  5 1              1 grant   
##  6 1              1 park    
##  7 1              1 illinois
##  8 1              1 november
##  9 1              1 4       
## 10 1              1 2008    
## # … with 27,839 more rows

4.7.3 Dump The “Stop Words”

So we have some more work to do. Let’s get rid of common words that don’t substantially add ot the comprehensibility of the information. There is a built in data frame called stop_words that we can use to filter out such useless words (well for purposes of information retrrieval anyway).

data(stop_words)

tidy_speeches_nostop <- tidy_speeches %>%
  anti_join(stop_words)

## Joining, by = "word"

4.7.4 Eliminate Numbers

Let’s also remove numbers from the words.

# remove numbers
nums <- tidy_speeches_nostop %>% 
  filter(str_detect(word, "^[0-9]")) %>% 
  select(word) %>% unique()

tidy_speeches_nostop_nonums <- tidy_speeches_nostop %>% 
  anti_join(nums, by = "word")

tidy_speeches_nostop_nonums %>%
  count(word, sort = TRUE)

## # A tibble: 2,736 x 2
##    word           n
##    <chr>      <int>
##  1 change       106
##  2 time         105
##  3 people        99
##  4 america       98
##  5 american      94
##  6 country       90
##  7 americans     55
##  8 care          54
##  9 washington    54
## 10 campaign      52
## # … with 2,726 more rows

4.7.5 Stemming

Some of these words need to be “stemmed” - such as “american” and “americans”. There is a function in the snowballC package that will allow us to do this. Here is an example.

some_words <- c("america","americas","american","cat","cats","catastrophe")
wordStem(some_words)

## [1] "america"    "america"    "american"   "cat"        "cat"       
## [6] "catastroph"

tidy_speeches_nostop_nonums_stem <- tidy_speeches_nostop_nonums %>% 
                mutate(word_stem = wordStem(word)) 

tidy_speeches_nostop_nonums_stem

## # A tibble: 9,412 x 4
##    book  linenumber word     word_stem
##    <chr>      <int> <chr>    <chr>    
##  1 1              1 election elect    
##  2 1              1 night    night    
##  3 1              1 victory  victori  
##  4 1              1 speech   speech   
##  5 1              1 grant    grant    
##  6 1              1 park     park     
##  7 1              1 illinois illinoi  
##  8 1              1 november novemb   
##  9 1              3 doubts   doubt    
## 10 1              3 america  america  
## # … with 9,402 more rows

4.7.6 Bar Plots

library(ggplot2)

tidy_speeches_nostop_nonums_stem %>%
  count(word_stem, sort = TRUE) %>% filter(n > 40) %>%
  mutate(word_stem = reorder(word_stem, n)) %>%
  ggplot(aes(word_stem, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

http://jacobsimmering.com/2016/11/15/tidytext/

library(tidyr)

frequency <- tidy_speeches_nostop_nonums_stem %>% 
  count(book,word_stem) %>%
  group_by(book) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  spread(book, proportion) %>% 
  gather(book, proportion, `1`:`9`)

tidy_speeches_nostop_nonums_stem %>% count(book,word_stem,sort=TRUE) %>% 
  filter(n > 14) %>% 
  ggplot(aes(word_stem, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() + facet_grid(book ~ .)

tidy_speeches_nostop_nonums_stem %>% count(book, word_stem, sort = TRUE) %>%
  group_by(book) %>%
  top_n(5) %>% 
  ungroup() %>%
  ggplot(aes(reorder_within(word_stem, n, book), n,fill = book)) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  scale_x_reordered() +
  coord_flip() +
  facet_wrap(~book, scales = "free") +
  scale_y_continuous(expand = c(0, 0)) +
  labs(
    x = NULL, y = "Word count",
    title = "Most frequent words after removing stop words"
  )

## Selecting by n

https://richpauloo.github.io/2017-12-29-Using-tidytext-to-make-word-clouds/

4.7.7 Wordclouds

One of the premier visualizations for text documents is the word cloud. You see this a lot in newsletters and on flyers since word clouds show you frequently occurring topics in a way that makes it obvious what the most mentioned words are.

pal <- brewer.pal(8,"Dark2")

# plot the 50 most common words

tidy_speeches_nostop_nonums_stem %>%
  count(word_stem, sort = TRUE) %>%
  with(wordcloud(word_stem, n, random.order = FALSE, max.words = 50, colors=pal))

See https://shiring.github.io/text_analysis/2017/06/28/twitter_post

4.8 Term Frequency

As mentioned in the previous section, much of the content of a collectio of text contains lots of “filler” words such as “the”,“a”,“an”,“as”,“is”,“at”,“which”. Identifying these words is important in text mining and in search engine technology since you generally want to avoid them as they can interfere with a getting a good result. Stop words are also language dependent.

Moreover, a given domain also has stop words. For example, when considering documents or tweets that mention kidney disease you might want to filter out specific references to “kidney disease” since you already know that you are considering kidney disease. So you might want to remove those obvious references and drill down into other words and phrases that might be important.

https://dev.mysql.com/doc/refman/5.5/en/fulltext-stopwords.html

4.8.1 Back To The Speeches

Let’s look at some examples of term frequency using collections of speeches from Obama and Romney when they were both candidates. I have two folders with 50 speeches each.

library(readtext) # Makes it easy to read in raw text files

# We have a folder with about 50 Obama Speeches

setwd("~/Downloads/Candidates/OBAMA")
filenames <- list.files( pattern="*", full.names=TRUE)
suppressWarnings(obama <- lapply(filenames, readtext))
collected_obama <- sapply(obama,function(x) x$text) 
length(collected_obama)

## [1] 48

# Now Let's get this into tidy text. We will also put a column
# that marks the text with its respective source ("Obama","Romney")

obama_df <- tibble(line=1:length(collected_obama),
                   text=collected_obama,
                   book="obama")

# Let's tidy this up in accordance with tidy principles

tidy_obama_df <- obama_df %>%
  unnest_tokens(word, text)

tidy_obama_df

## # A tibble: 188,999 x 3
##     line book  word       
##    <int> <chr> <chr>      
##  1     1 obama a          
##  2     1 obama couple     
##  3     1 obama of         
##  4     1 obama people     
##  5     1 obama i          
##  6     1 obama just       
##  7     1 obama want       
##  8     1 obama to         
##  9     1 obama acknowledge
## 10     1 obama first      
## # … with 188,989 more rows

Next we will do the same for the Romney speeches.

# We have about 50 speeches from Romney

setwd("~/Downloads/Candidates/ROMNEY")
filenames <- list.files( pattern="*", full.names=TRUE)
suppressWarnings(romney <- lapply(filenames, readtext))
collected_romney <- sapply(romney,function(x) x$text) 
length(collected_romney)

## [1] 49

# Now Let's get this into tidy text
rom_df <- tibble(line=1:length(collected_romney),
                 text=collected_romney,
                 book="romney")

tidy_romney_df <- rom_df %>%
  unnest_tokens(word, text)

# Let's combine the two data frames

combined_tidy_speech <- rbind(tidy_obama_df,tidy_romney_df)

# Count the total number of speech words per candidate

combined_tidy_speech %>% group_by(book) %>% summarize(count=n())

## # A tibble: 2 x 2
##   book    count
##   <chr>   <int>
## 1 obama  188999
## 2 romney  90229

Next we’ll count the number of times each word occurs within speeches for each candidate:

# Get number of times a word occurs within a speech
# per candidate

book_words <- combined_tidy_speech %>% 
  count(book, word, sort=TRUE)

# Get the total number of words in each speech collection
total_words <- book_words %>% 
  group_by(book) %>% 
  summarize(total = sum(n))

So now we can join these two data frames. What we get for our trouble is way to compute the frequency of a given term within each speech collection for each candidate. This provides information on what terms occur with great frequency and those that do not. We can plot these.

book_words <- left_join(book_words, total_words)

## Joining, by = "book"

book_words

## # A tibble: 11,515 x 4
##    book   word      n  total
##    <chr>  <chr> <int>  <int>
##  1 obama  the    7306 188999
##  2 obama  to     6579 188999
##  3 obama  and    6287 188999
##  4 romney the    4364  90229
##  5 obama  that   4050 188999
##  6 obama  a      3776 188999
##  7 romney and    3420  90229
##  8 obama  of     3361 188999
##  9 obama  you    3341 188999
## 10 obama  in     3115 188999
## # … with 11,505 more rows

library(ggplot2)

ggplot(book_words, aes(n/total, fill = book)) +
  geom_histogram(show.legend = FALSE) +
  xlim(NA, 0.0009) +
  facet_wrap(~book, ncol = 2, scales = "free_y")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 359 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

4.9 Tf and Idf

A central question in text mining and natural language processing is how to quantify what a document is about. One measure of how important a word may be is its term frequency (tf), how frequently a word occurs in a document

We could use a list of stop words but it’s not always the most sophisticated approach to adjusting term frequency for commonly used words.

Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. This can be combined with term frequency to calculate a term’s tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how rarely it is used. t-idf similarity

https://www.tidytextmining.com/tfidf.html

The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.

We can use the bind_tf_idf function to compute tf, and idf for us. Furthermore it will bind this information onto a tidy text data frame.

book_words <- book_words %>%
  bind_tf_idf(word, book, n)
book_words

## # A tibble: 11,515 x 7
##    book   word      n  total     tf   idf tf_idf
##    <chr>  <chr> <int>  <int>  <dbl> <dbl>  <dbl>
##  1 obama  the    7306 188999 0.0387     0      0
##  2 obama  to     6579 188999 0.0348     0      0
##  3 obama  and    6287 188999 0.0333     0      0
##  4 romney the    4364  90229 0.0484     0      0
##  5 obama  that   4050 188999 0.0214     0      0
##  6 obama  a      3776 188999 0.0200     0      0
##  7 romney and    3420  90229 0.0379     0      0
##  8 obama  of     3361 188999 0.0178     0      0
##  9 obama  you    3341 188999 0.0177     0      0
## 10 obama  in     3115 188999 0.0165     0      0
## # … with 11,505 more rows

Let’s look at a sorted list get the high tf_idf words

book_words %>%
  select(-total) %>%
  arrange(desc(tf_idf))

## # A tibble: 11,515 x 6
##    book   word            n       tf   idf   tf_idf
##    <chr>  <chr>       <int>    <dbl> <dbl>    <dbl>
##  1 romney в             106 0.00117  0.693 0.000814
##  2 obama  boo           200 0.00106  0.693 0.000733
##  3 romney enterprise     57 0.000632 0.693 0.000438
##  4 romney president's    55 0.000610 0.693 0.000423
##  5 obama  250,000       107 0.000566 0.693 0.000392
##  6 obama  bargain       100 0.000529 0.693 0.000367
##  7 obama  michelle       98 0.000519 0.693 0.000359
##  8 romney nations        44 0.000488 0.693 0.000338
##  9 obama  outstanding    82 0.000434 0.693 0.000301
## 10 romney iran           38 0.000421 0.693 0.000292
## # … with 11,505 more rows

We can plot this information

book_words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(book) %>% 
  top_n(15) %>% 
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = book)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~book, ncol = 2, scales = "free") +
  coord_flip()

## Selecting by tf_idf

data(stop_words)

book_words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(book) %>% 
  top_n(15) %>% 
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = book)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~book, ncol = 2, scales = "free") +
  coord_flip()

## Selecting by tf_idf

4.10 Sentiment

Attempting to understand the feeling associated with words is a complicated business especially given that context can drastically change the meaning of words and phrases. Consider the following statements. To a human the sentiment is obvious. The third statement reflects sarcasm which is notoriously difficult for an algorithm to pick up.

I had a good flight experience on Delta airlines.

I had a bad flight experience on Delta airlines.

Gee thanks Delta for losing my luggage.

4.10.1 Sentiment Dictionaries and Lexicons

The tidytext package has a data frame called sentiments which has three lexicons that will help us figure out the emotional value of words. Each dictionary handles sentiment assessment differently.

The nrc dictionary applies a single word emotional description to a text word Of course a word in a piece of text can be associated with different emotions. The possible sentiments are:

unique(get_sentiments("nrc")$sentiment)

##  [1] "trust"        "fear"         "negative"     "sadness"     
##  [5] "anger"        "surprise"     "positive"     "disgust"     
##  [9] "joy"          "anticipation"

Next there is the the afinn dictionary which presents a range of scores from -5 to 5 which coincides with a range of negative to positive emotion - although the rating is “stepped” (i.e. integers).

get_sentiments("afinn") %>% head(.)

## # A tibble: 6 x 2
##   word       score
##   <chr>      <int>
## 1 abandon       -2
## 2 abandoned     -2
## 3 abandons      -2
## 4 abducted      -2
## 5 abduction     -2
## 6 abductions    -2

get_sentiments("afinn") %>% select('score') %>% range(.)

## [1] -5  5

The bing lexicon offers a binary assessment of emotion. It’s either “negative” or “positive” so there is no continuum of emotion.

unique(get_sentiments("bing")$sentiment)

## [1] "negative" "positive"

#
get_sentiments("bing") %>% head()

## # A tibble: 6 x 2
##   word       sentiment
##   <chr>      <chr>    
## 1 2-faced    negative 
## 2 2-faces    negative 
## 3 a+         positive 
## 4 abnormal   negative 
## 5 abolish    negative 
## 6 abominable negative

So next up there is the loughran lecxicon which is appropriate for financial and business contexts:

unique(get_sentiments("loughran")$sentiment)

## [1] "negative"     "positive"     "uncertainty"  "litigious"   
## [5] "constraining" "superfluous"

4.10.2 An Example

head(sentiments)

## # A tibble: 6 x 4
##   word      sentiment lexicon score
##   <chr>     <chr>     <chr>   <int>
## 1 abacus    trust     nrc        NA
## 2 abandon   fear      nrc        NA
## 3 abandon   negative  nrc        NA
## 4 abandon   sadness   nrc        NA
## 5 abandoned anger     nrc        NA
## 6 abandoned fear      nrc        NA

txt <- c("I had a good flight experience on Delta airlines",
  "I had a bad flight experience on Delta airlines",
  "Gee thanks Delta for losing my luggage")

df <- tibble(1:length(txt),text=txt)
df <- df %>% unnest_tokens(word,text)

# We get rid of the stop words and then see what the nrc sentiment
# dictionary tells us about our data

df %>% anti_join(stop_words) %>% left_join(get_sentiments("nrc"))

## Joining, by = "word"
## Joining, by = "word"

## # A tibble: 19 x 3
##    `1:length(txt)` word       sentiment
##              <int> <chr>      <chr>    
##  1               1 flight     <NA>     
##  2               1 experience <NA>     
##  3               1 delta      <NA>     
##  4               1 airlines   <NA>     
##  5               2 bad        anger    
##  6               2 bad        disgust  
##  7               2 bad        fear     
##  8               2 bad        negative 
##  9               2 bad        sadness  
## 10               2 flight     <NA>     
## 11               2 experience <NA>     
## 12               2 delta      <NA>     
## 13               2 airlines   <NA>     
## 14               3 gee        <NA>     
## 15               3 delta      <NA>     
## 16               3 losing     anger    
## 17               3 losing     negative 
## 18               3 losing     sadness  
## 19               3 luggage    <NA>

df %>% anti_join(stop_words) %>% left_join(get_sentiments("bing"))

## Joining, by = "word"
## Joining, by = "word"

## # A tibble: 13 x 3
##    `1:length(txt)` word       sentiment
##              <int> <chr>      <chr>    
##  1               1 flight     <NA>     
##  2               1 experience <NA>     
##  3               1 delta      <NA>     
##  4               1 airlines   <NA>     
##  5               2 bad        negative 
##  6               2 flight     <NA>     
##  7               2 experience <NA>     
##  8               2 delta      <NA>     
##  9               2 airlines   <NA>     
## 10               3 gee        <NA>     
## 11               3 delta      <NA>     
## 12               3 losing     negative 
## 13               3 luggage    <NA>

df %>% anti_join(stop_words) %>% left_join(get_sentiments("afinn"))

## Joining, by = "word"
## Joining, by = "word"

## # A tibble: 13 x 3
##    `1:length(txt)` word       score
##              <int> <chr>      <int>
##  1               1 flight        NA
##  2               1 experience    NA
##  3               1 delta         NA
##  4               1 airlines      NA
##  5               2 bad           -3
##  6               2 flight        NA
##  7               2 experience    NA
##  8               2 delta         NA
##  9               2 airlines      NA
## 10               3 gee           NA
## 11               3 delta         NA
## 12               3 losing        -3
## 13               3 luggage       NA

4.10.3 Speech Sentiment

Let’s apply our new found knowledge of sentiment tools to the combined speeches of Romney and Obama. We might want to get some sense of how emotive their respective sets of speeches are. We might also want to look at the emotive range for all speches combined independently of the candidate. Of course, these are political speeches so you know up front that there will be a bias towards positive language. So here we will join the words from all speeches to the bing sentiment lexicon to rate the emotional content of the words comprising the speech. Note that many words in the speeches will not have an emotional equivalent because they don’t exist in the lexicon. This is why you should use a stop word approach or consider using a Term Frequency - Inverse frequency approach to first eliminate filler words.

 bing_word_counts <- combined_tidy_speech %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, word, sentiment, sort = TRUE) %>%
  ungroup()

## Joining, by = "word"

So let’s look at some of the emotional content of the speeches independently of candidate (i.e. all speeches combined).

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(15) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

## Selecting by n

It’s also really easy to look at the emotions for each candidate just by using the book variable from the bing_word_counts data frame.

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(15) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment+book, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

## Selecting by n

So if we look at a word cloud of combined sentiment across all speeches independently of candidate we see the following:

pal <- brewer.pal(8,"Dark2")

# plot the 50 most common words

bing_word_counts  %>%
  with(wordcloud(word, n, random.order = FALSE, max.words = 50, colors=pal))

4.10.4 Comparison Cloud

So another thing we can do is to look at a comparison cloud that would give us a word cloud that presents a visualization of the most common postive and negative words. There is a function called comparison.cloud() that will accomplish this but it also requires us to do some conversion on our data frame.

library(reshape2)
combined_tidy_speech %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

## Joining, by = "word"

We could also look at this on a per candidate basis:

library(reshape2)
combined_tidy_speech %>%
  inner_join(get_sentiments("bing")) %>% 
  filter(book=="romney") %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

## Joining, by = "word"

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): liberty could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): benefits could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): promises could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): recovery could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): innovation could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): greatness could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): powerful could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): destiny could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): proud could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): honor could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): commitment could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): confidence could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): courage could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): easy could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): protect could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): tough could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): extraordinary could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("gray20", "gray80"), max.words =
## 100): wonder could not be fit on page. It will not be plotted.