Chapter 5 Bag of Words Sentiment Analysis
One we have a collection of text it’s interesting to figure out what it might mean or infer - if anything at all. In text analysis and NLP (Natural Language Processing) we talk about “Bag of Words” to describe a collection or “corpus” of unstructured text. What do we do with a “bag of words” ?
- Extract meaning from collections of text (without reading !)
- Detect and analyze patterns in unstructured textual collections
- Use Natural Language Processing techniques to reach conclusions
- Discover what ideas occur in text and how they might be linked
- Determine if the discovered patterns be used to predict behavior ?
- Identify interesting ideas that might otherwise be ignored
5.1 Workflow
- Identify and Obtain text (e.g. websites, Twitter, Databases, PDFs, surveys)
- Create a text ”Corpus”- a structure that contains the raw text
- Apply transformations:
- Normalize case (convert to lower case)
- Remove puncutation and stopwords
- Remove domain specific stopwords
- Perform Analysis and Visualizations (word frequency, tagging, wordclouds)
- Do Sentiment Analysis
R has Packages to Help. These are just some of them:
- QDAP - Quantitative Discourse Package
- tm - text mining applications within R
- tidytext - Text Mining using ddplyr and ggplot and tidyverse tools
- SentimentAnalysis - For Sentiment Analysis
However, consider that:
- Some of these are easier to use than others
- Some can be kind of a problem to install (e.g. qdap)
- They all offer similar capabilities
- We’ll look at tidytext
5.2 Simple Example
Find the URL for Lincoln’s March 4, 1865 Speech:
<- "https://millercenter.org/the-presidency/presidential-speeches/march-4-1865-second-inaugural-address"
url library(rvest)
<- read_html(url) %>%
lincoln_doc html_nodes(".view-transcript") %>%
html_text()
lincoln_doc
## [1] "TranscriptFellow-Countrymen: At this second appearing to take the oath of the Presidential office there is less occasion for an extended address than there was at the first. Then a statement somewhat in detail of a course to be pursued seemed fitting and proper. Now, at the expiration of four years, during which public declarations have been constantly called forth on every point and phase of the great contest which still absorbs the attention and engrosses the energies of the nation, little that is new could be presented. The progress of our arms, upon which all else chiefly depends, is as well known to the public as to myself, and it is, I trust, reasonably satisfactory and encouraging to all. With high hope for the future, no prediction in regard to it is ventured.On the occasion corresponding to this four years ago all thoughts were anxiously directed to an impending civil war. All dreaded it, all sought to avert it. While the inaugural address was being delivered from this place, devoted altogether to saving the Union without war, insurgent agents were in the city seeking to destroy it without war-seeking to dissolve the Union and divide effects by negotiation. Both parties deprecated war, but one of them would make war rather than let the nation survive, and the other would accept war rather than let it perish, and the war came.One-eighth of the whole population were colored slaves, not distributed generally over the Union. but localized in the southern part of it. These slaves constituted a peculiar and powerful interest. All knew that this interest was somehow the cause of the war. To strengthen, perpetuate, and extend this interest was the object for which the insurgents would rend the Union even by war, while the Government claimed no right to do more than to restrict the territorial enlargement of it. Neither party expected for the war the magnitude or the duration which it has already attained. Neither anticipated that the cause of the conflict might cease with or even before the conflict itself should cease. Each looked for an easier triumph, and a result less fundamental and astounding. Both read the same Bible and pray to the same God, and each invokes His aid against the other. It may seem strange that any men should dare to ask a just God's assistance in wringing their bread from the sweat of other men's faces, but let us judge not, that we be not judged. The prayers of both could not be answered. That of neither has been answered fully. The Almighty has His own purposes. \"Woe unto the world because of offenses; for it must needs be that offenses come, but woe to that man by whom the offense cometh.\" If we shall suppose that American slavery is one of those offenses which, in the providence of God, must needs come, but which, having continued through His appointed time, He now wills to remove, and that He gives to both North and South this terrible war as the woe due to those by whom the offense came, shall we discern therein any departure from those divine attributes which the believers in a living God always ascribe to Him? Fondly do we hope, fervently do we pray, that this mighty scourge of war may speedily pass away. Yet, if God wills that it continue until all the wealth piled by the bondsman's two hundred and fifty years of unrequited toil shall be sunk, and until every drop of blood drawn with the lash shall be paid by another drawn with the sword, as was said three thousand years ago, so still it must be said \"the judgments of the Lord are true and righteous altogether.\"With malice toward none, with charity for all, with firmness in the fight as God gives us to see the right, let us strive on to finish the work we are in, to bind up the nation's wounds, to care for him who shall have borne the battle and for his widow and his orphan, to do all which may achieve and cherish a just and lasting peace among ourselves and with all nations."
There are probably lots of words that don’t really “matter” or contribute to the “real” meaning of the speech.
<- unlist(strsplit(lincoln_doc," "))
word_vec 1:20] word_vec[
## [1] "TranscriptFellow-Countrymen:" "" "At" "this"
## [5] "second" "appearing" "to" "take"
## [9] "the" "oath" "of" "the"
## [13] "Presidential" "office" "there" "is"
## [17] "less" "occasion" "for" "an"
sort(table(word_vec),decreasing = TRUE)[1:10]
## word_vec
## the to and of that for be in it a
## 54 26 24 22 11 9 8 8 8 7
How do we remove all the uninteresting words ? We could do it manaully
# Remove all punctuation marks
<- gsub("[[:punct:]]","",word_vec)
word_vec <- c("the","to","and","of","the","for","in","it",
stop_words "a","this","which","by","is","an","hqs","from",
"that","with","as")
for (ii in 1:length(stop_words)) {
for (jj in 1:length(word_vec)) {
if (stop_words[ii] == word_vec[jj]) {
<- ""
word_vec[jj]
} }
}<- word_vec[word_vec != ""]
word_vec sort(table(word_vec),decreasing = TRUE)[1:10]
## word_vec
## war all be we but God shall was do let
## 11 8 8 6 5 5 5 5 4 4
1:30] word_vec[
## [1] "TranscriptFellowCountrymen" "At" "second" "appearing"
## [5] "take" "oath" "Presidential" "office"
## [9] "there" "less" "occasion" "extended"
## [13] "address" "than" "there" "was"
## [17] "at" "first" "Then" "statement"
## [21] "somewhat" "detail" "course" "be"
## [25] "pursued" "seemed" "fitting" "proper"
## [29] "Now" "at"
5.3 tidytext
So the tidytext package provides some accomodations to convert your body of text into individual tokens which then simplfies the removal of less meaningful words and the creation of word frequency counts. The first thing you do is to create a data frame where the there is one line for each body of text. In this case we have only one long string of text this will be a one line data frame.
library(tidytext)
library(tidyr)
<- data_frame(line = 1:length(lincoln_doc), text = lincoln_doc)
text_df
text_df
## # A tibble: 1 × 2
## line text
## <int> <chr>
## 1 1 "TranscriptFellow-Countrymen: At this second appearing to take the oath of the Presidential office there is less occasion f…
The next step is to breakup each of text lines (we have only 1) into invdividual rows, each with it’s own line. We also want to count the number of times that each word appears. This is known as tokenizing the data frame.
<- text_df %>%
token_text unnest_tokens(word, text)
# Let's now count them
%>% count(word,sort=TRUE) token_text
## # A tibble: 339 × 2
## word n
## <chr> <int>
## 1 the 58
## 2 to 27
## 3 and 24
## 4 of 22
## 5 it 13
## 6 that 12
## 7 war 12
## 8 all 10
## 9 for 9
## 10 in 9
## # … with 329 more rows
But we need to get rid of the “stop words.” It’s a good thing that the tidytext package has a way to filter out the common words that do not significantly contribute to the meaning of the overall text. The stop_words data frame is built into tidytext. Take a look to see some of the words contained therein:
data(stop_words)
# Sample 40 random stop words
%>% sample_n(40) stop_words
## # A tibble: 40 × 2
## word lexicon
## <chr> <chr>
## 1 do onix
## 2 where SMART
## 3 contains SMART
## 4 him snowball
## 5 seeming SMART
## 6 ended onix
## 7 following SMART
## 8 area onix
## 9 across onix
## 10 ordered onix
## # … with 30 more rows
# Now remove stop words from the document
<- token_text %>%
tidy_text anti_join(stop_words)
## Joining, by = "word"
# This could also be done by the following. I point this out only because some people react
# negatively to "joins" although fully understanding what joins are can only help you since
# much of what the dplyr package does is based on SQL type joins.
<- token_text %>%
tidy_text filter(!word %in% stop_words$word)
%>% count(word,sort=TRUE) tidy_text
## # A tibble: 193 × 2
## word n
## <chr> <int>
## 1 war 12
## 2 god 5
## 3 union 4
## 4 offenses 3
## 5 woe 3
## 6 address 2
## 7 ago 2
## 8 altogether 2
## 9 answered 2
## 10 cease 2
## # … with 183 more rows
%>% count(word,sort=TRUE) tidy_text
## # A tibble: 193 × 2
## word n
## <chr> <int>
## 1 war 12
## 2 god 5
## 3 union 4
## 4 offenses 3
## 5 woe 3
## 6 address 2
## 7 ago 2
## 8 altogether 2
## 9 answered 2
## 10 cease 2
## # … with 183 more rows
%>%
tidy_text count(word, sort = TRUE) %>%
filter(n > 2) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
5.4 Back To The PubMed Example
We have around 935 abstracts that we mess with based on our work using the easyPubMed package
# Create a data frame out of the cleaned up abstracts
library(tidytext)
library(dplyr)
<- data_frame(line = 1:length(my_abstracts), text = my_abstracts)
text_df <- text_df %>%
token_text unnest_tokens(word, text)
# Many of these words aren't helpful
%>% count(total=word,sort=TRUE) token_text
## # A tibble: 6,936 × 2
## total n
## <chr> <int>
## 1 the 3062
## 2 of 2896
## 3 and 2871
## 4 in 1915
## 5 to 1884
## 6 a 1373
## 7 dialysis 1365
## 8 patients 1335
## 9 home 1281
## 10 with 1035
## # … with 6,926 more rows
# Now remove stop words
data(stop_words)
<- token_text %>%
tidy_text anti_join(stop_words)
# This could also be done by the following. I point this out only because some people react
# negatively to "joins" although fully understanding what joins are can only help you since
# much of what the dplyr package does is based on SQL type joins.
<- token_text %>%
tidy_text filter(!word %in% stop_words$word)
# Arrange the text by descending word frequency
%>%
tidy_text count(word, sort = TRUE)
## # A tibble: 6,460 × 2
## word n
## <chr> <int>
## 1 dialysis 1365
## 2 patients 1335
## 3 home 1281
## 4 hemodialysis 674
## 5 hd 463
## 6 hhd 440
## 7 patient 395
## 8 pd 303
## 9 renal 279
## 10 study 268
## # … with 6,450 more rows
Some of the most frequently occurring words are in fact “dialysis,” “patients” so maybe we should consider them to be stop words also since we already know quite well that the overall theme is, well, dialysis and kidneys. There are also synonymns and abbreviations that are somewhat redundant such as “pdd,”“pd,”“hhd” so let’s eliminate them also.
<- token_text %>%
tidy_text filter(!word %in% c(stop_words$word,"dialysis","patients","home","kidney",
"hemodialysis","haemodialysis","patient","hhd",
"pd","peritoneal","hd","renal","study","care",
"ci","chd","nhd","disease","treatment"))
%>%
tidy_text count(word, sort = TRUE)
## # A tibble: 6,441 × 2
## word n
## <chr> <int>
## 1 therapy 193
## 2 conventional 191
## 3 survival 191
## 4 center 186
## 5 compared 180
## 6 clinical 175
## 7 nocturnal 171
## 8 outcomes 171
## 9 quality 171
## 10 data 161
## # … with 6,431 more rows
Let’s do some plotting of these words
library(ggplot2)
%>%
tidy_text count(word, sort = TRUE) %>%
filter(n > 120) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
Okay, it looks like there are numbers in there which might be useful. I suspect that the “95” is probably associated with the idea of a confidence interval. But there are other references to numbers.
grep("^[0-9]{1,3}$",tidy_text$word)[1:20]
## [1] 9 273 275 284 288 293 296 305 308 387 388 554 614 671 679 680 682 744 758 762
<- tidy_text[grep("^[0-9]{1,3}$",tidy_text$word,invert=TRUE),] tidy_text_nonum
Okay well I think maybe we have some reasonable data to examine. As you might have realized by now, manipulating data to get it “clean” can be tedious and frustrating though it is an inevitable part of the process.
%>%
tidy_text_nonum count(word, sort = TRUE) %>%
filter(n > 120) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
5.4.1 How Do You Feel ?
The next step is to explore what some of these words might mean. The tidytext package has four dictionaries that help you figure out what sentiment is being expressed by your data frame.
# NRC Emotion Lexicon from Saif Mohammad and Peter Turney
get_sentiments("nrc") %>% sample_n(20)
## # A tibble: 20 × 2
## word sentiment
## <chr> <chr>
## 1 disgrace negative
## 2 undying anticipation
## 3 warn fear
## 4 independence anticipation
## 5 judiciary anticipation
## 6 doll joy
## 7 glorify joy
## 8 alien disgust
## 9 purify joy
## 10 flop disgust
## 11 sanguine positive
## 12 beastly negative
## 13 blackness negative
## 14 wear trust
## 15 neutral trust
## 16 affluence positive
## 17 invade anger
## 18 alienation anger
## 19 deceit disgust
## 20 contemptible negative
# the sentiment lexicon from Bing Liu and collaborators
get_sentiments("bing") %>% sample_n(20)
## # A tibble: 20 × 2
## word sentiment
## <chr> <chr>
## 1 nurturing positive
## 2 saintliness positive
## 3 split negative
## 4 object negative
## 5 mistrust negative
## 6 insolvent negative
## 7 extravagant negative
## 8 thinner positive
## 9 bombastic negative
## 10 disapointment negative
## 11 immoderate negative
## 12 dungeons negative
## 13 revengefully negative
## 14 mournful negative
## 15 sagacity positive
## 16 upgraded positive
## 17 adequate positive
## 18 spendy negative
## 19 awfully negative
## 20 displaced negative
# Tim Loughran and Bill McDonald
get_sentiments("loughran") %>% sample_n(20)
## # A tibble: 20 × 2
## word sentiment
## <chr> <chr>
## 1 duress negative
## 2 opposes negative
## 3 resigns negative
## 4 limitation negative
## 5 claimant litigious
## 6 cancel negative
## 7 honored positive
## 8 incompatibility negative
## 9 premiere positive
## 10 variably uncertainty
## 11 reassessments negative
## 12 devolved negative
## 13 lingering negative
## 14 unpredicted negative
## 15 legalized litigious
## 16 obstructing negative
## 17 forfeited negative
## 18 foreclosure negative
## 19 prejudicing negative
## 20 gained positive
# Pull out words that correspond to joy
<- get_sentiments("nrc") %>%
nrc_joy filter(sentiment == "joy")
nrc_joy
## # A tibble: 689 × 2
## word sentiment
## <chr> <chr>
## 1 absolution joy
## 2 abundance joy
## 3 abundant joy
## 4 accolade joy
## 5 accompaniment joy
## 6 accomplish joy
## 7 accomplished joy
## 8 achieve joy
## 9 achievement joy
## 10 acrobat joy
## # … with 679 more rows
So we will use the nrc sentiment dictionary to see the “sentiment” expressed in our abstracts.
<- tidy_text_nonum %>%
bing_word_counts inner_join(get_sentiments("nrc")) %>%
count(word,sentiment,sort=TRUE)
## Joining, by = "word"
t the positive vs negative words
%>%
bing_word_counts group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
## Selecting by n
Let’s create a word cloud
library(wordcloud)
#
%>%
tidy_text_nonum count(word) %>%
with(wordcloud(word,n,max.words=90,scale=c(4,.5),colors=brewer.pal(8,"Dark2")))
5.5 BiGrams
Let’s look at bigrams. We need to go back to the cleaned abstracts and pair words to get phrase that might be suggestive of some sentiment
<- data_frame(line = 1:length(my_abstracts), text = my_abstracts)
text_df <- text_df %>%
dialysis_bigrams unnest_tokens(bigram, text, token = "ngrams", n = 2)
%>%
dialysis_bigrams count(bigram, sort = TRUE)
## # A tibble: 41,738 × 2
## bigram n
## <chr> <int>
## 1 in the 382
## 2 of the 310
## 3 home dialysis 300
## 4 home hemodialysis 279
## 5 of home 195
## 6 peritoneal dialysis 193
## 7 associated with 174
## 8 home hd 153
## 9 home haemodialysis 144
## 10 in center 144
## # … with 41,728 more rows
But we have to filter out stop words
library(tidyr)
<- dialysis_bigrams %>%
bigrams_sep separate(bigram,c("word1","word2"),sep=" ")
<- c(stop_words$word,"dialysis","patients","home","kidney",
stop_list "hemodialysis","haemodialysis","treatment","patient","hhd",
"pd","peritoneal","hd","renal","study","care",
"ci","chd","nhd","esrd","lt","95","0.001")
<- bigrams_sep %>%
bigrams_filtered filter(!word1 %in% stop_list) %>%
filter(!word2 %in% stop_list)
<- bigrams_filtered %>%
bigram_counts count(word1, word2, sort = TRUE)
<- bigrams_filtered %>%
bigrams_united unite(bigram, word1, word2, sep = " ")
%>% count(bigram, sort = TRUE) %>% print(n=25) bigrams_united
## # A tibble: 11,842 × 2
## bigram n
## <chr> <int>
## 1 replacement therapy 71
## 2 vascular access 65
## 3 technique failure 54
## 4 confidence interval 41
## 5 left ventricular 39
## 6 blood pressure 36
## 7 short daily 35
## 8 clinical outcomes 33
## 9 thrice weekly 30
## 10 technique survival 29
## 11 hazard ratio 26
## 12 quality improvement 26
## 13 adverse events 22
## 14 6 months 21
## 15 access related 21
## 16 arteriovenous fistula 21
## 17 12 months 19
## 18 ventricular mass 18
## 19 3 times 15
## 20 buttonhole cannulation 15
## 21 cost effective 15
## 22 observational studies 15
## 23 retrospective cohort 15
## 24 cost effectiveness 14
## 25 daily life 14
## # … with 11,817 more rows
library(tidyquant)
%>%
bigram_counts filter(n > 30) %>%
ggplot(aes(x = reorder(word1, -n), y = reorder(word2, -n), fill = n)) +
geom_tile(alpha = 0.8, color = "white") +
scale_fill_gradientn(colours = c(palette_light()[[1]], palette_light()[[2]])) +
coord_flip() +
theme_tq() +
theme(legend.position = "right") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
labs(x = "first word in pair",
y = "second word in pair")