Text wrangling and analysis
Overview:
In this report, we conduct some data mining and exploration of text extracted from a given book and end with a sentiment analysis using the afinn lexicon.
Data source: You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life by Jen Sincero - PDF Drive. http://www.pdfdrive.com/you-are-a-badass-how-to-stop-doubting-your-greatness-and-start-living-an-awesome-life-e60365112.html. Accessed 14 Mar. 2022.
Data Wrangling
Get Book
<- pdf_text(here::here('data', 'You Are a Badass_ How to Stop Doubting Your Greatness and Start Living an Awesome Life ( PDFDrive ).pdf'))
book_text # - Each row is a page of the PDF (i.e., this is a vector of strings, one for each page)
# - Only sees text that is "selectable"
<- data.frame(book_text) %>%
book_text_lines mutate(page = 1:n()) %>%
mutate(text_full = str_split(book_text, pattern = '\\n')) %>%
unnest(text_full) %>%
mutate(text_full = str_trim(text_full))
More tidying
Now, we’ll add a new column that contains the Chapter number (so we can use this as a grouping variable later on).
We will use str_detect()
to look for any cells in
“text_full” column that contains the string “Chapter”, and if it does,
the new column will contain that chapter number:
<- book_text_lines %>%
book_text_chapts slice(-(1:65)) %>%
mutate(chapter = ifelse(str_detect(text_full, "CHAPTER "), text_full, NA)) %>%
fill(chapter, .direction = 'down') %>%
separate(col = chapter, into = c("ch", "no"), sep = " ") %>%
separate(col = no, into = c("no"), sep = ":") %>%
mutate(chapter = as.numeric(no))
Word counts by Chapter!
<- book_text_chapts %>%
book_words unnest_tokens(word, text_full) %>%
select(-book_text)
<- book_words %>%
book_wordcount count(chapter, word)
Remove stop words
Those very common (and often uninteresting) words are called “stop
words.” See ?stop_words
and View(stop_words)
to
look at documentation for stop words lexicons (from the
tidytext
package).
We will remove stop words using
tidyr::anti_join()
, which will omit any words in
stop_words
from `the book.
# head(stop_words)
<- book_words %>%
book_words_clean anti_join(stop_words, by = 'word')
<- book_words_clean %>%
nonstop_counts filter(!word %in% c ("it’s", "you’re")) %>%
count(chapter, word)
Top 5 words from each chapter
# top 5 words per chapter
<- nonstop_counts %>%
top_5_words group_by(chapter) %>%
arrange(-n) %>%
slice(1:5) %>%
ungroup()
<- top_5_words %>%
top_5_sub filter(n > 20) %>%
filter(!word %in% c("it’s", "you’re"))
# GGplot of the top 5 words in the book:
ggplot(data = top_5_sub, aes(x = n, y = word)) + geom_col(fill = "purple") + facet_wrap(~chapter,
scales = "free") + theme(panel.background = element_blank(), panel.grid = element_blank()) +
theme(plot.caption = element_text(hjust = 0, face = "bold.italic")) + labs(y = "Top Words",
x = "Count", caption = "Figure 1.Top words w/ repeating more than 20 times across the different chapters ") +
theme(plot.caption = element_text(hjust = 0, face = "bold.italic")) + theme(legend.position = "bottom")
Figure 1. From this figure, we see that for chapter 20 the word fear is being repeated more than 20 times which could imply that chapter 20 is about fear. This could be further checked with a sentiment analysis. Similarly, for chapter 6 the word love is being repeated over 20 times. Therefore from this graph, words use frequency is somewhat revealing of the overall thematic connotation of these different chapters.
Word cloud
<- nonstop_counts %>%
ch1_top100 filter(chapter == 1) %>%
arrange(-n) %>%
slice(1:100)
<- ggplot(data = ch1_top100, aes(label = word)) +
ch1_cloud geom_text_wordcloud(aes(color = n, size = n), shape = "diamond") +
scale_size_area(max_size = 6) +
scale_color_gradientn(colors = c("darkgreen","blue","purple")) +
labs(caption = "Figure 2. Word cloud of top 100 words from chapter 1") +
theme_minimal() +
theme(plot.caption = element_text(hjust = 0))+
theme(legend.position = "bottom")
ch1_cloud
Sentiment analysis
The sentiment lexicon, we will be using is:
- AFINN from Finn Årup Nielsen,
The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. All of this information is tabulated in the sentiments dataset, and tidytext provides a function get_sentiments() to get specific sentiment lexicons without the columns that are not used in that lexicon.”
WARNING: These collections include the most offensive words you can think of.
“afinn”: Words ranked from -5 (very negative) to +5 (very positive)
positive)
get_sentiments(lexicon = "afinn")
# Let's look at the pretty positive words:
<- get_sentiments("afinn") %>%
afinn_pos filter(value %in% c(3,4,5))
# Check them out:
afinn_pos
Sentiment analysis with afinn:
First, let’s bind words in book_words_clean
to
afinn
lexicon:
<- book_words_clean %>%
book_afinn inner_join(get_sentiments("afinn"), by = 'word')
<- book_afinn %>%
afinn_counts count(chapter, value)
# Plot them:
# ggplot(data = afinn_counts, aes(x = value, y = n)) +
# geom_col() +
# facet_wrap(~chapter)
# Find the mean afinn score by chapter:
<- book_afinn %>%
afinn_means group_by(chapter) %>%
summarize(mean_afinn = mean(value))
ggplot(data = afinn_means,
aes(x = fct_rev(factor(chapter)),
y = mean_afinn)) +
geom_col(fill = "cyan") +
coord_flip() +
theme(strip.background = element_blank()) +
theme(panel.background = element_blank(),
panel.grid = element_blank()) +
theme(plot.caption = element_text(hjust = 0, face = "bold.italic") , axis.title.y = element_blank(), axis.title.x = element_blank())+
labs(title = "Sentiment Analysis by chapter", caption = "Figure 2. Leading sentiments words across all chapters of the book using the afinn lexiconn.")
Figure 3 From figure 3, the assumptions made from the word count are verified. As we can see , chapter 20 matches with negative sentiments and the word the most frequent in that chapter was fear. Similarly, we can see that chapter 6, matches with words with positive sentiments and the most frequent word in that chapter is love.