Text wrangling and analysis

Overview:

In this report, we conduct some data mining and exploration of text extracted from a given book and end with a sentiment analysis using the afinn lexicon.

Data source: You Are a Badass: How to Stop Doubting Your Greatness and Start Living an Awesome Life by Jen Sincero - PDF Drive. http://www.pdfdrive.com/you-are-a-badass-how-to-stop-doubting-your-greatness-and-start-living-an-awesome-life-e60365112.html. Accessed 14 Mar. 2022.

Data Wrangling

Get Book

book_text <- pdf_text(here::here('data', 'You Are a Badass_ How to Stop Doubting Your Greatness and Start Living an Awesome Life ( PDFDrive ).pdf'))
# - Each row is a page of the PDF (i.e., this is a vector of strings, one for each page)
# - Only sees text that is "selectable"
book_text_lines <- data.frame(book_text) %>% 
  mutate(page = 1:n()) %>%
  mutate(text_full = str_split(book_text, pattern = '\\n')) %>% 
  unnest(text_full) %>% 
  mutate(text_full = str_trim(text_full)) 

More tidying

Now, we’ll add a new column that contains the Chapter number (so we can use this as a grouping variable later on).

We will use str_detect() to look for any cells in “text_full” column that contains the string “Chapter”, and if it does, the new column will contain that chapter number:

book_text_chapts <- book_text_lines %>% 
  slice(-(1:65)) %>% 
  mutate(chapter = ifelse(str_detect(text_full, "CHAPTER "), text_full, NA)) %>% 
  fill(chapter, .direction = 'down') %>% 
  separate(col = chapter, into = c("ch", "no"), sep = " ") %>% 
  separate(col = no, into = c("no"), sep = ":") %>% 
  mutate(chapter = as.numeric(no))

Word counts by Chapter!

book_words <- book_text_chapts  %>% 
  unnest_tokens(word, text_full) %>% 
  select(-book_text)
book_wordcount <- book_words %>% 
  count(chapter, word)

Remove stop words

Those very common (and often uninteresting) words are called “stop words.” See ?stop_words and View(stop_words)to look at documentation for stop words lexicons (from the tidytext package).

We will remove stop words using tidyr::anti_join(), which will omit any words in stop_words from `the book.

# head(stop_words)

book_words_clean <- book_words %>% 
  anti_join(stop_words, by = 'word')
nonstop_counts <- book_words_clean %>% 
  filter(!word %in% c ("it’s", "you’re")) %>% 
  count(chapter, word)

Top 5 words from each chapter

# top 5 words per chapter
top_5_words <- nonstop_counts %>%
    group_by(chapter) %>%
    arrange(-n) %>%
    slice(1:5) %>%
    ungroup()

top_5_sub <- top_5_words %>%
    filter(n > 20) %>%
    filter(!word %in% c("it’s", "you’re"))

# GGplot of the top 5 words in the book:
ggplot(data = top_5_sub, aes(x = n, y = word)) + geom_col(fill = "purple") + facet_wrap(~chapter,
    scales = "free") + theme(panel.background = element_blank(), panel.grid = element_blank()) +
    theme(plot.caption = element_text(hjust = 0, face = "bold.italic")) + labs(y = "Top Words",
    x = "Count", caption = "Figure 1.Top words w/ repeating more than 20 times across the different chapters ") +
    theme(plot.caption = element_text(hjust = 0, face = "bold.italic")) + theme(legend.position = "bottom")

Figure 1. From this figure, we see that for chapter 20 the word fear is being repeated more than 20 times which could imply that chapter 20 is about fear. This could be further checked with a sentiment analysis. Similarly, for chapter 6 the word love is being repeated over 20 times. Therefore from this graph, words use frequency is somewhat revealing of the overall thematic connotation of these different chapters.

Word cloud

ch1_top100 <- nonstop_counts %>% 
  filter(chapter == 1) %>% 
  arrange(-n) %>% 
  slice(1:100)
ch1_cloud <- ggplot(data = ch1_top100, aes(label = word)) +
  geom_text_wordcloud(aes(color = n, size = n), shape = "diamond") +
  scale_size_area(max_size = 6) +
  scale_color_gradientn(colors = c("darkgreen","blue","purple")) +
  labs(caption = "Figure 2. Word cloud of top 100 words from chapter 1") +
  theme_minimal() +
  theme(plot.caption = element_text(hjust = 0))+
  theme(legend.position = "bottom")

ch1_cloud

Sentiment analysis

The sentiment lexicon, we will be using is:

  • AFINN from Finn Årup Nielsen,

The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. All of this information is tabulated in the sentiments dataset, and tidytext provides a function get_sentiments() to get specific sentiment lexicons without the columns that are not used in that lexicon.”

WARNING: These collections include the most offensive words you can think of.

“afinn”: Words ranked from -5 (very negative) to +5 (very positive)

positive)

get_sentiments(lexicon = "afinn")

# Let's look at the pretty positive words:
afinn_pos <- get_sentiments("afinn") %>% 
  filter(value %in% c(3,4,5))

# Check them out:
afinn_pos

Sentiment analysis with afinn:

First, let’s bind words in book_words_clean to afinn lexicon:

book_afinn <- book_words_clean %>% 
  inner_join(get_sentiments("afinn"), by = 'word')
afinn_counts <- book_afinn %>% 
  count(chapter, value)

# Plot them: 
# ggplot(data = afinn_counts, aes(x = value, y = n)) +
#   geom_col() +
#   facet_wrap(~chapter)

# Find the mean afinn score by chapter: 
afinn_means <- book_afinn %>% 
  group_by(chapter) %>% 
  summarize(mean_afinn = mean(value))

ggplot(data = afinn_means, 
       aes(x = fct_rev(factor(chapter)),
           y = mean_afinn)) +
  geom_col(fill = "cyan") +
  coord_flip() +
  theme(strip.background = element_blank()) +
  theme(panel.background = element_blank(),
        panel.grid = element_blank()) +
  theme(plot.caption = element_text(hjust = 0, face = "bold.italic") , axis.title.y = element_blank(), axis.title.x = element_blank())+
  labs(title = "Sentiment Analysis by chapter", caption = "Figure 2. Leading sentiments words across all chapters of the book using the afinn lexiconn.")

Figure 3 From figure 3, the assumptions made from the word count are verified. As we can see , chapter 20 matches with negative sentiments and the word the most frequent in that chapter was fear. Similarly, we can see that chapter 6, matches with words with positive sentiments and the most frequent word in that chapter is love.