Simpsons Guest Stars
Text mining (tidytext package)
Notable topics: Text mining (tidytext package)
Recorded on: 2019-08-29
Timestamps by: Alex Cookson
Screencast
Timestamps
Using str_detect function to find guests that played themselves
Using separate_rows function and regex to get delimited values onto different rows (e.g., "Edna Krabappel; Ms. Melon" gets split into two rows)
Using parse_number function to convert a numeric variable coded as character to a proper numeric variable
Downloading and importing supplementary dataset of dialogue
Using semi_join function to filter dataframe based on values that appear in another dataframe
Using anti_join function to check which values in a dataframe do not appear in another dataframe
Using ifelse function to recode a single value with another (i.e., "Edna Krapabbel" becomes "Edna Krabappel-Flanders")
Explaining the goal of all the data cleaning steps
Using sample function to get an example line for each character
Setting geom_histogram function's binwidth and center arguments to get specific bin sizes
Using unnest_tokens and anti_join functions from tidytext package to split dialogue into individual words and remove stop words (e.g., "the", "or", "and")
Using bind_tf_idf function from tidytext package to get the TF-IDF (term frequency-inverse document frequency) of individual words
Using top_n function to get the top 1 TF-IDF value for each role
Using paste0 function to combine two character variables (e.g., "Groundskeeper Willie" and "ach" (separate variables) become "Groundskeeper Willie: ach")
Explanation of what TF-IDF (text frequency-inverse document frequency) tells us and how it is a "catchphrase detector"
Summary of screencast