The Office
Text mining (tidytext package), LASSO regression (glmnet package)
Notable topics: Text mining (tidytext package), LASSO regression (glmnet package)
Recorded on: 2020-03-15
Timestamps by: Alex Cookson
Screencast
Timestamps
Overview of transcripts data
Overview of ratintgs data
Using fct_inorder function to create a factor with levels based on when they appear in the dataframe
Using theme and element_text to turn axis labels 90 degrees
Creating a line graph with points at each observation (using geom_line and geom_point)
Adding text labels to very high and very low-rated episodes
Using theme function's panel.grid.major argument to get rid of some extraneous gridlines, using element_blank function
Using geom_text_repel from ggrepel package to experiment with different labelling (before abandoning this approach)
Using row_number function to add episode_number field to make graphing easier
Explanation of why number of ratings (votes) is relevant to interpreting the graph
Using unnest_tokens function from tidytext package to split full-sentence text field to individual words
Using anti_join function to filter out stop words (e.g., and, or, the)
Using str_remove_all function to get rid of quotation marks from character names (quirks that might pop up when parsing)
Asking, "Are there words that are specific to certain characters?" (using bind_tf_idf function)
Using reorder_within function to re-order factors within a grouping (when a term appears in multiple groups) and scale_x_reordered function to graph
Asking, "What effects the popularity of an episode?"
Dealing with inconsistent episode names between datasets
Using str_remove function and some regex to remove "(Parts 1&2)" from some episode names
Using str_to_lower function to further align episode names (addresses inconsistent capitalization)
Setting up dataframe of features for a LASSO regression, with director and writer each being a feature with its own line
Using separate_rows function to separate episodes with multiple writers so that each has their own row
Using log2 function to transform number of lines fields to something more useable (since it is log-normally distributed)
Using cast_sparse function from tidytext package to create a sparse matrix of features by episode
Using semi_join function as a "filtering join"
Setting up dataframes (after we have our features) to run LASSO regression
Using cv.glmnet function from glmnet package to run a cross-validated LASSO regression
Explanation of how to pick a lambda penalty parameter
Explanation of output of LASSO model
Outline of why David likes regularized linear models (which is what LASSO is)
Summary of screencast