TV Golden Age
Data manipulation, Logistic regression
Notable topics: Data manipulation, Logistic regression
Recorded on: 2019-01-08
Timestamps by: Alex Cookson
Screencast
Timestamps
Quick tip on how to start exploring a new dataset
Investigating inconsistency of shows having a count of seasons that is different from the number of seasons given in the data
Using %in% operator and all function to only get shows that have a first season and don't have skipped seasons in the data
Asking, "Which seasons have the most variation in ratings?"
Using facet_wrap function to separate different shows on a line graph into multiple small graphs
Writing custom embedded function to get width of breaks on the x-axis to always be even (e.g., season 2, 4, 6, etc.)
Committing, finding, and explaining a common error of using the same variable name when summarizing multiple things
Using truncated division operator %/% to bin data into two-year bins instead of annual (e.g., 1990 and 1991 get binned to 1990)
Using subsetting (with square brackets) within the mutate function to calculate mean on only a subset of data (without needing to filter)
Using gather function (now pivot_longer) to get metrics as columns into tidy format, in order to graph them all at once with a facet_wrap
Using pmin function to lump all seasons after 4 into one row (it still shows "4", but it represents "4+")
Asking, "If season 1 is good, do you get a second season?" (show survival)
Using paste0 and spread functions to get season 1-3 ratings into three columns, one for each season
Using distinct function with .keep_all argument remove duplicates by only keeping the first one that appears
Using logistic regression to answer, "Does season 1 rating affect the probability of getting a second season?" (note he forgets to specify the family argument, fixed at 57:25)
Using ntile function to divide data into N bins (5 in this case), then eventually using cut function instead
Adding year as an independent variable to the logistic regression model
Adding an interaction term (season 1 interacting with year) to the logistic regression model
Using augment function as a method of visualizing and interpreting coefficients of regression model
Using crossing function to create new data to test the logistic regression model on and interpret model coefficients
Fitting natural splines using the splines package, which would capture a non-linear relationship
Summary of screencast