Friends
Data Manipulation, Linear Modeling, Pairwise Correlation, Text Mining
Notable topics: Data Manipulation, Linear Modeling, Pairwise Correlation, Text Mining
Recorded on: 2020-09-07
Timestamps by: Eric Fletcher
Screencast
Timestamps
Use dplyr
package's count
function to count the unique values of multiple variables.
Use geom_col
to show how many lines of dialogue there is for each character. Use fct_reorder
to reorder the speaker
factor levels by sorting along n
.
Use semi_join
to join friends
dataset with main_cast
with by = "speaker
returning all rows from friends
with a match in main_cast
.
Use unite
to create the episode_number
variable which pastes together season
and episode
with sep = "."
.
Then, use inner_join
to combine above dataset with friends_info
with by = c("season", "episode")
.
Then, use mutate
and the glue
package instead to combine { season }.{ episode } { title }
.
Then use fct_reorder(episode_title, season + .001 * episode)
to order it by season
first then episode
.
Use geom_point
to visualize episode_title
and us_views_millions
.
Use as.integer
to change episode_title
to integer class.
Add labels to geom_point
using geom_text
with check_overlap = TRUE
so text that overlaps previous text in the same layer will not be plotted.
Run the above plot again using imdb_rating
instead of us_views_millions
Ahead of modeling:
Use geom_boxplot
to visualize the distribution of speaking for main characters.
Use the complete
function with fill = list(n = 0)
to replace existing explicit missing values in the data set.
Demonstration of how to account for missing imdb_rating
values using the fill
function with .direction = "downup"
to keep the imdb rating across the same title.
Ahead of modeling:
Use summarize
with cor(log2(n), imdb_rating)
to find the correlation between speaker and imdb rating -- the fact that the correlation is positive for all speakers gives David a suspicion that some episodes are longer than others because they're in 2 parts with higher ratings due to important moments. David addresses this confounding factor
by including percentage of lines
instead of number of lines
.
Visualize results with geom_boxplot
, geom_point
with geom_smooth
.
Use a linear model
to predict imdb rating based on various variables.
Use the tidytext
and tidylo
packages to see what words are most common amongst characters, and whether they are said more times than would be expected by chance.
Use geom_col
to visualize the most overrepresented words per character according to log_odds_weighted
.
Use the widyr
package and pairwise correlation
to determine which characters tend to appear in the same scences together?
Use geom_col
to visualize the correlation between characters.
Summary of screencast.