Chopped
Data manipulation, Modeling (Linear Regression, Random Forest, and Natural Spline)
Notable topics: Data manipulation, Modeling (Linear Regression, Random Forest, and Natural Spline)
Recorded on: 2020-08-24
Timestamps by: Eric Fletcher
Screencast
Timestamps
Use geom_histogram
to visualize the distribution of episode ratings.
Use geom_point
and geom_line
with color = factor(season)
to visualize the episode rating for every episode.
Use group_by
and summarize
to show the average rating for each season and the number of episodes in each season.
Continuing from previous row:
Use geom_line
and geom_point
with size = n_episodes
to visualize the average rating for each season with point size indicating the total number of episodes (larger = more episodes, smaller = fewer episodes).
Use fct_reorder
to reorder the episode_name
factor levels by sorting along the episode_rating
variable.
Use geom_point
to visualize the top episodes by rating.
Use the 'glue' package to place season number
and episode number
before episode name on the y axis
.
Use pivot_longer
to combine ingredients into one single column.
Use separate_rows
with sep = ", "
to separate out the ingredients with each ingredient getting its own row.
Use fct_lump
to lump ingredients together except for the 10 most frequent.
Use fct_reorder
to reorder ingredient
factor levels by sorting against n
.
Use geom_col
to create a stacked bar plot to visualize the most common ingredients by course.
Use fct_relevel
to reorder course
factor levels to appetizer, entree, dessert.
Use fct_rev
and scale_fill_discrete
with guide = guide_legend(reverse = TRUE)
to reorder the segments within the stacked bar plot.
Use the widyr
package and pairwise_cor
to find out what ingredients appear together.
Mentioned: David Robinson - The {widyr} Package YouTube Talk at 2020 R Conference
Use ggraph
, geom_edge_link
, geom_node_point
, geom_node_text
to create an ingredient network diagram to show their makeup and how they interact.
Use pairwise_count
from widyr
to count the number of times each pair of items appear together within a group defined by feature.
Use unite
from the tidyr
package in order to paste together the episode_course
and series_episode
columns into one column to figure out if any pairs of ingredients appear together in the same course across episodes.
Use summarize
with min
, mean,
max, and
n()to create the
first_season,
avg_season,
last_seasonand
n_appearances` variables.
Use slice
with tail
to get the n
ingredients that appear in early and late seasons.
Use geom_boxplot
to visualize the distribution of each ingredient across all seasons.
Fit predictive models (linear regression
, random forest
, and natural spline
) to determine if episode rating is explained by the ingredients or season.
Use pivot_wider
with values_fill = list(value = 0))
with 1 indicating ingredient was used and 0 indicating it wasn't used.
Summary of screencast.