Ramen Reviews
Web scraping (rvest package)
Notable topics: Web scraping (rvest package)
Recorded on: 2019-06-03
Timestamps by: Alex Cookson
Screencast
Timestamps
Looking at the website the data came from
Using gather function (now pivot_longer) to convert wide data to long (tidy) format
Graphing counts of all categorical variables at once, then exploring them
Using fct_lump function to lump three categorical variables to the top N categories and "Other"
Using reorder_within function to re-order factors that have the same name across multiple facets
Using lm function (linear model) to predict star rating
Visualising effects (and 95% CI) of indendent variables in linear model with a coefficient plot (TIE fighter plot)
Using fct_relevel function to get "Other" as the base reference level for categorical independent variables in a linear model
Using extract function and regex to split a camelCase variable into two separate variables
Using facet_wrap function to split coefficient / TIE fighter plot into three separate plots, based on type of coefficient
Using geom_vline function to add reference line to graph
Using unnest_tokens function from tidytext package to explore the relationship between variety (a sparse categorical variable) and star rating
Explanation of how he would approach variety variable with Lasso regression
Web scraping the using rvest package and SelectorGadget (Chrome Extension CSS selector)
Actually writing code for web scraping, using read_html, html_node, and html_table functions
Using clean_names function from janitor package to clean up names of variables
Explanation of web scraping task: get full review text using the links from the review summary table scraped above
Using parse_number function as alternative to as.integer function to cleverly drop extra weird text in review number
Using SelectorGadget (Chrome Extension CSS selector) to identify part of page that contains review text
Using html_nodes, html_text, and str_subset functions to write custom function to scrape review text identified in step above
Adding message function to custom scraping function to display URLs as they are being scraped
Using unnest_tokens and anti_join functions to split review text into individual words and remove stop words (e.g., "the", "or", "and")
Catching a mistake in the custom function causing it to read the same URL every time
Using str_detect function to filter out review paragraphs without a keyword in it
Using str_remove function and regex to get rid of string that follows a specific pattern
Explanation of possibly and safely functions in purrr package
Reviewing output of the URL that failed to scrape, including using character(0) as a default null value
Using pairwise_cor function from widyr package to see which words tend to appear in reviews together
Using igraph and ggraph packages to make network plot of word correlations
Using geom_node_text function to add labels to network plot
Including all words (not just those connected to others) as vertices in the network plot
Tweaking and refining network plot aesthetics (vertex size and colour)
Weird hack for getting a dark outline on hard-to-see vertex points
Summary of screencast