Medium Articles
Text mining (tidytext package)
Notable topics: Text mining (tidytext package)
Recorded on: 2018-12-03
Timestamps by: Alex Cookson
Screencast
Timestamps
Using summarise_at and starts_with functions to quickly sum up all variables starting with "tag_"
Using gather function (now pivot_longer) to convert topic tag variables from wide to tall (tidy) format
Explanation of how gathering step above will let us find the most/least common tags
Explanation of using median (instead of mean) as measure of central tendency for number of claps an article got
Visualizing log-normal (ish) distribution of number of claps an article gets
Using pmin function to bin reading times of 10 minutes or more to cap out at 10 minutes
Changing scale_x_continuous function's breaks argument to get custom labels and tick marks on a histogram
Discussion of using mean vs. median as measure of central tendency for reading time (he decides on mean)
Starting text mining analysis
Using unnest_tokens function from tidytext package to split character string into individual words
Explanation of stop words and using anti_join function from tidytext package to get rid of them
Using str_detect function to filter out "words" that are just numbers (e.g., "2", "35")
Quick analysis of which individual words are associated with more/fewer claps ("What are the hype words?")
Using geometric mean as alternative to median to get more distinction between words (note 27:33 where he makes a quick fix)
Starting analysis of clusters of related words (e.g., "neural" is linked to "network")
Finding correlations pairs of words using pairwise_cor function from widyr package
Using ggraph and igraph packages to make network plot of correlated pairs of words
Using geom_node_text to add labels for points (vertices) in the network plot
Filtering original data to only include words appear in the network plot (150 word pairs with most correlation)
Adding colour as a dimension to the network plot, representing geometric mean of claps
Changing default colour scale to one with Blue = Low and High = Red with scale_colour_gradient2 function
Adding dark outlines to points on network plot with a hack
Starting to predict number of claps based on title tag (Lasso regression)
Explanation of data format needed to conduct Lasso regression (and using cast_sparse function to get sparse matrix)
Bringing in number of claps to the sparse matrix (un-tidy methods)
Using cv.glmnet function (cv = cross validated) from glmnet package to run Lasso regression
Finding and fixing mistake in defining Lasso model
Explanation of Lasso model
Using tidy function from the broom package to tidy up the Lasso model
Visualizing how specific words affect the prediction of claps as lambda (Lasso's penalty parameter) changes
Summary of screencast