TidyTuesday Tweets
Text mining (tidytext package)
Notable topics: Text mining (tidytext package)
Recorded on: 2019-01-06
Timestamps by: Alex Cookson
Screencast
Timestamps
Importing an rds file using read_rds function
Using floor_date function from lubridate package to round dates down (that's what the floor part does) to the month level
Asking, "Which tweets get the most re-tweets?"
Using contains function to select only columns that contain a certain string ("retweet" in this case)
Exploring likes/re-tweets ratio, including dealing with one or the other being 0 (which would cause divide by zero error)
Starting exploration of actual text of tweets
Using unnest_tokens function from tidytext package to break tweets into individual words (using token argument specifically for tweet-style text)
Using anti_join function to filter out stop words (e.g., "and", "or", "the") from tokenized data frame
Calculating summary statistics per word (average retweets and likes), then looking at distributions
Explanation of Poisson log normal distribution (number of retweets fits this distribution)
Additional example of Poisson log normal distribution (number of likes)
Explanation of geometric mean as better summary statistic than median or arithmetic mean
Using floor_date function from lubridate package to floor dates to the week level and tweaking so that a week starts on Monday (default is Sunday)
Asking, "What topic is each week about?" using just the tweet text
Calculating TF-IDF of tweets, with week as the "document"
Using top_n and group_by functions to select the top tf-idf score for each week
Using str_detect function to filter out "words" that are just numbers (e.g., 16, 36)
Using distinct function with .keep_all argument to ensure only top 1 result, as alternative to top_n function (which includes ties)
Making Jenny Bryan disappointed
Using geom_text function to add text labels to graph to show to word associated with each week
Using geom_text_repel function from ggrepel package as an alternative to geom_text function for adding text labels to graph
Using rvest package to scrape web data from a table in Tidy Tuesday README
Starting to look at #rstats tweets
Spotting signs of fake accounts with purchased followers (lots of hashtags)
Explanation of spotting fake accounts
Using str_detect to filter out web URLs
Using str_count function and some regex to count how many hashtags a tweet has
Creating a Bland-Altman plot (total on x-axis, variable of interest on y-axis)
Using geom_text function with check_overlap argument to add labels to scatterplot
Asking, "Who are the most active #rstats tweeters?"
Summary of screncast