COVID-19 Open Research Dataset (CORD-19)
JSON formatted data
Notable topics: JSON formatted data
Recorded on: 2020-03-17
Timestamps by: Alex Cookson
Screencast
Timestamps
Disclaimer that David's not an epidemiologist
Overview of dataset
Using dir function with its full.names argument to get file paths for all files in a folder
Inspecting JSON-formatted data
Introducing hoist function as a way to deal with nested lists (typical for JSON data)
Continuing to use the hoist function
Brief explanation of pluck specification
Using object.size function to check size of json data
Using map_chr and str_c functions together to combine paragraphs of text in a list into a single character string
Using unnest_tokens function from tidytext package to split full paragraphs into individual words
Overview of scispaCy package for Python, which has named entity recognition features
Introducting spacyr package, which is a R wrapper around the Python scispaCy package
Showing how tidytext can use a custom tokenization function (David uses spacyr package's named entity recognition)
Demonstrating the tokenize_words function from the tokenizers package
Actually using a custom tokenizer in unnest_tokens function
Using sample_n function to get a random sample of n rows
Asking, "What are groups of words that tend to occur together?"
Using pairwise_cor from widyr package to find correlation between named entities
Using ggraph and igraph packages to create a network plot
Starting to look at papers' references
Using unnest_longer then unnest_wider function to convert lists into a tibble
Using str_trunc function to truncate long character strings to a certain number of characters
Using glue function for easy combination of strings and R code
Summary of screencast