COVID-19 Open Research Dataset (CORD-19)

JSON formatted data

Published

March 17, 2020

Notable topics: JSON formatted data

Recorded on: 2020-03-17

Timestamps by: Alex Cookson

View code

Screencast

Timestamps

Disclaimer that David's not an epidemiologist

Overview of dataset

dir

Using dir function with its full.names argument to get file paths for all files in a folder

Inspecting JSON-formatted data

hoist

Introducing hoist function as a way to deal with nested lists (typical for JSON data)

hoist

Continuing to use the hoist function

pluck

Brief explanation of pluck specification

object.size

Using object.size function to check size of json data

map_chrstr_c

Using map_chr and str_c functions together to combine paragraphs of text in a list into a single character string

unnest_tokens

tidytext

Using unnest_tokens function from tidytext package to split full paragraphs into individual words

Overview of scispaCy package for Python, which has named entity recognition features

spacyr

Introducting spacyr package, which is a R wrapper around the Python scispaCy package

tidytext

Showing how tidytext can use a custom tokenization function (David uses spacyr package's named entity recognition)

tokenize_words

tokenizers

Demonstrating the tokenize_words function from the tokenizers package

unnest_tokens

tidytext

Actually using a custom tokenizer in unnest_tokens function

sample_n

Using sample_n function to get a random sample of n rows

Asking, "What are groups of words that tend to occur together?"

pairwise_cor

widyr

Using pairwise_cor from widyr package to find correlation between named entities

ggraphigraph

Using ggraph and igraph packages to create a network plot

Starting to look at papers' references

unnest_wider

Using unnest_longer then unnest_wider function to convert lists into a tibble

str_trunc

Using str_trunc function to truncate long character strings to a certain number of characters

glue

Using glue function for easy combination of strings and R code

Summary of screencast

Screencast

Timestamps

0:0:55

0:2:55

0:7:50

0:9:45

0:10:40

0:11:40

0:13:10

0:16:35

0:17:40

0:20:00

0:22:50

0:24:40

0:28:50

0:32:20

0:37:00

0:39:45

0:43:25

0:44:30

0:45:40

0:52:05

0:53:30

0:59:30

1:06:25

1:19:15