US PhDs

Data cleaning (getting messy data into tidy format)

Published

February 21, 2019

Notable topics: Data cleaning (getting messy data into tidy format)

Recorded on: 2019-02-21

Timestamps by: Alex Cookson

View code

Screencast

Timestamps

read_xlsx

Using read_xlsx function to read in Excel spreadsheet, including skipping first few rows that don't have data

Overview of starting very messy data

gather

Using gather function to clean up wide dataset

fill

Using fill function to fill in NA values with a entries in a previous observation

fillifelse

Cleaning variable that has number and percent in it, on top of one another using a combination of ifelse and fill functions

spread

Using spread function on cleaned data to separate number and percent by year

str_detect

Spotted a mistake where he had the wrong string on str_detect function

sample

Using sample function to get 6 random fields of study to graph

Cleaning another dataset, which is much easier to clean

Renaming the first field, even without knowing the exact name

Cleaning another dataset

Discussing challenge of when indentation is used in original dataset (for group / sub-group distinction)

Starting to separate out data that is appended to one another in the original dataset (all, male, female)

contains

Removing field with long name using contains function

fct_recode

Using fct_recode function to rename an oddly-named category in a categorical variable (ifelse function is probably a better alternative)

Discussing solution to broad major field description and fine major field description (meaningfully indented in original data)

setdiff

Using setdiff function to separate broad and fine major fields

Screencast

Timestamps

0:3:15

0:7:25

0:8:20

0:9:20

0:10:10

0:12:00

0:13:50

0:16:50

0:18:50

0:19:05

0:21:55

0:23:10

0:25:20

0:27:30

0:28:10

0:35:30

0:39:40