HBCU Enrollment

Data Cleaning

Published

February 1, 2021

Notable topics: Data Cleaning

Recorded on: 2021-02-01

Timestamps by: Eric Fletcher

View code

Screencast

Timestamps

str_detect

stringr

Detect the presence or absence of a pattern in a string.

separate

tidyr

Separate a character column into multiple columns with a regular expression or numeric locations

rename

dplyr

Rename column.

distinct

dplyr

Select only unique/distinct rows from a data frame.

expand_limits

ggplot2

Expand the y axis plot limits by starting at 0.

full_join

dplyr

Combine two datasets while including all rows in x and y.

percent

scales

Y axis labels as percentages (2.5%, 50%, etc).

bind_rows

dplyr

Bind multiple data frames by row and an explanation as to why it's not the best approach for joining given the other options.

rbindrow_bind

dplyrbase

Brief discussion on the differences between rbind and row_bind.

str_remove

stringr

Remove matched patterns in a string.

clean_names

janitor

Turn variable names into 'snake case' (e.g. Standard Error, standard_error).

mutate_ifis.characterparse_number

dplyrbasereadr

Mutate multiple columns to change type from character to numeric while parsing out the numbers while getting rid of the other characters in the dataset.

slice

dplyr

Subset rows using their positions.

gathermutateifelsestr_removespread

tidyrdplyrstringrbase

Reshape the data from wide to long such that there is one row for each year and race.

abs

base

Compute the absolute value of x

str_remove

stringr

Remove matched patterns in a string (e.g. black1, black & white1, white).

fct_reorder

forcats

Reorder factor levels in geom_line plot by sorting along another variable.

bind_rows

dplyr

Bind multiple data frames by row.

fct_relevel

forcats

Reorder factor levels by hand.

str_remove

stringr

Detect and remove the presence of a pattern in a string to remove duplication from geom_line plot legend.

fct_reorder

forcats

"Reorder factor levels in geom_line plot by sorting along another variable with ordering based on the last value to make the data line up with how the values are displayed in the legend. 'fct_reorder(race_ethnicity, percent, last, .desc = TRUE)`"

read_excel

readxl

Import external Excel data set from Data.World.

starts_with

tidyselect

Select variables that match a pattern to remove.

str_removegroup_byfirstifelsecumsum

stringrdplyr

Unpack data in one column (field_gender) into two separate columns (field, gender).

Summary of screencast.

Screencast

Timestamps

0:2:45

0:3:30

0:3:30

0:4:20

0:5:55

0:6:20

0:11:00

0:12:30

0:14:55

0:16:10

0:17:10

0:18:10

0:18:50

0:20:15

0:21:25

0:24:55

0:25:35

0:29:25

0:36:05

0:37:45

0:38:50

0:40:35

0:44:20

0:49:00

0:49:20

0:58:00