College Majors and Income
Graphing for EDA (Exploratory Data Analysis)
Notable topics: Graphing for EDA (Exploratory Data Analysis)
Recorded on: 2018-10-14
Timestamps by: Alex Cookson
Screencast
Timestamps
Using read_csv function to import data directly from Github to R (without cloning the repository)
Creating a histogram (geom_histogram), then a boxplot (geom_boxplot), to explore the distribution of salaries
Using fct_reorder function to sort boxplot of college majors by salary
Using dollar_format function from scales package to convert scientific notation to dollar format (e.g., "4e+04" becomes "$40,000")
Creating a dotplot (geom_point) of 20 top-earning majors (includes adjusting axis, using the colour aesthetic, and adding error bars)
Using str_to_title function to convert string from ALL CAPS to Title Case
Creating a Bland-Altman graph to explore relationship between sample size and median salary
Using geom_text_repel function from ggrepel package to get text labels on scatter plot points
Using count function's wt argument to specify what should be counted (default is number of rows)
Spicing up a dull bar graph by adding a redundant colour aesthetic (trick from Julia Silge)
Starting to explore relationship between gender and salary
Creating a stacked bar graph (geom_col) of gender breakdown within majors
Using summarise_at to aggregate men and women from majors into categories of majors
Graphing scatterplot (geom_point) of share of women and median salary
Using geom_smooth function to add a line of best fit to scatterplot above
Explanation of why not to aggregate first when performing a statistical test (including explanation of Simpson's Paradox)
Fixing geom_smooth so that we get one overall line while still being able to map to the colour aesthetic
Predicting median salary from share of women with weighted linear regression (to take sample sizes into account)
Using nest function and tidy function from the broom package to apply a linear model to many categories at once
Using p.adjust function to adjust p-values to correct for multiple testing (using FDR, False Discovery Rate)
Showing how to add an appendix to an RMarkdown file with code that doesn't run when compiled
Using fct_lump function to aggregate major categories into the top four and an "Other" category
Adding sample size to the size aesthetic within the aes function
Using ggplotly function from plotly package to create an interactive scatterplot (tooltips appear when moused over)
Exploring IQR (Inter-Quartile Range) of salaries by major