Seattle Pet Names
Hypergeometric hypothesis testing, Adjusting for multiple hypothesis testing
Notable topics: Hypergeometric hypothesis testing, Adjusting for multiple hypothesis testing
Recorded on: 2019-03-15
Timestamps by: Alex Cookson
Screencast
Timestamps
Using mdy function from lubridate package to convert character-formatted date to date-class
Exploratory bar graph showing top species of cats, using geom_col function
Specifying facet_wrap function's ncol argument to get graphs stacked vertically (instead of side-by-side)
Asking, "Are some animal names associated with particular dog breeds?"
Explanation of add_count function
Adding up various metrics (e.g., number of names overall, number of breeds overall), but note a mistake that gets fixed at 17:05
Calculating a ratio for names that appear over-represented within a breed, then explaining how small samples can be misleading
Spotting and fixing an aggregation mistake
Explanation of how to investigate which names might be over-represented within a breed
Explanation of how to use hypergeometric distribution to test for name over-representation
Using phyper function to calculate p-values for a one-sided hypergeometric test
Additional explanation of hypergeometric distribution
First investigation of why and how to interpret a p-value histogram (second at 29:45, third at 37:45, and answer at 39:30)
Noticing that we are missing zeros (i.e., having a breed/name combination with 0 dogs), which is important for the hypergeometric test
Using complete function to turn implicit zeros (for breed/name combination) into explicit zeros
Second investigation of p-value histogram (after adding in implicit zeros)
Explanation of multiple hypothesis testing and correction methods (e.g., Bonferroni, Holm), and applying using p.adjust function
Explanation of False Discovery Rate (FDR) control as a method for correcting for multiple hypothesis testing, and applying using p.adjust function
Third investigation of p-value histogram, to hunt for under-represented names
Answer to why the p-value distribution is not well-behaved
Using crossing function to created a simulated dataset to explore how different values affect the p-value
Explanation of how total number of names and total number of breeds affects p-value
More general explanation of what different shapes of p-value histogram might indicate
Renaming variables within a transmute function, using backticks to get names with spaces in them
Using kable function from the knitr package to create a nice-looking table
Explanation of one-side p-value (as opposed to two-sided p-value)
Summary of screencast