Seattle Pet Names

Hypergeometric hypothesis testing, Adjusting for multiple hypothesis testing

Published

March 15, 2019

Notable topics: Hypergeometric hypothesis testing, Adjusting for multiple hypothesis testing

Recorded on: 2019-03-15

Timestamps by: Alex Cookson

View code

Screencast

Timestamps

mdy

lubridate

Using mdy function from lubridate package to convert character-formatted date to date-class

geom_col

Exploratory bar graph showing top species of cats, using geom_col function

facet_wrap

Specifying facet_wrap function's ncol argument to get graphs stacked vertically (instead of side-by-side)

Asking, "Are some animal names associated with particular dog breeds?"

add_count

Explanation of add_count function

Adding up various metrics (e.g., number of names overall, number of breeds overall), but note a mistake that gets fixed at 17:05

Calculating a ratio for names that appear over-represented within a breed, then explaining how small samples can be misleading

Spotting and fixing an aggregation mistake

Explanation of how to investigate which names might be over-represented within a breed

Explanation of how to use hypergeometric distribution to test for name over-representation

phyper

Using phyper function to calculate p-values for a one-sided hypergeometric test

Additional explanation of hypergeometric distribution

First investigation of why and how to interpret a p-value histogram (second at 29:45, third at 37:45, and answer at 39:30)

Noticing that we are missing zeros (i.e., having a breed/name combination with 0 dogs), which is important for the hypergeometric test

complete

Using complete function to turn implicit zeros (for breed/name combination) into explicit zeros

Second investigation of p-value histogram (after adding in implicit zeros)

p.adjust

Explanation of multiple hypothesis testing and correction methods (e.g., Bonferroni, Holm), and applying using p.adjust function

p.adjust

Explanation of False Discovery Rate (FDR) control as a method for correcting for multiple hypothesis testing, and applying using p.adjust function

Third investigation of p-value histogram, to hunt for under-represented names

Answer to why the p-value distribution is not well-behaved

crossing

Using crossing function to created a simulated dataset to explore how different values affect the p-value

Explanation of how total number of names and total number of breeds affects p-value

More general explanation of what different shapes of p-value histogram might indicate

transmute

Renaming variables within a transmute function, using backticks to get names with spaces in them

kable

knitr

Using kable function from the knitr package to create a nice-looking table

Explanation of one-side p-value (as opposed to two-sided p-value)

Summary of screencast

Screencast

Timestamps

0:2:40

0:4:20

0:6:30

0:9:55

0:11:15

0:12:35

0:16:10

0:17:05

0:17:55

0:18:55

0:20:40

0:23:30

0:24:00

0:25:15

0:27:10

0:29:45

0:31:55

0:34:25

0:37:45

0:39:30

0:42:40

0:44:55

0:46:00

0:47:30

0:49:20

0:50:00

0:53:55