Himalayan Climbers
Data Manipulation, Empirical Bayes, Logistic Regression Model
Notable topics: Data Manipulation, Empirical Bayes, Logistic Regression Model
Recorded on: 2020-09-21
Timestamps by: Eric Fletcher
Screencast
Timestamps
Create a geom_col
chart to visualize the top 50 tallest mountains.
Use fct_reorder
to reorder the peak_name
factor levels by sorting along the height_metres
variable.
Use summarize
with across
to get the total number of climbs, climbers, deaths, and first year climbed.
Use mutate
to calculate the percent death rate for members and hired staff.
Use inner_join
and select
to join with peaks
dataset by peak_id
.
Touching on statistical noise
and how it impacts the death rate for mountains with fewer number of climbs, and how to account for it using various statistical methods including Beta Binomial Regression
& Empirical Bayes
.
Further description of Empirical Bayes
and how to account for not overestimating death rate for mountains with fewer climbers.
Recommended reading: Introduction to Empirical Bayes: Examples from Baseball Statistics by David Robinson
Use the ebbr
package (Empirical Bayes for Binomial in R) to create an Empirical Bayes Estimate for each mountain by fitting prior distribution across data and adjusting the death rates down or up based on the prior distributions.
Use a geom_point
chart to visualize the difference between the raw death rate and new ebbr
fitted death rate.
Use geom_point
to visualize how deadly each mountain is with geom_errorbarh
representing the 95% credible interval between minimum and maximum values.
Use geom_point
to visualize the relationship between death rate
and height
of mountain.
There is not a clear relationship, but David does briefly mention how one could use Beta Binomial Regression
to further inspect for possible relationships / trends.
Use geom_histogram
and geom_boxplot
to visualize the distribution of time it took climbers to go from basecamp to the mountain’s high point for successful climbs only.
Use mutate
to calculate the number of days it took climbers to get from basecamp to the highpoint.
Add column to data using case_when
and str_detect
to identify strings in termination_reason
that contain the word Success
and rename them to Success
& how to use a vector
and %in%
to change multiple values in termination_reason
to NA
and rest to Failed
.
Use fct_lump
to show the top 10 mountains while lumping the other factor levels (mountains) into other
.
For just Mount Everest, use geom_histogram
and geom_density
with fill = success
to visualize the days from basecamp to highpoint for climbs that ended in success
, failure
or other
.
For just Mount Everest, use geom_histogram
to see the distribution of climbs per year.
For just Mount Everest, use ‘geom_lineand
geom_pointto visualize
pct_death` over time by decade.
Use mutate
with pmax
and integer division
to create a decade variable that lumps together the data for 1970 and before.
Write a function for summary statistics such as n_climbs
, pct_success
, first_climb
, pct_death
, ‘pct_hired_staff_death`.
For just Mount Everest, use geom_line
and geom_point
to visualize pct_success
over time by decade.
For just Mount Everest, use geom_line
and geom_point
to visualize pct_hired_staff_deaths
over time by decade.
David decides to visualize the pct_hired_staff_deaths
and pct_death
charts together on the same plot.
For just Mount Everest, fit a logistic regression model to predict the probability of death with format.pval
to calculate the p.value
.
Use fct_lump
to lump together all expedition_role
factors except for the n most frequent.
Use group_by
with integer division
and summarize
to calculate n_climbers
and pct_death
for age bucketed into decades.
Use geom_point
and geom_errorbarh
to visualize the logistic regression model with confident intervals.
Summary of screencast