Himalayan Climbers

Data Manipulation, Empirical Bayes, Logistic Regression Model

Published

September 21, 2020

Notable topics: Data Manipulation, Empirical Bayes, Logistic Regression Model

Recorded on: 2020-09-21

Timestamps by: Eric Fletcher

View code

Screencast

Timestamps

ggplotfct_reorder

ggplotforcats

Create a geom_col chart to visualize the top 50 tallest mountains.

Use fct_reorder to reorder the peak_name factor levels by sorting along the height_metres variable.

summarizeacrossarrangemutateinner_join

dplyr

Use summarize with across to get the total number of climbs, climbers, deaths, and first year climbed.

Use mutate to calculate the percent death rate for members and hired staff.

Use inner_join and select to join with peaks dataset by peak_id.

Touching on statistical noise and how it impacts the death rate for mountains with fewer number of climbs, and how to account for it using various statistical methods including Beta Binomial Regression & Empirical Bayes.

Further description of Empirical Bayes and how to account for not overestimating death rate for mountains with fewer climbers.

Recommended reading: Introduction to Empirical Bayes: Examples from Baseball Statistics by David Robinson

add_ebb_estimategeom_pointgeom_abline

ebbrggplot

Use the ebbr package (Empirical Bayes for Binomial in R) to create an Empirical Bayes Estimate for each mountain by fitting prior distribution across data and adjusting the death rates down or up based on the prior distributions.

Use a geom_point chart to visualize the difference between the raw death rate and new ebbr fitted death rate.

ggplotfct_reordergeom_errorbarh

ggplotforcats

Use geom_point to visualize how deadly each mountain is with geom_errorbarh representing the 95% credible interval between minimum and maximum values.

geom_point

ggplotforcats

Use geom_point to visualize the relationship between death rate and height of mountain.

There is not a clear relationship, but David does briefly mention how one could use Beta Binomial Regression to further inspect for possible relationships / trends.

mutatecase_whenstr_detectfct_lumpfct_reorder

dplyrstringrforcats

Use geom_histogram and geom_boxplot to visualize the distribution of time it took climbers to go from basecamp to the mountain’s high point for successful climbs only.

Use mutate to calculate the number of days it took climbers to get from basecamp to the highpoint.

Add column to data using case_when and str_detect to identify strings in termination_reason that contain the word Success and rename them to Success & how to use a vector and %in% to change multiple values in termination_reason to NA and rest to Failed.

Use fct_lump to show the top 10 mountains while lumping the other factor levels (mountains) into other.

geom_histogramgeom_density

ggplot

For just Mount Everest, use geom_histogram and geom_density with fill = success to visualize the days from basecamp to highpoint for climbs that ended in success, failure or other.

geom_histogram

ggplot

For just Mount Everest, use geom_histogram to see the distribution of climbs per year.

mutatepmaxgeom_linegeom_point

ggplotbasedplyr

For just Mount Everest, use ‘geom_lineandgeom_pointto visualizepct_death` over time by decade.

Use mutate with pmax and integer division to create a decade variable that lumps together the data for 1970 and before.

function

Write a function for summary statistics such as n_climbs, pct_success, first_climb, pct_death, ‘pct_hired_staff_death`.

mutatepmaxgeom_linegeom_point

ggplotbasedplyr

For just Mount Everest, use geom_line and geom_point to visualize pct_success over time by decade.

mutatepmaxgeom_linegeom_point

ggplotbasedplyr

For just Mount Everest, use geom_line and geom_point to visualize pct_hired_staff_deaths over time by decade.

David decides to visualize the pct_hired_staff_deaths and pct_death charts together on the same plot.

fct_lumpglmformat.pval

forcatsstatsbroombase

For just Mount Everest, fit a logistic regression model to predict the probability of death with format.pval to calculate the p.value.

Use fct_lump to lump together all expedition_role factors except for the n most frequent.

group_bysummarize

dplyr

Use group_by with integer division and summarize to calculate n_climbers and pct_death for age bucketed into decades.

geom_pointgeom_errorbarhconf.int

ggplotbroom

Use geom_point and geom_errorbarh to visualize the logistic regression model with confident intervals.

Summary of screencast

Screencast

Timestamps

0:3:00

0:8:50

0:11:20

0:14:30

0:17:00

0:21:20

0:26:35

0:28:00

0:35:30

0:38:40

0:39:55

0:41:30

0:46:20

0:47:10

0:50:45

0:56:30

0:59:45

1:03:30