Beach Volleyball
Data cleaning, Logistic regression
Notable topics: Data cleaning, Logistic regression
Recorded on: 2020-05-18
Timestamps by: Eric Fletcher
Screencast
Timestamps
Use pivot_longer
from the dplyr
package to pivot the data set from wide
to long
.
Use mutate_at
from the dplyr
package with starts_with
to change the class to character
for all columns that start with w_
and l_
.
Use separate
from the tidyr
package to separate the name
variable into three columns with extra = merge
and fill = right
.
Use rename
from the dplyr
package to rename w_player1
, w_player2
, l_player1
, and l_player2
.
Use pivot_wider
from the dplyr
package to pivot the name
variable from long
to wide
.
Use str_to_upper
to convert the winner_loser
w
and l
values to uppercase.
Add unique row numbers for each match using mutate
with row_number
from the dplyr
package.
Separate the score
values into multiple rows using separate_rows
from the tidyr
package.
Use separate
from the tidyr
package to actual scores into two columns, one for the winners score w_score
and another for the losers score l_score
.
Use na_if
from the dplyr
package to change the Forfeit or other
value from the score
variable to NA
.
Use str_remove
from the stringr
package to remove scores that include retired
.
Determine how many times the winners score w_score
is greter than the losers score l_score
at least 1/3 of the time.
Use summarize
from the dplyr
package to create the summary statistics including the number of matches
, winning percentage
, date of first match
, date of most recent match
.
Use type_convert
from the readr
package to convert character
class variables to numeric
.
Use summarize_all
from the dplyr
package to calculate the calculate which fraction of the data is not NA
.
Use summarize
from the dplyr
package to determine players number of matches
, winning percentage
, average attacks
, average errors
, average kills
, average aces
, average serve errors
, and total rows with data
for years prior to 2019.
The summary statistics are then used to answer how would we could predict if a player will win in 2019 using geom_point
and logistic regression
. Initially, David wanted to predict performance based on players first year performance. (NOTE - David mistakingly grouped by year
and age
. He cathces this around 1:02:00.)
Use year
from the lubridate
package within a group_by
to determine the age
for each play given their birthdate
.
Turn the summary statistics at timestamp 42:00
into a .
DOT %>%
PIPE function.
Summary of screencast.