Beyonce and Taylor Swift Lyrics
Text analysis, tf_idf
, Log odds ratio, Diverging bar graph, Lollipop graph
Notable topics: Text analysis, tf_idf
, Log odds ratio, Diverging bar graph, Lollipop graph
Recorded on: 2020-09-28
Timestamps by: Eric Fletcher
Screencast
Timestamps
Use fct_reorder
from the forcats
package to reorder title
factor levels by sorting along the sales
variable in geom_col
plot.
Use labels = dollar
from the scales
package to format the geom_col
x-axis values as currency.
Use rename_all(str_to_lower)
to convert variable names to lowercase.
Use unnest_tokens
from the tidytext
package to split the lyrics into one-lyric-per-row.
Use anti_join
from the tidytext
package to find the most common words int he lyrics without stop_words
.
Use bind_tf_idf
from the tidytext
package to determine tf
- the proportion each word has in each album and idf
- how specific each word is to each particular album.
Use reorder_within
with scale_y_reordered
in order to reorder the bars within each facet panel
. David replaces top_n
with slice_max
from the dplyr
package in order to show the top 10 words with ties = FALSE
.
Use bind_log_odds
from the tidylo
package to calculate the log odds ratio
of album and words, that is how much more common is the word in a specific album than across all the other albums.
Use filter(str_length(word) <= 3)
to come up with a list in order to remove common filler words like ah
, uh
, ha
, ey
, eeh
, and huh
.
Use mdy
from the lubridate
package and str_remove(released, " \\(.*)"))
from the stringr
package to parse the dates in the released
variable.
Use inner_join
from the dplyr
package to join taylor_swift_words
with release_dates
.
David ends up having to use fct_recode
since the albums reputation
and folklore
were nor lowercase
in a previous table thus excluding them from the inner_join
.
Use fct_reorder
from the forcats
package to reorder album
factor levels by sorting along the released
variable to be used in the faceted
geom_col
.
Use bind_rows
from hte dplyr
package to bind ts
with beyonce
with unnest_tokens
from the tidytext
package to get one lyric per row per artist.
Use bind_log_odds
to figure out which words are more likely to come from a Taylor Swift or Beyonce song?
Use slice_max
from the dplyr
package to select the top 100 words by num_words_total
and then the top 25 by log_odds_weighted
. Results are used to create a diverging bar chart showing which words are most common between Beyonce and Taylor Swift songs.
Use scale_x_continuous
to make the log_odds_weighted
scale more interpretable.
Take the previous plot and turn it into a lollipop graph
with geom_point(aes(size = num_words_total, color = direction))
Use ifelse
to change the 1x
value on the x-axis to same
.
Create a geom_point
with geom_abline
to show the most popular words they use in common.
Summary of screencast.