Tidy Tuesday Exercise

Loading Libraries

library(ggplot2) #Loading some libraries I may use for this exercise
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.4      ✔ forcats 0.5.2 
✔ purrr   1.0.1      
Warning: package 'tidyr' was built under R version 4.2.3
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(ggthemes)
library(dplyr)
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(tibble)

“Getting the Data” Manually

age_gaps <-  readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-14/age_gaps.csv') #Here I manually read in the csv used for this weeks TidyTuesday
Rows: 1155 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): movie_name, director, actor_1_name, actor_2_name, character_1_gend...
dbl  (5): release_year, age_difference, couple_number, actor_1_age, actor_2_age
date (2): actor_1_birthdate, actor_2_birthdate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(age_gaps)
spc_tbl_ [1,155 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ movie_name        : chr [1:1155] "Harold and Maude" "Venus" "The Quiet American" "The Big Lebowski" ...
 $ release_year      : num [1:1155] 1971 2006 2002 1998 2010 ...
 $ director          : chr [1:1155] "Hal Ashby" "Roger Michell" "Phillip Noyce" "Joel Coen" ...
 $ age_difference    : num [1:1155] 52 50 49 45 43 42 40 39 38 38 ...
 $ couple_number     : num [1:1155] 1 1 1 1 1 1 1 1 1 1 ...
 $ actor_1_name      : chr [1:1155] "Ruth Gordon" "Peter O'Toole" "Michael Caine" "David Huddleston" ...
 $ actor_2_name      : chr [1:1155] "Bud Cort" "Jodie Whittaker" "Do Thi Hai Yen" "Tara Reid" ...
 $ character_1_gender: chr [1:1155] "woman" "man" "man" "man" ...
 $ character_2_gender: chr [1:1155] "man" "woman" "woman" "woman" ...
 $ actor_1_birthdate : Date[1:1155], format: "1896-10-30" "1932-08-02" ...
 $ actor_2_birthdate : Date[1:1155], format: "1948-03-29" "1982-06-03" ...
 $ actor_1_age       : num [1:1155] 75 74 69 68 81 59 62 69 57 77 ...
 $ actor_2_age       : num [1:1155] 23 24 20 23 38 17 22 30 19 39 ...
 - attr(*, "spec")=
  .. cols(
  ..   movie_name = col_character(),
  ..   release_year = col_double(),
  ..   director = col_character(),
  ..   age_difference = col_double(),
  ..   couple_number = col_double(),
  ..   actor_1_name = col_character(),
  ..   actor_2_name = col_character(),
  ..   character_1_gender = col_character(),
  ..   character_2_gender = col_character(),
  ..   actor_1_birthdate = col_date(format = ""),
  ..   actor_2_birthdate = col_date(format = ""),
  ..   actor_1_age = col_double(),
  ..   actor_2_age = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
glimpse(age_gaps)#Gives me a snapshot of the columns in the df
Rows: 1,155
Columns: 13
$ movie_name         <chr> "Harold and Maude", "Venus", "The Quiet American", …
$ release_year       <dbl> 1971, 2006, 2002, 1998, 2010, 1992, 2009, 1999, 199…
$ director           <chr> "Hal Ashby", "Roger Michell", "Phillip Noyce", "Joe…
$ age_difference     <dbl> 52, 50, 49, 45, 43, 42, 40, 39, 38, 38, 36, 36, 35,…
$ couple_number      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ actor_1_name       <chr> "Ruth Gordon", "Peter O'Toole", "Michael Caine", "D…
$ actor_2_name       <chr> "Bud Cort", "Jodie Whittaker", "Do Thi Hai Yen", "T…
$ character_1_gender <chr> "woman", "man", "man", "man", "man", "man", "man", …
$ character_2_gender <chr> "man", "woman", "woman", "woman", "man", "woman", "…
$ actor_1_birthdate  <date> 1896-10-30, 1932-08-02, 1933-03-14, 1930-09-17, 19…
$ actor_2_birthdate  <date> 1948-03-29, 1982-06-03, 1982-10-01, 1975-11-08, 19…
$ actor_1_age        <dbl> 75, 74, 69, 68, 81, 59, 62, 69, 57, 77, 59, 56, 65,…
$ actor_2_age        <dbl> 23, 24, 20, 23, 38, 17, 22, 30, 19, 39, 23, 20, 30,…

Data Wrangling

From the glimpse function, we can see that we have a lot of data- 13 columns and 1,155 rows. To parse this dataset down, I will first perform some data wrangling where I will remove unneeded columns.

Given the theme of this weeks Tidy Tuesday, I am interested in exploring trends in actor age difference throughout the years from movies released in the 60s and 70s versus the 2018-2022. I am also interested in comparing the age difference of actors that stared in my favorite director’s movies (e.g., Wes Anderson and Alfred Hitchcock).

age_gaps1<-age_gaps[-c(3,5:13)] #Removing columns I do not need for my specific data visualization [Age Difference in lead actors from 1960-1980 vs. 2018-2022]

age_gaps_2_yr<- age_gaps1 %>% filter(   #Selecting the years I want to keep in my dataset 
  release_year %in% c("1960", "1961", "1962", "1963", "1964", "1965", "1966", "1967", "1968", "1969", "1970","1971","1972","1973","1974", "1975","1976","1977","1978","1979","2018", "2019", "2020", "2021", "2022"))

Data Visualization

library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
#Utilizing Plotly for interaction: Age Difference of Lead Actors in movies by year between 1960-1979 and post Me too Movement 2018-2022
gaps <- plot_ly(
  type="scatter", 
  mode="markers",
  age_gaps_2_yr, 
  x=~release_year, 
  y=~age_difference,
  textposition= "auto",
  hoverinfo= "text",
  hovertext= paste("Movie Name :", age_gaps_2_yr$movie_name),
  yaxis= list(title='Lead Actors Age Difference')) %>%
  layout(title = "Age Difference of Lead Actors (Male/Female) in Movies from 1960-1970 and 2018-2022", xaxis=list(title= 'Movie Release Year'), yaxis= list(title='Lead Actors Age Difference'))

gaps

The largest age difference in actors occurred in the 1970 in the movie Harold and Maude.The age difference for actors looks to be similar in the years visualized for this exercise.

More Data Visualization by director

age_gaps3dir<-age_gaps[-c(5:13)] #Here I will remove columns I do not need

dir<-age_gaps3dir %>% filter(   #Selecting the directors I want to keep in my dataset Hitchcock and Anderson are my two favorite! 
  director %in% c( "Wes Anderson", "Alfred Hitchcock"))

dad<- ggplot(dir, aes(x = release_year, y = age_difference)) +
    geom_point(aes(color = factor(director)))

dad + labs( x= "Movie Release Year",
    y= "Actor Age Difference",
    color= "Director",
    title= "Actor Age Difference in Alfred Hitchcock and Wes Anderson Films")

From the graph, it looks like Alfred Hitchcock had more actors with larger age differences than Wes Anderson.