Tidy Tuesday Exercise 2

Data Dictionary

This data dictionary will provide insight into the dataset we will be working on today.

`egg-production.csv`

variable	class	description
observed_month	double	Month in which report observations are collected,Dates are recorded in ISO 8601 format YYYY-MM-DD
prod_type	character	type of egg product: hatching, table eggs
prod_process	character	type of production process and housing: cage-free (organic), cage-free (non-organic), all. The value ‘all’ includes cage-free and conventional housing.
n_hens	double	number of hens produced by hens for a given month-type-process combo
n_eggs	double	number of eggs producing eggs for a given month-type-process combo
source	character	Original USDA report from which data are sourced. Values correspond to titles of PDF reports. Date of report is included in title.

`cage-free-percentages.csv`

variable	class	description
observed_month	double	Month in which report observations are collected,Dates are recorded in ISO 8601 format YYYY-MM-DD
percent_hens	double	observed or computed percentage of cage-free hens relative to all table-egg-laying hens
percent_eggs	double	computed percentage of cage-free eggs relative to all table eggs,This variable is not available for data sourced from the Egg Markets Overview report
source	character	Original USDA report from which data are sourced. Values correspond to titles of PDF reports. Date of report is included in title.

1. Loading in Necessary Libraries

library(tidyverse) #Working with multiple Tidy packages

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.4      ✔ forcats 0.5.2

Warning: package 'tidyr' was built under R version 4.2.3

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(tidymodels) #Model building

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.0.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.1.0     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.4     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages

library(dplyr) #Data wrangling
library(here) #setting pathways

here() starts at C:/GitHub/MADA/kimberlyperez-MADA-portfolio

library(rpart) #Model fitting


Attaching package: 'rpart'

The following object is masked from 'package:dials':

    prune

library(ranger) #Model fitting

Warning: package 'ranger' was built under R version 4.2.3

library(glmnet) #Model fitting

Warning: package 'glmnet' was built under R version 4.2.3

Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loaded glmnet 4.1-7

library(purrr)
library(stacks)

Warning: package 'stacks' was built under R version 4.2.3

2. Loading Data

Let’s try a different way to load the data in

tuesdata <- tidytuesdayR::tt_load('2023-04-11')

--- Compiling #TidyTuesday Information for 2023-04-11 ----

--- There are 2 files available ---

--- Starting Download ---


    Downloading file 1 of 2: `egg-production.csv`
    Downloading file 2 of 2: `cage-free-percentages.csv`

--- Download complete ---

tuesdata <- tidytuesdayR::tt_load(2023, week = 15) #loading in datasets from github

--- Compiling #TidyTuesday Information for 2023-04-11 ----

--- There are 2 files available ---

--- Starting Download ---


    Downloading file 1 of 2: `egg-production.csv`
    Downloading file 2 of 2: `cage-free-percentages.csv`

--- Download complete ---

ep <- tuesdata$`egg-production` #dataset 1
cf <- tuesdata$`cage-free-percentages` #dataset 2

Now, we will explore the egg production and cage free datasets

Egg Production

glimpse(ep)

Rows: 220
Columns: 6
$ observed_month <date> 2016-07-31, 2016-08-31, 2016-09-30, 2016-10-31, 2016-1…
$ prod_type      <chr> "hatching eggs", "hatching eggs", "hatching eggs", "hat…
$ prod_process   <chr> "all", "all", "all", "all", "all", "all", "all", "all",…
$ n_hens         <dbl> 57975000, 57595000, 57161000, 56857000, 57116000, 57750…
$ n_eggs         <dbl> 1147000000, 1142700000, 1093300000, 1126700000, 1096600…
$ source         <chr> "ChicEggs-09-23-2016.pdf", "ChicEggs-10-21-2016.pdf", "…

summary(ep)

 observed_month        prod_type         prod_process           n_hens         
 Min.   :2016-07-31   Length:220         Length:220         Min.   : 13500000  
 1st Qu.:2017-09-30   Class :character   Class :character   1st Qu.: 17284500  
 Median :2018-11-15   Mode  :character   Mode  :character   Median : 59939500  
 Mean   :2018-11-14                                         Mean   :110839873  
 3rd Qu.:2019-12-31                                         3rd Qu.:125539250  
 Max.   :2021-02-28                                         Max.   :341166000  
     n_eggs             source         
 Min.   :2.981e+08   Length:220        
 1st Qu.:4.240e+08   Class :character  
 Median :1.155e+09   Mode  :character  
 Mean   :2.607e+09                     
 3rd Qu.:2.963e+09                     
 Max.   :8.601e+09

Based on the glimpse function there are 220 rows and six variables to explore. I see some variables that may not be necessary for this exercise and will thus remove them during the wrangling process.

Cage-free

summary(cf)

 observed_month        percent_hens    percent_eggs       source         
 Min.   :2007-12-31   Min.   : 3.20   Min.   : 9.557   Length:96         
 1st Qu.:2017-05-23   1st Qu.:13.46   1st Qu.:14.521   Class :character  
 Median :2018-11-15   Median :17.30   Median :16.235   Mode  :character  
 Mean   :2018-05-12   Mean   :17.95   Mean   :17.095                     
 3rd Qu.:2020-02-28   3rd Qu.:23.46   3rd Qu.:19.460                     
 Max.   :2021-02-28   Max.   :29.20   Max.   :24.546                     
                                      NA's   :42

glimpse(cf)

Rows: 96
Columns: 4
$ observed_month <date> 2007-12-31, 2008-12-31, 2009-12-31, 2010-12-31, 2011-1…
$ percent_hens   <dbl> 3.20000, 3.50000, 3.60000, 4.40000, 5.40000, 6.00000, 5…
$ percent_eggs   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 9.634938, NA, 9…
$ source         <chr> "Egg-Markets-Overview-2019-10-19.pdf", "Egg-Markets-Ove…

Based on the glimpse function there are 96 rows and four variables to explore. I see one variables that may not be necessary for this exercise and will thus remove it during the wrangling process.

2. Data Wrangling

Egg Production

ep<- 
  ep %>%
  select(-c('source')) %>%
  mutate(egg_hen = n_eggs / n_hens,
         prod_process = recode(prod_process, "cage-free (non-organic)" = "CF_NO"),
         prod_process = recode(prod_process, "cage-free (organic)" = "CF_O"))

Cage-Free

cf<- 
  cf%>%
  select(!source)

3.Data Visualization

Egg Production

ep_plot<- 
  ep %>%
  ggplot(aes(x=prod_type, y=egg_hen)) +
  geom_col(aes(fill=prod_process)) +
  xlab("Production Type") +
  ylab("Average Egg Production Per Hen") +
  ggtitle("Average Egg production Per Hen Given Production Type & Process")
 
ep_plot + scale_fill_discrete(name="Production Process")

There is not much variation in the production process when comparing cage-free organic, non-organic, or all (see above for what this encompasses), however, there is a difference in average egg production for hatching eggs versus table eggs. This is not surprising as the majority of hens produce table eggs which are sold for consumption, while a smaller majority are diverted to be reared on farms or backyards as production hens (e.g., egg or meat). Unlike other agricultural sectors, poultry sector is vertically integrated with minimal breed variation (e.g. Table eggs produced from hybrid White Leghorns v.hatching eggs from a variety of breeds including heritage), thus, a uniform distribution in production processes, across the various sectors, is expected.

ep_time<- 
  ep %>%
  ggplot() +
  geom_line(
    aes(x=observed_month,
        y=n_eggs, 
        color=prod_process)) + 
      theme_dark()+ 
      labs ( x= "Year",
             y= "Total Number of Eggs", 
             title= "Number of Eggs Produced per Production Process Over a Five Year Period",
             color= "Production Process")

ep_time

Not an average so likely wont use this…

Cage-free

cf_plot<- 
  cf %>%
  ggplot () + geom_line(
    aes(x= observed_month,
         y=percent_hens)) +
      theme_dark() +
      labs(x="Year",
           y= "Percent of hens (%)",
           title= "Percentage of Cage-free Hens from 2007-2021")
cf_plot

I think the vast increase after 2015 is interesting, however, likely increased because of industry cooperation and commitments as well as state legislature (i.e., Western states [Oregon] only selling products containing “cage-free” eggs in stores).

Hypothesis

Given the varying needs based on consumer demands, I would expect the majority of egg production to come from tabled eggs, as these are for consumption and hatched eggs are for production purposes. Thus:

Predictor: Production Type Outcome: Average Number of Eggs per Hen

4. Training: Splitting Data

Here I will utilize and recreate prior machine learning (ML) models utilized in previous exercises to assess this data.

ep_split<- initial_split(ep, prop=3/4)

train_data<- training(ep_split)
test_data<- testing(ep_split)

set.seed(321)

ep_fold_train<-vfold_cv(train_data, v=5, repeats=5, strata=n_eggs)

ep_fold_test<-vfold_cv(test_data, v=5, repeats=5, strata=n_eggs)

Warning: The number of observations in each quantile is below the recommended threshold of 20.
• Stratification will use 2 breaks instead.
The number of observations in each quantile is below the recommended threshold of 20.
• Stratification will use 2 breaks instead.
The number of observations in each quantile is below the recommended threshold of 20.
• Stratification will use 2 breaks instead.
The number of observations in each quantile is below the recommended threshold of 20.
• Stratification will use 2 breaks instead.
The number of observations in each quantile is below the recommended threshold of 20.
• Stratification will use 2 breaks instead.

Recipes for Train and Test Data plus Defining the Model

ep_recipe_train<- 
  recipe(egg_hen~prod_type, data=train_data)

ep_recipe_test<-
   recipe(egg_hen~prod_type, data=test_data)

lm_mod<- 
  linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

Now, let’s train the data…

null_train_recipe<- recipe(egg_hen~1, data=train_data)
null_train_recipe

Recipe

Inputs:

    role #variables
 outcome          1

We need a workflow. Again, I am following my process from previous exercises to walk me through this portion. I am tailoring the code to fit my dataset.

train_wfn<- 
  workflow() %>%
  add_model(lm_mod) %>%
  add_recipe(null_train_recipe)

Time to FIT!

train_fit <-
  fit_resamples(train_wfn, resamples= ep_fold_train)

! Fold1, Repeat1: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold2, Repeat1: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold3, Repeat1: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold4, Repeat1: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold5, Repeat1: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold1, Repeat2: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold2, Repeat2: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold3, Repeat2: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold4, Repeat2: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold5, Repeat2: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold1, Repeat3: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold2, Repeat3: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold3, Repeat3: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold4, Repeat3: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold5, Repeat3: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold1, Repeat4: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold2, Repeat4: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold3, Repeat4: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold4, Repeat4: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold5, Repeat4: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold1, Repeat5: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold2, Repeat5: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold3, Repeat5: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold4, Repeat5: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold5, Repeat5: internal: A correlation computation is required, but `estimate` is constant and ha...

Great, let’s check out the RMSE

train_rmse<- collect_metrics(train_fit)

train_rmse

# A tibble: 2 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 rmse    standard     2.20    25  0.0407 Preprocessor1_Model1
2 rsq     standard   NaN        0 NA      Preprocessor1_Model1

Doing the same for the test data

null_test_recipe<- 
  recipe(egg_hen~1, data=test_data)

test_wfn<- workflow() %>%
  add_model(lm_mod) %>%
  add_recipe(null_test_recipe)

test_fit<- 
  fit_resamples(test_wfn, resamples = ep_fold_test)

! Fold1, Repeat1: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold2, Repeat1: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold3, Repeat1: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold4, Repeat1: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold5, Repeat1: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold1, Repeat2: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold2, Repeat2: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold3, Repeat2: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold4, Repeat2: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold5, Repeat2: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold1, Repeat3: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold2, Repeat3: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold3, Repeat3: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold4, Repeat3: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold5, Repeat3: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold1, Repeat4: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold2, Repeat4: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold3, Repeat4: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold4, Repeat4: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold5, Repeat4: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold1, Repeat5: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold2, Repeat5: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold3, Repeat5: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold4, Repeat5: internal: A correlation computation is required, but `estimate` is constant and ha...

! Fold5, Repeat5: internal: A correlation computation is required, but `estimate` is constant and ha...

test_rmse<- collect_metrics(test_fit)

test_rmse

# A tibble: 2 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 rmse    standard     2.11    25  0.0841 Preprocessor1_Model1
2 rsq     standard   NaN        0 NA      Preprocessor1_Model1

Let’s move on from our null model to fitting some other models we have learned about (e.g., Tree,Linear Model, RF). We’ll start with a Tree Model

Tree Model

#Identifying HP
tune_dtree <- 
  decision_tree(
    cost_complexity = tune(),
    tree_depth = tune()) %>% 
  set_engine("rpart") %>% 
  set_mode("regression")

tune_dtree

Decision Tree Model Specification (regression)

Main Arguments:
  cost_complexity = tune()
  tree_depth = tune()

Computational engine: rpart

**Tuning, Workflow Creation, and More Tuning using other methods

#Tuning Grid
grid_dt <-
  dials::grid_regular(
    cost_complexity(), 
    tree_depth(), 
    levels = 5)

grid_dt

# A tibble: 25 × 2
   cost_complexity tree_depth
             <dbl>      <int>
 1    0.0000000001          1
 2    0.0000000178          1
 3    0.00000316            1
 4    0.000562              1
 5    0.1                   1
 6    0.0000000001          4
 7    0.0000000178          4
 8    0.00000316            4
 9    0.000562              4
10    0.1                   4
# … with 15 more rows

#And now creating the workflow 
dt_wf<-
  workflow() %>%
  add_model(tune_dtree) %>%
  add_recipe(ep_recipe_train)

#More Tuning 
resamp<- 
  dt_wf %>%
  tune_grid(
    resample= ep_fold_train,
    grid= grid_dt)

#Let's plot it!
resamp %>%
  autoplot()

Collect Metrics

resamp %>%
  collect_metrics()

# A tibble: 50 × 8
   cost_complexity tree_depth .metric .estimator  mean     n std_err .config    
             <dbl>      <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>      
 1    0.0000000001          1 rmse    standard   0.930    25  0.0219 Preprocess…
 2    0.0000000001          1 rsq     standard   0.819    25  0.0129 Preprocess…
 3    0.0000000178          1 rmse    standard   0.930    25  0.0219 Preprocess…
 4    0.0000000178          1 rsq     standard   0.819    25  0.0129 Preprocess…
 5    0.00000316            1 rmse    standard   0.930    25  0.0219 Preprocess…
 6    0.00000316            1 rsq     standard   0.819    25  0.0129 Preprocess…
 7    0.000562              1 rmse    standard   0.930    25  0.0219 Preprocess…
 8    0.000562              1 rsq     standard   0.819    25  0.0129 Preprocess…
 9    0.1                   1 rmse    standard   0.930    25  0.0219 Preprocess…
10    0.1                   1 rsq     standard   0.819    25  0.0129 Preprocess…
# … with 40 more rows

resamp %>%
  show_best(n=1)

Warning: No value of `metric` was given; metric 'rmse' will be used.

# A tibble: 1 × 8
  cost_complexity tree_depth .metric .estimator  mean     n std_err .config     
            <dbl>      <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>       
1    0.0000000001          1 rmse    standard   0.930    25  0.0219 Preprocesso…

The code produced the output: RMSE=0.83; standard error=0.02.

Now let’s look at the best performing model

bt<- resamp %>%
  select_best()

Warning: No value of `metric` was given; metric 'rmse' will be used.

bt

# A tibble: 1 × 3
  cost_complexity tree_depth .config              
            <dbl>      <int> <chr>                
1    0.0000000001          1 Preprocessor1_Model01

Nice! Now that we have that out of the way, we need to create a final fit.

final_wf<- 
  dt_wf %>%
  finalize_workflow(bt)

final_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (regression)

Main Arguments:
  cost_complexity = 1e-10
  tree_depth = 1

Computational engine: rpart

dt_fin_fit<- final_wf %>%
  fit(train_data)

Residuals

#Piping several augment helps determine predictions from OG train data, while mutate will make a new row with calculated residuals
dt_res<- dt_fin_fit %>%
  augment(train_data) %>%
  select(c(.pred, egg_hen)) %>%
  mutate(.resid=egg_hen-.pred)

dt_res

# A tibble: 165 × 3
   .pred egg_hen  .resid
   <dbl>   <dbl>   <dbl>
 1  23.5    24.6  1.12  
 2  23.5    22.8 -0.728 
 3  23.5    23.5 -0.0454
 4  23.5    23.2 -0.278 
 5  23.5    23.2 -0.278 
 6  19.0    19.2  0.222 
 7  23.5    23.3 -0.200 
 8  23.5    21.1 -2.39  
 9  19.0    18.5 -0.475 
10  23.5    23.6  0.0321
# … with 155 more rows

Great, let’s do some plotting!

#Pred v. Actual
dtpred_plot<- ggplot(dt_res,
                     aes(x=egg_hen,
                         y=.pred)) +
  geom_point()

dtpred_plot

#Pred v. Res
dtpredres_plot<- ggplot(dt_res,
                     aes(x=.pred,
                         y=.resid)) +
  geom_point()

dtpredres_plot

Nice! Let’s move onto Linear Model! While this is a basic model to run, ML is broad! Including a simple model may provide good insight!

Linear Model

We will follow a similar process as above [SWF]

#Set
lm<- linear_reg() %>%
  set_engine("lm") %>%
  set_mode("regression")

#Workflow
lm_wf<- 
  workflow() %>%
  add_model(lm) %>%
  add_recipe(ep_recipe_train)

#Fit
epf<- lm_wf %>%
  fit(data=train_data)

Now to assessing performance and Checking Out Residuals.

aug_test <- augment(epf, train_data)

rmse <- aug_test %>% rmse(truth = egg_hen, .pred)

rsq <- aug_test %>% rsq(truth = egg_hen, .pred)

metrics<- full_join(rmse, rsq)

Joining, by = c(".metric", ".estimator", ".estimate")

metrics

# A tibble: 2 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.927
2 rsq     standard       0.823

#Residuals
epm<- lm(egg_hen~prod_type, data = train_data)

res<- resid(epm)

plot(fitted(epm), res) +
abline(0,0)

integer(0)

Quick and easy!

Random Forest

I saved the best for last. Let’s explore the Random Forest (RF) Model!

cores <- parallel::detectCores()
cores

[1] 8

rf <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_engine("ranger", num.threads = cores) %>% 
  set_mode("regression")

Workflow and Tuning Grid Time

rf_wf <- workflow() %>%
  add_model(rf) %>%
  add_recipe(ep_recipe_train)

rfg  <- expand.grid(mtry = c(3, 4, 5, 6), min_n = c(40,50,60), trees = c(500,1000))

rf_resamp <- 
  rf_wf %>% 
  tune_grid(ep_fold_train,
            grid = 25,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(yardstick::rmse))

i Creating pre-processing data to finalize unknown parameter: mtry

rf_resamp %>%
  collect_metrics()

# A tibble: 22 × 8
    mtry min_n .metric .estimator  mean     n std_err .config              
   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1     1    30 rmse    standard   0.930    25  0.0219 Preprocessor1_Model01
 2     1    28 rmse    standard   0.930    25  0.0219 Preprocessor1_Model02
 3     1     7 rmse    standard   0.930    25  0.0219 Preprocessor1_Model03
 4     1    39 rmse    standard   0.930    25  0.0219 Preprocessor1_Model04
 5     1    12 rmse    standard   0.930    25  0.0218 Preprocessor1_Model05
 6     1    26 rmse    standard   0.930    25  0.0219 Preprocessor1_Model06
 7     1    36 rmse    standard   0.930    25  0.0220 Preprocessor1_Model07
 8     1    34 rmse    standard   0.930    25  0.0219 Preprocessor1_Model08
 9     1    14 rmse    standard   0.930    25  0.0219 Preprocessor1_Model09
10     1     3 rmse    standard   0.930    25  0.0220 Preprocessor1_Model10
# … with 12 more rows

Let’s plot it

rf_resamp %>%
  autoplot()

Let’s Check out the Metrics of this RF

rf_resamp %>%
  show_best(n=1)

# A tibble: 1 × 8
   mtry min_n .metric .estimator  mean     n std_err .config              
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1     1    12 rmse    standard   0.930    25  0.0218 Preprocessor1_Model05

best<- rf_resamp %>%
  select_best(method="rmse")

Final Fits- same process as we did above!

wf_final<-
  rf_wf %>%
  finalize_workflow(best)

rf_ff<-
  wf_final %>%
  fit(train_data)

Residuals

rf_res<- rf_ff %>%
  augment(train_data) %>%
  select(c(.pred, egg_hen)) %>%
  mutate(.resid = egg_hen - .pred)

rf_res

# A tibble: 165 × 3
   .pred egg_hen  .resid
   <dbl>   <dbl>   <dbl>
 1  23.5    24.6  1.11  
 2  23.5    22.8 -0.730 
 3  23.5    23.5 -0.0478
 4  23.5    23.2 -0.280 
 5  23.5    23.2 -0.280 
 6  19.0    19.2  0.221 
 7  23.5    23.3 -0.203 
 8  23.5    21.1 -2.39  
 9  19.0    18.5 -0.476 
10  23.5    23.6  0.0297
# … with 155 more rows

Continue following workflow from above…visualize using ggplot

rf_predp<- ggplot(rf_res,
                  aes(y=.pred,
                      x=egg_hen)) +geom_point() 

rf_predp

rf_resp<- ggplot(rf_res,
                  aes(x=.pred,
                      y=.resid)) +geom_point() 
rf_resp

For the final assessment with the test data, I will utilize the RF model, both RMSE for RF and our Tree model were nearly the same ~0.83 after rounding.

Random Forest on Test Data

rf_wf_test <- workflow() %>%
  add_model(rf) %>%
  add_recipe(ep_recipe_test)

rfgt  <- expand.grid(mtry = c(3, 4, 5, 6), min_n = c(40,50,60), trees = c(500,1000))

rf_resamptest <- 
  rf_wf_test %>% 
  tune_grid(ep_fold_test,
            grid = 25,
            control = control_grid(save_pred = TRUE),
            metrics = metric_set(yardstick::rmse))

i Creating pre-processing data to finalize unknown parameter: mtry

rf_resamptest %>%
  collect_metrics()

# A tibble: 24 × 8
    mtry min_n .metric .estimator  mean     n std_err .config              
   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1     1    18 rmse    standard   0.844    25  0.0266 Preprocessor1_Model01
 2     1    26 rmse    standard   0.844    25  0.0266 Preprocessor1_Model02
 3     1    32 rmse    standard   0.844    25  0.0265 Preprocessor1_Model03
 4     1    36 rmse    standard   0.844    25  0.0265 Preprocessor1_Model04
 5     1    15 rmse    standard   0.843    25  0.0267 Preprocessor1_Model05
 6     1    29 rmse    standard   0.843    25  0.0266 Preprocessor1_Model06
 7     1     4 rmse    standard   0.844    25  0.0266 Preprocessor1_Model07
 8     1    23 rmse    standard   0.843    25  0.0266 Preprocessor1_Model08
 9     1    13 rmse    standard   0.844    25  0.0265 Preprocessor1_Model09
10     1    12 rmse    standard   0.844    25  0.0265 Preprocessor1_Model10
# … with 14 more rows

Let’s plot it

rf_resamptest %>%
  autoplot()

Let’s Check out the Metrics of this RF

rf_resamptest %>%
  show_best(n=1)

# A tibble: 1 × 8
   mtry min_n .metric .estimator  mean     n std_err .config              
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1     1    15 rmse    standard   0.843    25  0.0267 Preprocessor1_Model05

best<- rf_resamptest %>%
  select_best(method="rmse")

Final Fits- same process as we did above!

wf_finaltest<-
  rf_wf_test %>%
  finalize_workflow(best)

rf_fftest<-
  wf_finaltest%>%
  fit(test_data)

Residuals

rf_restest<- rf_ff %>%
  augment(test_data) %>%
  select(c(.pred, egg_hen)) %>%
  mutate(.resid = egg_hen - .pred)

rf_restest

# A tibble: 55 × 3
   .pred egg_hen .resid
   <dbl>   <dbl>  <dbl>
 1  19.0    19.8  0.838
 2  19.0    19.6  0.639
 3  19.0    18.6 -0.404
 4  19.0    18.6 -0.411
 5  19.0    18.6 -0.375
 6  19.0    17.3 -1.72 
 7  19.0    19.1  0.160
 8  19.0    19.1  0.100
 9  19.0    19.2  0.238
10  19.0    19.7  0.712
# … with 45 more rows

Continue following workflow from above utilizing test…visualize using ggplot

rf_predptest<- ggplot(rf_restest,
                  aes(y=.pred,
                      x=egg_hen)) +geom_point()


rf_predptest

rf_resptest<- ggplot(rf_restest,
                  aes(x=.pred,
                      y=.resid)) +geom_point() 
rf_resptest

Quick Discussion

This exercise contained a multistep process, from loading in the Tidy Tuesday data via a different method to data wrangling which allowed me to clean the data and remove any unnecessary columns/rows/variables. Because this Tidy Tuesday contained two data sets, I visualized both and ultimately decided to use the egg production data as the cage-free data had limited variables and would limit the hypothesis I could formulate.

I love production animals and am an owner of hatch chickens so I decided to explore the hatch v. table egg and the production of eggs. I also visualized this and decided it would be interesting to explore. I ran several models and ultimately decided on the random forest model given its RMSE in comparison to the null model. I am also trying to become more proficient with ML so I did not want to select a simple model such as the LM I ran. I referred back to previous exercises to ensure my workflow and process was consistent. I began with my trained data and once I selected the final model, I went through the same process with the test data. RF performed better than the null model (first model ran) which is indicative of a relationship. The RMSE of the test data is a bit higher than the train data which would make me cautious in utilizing this to make predictions or for overall reproducibility.