Bagged Tree Models Not Robust to Column Names and Tuning Arguments

**PROBLEM STATEMENT**

I had trouble with Bagged Tree Models.  The code below stops and throws an Error at the Code Chunk just below the heading "Fit the BTREE Model to CV Folds"  .  The Error looks like this:

→ A | error:   Expecting '}'
There were issues with some computations   A: x10

The workaround was to simplify the column names and change the arguments in the model specification in the Code Chunk "Define Bagged Tree (BTREE) Model and Engine" 

from

```
tree_depth = integer(5), 
min_n = integer(10)  
```

to

```
tree_depth = 5
min_n = 10
```

The code associated with tuning the Bagged Tree Model appears to be weak (or not robust) to column names in data set and arguments in the model spec for tree_depth and min_n.  All the other models work fine so I think this is a reportable issue.  

````
**CODE**
---
title: "HW9 Modeling Practice"
format: html
editor: visual
---

## Read Data and Library Setup
```{r}

library(baguette)
library(readr)
library(lubridate)
library(recipes)
library(tidyverse)
library(tidymodels)
library(tree) 
library(vip)

raw_data <- read_csv("https://www4.stat.ncsu.edu/~online/datasets/SeoulBikeData.csv",
                      locale = locale(encoding = "latin1"))

```

```{r}
library(dplyr)
library(lubridate)
library(forcats)

bike_data <- 
  raw_data |>
  rename(
    DATE = `Date`,
    RBC = `Rented Bike Count`,
    HOUR = `Hour`,
    "TEMP(deg C)" = `Temperature(°C)`,
    "RH(%)" = `Humidity(%)`,
    "WD_SPD(m/s)" = `Wind speed (m/s)`,
    "VIS(10m)" = `Visibility (10m)`,
    "DP_TEMP(deg C)" = `Dew point temperature(°C)`,
    "SOL_RAD(MJ/m2)" = `Solar Radiation (MJ/m2)`,
    "RAIN_FALL(mm)" = `Rainfall(mm)`,
    "SNOW_FALL(cm)" = `Snowfall (cm)`,
    SEASONS = `Seasons`,
    HOLIDAY = `Holiday`,
    FUNC_DAY = `Functioning Day`
  ) |>
  mutate(
    DATE = dmy(DATE),
    HOLIDAY = as_factor(HOLIDAY),
    SEASONS = as_factor(SEASONS),
    FUNC_DAY = as_factor(FUNC_DAY)
  )

write.csv(bike_data, "bike_data.csv")
```

Summarize Across Hours
```{r}

bd_sub <- bike_data |> 
  group_by(DATE, SEASONS, HOLIDAY) |>
  summarise(across(where(is.numeric) , mean, na.rm = TRUE, .names = '{col}_MEAN'),
            SUM_RBC = sum(RBC),
            "SUM_RAIN_FALL(mm)" = sum(`RAIN_FALL(mm)`),
            "SUM_SNOW_FALL(cm)" = sum(`SNOW_FALL(cm)`),) |>
            select(DATE, SEASONS, HOLIDAY, starts_with("SUM"), everything(), -RBC_MEAN, -HOUR_MEAN, -`RAIN_FALL(mm)_MEAN`, -`SNOW_FALL(cm)_MEAN`)
                        
write.csv(bd_sub, "bd_sub_after.csv")

```

## Split the Data
Use functions from tidymodels to split the data into a training and test set (75/25 split). Use the strata argument to stratify the split on the seasons variable. • On the training set, create a 10 fold CV split.

The code splits the dataset bd_sub into training and test sets using tidymodels in R. initial_split(bd_sub, prop = 3/4, strata = SEASONS): Splits bd_sub so that 75% goes to training and 25% to testing, while preserving the distribution of SEASONS.

bike_split: This is a split object created by initial_split(). It contains information about how the data was divided into training and testing sets, but does not contain the actual data frames.

bike_train: This is the training set, extracted from bike_split using training(bike_split). It contains 70% of the original data, selected randomly.

bike_test: This is the testing set, extracted from bike_split using testing(bike_split). It contains the remaining 30% of the data.

```{r}
library(tidymodels)

set.seed(222)
# Put 3/4 of the data into the training set.  init_split is the Test Set. 
bike_data_split <- initial_split(bd_sub, prop = 3/4, strata = SEASONS)
               
# Create data frames for the two sets:
bike_train_data <- training(bike_data_split)
bike_test_data  <- testing(bike_data_split)

#On the training set, create a 10 fold CV split
bike_CV_folds <- vfold_cv(bike_train_data, 10)

```

## Recipes for HW 9

This recipe is for all model types except MLR.
```{r}

  #Define Recipe 1
  recipe1 <- 
    recipe(SUM_RBC ~ ., data = bike_train_data) |> 
    step_date(DATE, features = c("dow")) |>              
    step_mutate(
      DAY_TYPE = factor(
        ifelse(
          DATE_dow == "Sat" | DATE_dow == "Sun", "WKEND", "WKDAY"))) |>
    step_rm(DATE_dow, DATE) |>
    step_normalize(all_numeric_predictors()) |>
    step_dummy(all_nominal_predictors())


recipe1_prep <- prep(recipe1, training = bike_train_data)
recipe1_baked <- bake(recipe1_prep, new_data = bike_test_data)  
 
write.csv(recipe1_baked, "recipe1_baked.csv")

```

MLR Recipe

```{r}
#Define Recipe 2

recipe2 <- 
  recipe(SUM_RBC ~ ., data = bike_train_data) |> 
  step_date(DATE, features = c("dow")) |>              
  step_mutate(
    DAY_TYPE = factor(
      ifelse(
        DATE_dow == "Sat" | DATE_dow == "Sun", "WKEND", "WKDAY"))) |>
  step_rm(DATE_dow, DATE) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_interact(
  terms = ~ starts_with("HOLIDAY"):starts_with("SEASONS") +
           `TEMP(deg C)_MEAN`:starts_with("SEASONS") +
           `TEMP(deg C)_MEAN`:`SUM_RAIN_FALL(mm)`
)
  
recipe2_prep <- prep(recipe2, training = bike_train_data)
recipe2_baked <- bake(recipe2_prep, new_data = bike_test_data)  
 
write.csv(recipe2_baked, "recipe2_baked.csv")

```

## MLR Model from HW 8

The best model was already selected from HW 8 as Recipe 2, so that step is not shown here.

### Define MLR Model Object
```{r}
MLR_model <- linear_reg() |>
set_engine("lm")

```

### Define MLR Workflow
```{r}
MLR_wkf <- workflow() |>
  add_recipe(recipe2) |>
  add_model(MLR_model)
```

### Fit the MLR to CV Folds
```{r}
MLR_CV_fit <- MLR_wkf |>
  fit_resamples(bike_CV_folds)
```
I got this Warning in HW 8 and was unable to resolve it.  Professor Post said the code looked fine.

### Fit the MLR to the Training Set
```{r}
MLR_fitT <- MLR_wkf |>
fit(bike_train_data)
```

MLR_fit: This fits the chosen model (e.g., best multiple linear regression) to the entire training set.

### Fit the MLR to the Test Set
```{r}
MLR_final <- MLR_wkf |>
   last_fit(bike_data_split)
```

MLR_final: This uses last_fit() to train the model on the training set and evaluate its predictions on the test set, giving performance metrics for out-of-sample data.

## LASSO Model

### Define LASSO Model Object
```{r}
LASSO_model <- linear_reg(penalty = tune(), mixture = 1) |>
  set_engine("glmnet")
```

### Create Workflow for LASSO Model
```{r}
LASSO_wkf <- workflow() |>
  add_recipe(recipe1) |>
  add_model(LASSO_model)
```

### Fit the LASSO Workflow to CV Folds with tune_grid() and grid_regular()

We saw how to fit a workflow to a set of CV folds with fit_resample(). Since we have a tuning parameter here, we don’t want to use that function. Instead, we use tune_grid(). This function allows us to fit the model to CV folds but specify the set of tuning parameters to consider.

This implies we are actually doing a bunch of model fits on the CV folds (one for each tuning parameter). In the tune_grid() function we can specify the values of the tuning parameter with the grid = argument. grid_regular() is a function that can be used to choose a grid of reasonable values.

```{r}
LASSO_grid <- LASSO_wkf |>
  tune_grid(resamples = bike_CV_folds,
            grid = grid_regular(penalty(), levels = 200)) 
```

### Select Best Model with select_best()
```{r}
LASSO_best <- LASSO_grid |>
  select_best(metric = "rmse")
```

### Fit Best LASSO Model to the Training Set with finalize_workflow()
```{r}
LASSO_fitT <- LASSO_wkf |>
  finalize_workflow(LASSO_best) |>
  fit(bike_train_data)
```

finalize_workflow() tells R to finish our training with a specific setting of the terms we set to tune() in our model definition.

### Fit the MLR to the Test Set
```{r}
LASSO_final <- LASSO_wkf |>
  finalize_workflow(LASSO_best) |>
  last_fit(bike_data_split)
```

## Regression Tree Model

### Define Regression Tree (RTREE) Model and Engine
```{r}
RTREE_model <- decision_tree(tree_depth = tune(),
                          min_n = 20,
                          cost_complexity = tune()) |>
  set_engine("rpart") |>
  set_mode("regression")
```

### Create Workflow for RTREE Model
```{r}
RTREE_wkf <- workflow() |>
  add_recipe(recipe1) |>
  add_model(RTREE_model)
```

### Fit the RTREE Model to CV Folds
```{r}
RTREE_grid <- grid_regular(cost_complexity(),
                          tree_depth(),
                          levels = c(10, 5))
RTREE_fits <- RTREE_wkf |> 
  tune_grid(resamples = bike_CV_folds,
            grid = RTREE_grid)
```

### Select the Best RTREE Model with select_best()
```{r}
RTREE_best <- select_best(RTREE_fits, metric = "rmse")
```

### Fit Best RTREE model on the Training Set with finalize_workflow()
```{r}
RTREE_fitT <- RTREE_wkf |>
  finalize_workflow(RTREE_best)|>
  fit(bike_train_data)
```

### Fit the RTREE model to the Test Set
```{r}
RTREE_final <- RTREE_wkf |>
  finalize_workflow(RTREE_best) |>
  last_fit(bike_data_split)
```

## Bagged Tree Model

### Define Bagged Tree (BTREE) Model and Engine
```{r eval=FALSE}
BTREE_model <- bag_tree(tree_depth = integer(5), min_n = integer(10), cost_complexity = tune()) |>
set_engine("rpart") |>
set_mode("regression") |>
  translate()
```

### Create Workflow for BTREE Model
```{r eval=FALSE}
BTREE_wkf <- workflow() |>
  add_recipe(recipe1) |>
add_model(BTREE_model)
```

### Fit the BTREE Model to CV Folds
```{r eval=FALSE}
BTREE_tuned <- BTREE_wkf |>
tune_grid(resamples = bike_CV_folds,
grid = grid_regular(cost_complexity(),
levels = 15),
metrics = metric_set(rmse))
```


### Select the Best BTREE Model with select_best()
```{r eval=FALSE}
BTREE_best <- select_best(BTREE_tuned, metric = "rmse")
```

### Fit Best BTREE model on the Training Set with finalize_workflow()
```{r eval=FALSE}
BTREE_fitT <- BTREE_wkf |>
  finalize_workflow(BTREE_best)|>
  fit(bike_train_data)
```

### Fit the BTREE model to the Test Set
```{r eval=FALSE}
BTREE_final <- BTREE_wkf |>
  finalize_workflow(BTREE_best) |>
  last_fit(bike_data_split)
```

## Random Forest Model

### Define Random Forest (RF) Model and Engine
```{r}
RF_model <- rand_forest(mode = "regression", mtry = tune()) |>
set_engine("ranger", importance = "impurity") 
```

### Create Workflow for RF Model
```{r}
RF_wkf <- workflow() |> 
  add_recipe(recipe1) |>
add_model(RF_model)
```

### Fit the RF Model to CV Folds
```{r}
RF_tuned <- RF_wkf |>
tune_grid(resamples = bike_CV_folds,
grid = 7, 
metrics = metric_set(rmse))
```

### Select the Best RF Model with select_best()
```{r}
RF_best <- select_best(RF_tuned, metric = "rmse")
```

### Fit Best RF model on the Training Set with finalize_workflow()
```{r}
RF_fitT <- RF_wkf |>
  finalize_workflow(RF_best)|>
  fit(bike_train_data)
```

### Fit the RF model to the Test Set
```{r}
RF_final <- RF_wkf |>
  finalize_workflow(RF_best) |>
  last_fit(bike_data_split)
```

## Compare All Final Models on the Test set
```{r}

rbind(MLR_final |> compute_metrics(metric_set(rmse, mae)),
      LASSO_final |> compute_metrics(metric_set(rmse, mae)),
      RTREE_final |> compute_metrics(metric_set(rmse, mae)),
#     BTREE_final |> compute_metrics(metric_set(rmse, mae)),
      RF_final |> compute_metrics(metric_set(rmse, mae)))
```

Here we can see that the RF Model is the best model with the lowest RMSE and MAE.

## Extract the Final Model Fits and Report a Summary of Each Model
– For the LASSO and MLR models, report the final coefficient tables – For the regression tree model, give a plot of the final fit – For the bagged tree and random forest models, produce a variable importance plot ∗ For the random forest model, this is a bit complicated. Check this out and see if you can get it to work.

MLR Final Model Fit and Summary
```{r}
MLR_final |> 
  extract_fit_parsnip() |> 
  tidy()
```

LASSO Final Model Fit and Summary
```{r}
LASSO_final |> 
  extract_fit_parsnip() |> 
  tidy()
```

RTREE Final Model Fit and Summary
```{r}
RTREE_final |>
  extract_fit_engine() |>
  rpart.plot::rpart.plot(cex = 0.5 , roundint = FALSE)
```

BTREE Final Model Fit and Summary
```{r eval=FALSE}
BTREE_final |>
  extract_fit_engine()

BTREE_final$imp |>
  mutate(term = factor(term, levels = term)) |>
  ggplot(aes(x = term, y = value)) +
  geom_bar(stat ="identity") +
  coord_flip()
```

RF Final Model Fit and Summary
```{r}
RF_final |>
  extract_fit_parsnip() |>
  vip(num_features = 10)
```

## Fit the Final RF Model to the Entire Data Set
```{r}
Final_Model <- RF_wkf |>
  finalize_workflow(RF_best) |>
  fit(bd_sub) |>
   extract_fit_parsnip() |>
  vip(num_features = 10)

Final_Model
```
````



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bagged Tree Models Not Robust to Column Names and Tuning Arguments #1114

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bagged Tree Models Not Robust to Column Names and Tuning Arguments #1114

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions