In this tutorial, we will learn cover decision trees and random forests. These are supervised learning algorithms that can be used for classification or regression.
Please click the button below to open an interactive version of all course
R tutorials through RStudio Cloud.
Note: you will need to register for an account before opening the project. Please remember to use your GMU e-mail address.
Click the button below to launch an interactive RStudio environment using
Binder.org. This will launch a pre-configured RStudio environment within your browser. Unlike RStudio cloud, this service has no monthly usage limits, but it may take up to 10 minutes to launch and you will not be able to save your work.
The code below will load the required packages and data sets for this tutorial. We will need a new package for this lesson,
rpart.plot. This packages is used for visualizing decision tree models.
If working with RStudio desktop, please install the
library(tidyverse) library(tidymodels) library(vip) library(rpart.plot) # Telecommunications customer churn data churn_df <- read_rds(url('https://gmudatamining.com/data/churn_data.rds'))
We will be working with the
churn_df data frame in this lesson. Take a moment to explore this data set below.
A row in this data frame represents a customer at a telecommunications company. Each customer has purchased phone and internet services from this company.
The response variable in this data is
canceled_service which indicates whether the customer terminated their services.
To demonstrate fitting decision trees, we will use the
churn_df data set and predict
canceled_service using all available predictor variables.
A decision tree is specified with the
decision_tree() function from
tidymodels and has three hyperparameters,
min_n. Since we will need to perform hyperparameter tuning, we will create cross validation folds from our training data.
We will split the data into a training and test set. The training data will be further divided into 5 folds for hyperparameter tuning.
set.seed(314) # Remember to always set your seed. Any integer will work churn_split <- initial_split(churn_df, prop = 0.75, strata = canceled_service) churn_training <- churn_split %>% training() churn_test <- churn_split %>% testing() # Create folds for cross validation on the training data set ## These will be used to tune model hyperparameters set.seed(314) churn_folds <- vfold_cv(churn_training, v = 5)
Now we can create a feature engineering recipe for this data. We will train the following transformations on our training data.
churn_recipe <- recipe(canceled_service ~ ., data = churn_training) %>% step_YeoJohnson(all_numeric(), -all_outcomes()) %>% step_normalize(all_numeric(), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes())
Let’s check to see if the feature engineering steps have been carried out correctly.
churn_recipe %>% prep() %>% bake(new_data = churn_training)