In this tutorial, we will learn cover decision trees and random forests. These are supervised learning algorithms that can be used for classification or regression.


Please click the button below to open an interactive version of all course R tutorials through RStudio Cloud.

Note: you will need to register for an account before opening the project. Please remember to use your GMU e-mail address.



Click the button below to launch an interactive RStudio environment using Binder.org. This will launch a pre-configured RStudio environment within your browser. Unlike RStudio cloud, this service has no monthly usage limits, but it may take up to 10 minutes to launch and you will not be able to save your work.


Binder



The code below will load the required packages and data sets for this tutorial. We will need a new package for this lesson, rpart.plot. This packages is used for visualizing decision tree models.

If working with RStudio desktop, please install the rpart, rpart.plot, and ranger packages.


library(tidyverse)
library(tidymodels)
library(vip)
library(rpart.plot)

# Telecommunications customer churn data
churn_df <- read_rds(url('https://gmudatamining.com/data/churn_data.rds'))



Data

We will be working with the churn_df data frame in this lesson. Take a moment to explore this data set below.

A row in this data frame represents a customer at a telecommunications company. Each customer has purchased phone and internet services from this company.

The response variable in this data is canceled_service which indicates whether the customer terminated their services.





Decision Trees

To demonstrate fitting decision trees, we will use the churn_df data set and predict canceled_service using all available predictor variables.

A decision tree is specified with the decision_tree() function from tidymodels and has three hyperparameters, cost_complexity, tree_depth, and min_n. Since we will need to perform hyperparameter tuning, we will create cross validation folds from our training data.



Data Splitting

We will split the data into a training and test set. The training data will be further divided into 5 folds for hyperparameter tuning.

set.seed(314) # Remember to always set your seed. Any integer will work

churn_split <- initial_split(churn_df, prop = 0.75, 
                             strata = canceled_service)

churn_training <- churn_split %>% training()

churn_test <- churn_split %>% testing()

# Create folds for cross validation on the training data set
## These will be used to tune model hyperparameters
set.seed(314)

churn_folds <- vfold_cv(churn_training, v = 5)



Feature Engineering

Now we can create a feature engineering recipe for this data. We will train the following transformations on our training data.


  • Remove skewness from numeric predictors
  • Normalize all numeric predictors
  • Create dummy variables for all nominal predictors


churn_recipe <- recipe(canceled_service ~ ., data = churn_training) %>% 
                       step_YeoJohnson(all_numeric(), -all_outcomes()) %>% 
                       step_normalize(all_numeric(), -all_outcomes()) %>% 
                       step_dummy(all_nominal(), -all_outcomes())



Let’s check to see if the feature engineering steps have been carried out correctly.


churn_recipe %>% 
  prep() %>% 
  bake(new_data = churn_training)