This homework assignment we will focus on machine learning with tidymodels.

To complete this assignment, students must download the R notebook template and open the file in their RStudio application. Please click the button below to download the template.




Load Packages and Data

The R code chunk below will load the tidyverse, tidymodels, and discrim packages as well as the mobile_carrier_df data set.

Note: You will have to install the klaR, kknn, and discrim packages for this assignment. If you get an error running the code below, make sure that you have installed the required packages in your RStudio desktop environment. To install any package, navigate to the bottom right pane of RStudio, select the Packages tab and click the Install button.



library(tidyverse)
library(tidymodels)
library(discrim)

mobile_carrier_df <- read_rds(url('https://gmudatamining.com/data/mobile_carrier_data.rds'))



The mobile_carrier_df data frame contains information on U.S. customers for a national mobile service carrier.

Each row represents a customer who did or did not cancel their service. The response variable in this data set is named canceled_plan and has levels of ‘yes’ or ‘no’. The predictor variables in this data frame contain information about the customers’ residence region and mobile call activity.

Our goal in this assignment is to predict canceled_plan with various machine learning algorithms including logistic regression, LDA, and KNN.


mobile_carrier_df



Logistic Regression

In each part of the code blocks below, your assignment is to replace the ---- with the correct input for fitting a logistic regression to the mobile_carrier_df data frame.


Data Splitting

set.seed(271)

mobile_split <- initial_split( ---- , prop = 0.75,
                              strata = ---- )

mobile_training <- ---- %>% training()

mobile_test <- ---- %>% testing()


# Create cross validation folds for hyperparameter tuning
set.seed(271)

mobile_folds <- vfold_cv(----, v = 5)



Feature Engineering

Create a feature engineering pipeline, mobile_recipe, with the following transformations:


  • Remove skewness from numeric predictors
  • Normalize all numeric predictors
  • Create dummy variables for all nominal predictors


mobile_recipe <- recipe(----, data = ----) %>% 
                 ---- %>% 
                 ---- %>% 
                 ----



Check Transformations

You should get the results below when you apply your feature engineering transformations to the training data.


mobile_recipe %>% 
  prep() %>% 
  bake(new_data = mobile_training)



Specify Logistic Regression Model

Next, specify a logistic regression model using the appropriate parnsip function. Use the “glm” engine.


logistic_model <- ---- %>% 
                  set_engine(----) %>% 
                  set_mode(----)



Create a Workflow

Next, combine your model and recipe into a single workflow, logistic_wf


logistic_wf <- workflow() %>% 
               ---- %>% 
               ----



Fit Model

Fit your workflow using the last_fit() function. This will train you model on the training data and calculate predictions on the test data.


logistic_fit <-  ---- %>% 
                 last_fit(----)



Collect Predictions

Use the collect_predictions() function to create a data frame of test results.


logistic_results <-  ---- %>% 
                     collect_predictions()



ROC Curve

Calculate the ROC Curve, area under the ROC curve, and the confusion matrix on the test data. You should get the results below.


## ROC Curve
roc_curve( ---- , truth = ---- , estimate = ---- ) %>% 
  autoplot()

# ROC AUC
roc_auc(----, truth = ----, ----)

# Confusion Matrix
conf_mat(----, truth = ----, estimate = ----)



Results

ROC Curve

ROC AUC

Confusion Matrix

          Truth
Prediction yes  no
       yes  89  37
       no   87 355





Linear Discriminant Analysis

In this section we will modify the steps from above to fit an LDA model to the mobile_carrier_df data. We have already created our training/test/data folds and trained our feature engineering recipe.

To fit an LDA model, we must specify an LDA object with discrim_regularized(), create an LDA workflow, and fit our model with last_fit().


Specify LDA model


lda_model <- discrim_regularized(----) %>% 
             set_engine(----) %>% 
             set_mode(----)


Create LDA Workflow


lda_wf <- workflow() %>% 
          add_model(----) %>% 
          add_recipe(----)


Fit Model

lda_fit <-  ---- %>% 
                 last_fit(----)



Collect Predictions

Use the collect_predictions() function to create a data frame of test results.


lda_results <-   ---- %>% 
                 collect_predictions()


ROC Curve and Confusion Matrix

Calculate the ROC Curve, area under the ROC curve, and the confusion matrix on the test data. You should get the results below.


## ROC Curve
roc_curve( ---- , truth = ---- , estimate = ---- ) %>% 
  autoplot()

# ROC AUC
roc_auc(----, truth = ----, ----)


# Confusion Matrix
conf_mat(----, truth = ----, estimate = ----)



Results

ROC Curve

ROC AUC

# ROC AUC
roc_auc(lda_results, truth = canceled_plan, .pred_yes)

Confusion Matrix

# Confusion Matrix
conf_mat(lda_results, truth = canceled_plan, estimate = .pred_class)
          Truth
Prediction yes  no
       yes  84  32
       no   92 360



KNN Classification

In this section we will modify the steps from above to fit an KNN model to the mobile_carrier_df data.

To fit an KNN model, we must specify an KNN object with nearest_neighbor(), create an KNN workflow, tune our hyperparameter, neighbors, and fit our model with last_fit().


Specify KNN model


knn_model <- nearest_neighbor(----) %>% 
             set_engine(----) %>% 
             set_mode(----)


Create KNN Workflow


knn_wf <- workflow() %>% 
          add_model(----) %>% 
          add_recipe(----)


Tune Hyperparameter

Create Tuning Grid

Next, create a grid of the following values of neighbors: 10, 15, 25, 45, 60, 80, 100, 120, 140, and 180


## Create a grid of hyperparameter values to test
k_grid <- tibble(neighbors = ----)



Select Best Model

Use the select_best() function to select the best model from our tuning results based on the area under the ROC curve.


## Select best model based on roc_auc
best_k <- ---- %>% 
          select_best(metric = ----)




Finalize Workflow

The last step is to use finalize_workflow() to add our optimal model to our workflow object.


## Finalize workflow by adding the best performing model

final_knn_wf <- ---- %>% 
                finalize_workflow(----)



Fit Model

knn_fit <- ---- %>% 
           last_fit(split = ----)



Collect Predictions

Use the collect_predictions() function to create a data frame of test results.


knn_results <-   ---- %>% 
                 collect_predictions()


ROC Curve and Confusion Matrix

Calculate the ROC Curve, area under the ROC curve, and the confusion matrix on the test data. You should get the results below.


## ROC Curve
roc_curve( ---- , truth = ---- , estimate = ---- ) %>% 
  autoplot()

# ROC AUC
roc_auc(----, truth = ----, ----)


# Confusion Matrix
conf_mat(----, truth = ----, estimate = ----)



Results

ROC Curve

ROC AUC

# ROC AUC
roc_auc(knn_results, truth = canceled_plan, .pred_yes)

Confusion Matrix

# Confusion Matrix
conf_mat(knn_results, truth = canceled_plan, estimate = .pred_class)
          Truth
Prediction yes  no
       yes  64  16
       no  112 376
 



Copyright © David Svancer 2020