In this tutorial, we will learn about classification and logistic regression with `tidymodels`

. We will be using the `heart_disease`

and `employee`

_data` data sets to demonstrate concepts in this tutorial.

The `heart_disease`

data contains demographics and outcomes of various medical tests for patients in a heart disease study. The variable of interest in this data is `heart_disease`

and it indicates whether a patient was diagnosed with heart disease (Yes or No).

Please click the button below to open an interactive version of all course `R`

tutorials through RStudio Cloud.

**Note**: you will need to register for an account before opening the project. Please remember to use your GMU e-mail address.

Click the button below to launch an interactive RStudio environment using `Binder.org`

. This will launch a pre-configured RStudio environment within your browser. Unlike RStudio cloud, this service has no monthly usage limits, but it may take up to 10 minutes to launch and you will not be able to save your work.

The code below will load the required packages and data sets for this tutorial.

```
library(tidyverse)
library(tidymodels)
library(vip)
# Load employee attrition data
employee_data <- read_rds(url('https://gmudatamining.com/data/employee_data.rds'))
# Heart disease data
heart_df <- read_rds(url('https://gmudatamining.com/data/heart_disease.rds')) %>%
select(heart_disease, age, chest_pain, max_heart_rate, resting_blood_pressure)
# View heart disease data
heart_df
```

In logistic regression, we are estimating the probability that our **Bernoulli** response variable is equal to the `Positive`

class.

In classification, the event we are interested in predicting, such as `heart_disease = 'Yes'`

in our heart disease example, is known as the `Postive`

event. Whereas the remaining event, `heart_disease = 'No'`

, is the `Negative`

event.

In our course tutorials, we will follow the model fitting process that is expected to be followed on the course analytics project. When fitting a classification model, whether logistic regression or a different type of algorithm, we will take the following steps:

- Split the data into a training and test set
- Specify a feature engineering pipeline with the
`recipes`

package - Specify a
`parsnip`

model object - Package your recipe and model into a workflow
- Fit your workflow to the training data
- If your model has hyperparameters, perform hyperparameter tuning - this will be covered next week

- Evaluate model performance on the test set by studying the confusion matrix, ROC curve, and other performance metrics

Let’s demonstrate this process using logistic regression and the `heart_df`

data.

The first step in modeling is to split our data into a training and test set. In the classification setting, we must also make sure that the response variable in our data set is a factor.

By default, `tidymodels`

maps the first level of a factor to the `Positive`

class while calculating performance metrics. Therefore, before we split our data and proceed to modeling, we need to make sure that the event we are trying to predict is the first level of our response variable.

For the `heart_df`

data, the event we are interested in predicting is `heart_disease = 'Yes'`

. We can use the `levels()`

function to check the current ordering of the levels of the `heart_disease`

variable.

`levels(heart_df$heart_disease)`

`[1] "yes" "no" `

Since ‘yes’ is the first level, we don’t need to take an further steps and can proceed to splitting our data.

In the code below, we use the `initial_split()`

function from `rsample`

to create our training and testing data using the `heart_df`

data.

```
## Always remember to set your seed
## Add an integer to the argument of set.seed()
set.seed(345)
heart_split <- initial_split(heart_df, prop = 0.75,
strata = heart_disease)
heart_training <- heart_split %>% training()
heart_test <- heart_split %>% testing()
```

The next step in the modeling process is to define our feature engineering steps. In the code below, we process our numeric predictors by removing skewness and normalizing, and create dummy variables from our `chest_pain`

predictor.

When creating a feature engineering pipeline, it’s important to exclude `prep()`

and `bake()`

because these will be implemented automatically in our workflow that is created at a later stage

```
heart_recipe <- recipe(heart_disease ~ ., data = heart_training) %>%
step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes())
```

However, it is always good practice to check that the feature engineering recipe is doing what we expect. The code below processes our training data with `prep()`

and `bake()`

so that we can have a look at the results.

```
heart_recipe %>%
prep() %>%
bake(new_data = heart_training)
```

Next, we define our logistic regression model object. In this case, we use the `logistic_reg()`

function from `parsnip`

. Our engine is `glm`

and our mode is `classification.

```
logistic_model <- logistic_reg() %>%
set_engine('glm') %>%
set_mode('classification')
```

Now we can combine our model object and recipe into a single workflow object using the `workflow()`

function.

```
heart_wf <- workflow() %>%
add_model(logistic_model) %>%
add_recipe(heart_recipe)
```

Next we fit our workflow to the training data. This is done by passing our workflow object to the `fit()`

function

```
heart_logistic_fit <- heart_wf %>%
fit(data = heart_training)
```

Once we have trained our logistic regression model on our training data, we can optionally study variable importance with the `vip()`

function.

The first step is to extract the trained model from our workflow fit, `heart_logistic_fit`

. This can be done by passing `heart_logistic_fit`

to the `pull_workflow_fit()`

function.

```
heart_trained_model <- heart_logistic_fit %>%
pull_workflow_fit()
```

Next we pass `heart_trained_model`

to the `vip()`

function. This will return a `ggplot`

object with the variable importance scores from our model. The importance scores are based on the z-statistics associated with each predictor.

We see from the results below, that asymptomatic chest pain, maximum heart rate, and resting blood pressure, are the most important predictors of heart disease from our data set.

`vip(heart_trained_model)`

The next step in the modeling process is to assess the accuracy of our model on new data. This is done by obtaining predictions on our test data set with our trained model object, `heart_logistic_fit`

.

Before we can do this, we create a results data frame with the following data:

- The true response values from our test set
- The predicted response category for each row of our test data
- The estimated probabilities for each response category

All of this data can be put together using the `predict()`

function.

To obtain the predicted category for each row in our test set, we pass the `heart_logistic_fit`

object to the `predict`

function and specify `new_data = heart_test`

.

We will get a data frame with a column named `.pred_class`

which has the predicted category (yes/no) for each row of our test data set.

```
predictions_categories <- predict(heart_logistic_fit, new_data = heart_test)
predictions_categories
```

Next we need to obtain the estimated probabilities for each category of our response variable.

This is done with the same code as above but with the additional argument, `type = ‘prob’)

In this case we get a data frame with the following columns, `.pred_yes`

and `.pred_no`

. The `tidymodels`

package will always use the following convention for naming these columns `.pred_level_of_factor_in_response_variable`

```
predictions_probabilities <- predict(heart_logistic_fit, new_data = heart_test, type = 'prob')
predictions_probabilities
```

The final step is to combine the results from above with the true response variable values in our test data set.

```
# Combine
test_results <- heart_test %>% select(heart_disease) %>%
bind_cols(predictions_categories) %>%
bind_cols(predictions_probabilities)
test_results
```

The `yardstick`

package from `tidymodels`

has a number of functions for calculating performance metrics on the results of a machine learning algorithm.

Important function from this package include `conf_mat()`

, `f_meas()`

, `roc_curve()`

, `roc_auc()`

All of these functions take a data frame with the structure of our `test_results`

as the first argument. The input data frame must contain the three pieces of information mentioned at the beginning of this section:

- The true response values from our test set
- The predicted response category for each row of our test data
- The estimated probabilities for each response category

The first result to explore is usually the confusion matrix. The `conf_mat()`

function will produce one for us. It takes the following important arguments:

`data`

- the first argument is a data frame with model results (usually on the test set)`truth`

- a factor column with the true response categories`estimate`

- a factor column with the predicted response categories

The results of this function are a confusion matrix with the predicted categories in the rows and true categories in the columns.

By default, all `yardstick`

functions map the first level of the response variable to the positive class. The `conf_mat()`

function orders the output by displaying the positive class first in both the rows and columns.

Form the results below, we have 58 correct predictions on our test data set. Since we have 75 rows in our test data, this gives us an accuracy of 77%.

We have 9 false positives (we predicted the positive class (‘yes’) but the truth was ‘no’) and 8 false negatives.

`conf_mat(test_results, truth = heart_disease, estimate = .pred_class)`

```
Truth
Prediction yes no
yes 26 9
no 8 32
```

The F_{1} score is a performance metric that equally balances our false positive and false negative mistakes. The range of an F_{1} score is from 0 (worst) to 1 (best).

The `f_meas()`

function from `yardstick`

is used to calculate this metric. It takes the same arguments as `conf_mat()`

.

From the results below, we have an F_{1} score 0.75 on our test data results.

`f_meas(test_results, truth = heart_disease, estimate = .pred_class)`

The ROC curve is a way to visualize the performance of any classification model. The plot includes the sensitivity on the y-axis and (1 - specificity on the x-axis for all possible probability cut-off values.

The default probability cut-off value used by classification models is 0.5. But changing this can guard against either false positives or false negatives. The ROC curve plots all of this information in one plot.

**What to for**: the best ROC curve is as close as possible to the point (0, 1) that is at the top left corner of the plot. The closer the ROC curve is to that point throughout the entire range, the better the classification model.

The dashed line through the middle of the graph represents a model that is just guessing at random.

The first step in creating an ROC curve is to pass our results data frame to the `roc_curve()`

function. This function takes a data frame with model results, the `truth`

, and `estimate`

columns to produce a data frame with the specificy and sensitivity for all probability thresholds.

`roc_curve(test_results, truth = heart_disease, estimate = .pred_yes)`

To plot this data, we simply pass the results of `roc_curve()`

to the `autoplot()`

function.

```
roc_curve(test_results, truth = heart_disease, estimate = .pred_yes) %>%
autoplot()
```

Another important performance metric is the area under the ROC curve. This metric can be loosely interpreted as a letter grade.

In terms of model performance, an area under the ROC value between 0.9 - 1 indicates an “A”, 0.8 - 0.9 a “B”, and so forth. Anything below a 0.6 is an “F” and indicates poor model performance.

To calculate the area under the ROC curve, we use the `roc_auc()`

.

This function takes the results data frame as the first argument, the `truth`

column as the second argument, and the column of estimated probabilities for the positive class as the third argument.

From our results below, our model gets a “B”.

`roc_auc(test_results, truth = heart_disease, .pred_yes)`

It is also possible to create a custom metric set using the `metric_set()`

function. This function takes `yardstick`

function names as arguments and returns a new function that we can use to calculate that set of metrics.

In the code below we create a new function, `my_metrics()`

that will calculate the accuracy and F_{1} from my results data frame.

```
my_metrics <- metric_set(accuracy, f_meas)
my_metrics(test_results, truth = heart_disease, estimate = .pred_class)
```

Just like with linear regression, we can automate the process of fitting a logistic regression model by using the `last_fit()`

function. This will automatically give use the predictions and metrics on our test data set.

In the example below, we will fit the same model as above, but with `last_fit()`

instead of `fit()`

.

The `last_fit()`

function takes a workflow object as the first argument and a data split object as the second. It will trained the model on the training data and provide predictions and calculate metrics on the test set.

```
last_fit_model <- heart_wf %>%
last_fit(split = heart_split)
```

To obtain the metrics on the test set (accuracy and roc_auc by default) we use `collect_metrics()`

.

`last_fit_model %>% collect_metrics()`

We can also obtain a data frame with test set results by using the `collect_predictions()`

function.

```
last_fit_results <- last_fit_model %>%
collect_predictions()
last_fit_results
```

We can use this data frame to make an ROC plot by using `roc_curve()`

and `autoplot()`

.

```
last_fit_results %>%
roc_curve(truth = heart_disease, estimate = .pred_yes) %>%
autoplot()
```

Let’s go through one more example of fitting a logistic regression. This time we will predict whether employees will leave a company or not using the `employee_data`

data frame.

In this case, our event of interest is `left_company == 'Yes'`

. This is what we would like to map to the positive class when calculating our performance metrics.

The code below shows that we need to recode the `left_company`

column so that ‘Yes’ is the first level of the factor.

`levels(employee_data$left_company)`

`[1] "No" "Yes"`

The code below reorders the levels of the `left_company`

using the `factor()`

function. We just add ‘Yes’ as the first level in the `levels`

argument.

```
employee_data <- employee_data %>%
mutate(left_company = factor(left_company,
levels = c('Yes', 'No')))
levels(employee_data$left_company)
```

`[1] "Yes" "No" `

Now we can proceed to split our data with `initial_split()`

.

```
set.seed(314)
employee_split <- initial_split(employee_data, prop = 0.75,
strata = left_company)
employee_training <- employee_split %>% training()
employee_test <- employee_split %>% testing()
```

Next we create our feature engineering pipeline. In this case, we will perform the same steps as our prior feature engineering:

- Remove skewness from numeric predictors
- Normalize all numeric predictors
- Create dummy variables for all nominal predictors

```
employee_recipe <- recipe(left_company ~ ., data = employee_training) %>%
step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes())
```

Let’s check to see if the feature engineering steps have been carried out correctly.

```
employee_recipe %>%
prep() %>%
bake(new_data = employee_training)
```

In this case, we will be using the same model object as before, `logistic_model`

.

```
employee_wf <- workflow() %>%
add_model(logistic_model) %>%
add_recipe(employee_recipe)
```

`last_fit()`

Finally we will train our model and estimate performance on our test data set using the `last_fit()`

function.

```
last_fit_employee <- employee_wf %>%
last_fit(split = employee_split)
```

To obtain the metrics on the test set (accuracy and roc_auc by default) we use `collect_metrics()`

. Based on area under the ROC curve, our model has an “A+”.

`last_fit_employee %>% collect_metrics()`

We can also obtain a data frame with test set results by using the `collect_predictions()`

function. Notice that that our column with estimated probabilities for the positive class (‘Yes’ in our factor `left_company`

) is now named `.pred_Yes`

’.

```
last_fit_employee <- last_fit_employee %>%
collect_predictions()
last_fit_employee
```

We can use this data frame to make an ROC plot by using `roc_curve()`

and `autoplot()`

.

```
last_fit_employee %>%
roc_curve(truth = left_company, estimate = .pred_Yes) %>%
autoplot()
```