In this tutorial, we will learn about classification with discriminant analysis and the K-nearest neighbor (KNN) algorithm. KNN can be used for both regression and classification and will serve as our first example for hyperparameter tuning. We will be using two data sets to demonstrate the algorithms in this lesson, churn_df and home_sales.


Please click the button below to open an interactive version of all course R tutorials through RStudio Cloud.

Note: you will need to register for an account before opening the project. Please remember to use your GMU e-mail address.



Click the button below to launch an interactive RStudio environment using Binder.org. This will launch a pre-configured RStudio environment within your browser. Unlike RStudio cloud, this service has no monthly usage limits, but it may take up to 10 minutes to launch and you will not be able to save your work.


Binder



The code below will load the required packages and data sets for this tutorial. We will need a new package for this lesson, discrim. This packages is part of tidymodels and serves as a general interface to discriminant analysis algorithms in R.

When installing discrim, you will also need to install the klaR package.


library(tidyverse)
library(tidymodels)
library(discrim) # for discriminant analysis

# Telecommunications customer churn data
churn_df <- read_rds(url('https://gmudatamining.com/data/churn_data.rds'))


# Seattle home sales
home_sales <- read_rds(url('https://gmudatamining.com/data/home_sales.rds')) %>% 
              select(-selling_date)



Data

We will be working with the churn_df and home_sales data frames in this lesson.

Take a moment to explore these data sets below.



Telecommunication Customer Churn


A row in this data frame represents a customer at a telecommunications company. Each customer has purchased phone and internet services from this company.

The response variable in this data is canceled_service which indicates whether the customer terminated their services.


Seattle Home Sales


A row in this data frame represents a home that was sold in the Seattle area between 2014 and 2015.

The response variable in this data is selling_price.