This is an individual assignment and will be a chance for you to perform an applied data science project on a real data set.
We will be working with the
loans_df data frame in this project. This data set contains information on over 4,000 individuals who secured a personal loan in 2017 from a national bank. The description of this data and the variables contained in it are provided below.
The objective of this project is to explore the factors that lead to loan default and develop a machine learning algorithm that will predict the likelihood of an applicant defaulting on their loan in the future.
To complete this assignment, students must download the
R notebook template and open the file in their RStudio application. Please click the button below to download the template.
loans_df data frame contains information on 3 and 5-year loans that were originated in 2017 by a national bank for customers residing in the Middle Atlantic and Northeast regions of the United States.
The company is looking to see if it can determine the factors that lead to loan default and whether it can predict if a customer will eventually default on their loan.
The bank has experienced record levels of customers defaulting on their loans in the past couple of years and this is leading to large financial losses.
The goal is to become better at identifying customers at risk of defaulting on their loans to minimize financial losses.
Specifically, the broad questions that the bank is trying to answer include:
The data set contains a mixture of applicant financial information (income, debt ratios, etc..), and applicant behavior (number of open accounts, historical engagement with the bank’s products, number of missed payments, etc…)
The response variable in this data is
loan_default. This variable records whether an applicant eventually defaulted on their loan and indicates a financial loss to the bank.
Note: The response variable has been coded as a factor with ‘yes’ as the first level. This is the format that
tidymodels expects for calculating model performance metrics. There is no need to recode this variable in your machine learning process.
|loan_default||Did the borrower default on their loan (yes/no)||Factor|
|installment||Monthly paymeny amount||Numeric|
|loan_purpose||Purpose of the loan||Factor|
|application_type||Loan application type (individual or joint)||Factor|
|term||Loan term (three/five year)||Factor|
|homeownership||Borrower(s) homeownership status||Factor|
|current_job_years||Years employed at current job||Numeric|
|debt_to_income||Debt-to-income ratio at application time||Numeric|
|total_credit_lines||Total number of open credit lines||Integer|
|years_credit_history||Years of credit history||Numeric|
|missed_payment_2_yr||History of missed payments in the last 2 years (yes/no)||Factor|
|history_bankruptcy||History of bankruptcy (yes/no)||Factor|
|history_tax_liens||History of tax liens (yes/no)||Factor|
In this section, you must think of at least 5 relevant questions that explore the relationship between
loan_default and the other variables in the
loan_df data set. The goal of your analysis should be discovering which variables drive the differences between customers who do and do not default on their loans.
You must answer each question and provide supporting data summaries with either a summary data frame (using
tidyr) or a plot (using
ggplot) or both.
In total, you must have a minimum of 3 plots (created with
ggplot) and 3 summary data frames (created with
dplyr) for the exploratory data analysis section. Among the plots you produce, you must have at least 3 different types (ex. box plot, bar chart, histogram, scatter plot, etc…)
See the example question below.
Note: To add an R code chunk to any section of your project, you can use the keyboard shortcut
i or the
insert button at the top of your R project template notebook file.
Are there differences in loan default rates by loan purpose?
Answer: Yes, the data indicates that credit card and medical loans have significantly larger default rates than any other type of loan. In fact, both of these loan types have default rates at more than 50%. This is nearly two times the average default rate for all other loan types.
default_rates <- loans_df %>% group_by(loan_purpose) %>% summarise(n_customers = n(), customers_default = sum(loan_default == 'yes'), default_percent = 100 * mean(loan_default == 'yes')) ggplot(data = default_rates, mapping = aes(x = loan_purpose, y = default_percent)) + geom_bar(stat = 'identity', fill = '#006EA1', color = 'white') + labs(title = 'Loan Default Rate by Purpose of Loan', x = 'Loan Purpose', y = 'Default Percentage') + theme_light()
In this section of the project, you will fit two classification algorithms to predict the response variable,
loan_default. You should use all of the other variables in the
loans_df data as predictor variables for each model.
You must follow the machine learning steps below.
The data splitting and feature engineering steps should only be done once so that your models are using the same data and feature engineering steps for training.
loans_dfdata into a training and test set (remember to set your seed)
vfold_cv(remember to set your seed)
grid_random()too large. I recommend
size= 10 or smaller.
select_best()and finalize your workflow
autoplot()and calculating the area under the ROC curve on your test data
Write a summary of your overall findings and recommendations to the executives at the bank. Think of this section as your closing remarks of a presentation, where you summarize your key findings, model performance, and make recommendations to improve loan processes at the bank.
Your executive summary must be written in a professional tone, with minimal grammatical errors, and should include the following sections:
An introduction where you explain the business problem and goals of your data analysis
What problem(s) is this company trying to solve? Why are they important to their future success?
What was the goal of your analysis? What questions were you trying to answer and why do they matter?
What were the interesting findings from your analysis and why are they important for the business?
This section is meant to establish the need for your recommendations in the following section
Your recommendations to the company on how to reduce loan default rates
Each recommendation must be supported by your data analysis results
You must clearly explain why you are making each recommendation and which results from your data analysis support this recommendation
You must also describe the potential business impact of your recommendation:
Why is this a good recommendation?
What benefits will the business achieve?