Before starting this assignment, please download R and RStudio Desktop on your computer. Both are open-source and free to use.

Detailed installation instructions can be found here

To complete this assignment, students must download the R notebook template and open the file in their RStudio application. Please click the button below to download the template.

After completing the assignment, please upload the template (.Rmd file) to Blackboard as your submission.




Introduction

This semester we will be working with a data set from the field of Human Resources Analytics.

Broadly speaking, this field is concerned with using employee data within a company to optimize objectives such as employee satisfaction, productivity, project management, and most commonly, avoiding employee attrition.

Ideally, companies would like to keep attrition rates (the proportion of employees leaving a company for other opportunities) as low as possible due to the variable costs and business disruptions that come with having to replace productive employees on short notice.

The objective of this project is to perform an exploratory data analysis on the employee_data data set to uncover potential solutions for minimizing employee attrition rates.



Employee Attrition Data

The employee_data data frame is loaded below and consists of 1,470 employee records for a U.S. based product company. The rows in this data frame represent the attributes of an employee at this company across the variables listed in the table below.


library(tidyverse)

employee_data <- read_rds(url('https://gmudatamining.com/data/employee_data.rds'))



Variable Information

Variable Definition Data Type
left_company Did the employee leave the company? (Yes/No) Factor
department Department within the company Factor
job_level Job Level (Associate - Vice President) Factor
salary Employee yearly salary (US Dollars) Numeric
weekly_hours Self-reported average weekly hours spent on the job (company survey) Numeric
business_travel Level of required business travel Factor
yrs_at_company Tenure at the company (years) Numeric
yrs_since_promotion Years since last promotion Numeric
previous_companies Number of previous companies for which the employee has worked Numeric
job_satisfaction Self-reported job satisfaction (company survey) Factor
performance_rating Most recent annual performance rating Factor
marital_status Marital status (Single, Married, or Divorced) Factor
miles_from_home Distance from employee address to office location Numeric



Raw Data

employee_data



Exploratory Data Analysis


Executives at this company have hired you as a data science consultant to identify the factors that lead to employees leaving their company.

They would like for you to explore why employees are leaving their company and make recommendations on how to minimize this behavior.

You must think of at least 5 relevant questions that explore the relationship between left_company and the other variables in the employee_data data frame.

The goal of your analysis should be discovering which variables drive the differences between employees who do and do not leave the company.

You must answer each question and provide supporting data summaries with either a summary data frame (using dplyr/tidyr) or a plot (using ggplot) or both.

In total, you must have a minimum of 3 plots and 3 summary data frames for the exploratory data analysis section. Among the plots you produce, you must have at least 3 different types (ex. box plot, bar chart, histogram, heat map, etc…)

Each question must be answered with supporting evidence from your tables and plots. See the example question below.



Sample Question

Is there a relationship between employees leaving the company and their current salary?


Answer: Yes, the data indicates that employees who leave the company tend to have lower salaries when compared to employees who do not. Among the 237 employees that left the company, the average salary was $76,625. This is over $20,000 less than the average salary of employees who did not leave the company.

Among the employees who did not leave the company, only 10% have a salary that is less than or equal to $60,000. When looking at employees who did leave the company, this increase to 34%.

Summary Table

employee_data %>% group_by(left_company) %>% 
                  summarise(n_employees = n(),
                            min_salary = min(salary),
                            avg_salary = mean(salary),
                            max_salary = max(salary),
                            sd_salary = sd(salary),
                            pct_less_60k = mean(salary <= 60000))

Data Visulatization

ggplot(data = employee_data, aes(x = salary, fill = left_company)) + 
   geom_histogram(aes(y = ..density..), color = "white", bins = 20) +
   facet_wrap(~ left_company, nrow = 2) +
   labs(title = "Employee Salary Distribution by Status (Left the Comapny - Yes/No)",
           x = "Salary (US Dollars", y = "Proportion of Employees")



Summary of Results

Write an executive summary of your overall findings and recommendations to the executives at this company. Think of this section as your closing remarks of a presentation, where you summarize your key findings and make recommendations to improve HR processes at the company.

Your executive summary must be written in a professional tone, with minimal grammatical errors, and should include the following sections:

  1. An introduction where you explain the business problem and goals of your data analysis

    • What problem(s) is this company trying to solve? Why are they important to their future success?

    • What was the goal of your analysis? What questions were you trying to answer and why do they matter?


  1. Highlights and key findings from your Exploratory Data Analysis section
    • What were the interesting findings from your analysis and why are they important for the business?

    • This section is meant to establish the need for your recommendations in the following section


  1. Your recommendations to the company on how to reduce employee attrition rates

    • Each recommendation must be supported by your data analysis results

    • You must clearly explain why you are making each recommendation and which results from your data analysis support this recommendation

    • You must also describe the potential business impact of your recommendation:

      • Why is this a good recommendation?

      • What benefits will the business achieve?