In this tutorial, we will cover more advanced techniques for data manipulation with dplyr.

Please click the button below to open an interactive version of all course R tutorials through RStudio Cloud.

Note: you will need to register for an account before opening the project. Please remember to use your GMU e-mail address.



Click the button below to launch an interactive RStudio environment using Binder.org. This will launch a pre-configured RStudio environment within your browser. Unlike RStudio cloud, this service has no monthly usage limits, but it may take up to 10 minutes to launch and you will not be able to save your work.


Binder



First let’s load the tidyverse and skimr packages as well as the employee_data tibble into our R environment.


# This will load all 8 of the tidyverse packages
library(tidyverse)
library(skimr)

# Load employee_data
employee_data <- read_rds(url('https://gmudatamining.com/data/employee_data.rds'))

# View data
employee_data



Review

Before we begin our study of advanced techniques for data manipulation, let’s review the core dplyr functions from our last tutorial. Let’s say that we are interested in answering the following question:

Do employees who have left the company tend to work more weekly hours, on average?


One way to answer this question would be to execute the following dplyr workflow

  • start with employee_data
  • create a new variable that standardizes the weekly_hours column by subtracting the mean and dividing by the standard deviation
  • select only the rows of employee_data where left_company is yes
  • calculate the average of the standardized weekly hours variable


Remember that when we standardize a numeric variable by subtracting the mean and dividing by the standard deviation, the resulting values are in standard deviation units.

So a 2.2 value on this scale represents a value that is 2.2 standard deviations to the right of the average value in the original data. A value of 0 on this scale represents the average value in the original data.

The code below executes our analytics workflow and we see that employees who left the company tend to work more weekly hours on average. In fact, these employees work nearly 2 standard deviations to the right of the average hours across all employees in the company.


employee_data %>% 
  mutate(hours_scaled = (weekly_hours - mean(weekly_hours)) / sd(weekly_hours)) %>% 
  filter(left_company == 'Yes') %>% 
  summarise(avg_weekly_hours = mean(hours_scaled))



Note that we standardized our data before selecting employees that left the company.

  • What if selected the employees who left and then created our standardized weekly hours variable?
  • Would this have answered our question?


Try to perform this with dplyr and see what you get. What does it mean in this case?



Split-Apply-Combine

A common data analysis technique, known as split-apply-combine, involves creating statistical summaries by groups within a data frame.

Using the split-apply-combine technique with dplyr usually involves taking a data frame, forming subsets with the group_by() function, applying a summary function to to the groups, and collecting the results into a single data frame.

group_by()

The group_by() function allows you to perform tasks by groups in your data. A typical application of the group_by() function is when addressing questions such as What is the average salary of employees by department?

The group_by() function will separate a data frame into distinct groups using the variable(s) that is/are supplied as the argument. Behind the scenes, the group_by() function splits a data frame into multiple data frames, one for each unique value of the grouping variable. Any dplyr functions that follow afterward are applied to the groups within the grouped data frame.

Below is an example of how we would answer the question from above.


employee_data %>% 
  group_by(department) %>% # form groups by department
  summarise(average_salary = mean(salary)) # calculate mean within each group



The group_by() function can be used to form groups with multiple levels. This happens when we pass more than one variable into the function.

Let’s say we are interested in answering the following question: What is the average salary of employees by department and whether they left the company or not?

The small adjustment to our previous code below will answer this for us.


employee_data %>% 
  group_by(left_company, department) %>% # form groups by department and left_company
  summarise(average_salary = mean(salary)) # calculate mean within each combination of values



Test Your Knowledge

What dplyr commands would you need to add to the previous code so that results are sorted as they appear below?




The group_by() function can also be used in combination with the mutate() function.

Imagine if we wanted to accomplish the following:

Create a new salary_scaled_dept variable that scales the salaryvariable by subtracting the mean and dividing by the standard deviation. However, unlike in our previous example, we want to scale each salary value within each department.

This task is accomplished by the code below. First we group employee_data by the unique values in the department variable and then apply the transformation. Within mutate(), the function mean(salary) is calculating the mean salary value within each group.


employee_data %>% 
  group_by(department) %>% 
  mutate(salary_scaled_dept = (salary - mean(salary)) / sd(salary)) %>% 
  select(department, salary_scaled_dept)



What if we want to create a salary_scaled variable, like in our review example, as well? Maybe we would like to see how the standardized values differ across all employees as well as within each department.

To achieve this, we must add a step to our code above.

First we use mutate() to create a scaled_salary variable across all rows. We then pass this into a group_by() function where we perform the same calculation by group. This complex set of transformations is easy to follow when we use the %>% operator.


employee_data %>% 
  mutate(salary_scaled = (salary - mean(salary)) / sd(salary)) %>% # scale across all rows
  group_by(department) %>% # form groups by department
  mutate(salary_scaled_dept = (salary - mean(salary)) / sd(salary)) %>% # scale within each group
  select(department, salary_scaled, salary_scaled_dept)



Exploring Grouped Data Frames with skim()

In lesson 2, we explored data frames with the skim() function from the skimr() package.

The skim() function takes a data frame as input and provides a high-level summary of the structure of the data frame and the values within its columns.

As output, we get:

  • the number of rows and columns along with the data types present in our data
  • a summary of our factor variables, including the number of missing observations, whether our factor levels are ordered, the count of unique levels, and an abbreviated list of the most frequent factor levels
  • a summary of our numeric variables which include the number of missing observations, the mean and standard deviation, a five number summary, and a plot of the distribution of values


An added bonus of the skim() function is that it also accepts a grouped data frame. In this case all summary statistics will be calculated by group.

In the code below, we use the %>% operator to group the employee_data tibble by department and pass it into the skim() function. Notice that all summaries are now provided by each department.


# View data frames summaries by groups

employee_data %>% group_by(department) %>% 
  skim()
Data summary
Name Piped data
Number of rows 1470
Number of columns 13
_______________________
Column type frequency:
factor 6
numeric 6
________________________
Group variables department

Variable type: factor

skim_variable department n_missing complete_rate ordered n_unique top_counts
left_company Marketing 0 1 FALSE 2 No: 198, Yes: 40
left_company Sales 0 1 FALSE 2 No: 175, Yes: 77
left_company Research 0 1 FALSE 2 No: 284, Yes: 9
left_company Product Development 0 1 FALSE 2 No: 128, Yes: 50
left_company IT and Analytics 0 1 FALSE 2 No: 369, Yes: 30
left_company Finance and Operations 0 1 FALSE 2 No: 79, Yes: 31
job_level Marketing 0 1 FALSE 5 Sen: 81, Man: 59, Dir: 51, Ass: 28
job_level Sales 0 1 FALSE 5 Sen: 73, Man: 63, Dir: 56, Ass: 41
job_level Research 0 1 FALSE 5 Sen: 101, Dir: 74, Man: 64, Vic: 32
job_level Product Development 0 1 FALSE 5 Sen: 57, Man: 39, Dir: 39, Ass: 28
job_level IT and Analytics 0 1 FALSE 5 Sen: 131, Man: 90, Dir: 88, Ass: 51
job_level Finance and Operations 0 1 FALSE 5 Sen: 33, Man: 29, Dir: 23, Ass: 15
business_travel Marketing 0 1 FALSE 3 Rar: 173, Fre: 37, Non: 28
business_travel Sales 0 1 FALSE 3 Rar: 176, Fre: 49, Non: 27
business_travel Research 0 1 FALSE 3 Rar: 220, Fre: 50, Non: 23
business_travel Product Development 0 1 FALSE 3 Rar: 111, Fre: 45, Non: 22
business_travel IT and Analytics 0 1 FALSE 3 Rar: 278, Fre: 78, Non: 43
business_travel Finance and Operations 0 1 FALSE 3 Rar: 85, Fre: 18, Non: 7
job_satisfaction Marketing 0 1 FALSE 4 Ver: 77, Hig: 70, Low: 49, Med: 42
job_satisfaction Sales 0 1 FALSE 4 Hig: 87, Ver: 69, Low: 49, Med: 47
job_satisfaction Research 0 1 FALSE 4 Hig: 95, Ver: 95, Low: 58, Med: 45
job_satisfaction Product Development 0 1 FALSE 4 Hig: 59, Ver: 46, Med: 38, Low: 35
job_satisfaction IT and Analytics 0 1 FALSE 4 Ver: 132, Hig: 105, Med: 92, Low: 70
job_satisfaction Finance and Operations 0 1 FALSE 4 Ver: 40, Low: 28, Hig: 26, Med: 16
performance_rating Marketing 0 1 FALSE 5 Mee: 95, Exc: 80, Exc: 39, Min: 20
performance_rating Sales 0 1 FALSE 5 Mee: 95, Exc: 84, Exc: 41, Min: 27
performance_rating Research 0 1 FALSE 5 Mee: 104, Exc: 99, Exc: 75, Min: 12
performance_rating Product Development 0 1 FALSE 5 Mee: 69, Exc: 52, Min: 25, Exc: 24
performance_rating IT and Analytics 0 1 FALSE 5 Mee: 141, Exc: 129, Exc: 87, Min: 34
performance_rating Finance and Operations 0 1 FALSE 5 Mee: 42, Exc: 28, Exc: 20, Min: 18
marital_status Marketing 0 1 FALSE 3 Mar: 101, Sin: 82, Div: 55
marital_status Sales 0 1 FALSE 3 Mar: 119, Sin: 81, Div: 52
marital_status Research 0 1 FALSE 3 Mar: 134, Sin: 84, Div: 75
marital_status Product Development 0 1 FALSE 3 Mar: 86, Sin: 57, Div: 35
marital_status IT and Analytics 0 1 FALSE 3 Mar: 187, Sin: 124, Div: 88
marital_status Finance and Operations 0 1 FALSE 3 Mar: 46, Sin: 42, Div: 22

Variable type: numeric

skim_variable department n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
salary Marketing 0 1 93265.58 36514.10 30559.08 72344.72 87856.66 117603.05 206581.5 ▂▇▃▁▁
salary Sales 0 1 90148.67 37746.84 30488.15 66664.75 85180.55 116104.17 197965.1 ▅▇▅▁▂
salary Research 0 1 99425.10 37606.81 33276.54 73047.08 91205.32 119657.44 203528.9 ▂▇▅▁▂
salary Product Development 0 1 92429.42 37322.31 32443.79 68360.73 87066.84 114993.62 195345.4 ▅▇▅▁▂
salary IT and Analytics 0 1 93897.64 37367.83 29848.56 70115.29 88521.84 116088.12 211621.0 ▃▇▃▁▁
salary Finance and Operations 0 1 93893.34 39945.11 37115.29 70317.59 88760.26 113516.54 212134.7 ▆▇▃▁▂
weekly_hours Marketing 0 1 49.96 4.81 41.00 47.00 49.00 52.00 66.0 ▃▇▂▂▁
weekly_hours Sales 0 1 51.58 5.81 40.00 47.00 50.00 56.00 65.0 ▂▇▃▃▂
weekly_hours Research 0 1 48.53 3.57 40.00 46.00 48.00 50.00 62.0 ▂▇▇▁▁
weekly_hours Product Development 0 1 50.89 5.39 42.00 46.00 49.50 55.00 64.0 ▇▇▆▃▂
weekly_hours IT and Analytics 0 1 49.23 3.93 40.00 47.00 49.00 51.00 64.0 ▂▇▆▁▁
weekly_hours Finance and Operations 0 1 51.98 5.37 43.00 48.00 51.00 55.00 66.0 ▅▇▃▃▁
yrs_at_company Marketing 0 1 6.95 6.01 0.00 3.00 5.00 9.75 36.0 ▇▃▁▁▁
yrs_at_company Sales 0 1 6.86 6.37 0.00 2.00 5.00 9.00 40.0 ▇▂▁▁▁
yrs_at_company Research 0 1 7.13 6.21 0.00 3.00 5.00 9.00 33.0 ▇▅▁▁▁
yrs_at_company Product Development 0 1 6.79 6.15 0.00 2.00 5.00 9.75 33.0 ▇▅▁▁▁
yrs_at_company IT and Analytics 0 1 7.30 6.25 0.00 3.00 5.00 10.00 36.0 ▇▃▁▁▁
yrs_at_company Finance and Operations 0 1 6.48 5.06 0.00 3.00 5.00 9.00 21.0 ▇▅▃▂▁
yrs_since_promotion Marketing 0 1 2.29 3.25 0.00 0.00 1.00 3.00 15.0 ▇▁▁▁▁
yrs_since_promotion Sales 0 1 2.12 3.17 0.00 0.00 1.00 3.00 15.0 ▇▁▁▁▁
yrs_since_promotion Research 0 1 2.24 3.48 0.00 0.00 1.00 2.00 15.0 ▇▁▁▁▁
yrs_since_promotion Product Development 0 1 1.87 2.72 0.00 0.00 1.00 2.00 13.0 ▇▂▁▁▁
yrs_since_promotion IT and Analytics 0 1 2.30 3.28 0.00 0.00 1.00 2.50 15.0 ▇▁▁▁▁
yrs_since_promotion Finance and Operations 0 1 2.07 3.13 0.00 0.00 1.00 2.00 13.0 ▇▁▁▁▁
previous_companies Marketing 0 1 3.17 1.53 1.00 2.00 3.00 4.00 7.0 ▇▇▂▂▂
previous_companies Sales 0 1 3.41 1.58 1.00 2.00 3.00 5.00 7.0 ▇▇▂▃▃
previous_companies Research 0 1 3.05 1.51 1.00 2.00 3.00 3.00 7.0 ▇▇▁▂▂
previous_companies Product Development 0 1 3.57 1.68 1.00 2.00 3.00 5.00 7.0 ▆▇▂▃▃
previous_companies IT and Analytics 0 1 3.09 1.56 1.00 2.00 3.00 3.00 7.0 ▇▇▁▁▂
previous_companies Finance and Operations 0 1 3.46 1.62 1.00 2.00 3.00 5.00 7.0 ▆▇▁▂▃
miles_from_home Marketing 0 1 8.89 8.11 1.00 2.00 7.00 12.00 29.0 ▇▅▁▁▂
miles_from_home Sales 0 1 9.85 8.17 1.00 2.00 8.00 15.25 29.0 ▇▅▂▂▂
miles_from_home Research 0 1 9.37 7.83 1.00 3.00 7.00 14.00 29.0 ▇▅▂▂▂
miles_from_home Product Development 0 1 8.98 8.10 1.00 2.00 7.00 12.75 29.0 ▇▅▁▂▂
miles_from_home IT and Analytics 0 1 9.04 8.45 1.00 2.00 6.00 14.50 29.0 ▇▃▁▂▂
miles_from_home Finance and Operations 0 1 8.79 7.43 1.00 2.00 7.00 12.00 29.0 ▇▆▂▂▁



As before, we can select a subset of columns to explore by typing their names within the skim() function.


# Select a subset of variables

employee_data %>% group_by(department) %>% 
  skim(left_company, salary)
Data summary
Name Piped data
Number of rows 1470
Number of columns 13
_______________________
Column type frequency:
factor 1
numeric 1
________________________
Group variables department

Variable type: factor

skim_variable department n_missing complete_rate ordered n_unique top_counts
left_company Marketing 0 1 FALSE 2 No: 198, Yes: 40
left_company Sales 0 1 FALSE 2 No: 175, Yes: 77
left_company Research 0 1 FALSE 2 No: 284, Yes: 9
left_company Product Development 0 1 FALSE 2 No: 128, Yes: 50
left_company IT and Analytics 0 1 FALSE 2 No: 369, Yes: 30
left_company Finance and Operations 0 1 FALSE 2 No: 79, Yes: 31

Variable type: numeric

skim_variable department n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
salary Marketing 0 1 93265.58 36514.10 30559.08 72344.72 87856.66 117603.1 206581.5 ▂▇▃▁▁
salary Sales 0 1 90148.67 37746.84 30488.15 66664.75 85180.55 116104.2 197965.1 ▅▇▅▁▂
salary Research 0 1 99425.10 37606.81 33276.54 73047.08 91205.32 119657.4 203528.9 ▂▇▅▁▂
salary Product Development 0 1 92429.42 37322.31 32443.79 68360.73 87066.84 114993.6 195345.4 ▅▇▅▁▂
salary IT and Analytics 0 1 93897.64 37367.83 29848.56 70115.29 88521.84 116088.1 211621.0 ▃▇▃▁▁
salary Finance and Operations 0 1 93893.34 39945.11 37115.29 70317.59 88760.26 113516.5 212134.7 ▆▇▃▁▂



Counting rows with n() and count()

The group_by() function is also used for obtaining counts of factor or character variable values.

The n() function is used to count the number of rows within a data frame. It can be used within either summarise() or mutate(). If a grouped data frame is used, the n() function will tally the number of rows within each level of the grouping variable.

Unlike other dplyr functions, n() takes no arguments. In the example below, we group employee_data by left_company and use n() within summarise() to count the number of rows within the levels of left_company. This will tally the number “Yes” and “No” values within left_company.


# The n() function in dplyr returns the number of rows per group
employee_data %>% group_by(left_company) %>% 
                  summarise(number_employees = n())


This type of summary is so common, that dplyr provides a function named count() to provide counts by levels of single or multiple variable combinations. This function creates a column named n with the counts. The count() function is best used with factor or character columns that do not have too many unique values.


# Count the occurrence of each unique value in department
employee_data %>% count(department)


# Counts can be sorted
employee_data %>% count(department, sort = TRUE)


# Count variable can be renamed
employee_data %>% count(department, sort = TRUE, name = 'number_of_employees')



The count function also takes multiple variables as arguments. This will count all of the combinations of variable values that are present in the data.

# The count() function also takes multiple arguments
employee_data %>% count(left_company, department, name = 'number_of_employees')


Keep in mind that all dplyr functions return a data frame so we can easily use other dplyr functions on the results on count().

# Arrange the output by department and left_company
employee_data %>% count(left_company, department, name = 'number_of_employees') %>% 
  arrange(department, left_company)



One more example. How many employees are there by job satisfaction? What are their average miles from the office by each category?

employee_data %>% group_by(job_satisfaction) %>% 
                  summarise(number_of_employees = n(),
                            avg_miles = mean(miles_from_home))



Can you figure out why the code below doesn’t work?

Hint: highlight and execute

employee_data %>% count(job_satisfaction, name = 'number_of_employees')


What does this give you?

This is what is being passed to summarise()by the second %>% operator in the expression below.

Why doesn’t summarise() know how to calculate the mean of the miles_from_home variable?


employee_data %>% count(job_satisfaction, name = 'number_of_employees') %>% 
                  summarise(avg_miles = mean(miles_from_home))



A Warning on Working With group_by()

When using group_by() it’s important to know that all summarise() and mutate() functions are applied by group. If you need to apply functions to all rows of a data frame and not by groups, then use the ungroup() function to un-group the input data frame.

Let’s say that we want to create a data frame that has employees grouped by left_company and job_level. For each group, we want the number of employees in that group and the percentage these employees represent of all employees in the data set.

In the code below, I first calculate this the incorrect way. Notice that the percent_total_employees sums to 200. This is because the mutate() function was applied to the grouped data frame that summarise() produced. So we get the percentage for each group within “Yes” and “No”.

# Not what we wanted
employee_data %>% group_by(left_company, job_level) %>% 
                  summarise(employees = n()) %>% 
                  mutate(percent_of_total_employees = 100*(employees/sum(employees)))



To avoid these type of surprises, I recommended dropping groups after summary statistics have been calculated. This is done with the ungroup() function.

# This is correct
employee_data %>% group_by(left_company, job_level) %>% 
                  summarise(employees = n()) %>% 
                  ungroup() %>% 
                  mutate(percent_of_total_employees = 100*(employees/sum(employees)))



Obtaining Distinct Values

The n() function counts the number of rows in an data frame while the n_distinct() function counts the number of unique values observed in a variable. This is useful in many situations.

Like the n() function, n_distinct() is used within summarise() or mutate().

Let’s say we wanted to know the number of distinct departments that are in the employee_data data frame. The code below will calculate this. We simply pass the department variable into n_distinct()

employee_data %>% 
  summarise(number_of_departments = n_distinct(department))



Advanced Filtering

Ranking Data with top_n()

The top_n() function can be used to subset a data frame based on the ranked values of a particular variable. This function takes three inputs:

  • a data frame
  • number of rows to return
  • variable used for ordering


In the example below, I filter the employee_data dataset to include only the top 3 rows based on the salary values. By default, top_n() will return the rows with the 3 largest values of salary.

employee_data %>% top_n(3, salary)



If there are any ties, the top_n() function will return all values. Notice that we get 4 rows in the output below. This is because the top 3 values of yrs_at_company are 40, 37, and 36 and there are two instances of 36 in the data.

employee_data %>% top_n(3, yrs_at_company)



Selecting Rows with slice()

The slice() function is used to select rows by a numeric index. If you want only the first 3 rows of a data frame, you could use slice(1:3).

Let’s re-create the top 3 analysis we just performed, but this time we’ll keep only the first three 3 rows.


# Keep just the first 3 rows
employee_data %>% slice(1:3)


# Keep rows 1, 200, 1002
employee_data %>% slice(1, 200, 1002)



The slice() function can be used as an alternative to top_n() when paired with arrange(). This has the benefit of dropping tied values.

The code below returns the top 3 values of yrs_at_company just as above, but only keeps the first 3 rows. This is achieved by first ordering the values of yrs_at_company in descending order.

employee_data %>% arrange(desc(yrs_at_company)) %>% 
                  slice(1:3)



The slice() and top_n() functions are also useful for creating subsets by groups in a data frame.

Let’s say that we wanted to filter the employee_data data frame to only include the top 2 employees with the largest values of salary for each department. The code below shows how we would accomplish this with group_by() and top_n().

employee_data %>% group_by(department) %>% 
                  top_n(2, salary)


To make things easier to interpret, we can arrange the results from above by department and salary

employee_data %>% group_by(department) %>% 
                  top_n(2, salary) %>% 
                  arrange(department, desc(salary))



Using Logical Subsetting

Special Properties of Logical Vectors

Imagine we have data from a survey we recently conducted where 7 people responded and provided their age. This data is stored in the age vector below.

age <- c(23, 31, 27, 41, 54, 34, 25)

age
[1] 23 31 27 41 54 34 25

What if we would like to know the number of people who are 30 or older and what percentage of the total respondents this group represents.

We can answer this question by first using the >= operator to find where values stored in the age vector are greater than or equal to the value 30. Anytime we use comparison operators (>, >=, <, <=, ==) on a vector, we will get a logical vector consisting of TRUE/FALSE values indicating where our condition was met.

For example, running the code below produces a sequence of TRUE/FALSE values that test where our respondents are 30 or older in the age vector.

age >= 30
[1] FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE

Two Important Properties

To answer our question above, we can use the following properties of logical vectors in R:

  • the sum of a logical vector returns the number of TRUE values
  • the mean of a logical vector returns the proportion of TRUE values

We see from the output below that 4 people in our survey were 30 years or older and that this represents 57% of the total respondents.

sum(age >= 30)
[1] 4
mean(age >= 30)
[1] 0.5714286



Logical Subsetting Within summarise()

We can use these properties to extend our capabilities with summarise(). What if someone asked us how many employees there are by department and the count and proportion of those employees that make less than $60,000?

employee_data %>% group_by(department) %>% 
                  summarise(employees = n(),
                            employees_less_60 = sum(salary < 60000),
                            employees_less_60_prop = mean(salary < 60000))



Common Data Manipulation Tasks

This section covers some common data analysis tasks that we will be performing on data frames throughout the semester. To demonstrate these tasks, let’s work with the following simple data frame.

The code below creates a simple data set with results from a job survey.

survey <- tibble(age = c(26, 31, 28, 42, 31, 37, 51, 29),
                 job_function = c('Biotechnology', 'Analytics', 'Machine Learning',
                                  'Marketing','Biotechnology', 'Machine Learning', 
                                  'Analytics', 'Biotechnology'),
                 job_industry = c('Healthcare', 'Healthcare', 'Financial Services',
                                  'Retail', 'Non-Profit', 'Education',
                                  'Retail', 'Healthcare'),
                 job_level = c('Entry', 'Mid', 'Mid', 'Senior', 'Mid', 
                               'Mid', 'Senior', 'Mid'),
                 salary = c(75500, 87600, 97000, 92000, 89000,
                            108500, 121000, 94000))

survey



Recoding Categorical Variables

In many cases, you will need to re-code the values of a given variable. In our data, we have a job_function variable that captures our survey respondents’ specific role within their company.

What if we wanted to replace “Analytics” and “Machine Learning” with “Data Science”. We can achieve this goal by using the recode() function.

This functions takes a vector to modify as its first argument and a series of replacements in the form old value = new value separated by commas. All values that are left out will be kept the same.

Below is an example of how to use this function on the job_function column within our survey data.


recode(survey$job_function,
       'Analytics' = 'Data Science',
       'Machine Learning' = 'Data Science')
[1] "Biotechnology" "Data Science"  "Data Science"  "Marketing"    
[5] "Biotechnology" "Data Science"  "Data Science"  "Biotechnology"



To obtain the results as a factor variable, we can use the recode_factor() function. The recode_factor()function has the same arguments as recode(), but will return a factor as a result.

In the code below, we will use mutate() to create new variables that recode “Analytics” and “Machine Learning” into “Data Science”. One will be with recode(), which will return a character variable, and one with recode_factor().

Notice in the output, that even though the two new columns look identical, their types are different.


survey %>% 
  mutate(job_function_chr = recode(job_function,
                                 'Analytics' = 'Data Science',
                                 'Machine Learning' = 'Data Science'),
         job_function_fct = recode_factor(job_function,
                                          'Analytics' = 'Data Science',
                                          'Machine Learning' = 'Data Science'))


For more examples of the recode() functions, see the tidyverse documentation



Creating Variables with case_when()

The case_when() function from the dplyr package is particularly useful when you need to create a new variable that relies on a complex combination of existing variables in a data frame.

The case_when() function takes a sequence of two-sided formulas. The left hand side determines which values match this case and the right hand side provides the replacement value.

The general syntax is logical condition ~ replacement value, where the logical condition can involve multiple variables from a data frame. The sequence ends with TRUE ~ value for all other cases.

The sequence is evaluated in the order that it is put into case_when(). This syntax is best understood with an example.


In the code below, we use case_when() to create a new character vector from the age and job_function columns in the survey data.

Let’s say that we would like to create a new column that labels people that are 30 years or older and work in Analytic or Machine Learning as “Data Science, 30+” and people who are 30 and older and work in Biotechnology as “Biotechnology, 30+”. For all other respondents to our survey, we would like “Other”.

survey_updated <- survey %>% 
                  mutate(ds_biotech_30 = case_when(age >= 30 & job_function %in% c('Analytics', 'Machine Learning') ~ 'Data Science, 30+',
                                                age >= 30 & job_function == "Biotechnology" ~ "Biotechnology, 30+",
                                                TRUE ~ 'Other'))

survey_updated


For more examples of the case_when() function, see the tidyverse documentation



Binning Numeric Variables

Converting numeric columns into categories is a common data analysis task. A numeric variable can be partitioned into categories by using the cut() function. Let’s say that we want to create a categorical age variable with the following values: “Less than 25”, “25 - 35”, and “36 and older”.

The cut() function divides a numeric vector into intervals and codes the values according to which interval they fall. The cut() function takes the following arguments:

x - a numeric vector

breaks - a numeric vector of cut points (can include -Inf and Inf)

labels - a vector of labels for the resulting intervals (optional)

right - logical (default is TRUE), indicating if the intervals should be closed on the right (i.e. (a, b])


The nice thing about cut() is that it returns an ordered factor vector as a result.

Below is an example of using cut() on the age vector in survey.

The symbols -Inf, and Inf in the breaks argument stand for the lowest and largest observed values, respectively.

Using the breaks from below, the cut() function will segment the age values into the following categories:


(-infinity, 24], (24, 35], and (35, +infinity)


# original values
survey$age
[1] 26 31 28 42 31 37 51 29
# Using cut()
cut(x = survey$age,
    breaks = c(-Inf, 24, 35, Inf),
    labels = c("Less than 25", "25 - 35", "36 and older"),
    right = TRUE)
[1] 25 - 35      25 - 35      25 - 35      36 and older 25 - 35     
[6] 36 and older 36 and older 25 - 35     
Levels: Less than 25 25 - 35 36 and older


Here is an example of doing the above task with dplyr and mutate().

survey %>% 
  mutate(age_category = cut(age,
                            breaks = c(-Inf, 24, 35, Inf),
                            labels = c("Less than 25", "25 - 35", "36 and older"),
                            right = TRUE)) %>% 
  select(age, age_category)



Automatic Binning

There are three very useful functions within the tidyverse package that perform automating binning of numeric data for quick exploratory analysis. These functions take a numeric vector as input and return a factor with labeled levels.

  • cut_interval() makes n groups with equal range
  • cut_number() makes n groups with (approximately) equal numbers of observations
  • cut_width() makes groups of a given width


cut_interval()

# Original variable
survey$age
[1] 26 31 28 42 31 37 51 29
# cut_interval
age_interval <- cut_interval(survey$age, n = 3)

# View results
age_interval
[1] [26,34.3]   [26,34.3]   [26,34.3]   (34.3,42.7] [26,34.3]   (34.3,42.7]
[7] (42.7,51]   [26,34.3]  
Levels: [26,34.3] (34.3,42.7] (42.7,51]


# Obtain factor level counts
summary(age_interval)
  [26,34.3] (34.3,42.7]   (42.7,51] 
          5           2           1 



cut_number()

# cut_number
age_number <- cut_number(survey$age, n = 3)

# View results
age_number
[1] [26,29.7] (29.7,35] [26,29.7] (35,51]   (29.7,35] (35,51]   (35,51]  
[8] [26,29.7]
Levels: [26,29.7] (29.7,35] (35,51]
# View factor level counts
summary(age_number)
[26,29.7] (29.7,35]   (35,51] 
        3         2         3 



cut_width()

The cut_interval() takes a width argument which specifies the width of each interval and a boundary argument for centering intervals. By default, boundary is equal to 0.5 which may lead to intervals that have non-integer end points. I usually set boundary to 0 so that the intervals always fall between integer values.

# Original variable
survey$age
[1] 26 31 28 42 31 37 51 29
# cut_width
age_width <- cut_width(survey$age, width = 10, boundary = 0)
# View result
age_width
[1] [20,30] (30,40] [20,30] (40,50] (30,40] (30,40] (50,60] [20,30]
Levels: [20,30] (30,40] (40,50] (50,60]
# View counts
summary(age_width)
[20,30] (30,40] (40,50] (50,60] 
      3       3       1       1 



Example with employee_data

Let’s use the cut_width() function to accomplish the following data analysis workflow with dplyr


  • Start with employee_data
  • Create a miles_category variable that bins miles_from_home values in 5 mile increments
  • Form groups by left_company and miles_category
  • Count the number of employees that fall into each combination


employee_data %>% 
  mutate(miles_category = cut_width(miles_from_home, width = 5, boundary = 0)) %>% 
  group_by(left_company, miles_category) %>% 
  summarise(employees = n())