In this tutorial, we will cover more advanced techniques for data manipulation with `dplyr`

.

Please click the button below to open an interactive version of all course `R`

tutorials through RStudio Cloud.

**Note**: you will need to register for an account before opening the project. Please remember to use your GMU e-mail address.

Click the button below to launch an interactive RStudio environment using `Binder.org`

. This will launch a pre-configured RStudio environment within your browser. Unlike RStudio cloud, this service has no monthly usage limits, but it may take up to 10 minutes to launch and you will not be able to save your work.

First let’s load the `tidyverse`

and `skimr`

packages as well as the `employee_data`

tibble into our `R`

environment.

```
# This will load all 8 of the tidyverse packages
library(tidyverse)
library(skimr)
# Load employee_data
employee_data <- read_rds(url('https://gmudatamining.com/data/employee_data.rds'))
# View data
employee_data
```

Before we begin our study of advanced techniques for data manipulation, let’s review the core `dplyr`

functions from our last tutorial. Let’s say that we are interested in answering the following question:

Do employees who have left the company tend to work more weekly hours, on average?

One way to answer this question would be to execute the following `dplyr`

workflow

- start with
`employee_data`

- create a new variable that standardizes the
`weekly_hours`

column by subtracting the mean and dividing by the standard deviation - select only the rows of
`employee_data`

where`left_company`

is**yes** - calculate the average of the standardized weekly hours variable

Remember that when we standardize a numeric variable by subtracting the mean and dividing by the standard deviation, the resulting values are in `standard deviation units`

.

So a 2.2 value on this scale represents a value that is 2.2 standard deviations to the right of the average value in the original data. A value of 0 on this scale represents the average value in the original data.

The code below executes our analytics workflow and we see that employees who left the company tend to work more weekly hours on average. In fact, these employees work nearly 2 standard deviations to the right of the average hours across all employees in the company.

```
employee_data %>%
mutate(hours_scaled = (weekly_hours - mean(weekly_hours)) / sd(weekly_hours)) %>%
filter(left_company == 'Yes') %>%
summarise(avg_weekly_hours = mean(hours_scaled))
```

Note that we standardized our data **before** selecting employees that left the company.

- What if selected the employees who left and then created our standardized weekly hours variable?
- Would this have answered our question?

Try to perform this with `dplyr`

and see what you get. What does it mean in this case?

A common data analysis technique, known as `split-apply-combine`

, involves creating statistical summaries **by groups** within a data frame.

Using the `split-apply-combine`

technique with `dplyr`

usually involves taking a data frame, forming subsets with the `group_by()`

function, applying a summary function to to the groups, and collecting the results into a single data frame.

The `group_by()`

function allows you to perform tasks `by groups`

in your data. A typical application of the `group_by()`

function is when addressing questions such as *What is the average salary of employees by department*?

The `group_by()`

function will separate a data frame into distinct groups using the variable(s) that is/are supplied as the argument. Behind the scenes, the `group_by()`

function splits a data frame into multiple data frames, one for each unique value of the grouping variable. Any `dplyr`

functions that follow afterward are applied to the groups within the grouped data frame.

Below is an example of how we would answer the question from above.

```
employee_data %>%
group_by(department) %>% # form groups by department
summarise(average_salary = mean(salary)) # calculate mean within each group
```

The `group_by()`

function can be used to form groups with multiple levels. This happens when we pass more than one variable into the function.

Let’s say we are interested in answering the following question: *What is the average salary of employees by department and whether they left the company or not*?

The small adjustment to our previous code below will answer this for us.

```
employee_data %>%
group_by(left_company, department) %>% # form groups by department and left_company
summarise(average_salary = mean(salary)) # calculate mean within each combination of values
```

What `dplyr`

commands would you need to add to the previous code so that results are sorted as they appear below?

The `group_by()`

function can also be used in combination with the `mutate()`

function.

Imagine if we wanted to accomplish the following:

Create a new `salary_scaled_dept`

variable that scales the `salary`

variable by subtracting the mean and dividing by the standard deviation. However, unlike in our previous example, we want to scale each `salary`

value **within each department**.

This task is accomplished by the code below. First we group `employee_data`

by the unique values in the `department`

variable and then apply the transformation. Within `mutate()`

, the function `mean(salary)`

is calculating the mean `salary`

value **within each group**.

```
employee_data %>%
group_by(department) %>%
mutate(salary_scaled_dept = (salary - mean(salary)) / sd(salary)) %>%
select(department, salary_scaled_dept)
```

What if we want to create a `salary_scaled`

variable, like in our review example, as well? Maybe we would like to see how the standardized values differ across all employees as well as within each department.

To achieve this, we must add a step to our code above.

First we use `mutate()`

to create a `scaled_salary`

variable across all rows. We then pass this into a `group_by()`

function where we perform the same calculation by group. This complex set of transformations is easy to follow when we use the `%>%`

operator.

```
employee_data %>%
mutate(salary_scaled = (salary - mean(salary)) / sd(salary)) %>% # scale across all rows
group_by(department) %>% # form groups by department
mutate(salary_scaled_dept = (salary - mean(salary)) / sd(salary)) %>% # scale within each group
select(department, salary_scaled, salary_scaled_dept)
```

`skim()`

In lesson 2, we explored data frames with the `skim()`

function from the `skimr()`

package.

The `skim()`

function takes a data frame as input and provides a high-level summary of the structure of the data frame and the values within its columns.

As output, we get:

- the number of rows and columns along with the data types present in our data
- a summary of our factor variables, including the number of missing observations, whether our factor levels are ordered, the count of unique levels, and an abbreviated list of the most frequent factor levels
- a summary of our numeric variables which include the number of missing observations, the mean and standard deviation, a five number summary, and a plot of the distribution of values

An added bonus of the `skim()`

function is that it also accepts a grouped data frame. In this case all summary statistics will be calculated by group.

In the code below, we use the `%>%`

operator to group the `employee_data`

tibble by `department`

and pass it into the `skim()`

function. Notice that all summaries are now provided by each department.

```
# View data frames summaries by groups
employee_data %>% group_by(department) %>%
skim()
```

Name | Piped data |

Number of rows | 1470 |

Number of columns | 13 |

_______________________ | |

Column type frequency: | |

factor | 6 |

numeric | 6 |

________________________ | |

Group variables | department |

**Variable type: factor**

skim_variable | department | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|---|

left_company | Marketing | 0 | 1 | FALSE | 2 | No: 198, Yes: 40 |

left_company | Sales | 0 | 1 | FALSE | 2 | No: 175, Yes: 77 |

left_company | Research | 0 | 1 | FALSE | 2 | No: 284, Yes: 9 |

left_company | Product Development | 0 | 1 | FALSE | 2 | No: 128, Yes: 50 |

left_company | IT and Analytics | 0 | 1 | FALSE | 2 | No: 369, Yes: 30 |

left_company | Finance and Operations | 0 | 1 | FALSE | 2 | No: 79, Yes: 31 |

job_level | Marketing | 0 | 1 | FALSE | 5 | Sen: 81, Man: 59, Dir: 51, Ass: 28 |

job_level | Sales | 0 | 1 | FALSE | 5 | Sen: 73, Man: 63, Dir: 56, Ass: 41 |

job_level | Research | 0 | 1 | FALSE | 5 | Sen: 101, Dir: 74, Man: 64, Vic: 32 |

job_level | Product Development | 0 | 1 | FALSE | 5 | Sen: 57, Man: 39, Dir: 39, Ass: 28 |

job_level | IT and Analytics | 0 | 1 | FALSE | 5 | Sen: 131, Man: 90, Dir: 88, Ass: 51 |

job_level | Finance and Operations | 0 | 1 | FALSE | 5 | Sen: 33, Man: 29, Dir: 23, Ass: 15 |

business_travel | Marketing | 0 | 1 | FALSE | 3 | Rar: 173, Fre: 37, Non: 28 |

business_travel | Sales | 0 | 1 | FALSE | 3 | Rar: 176, Fre: 49, Non: 27 |

business_travel | Research | 0 | 1 | FALSE | 3 | Rar: 220, Fre: 50, Non: 23 |

business_travel | Product Development | 0 | 1 | FALSE | 3 | Rar: 111, Fre: 45, Non: 22 |

business_travel | IT and Analytics | 0 | 1 | FALSE | 3 | Rar: 278, Fre: 78, Non: 43 |

business_travel | Finance and Operations | 0 | 1 | FALSE | 3 | Rar: 85, Fre: 18, Non: 7 |

job_satisfaction | Marketing | 0 | 1 | FALSE | 4 | Ver: 77, Hig: 70, Low: 49, Med: 42 |

job_satisfaction | Sales | 0 | 1 | FALSE | 4 | Hig: 87, Ver: 69, Low: 49, Med: 47 |

job_satisfaction | Research | 0 | 1 | FALSE | 4 | Hig: 95, Ver: 95, Low: 58, Med: 45 |

job_satisfaction | Product Development | 0 | 1 | FALSE | 4 | Hig: 59, Ver: 46, Med: 38, Low: 35 |

job_satisfaction | IT and Analytics | 0 | 1 | FALSE | 4 | Ver: 132, Hig: 105, Med: 92, Low: 70 |

job_satisfaction | Finance and Operations | 0 | 1 | FALSE | 4 | Ver: 40, Low: 28, Hig: 26, Med: 16 |

performance_rating | Marketing | 0 | 1 | FALSE | 5 | Mee: 95, Exc: 80, Exc: 39, Min: 20 |

performance_rating | Sales | 0 | 1 | FALSE | 5 | Mee: 95, Exc: 84, Exc: 41, Min: 27 |

performance_rating | Research | 0 | 1 | FALSE | 5 | Mee: 104, Exc: 99, Exc: 75, Min: 12 |

performance_rating | Product Development | 0 | 1 | FALSE | 5 | Mee: 69, Exc: 52, Min: 25, Exc: 24 |

performance_rating | IT and Analytics | 0 | 1 | FALSE | 5 | Mee: 141, Exc: 129, Exc: 87, Min: 34 |

performance_rating | Finance and Operations | 0 | 1 | FALSE | 5 | Mee: 42, Exc: 28, Exc: 20, Min: 18 |

marital_status | Marketing | 0 | 1 | FALSE | 3 | Mar: 101, Sin: 82, Div: 55 |

marital_status | Sales | 0 | 1 | FALSE | 3 | Mar: 119, Sin: 81, Div: 52 |

marital_status | Research | 0 | 1 | FALSE | 3 | Mar: 134, Sin: 84, Div: 75 |

marital_status | Product Development | 0 | 1 | FALSE | 3 | Mar: 86, Sin: 57, Div: 35 |

marital_status | IT and Analytics | 0 | 1 | FALSE | 3 | Mar: 187, Sin: 124, Div: 88 |

marital_status | Finance and Operations | 0 | 1 | FALSE | 3 | Mar: 46, Sin: 42, Div: 22 |

**Variable type: numeric**

skim_variable | department | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|

salary | Marketing | 0 | 1 | 93265.58 | 36514.10 | 30559.08 | 72344.72 | 87856.66 | 117603.05 | 206581.5 | ▂▇▃▁▁ |

salary | Sales | 0 | 1 | 90148.67 | 37746.84 | 30488.15 | 66664.75 | 85180.55 | 116104.17 | 197965.1 | ▅▇▅▁▂ |

salary | Research | 0 | 1 | 99425.10 | 37606.81 | 33276.54 | 73047.08 | 91205.32 | 119657.44 | 203528.9 | ▂▇▅▁▂ |

salary | Product Development | 0 | 1 | 92429.42 | 37322.31 | 32443.79 | 68360.73 | 87066.84 | 114993.62 | 195345.4 | ▅▇▅▁▂ |

salary | IT and Analytics | 0 | 1 | 93897.64 | 37367.83 | 29848.56 | 70115.29 | 88521.84 | 116088.12 | 211621.0 | ▃▇▃▁▁ |

salary | Finance and Operations | 0 | 1 | 93893.34 | 39945.11 | 37115.29 | 70317.59 | 88760.26 | 113516.54 | 212134.7 | ▆▇▃▁▂ |

weekly_hours | Marketing | 0 | 1 | 49.96 | 4.81 | 41.00 | 47.00 | 49.00 | 52.00 | 66.0 | ▃▇▂▂▁ |

weekly_hours | Sales | 0 | 1 | 51.58 | 5.81 | 40.00 | 47.00 | 50.00 | 56.00 | 65.0 | ▂▇▃▃▂ |

weekly_hours | Research | 0 | 1 | 48.53 | 3.57 | 40.00 | 46.00 | 48.00 | 50.00 | 62.0 | ▂▇▇▁▁ |

weekly_hours | Product Development | 0 | 1 | 50.89 | 5.39 | 42.00 | 46.00 | 49.50 | 55.00 | 64.0 | ▇▇▆▃▂ |

weekly_hours | IT and Analytics | 0 | 1 | 49.23 | 3.93 | 40.00 | 47.00 | 49.00 | 51.00 | 64.0 | ▂▇▆▁▁ |

weekly_hours | Finance and Operations | 0 | 1 | 51.98 | 5.37 | 43.00 | 48.00 | 51.00 | 55.00 | 66.0 | ▅▇▃▃▁ |

yrs_at_company | Marketing | 0 | 1 | 6.95 | 6.01 | 0.00 | 3.00 | 5.00 | 9.75 | 36.0 | ▇▃▁▁▁ |

yrs_at_company | Sales | 0 | 1 | 6.86 | 6.37 | 0.00 | 2.00 | 5.00 | 9.00 | 40.0 | ▇▂▁▁▁ |

yrs_at_company | Research | 0 | 1 | 7.13 | 6.21 | 0.00 | 3.00 | 5.00 | 9.00 | 33.0 | ▇▅▁▁▁ |

yrs_at_company | Product Development | 0 | 1 | 6.79 | 6.15 | 0.00 | 2.00 | 5.00 | 9.75 | 33.0 | ▇▅▁▁▁ |

yrs_at_company | IT and Analytics | 0 | 1 | 7.30 | 6.25 | 0.00 | 3.00 | 5.00 | 10.00 | 36.0 | ▇▃▁▁▁ |

yrs_at_company | Finance and Operations | 0 | 1 | 6.48 | 5.06 | 0.00 | 3.00 | 5.00 | 9.00 | 21.0 | ▇▅▃▂▁ |

yrs_since_promotion | Marketing | 0 | 1 | 2.29 | 3.25 | 0.00 | 0.00 | 1.00 | 3.00 | 15.0 | ▇▁▁▁▁ |

yrs_since_promotion | Sales | 0 | 1 | 2.12 | 3.17 | 0.00 | 0.00 | 1.00 | 3.00 | 15.0 | ▇▁▁▁▁ |

yrs_since_promotion | Research | 0 | 1 | 2.24 | 3.48 | 0.00 | 0.00 | 1.00 | 2.00 | 15.0 | ▇▁▁▁▁ |

yrs_since_promotion | Product Development | 0 | 1 | 1.87 | 2.72 | 0.00 | 0.00 | 1.00 | 2.00 | 13.0 | ▇▂▁▁▁ |

yrs_since_promotion | IT and Analytics | 0 | 1 | 2.30 | 3.28 | 0.00 | 0.00 | 1.00 | 2.50 | 15.0 | ▇▁▁▁▁ |

yrs_since_promotion | Finance and Operations | 0 | 1 | 2.07 | 3.13 | 0.00 | 0.00 | 1.00 | 2.00 | 13.0 | ▇▁▁▁▁ |

previous_companies | Marketing | 0 | 1 | 3.17 | 1.53 | 1.00 | 2.00 | 3.00 | 4.00 | 7.0 | ▇▇▂▂▂ |

previous_companies | Sales | 0 | 1 | 3.41 | 1.58 | 1.00 | 2.00 | 3.00 | 5.00 | 7.0 | ▇▇▂▃▃ |

previous_companies | Research | 0 | 1 | 3.05 | 1.51 | 1.00 | 2.00 | 3.00 | 3.00 | 7.0 | ▇▇▁▂▂ |

previous_companies | Product Development | 0 | 1 | 3.57 | 1.68 | 1.00 | 2.00 | 3.00 | 5.00 | 7.0 | ▆▇▂▃▃ |

previous_companies | IT and Analytics | 0 | 1 | 3.09 | 1.56 | 1.00 | 2.00 | 3.00 | 3.00 | 7.0 | ▇▇▁▁▂ |

previous_companies | Finance and Operations | 0 | 1 | 3.46 | 1.62 | 1.00 | 2.00 | 3.00 | 5.00 | 7.0 | ▆▇▁▂▃ |

miles_from_home | Marketing | 0 | 1 | 8.89 | 8.11 | 1.00 | 2.00 | 7.00 | 12.00 | 29.0 | ▇▅▁▁▂ |

miles_from_home | Sales | 0 | 1 | 9.85 | 8.17 | 1.00 | 2.00 | 8.00 | 15.25 | 29.0 | ▇▅▂▂▂ |

miles_from_home | Research | 0 | 1 | 9.37 | 7.83 | 1.00 | 3.00 | 7.00 | 14.00 | 29.0 | ▇▅▂▂▂ |

miles_from_home | Product Development | 0 | 1 | 8.98 | 8.10 | 1.00 | 2.00 | 7.00 | 12.75 | 29.0 | ▇▅▁▂▂ |

miles_from_home | IT and Analytics | 0 | 1 | 9.04 | 8.45 | 1.00 | 2.00 | 6.00 | 14.50 | 29.0 | ▇▃▁▂▂ |

miles_from_home | Finance and Operations | 0 | 1 | 8.79 | 7.43 | 1.00 | 2.00 | 7.00 | 12.00 | 29.0 | ▇▆▂▂▁ |

As before, we can select a subset of columns to explore by typing their names within the `skim()`

function.

```
# Select a subset of variables
employee_data %>% group_by(department) %>%
skim(left_company, salary)
```

Name | Piped data |

Number of rows | 1470 |

Number of columns | 13 |

_______________________ | |

Column type frequency: | |

factor | 1 |

numeric | 1 |

________________________ | |

Group variables | department |

**Variable type: factor**

skim_variable | department | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|---|

left_company | Marketing | 0 | 1 | FALSE | 2 | No: 198, Yes: 40 |

left_company | Sales | 0 | 1 | FALSE | 2 | No: 175, Yes: 77 |

left_company | Research | 0 | 1 | FALSE | 2 | No: 284, Yes: 9 |

left_company | Product Development | 0 | 1 | FALSE | 2 | No: 128, Yes: 50 |

left_company | IT and Analytics | 0 | 1 | FALSE | 2 | No: 369, Yes: 30 |

left_company | Finance and Operations | 0 | 1 | FALSE | 2 | No: 79, Yes: 31 |

**Variable type: numeric**

skim_variable | department | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|

salary | Marketing | 0 | 1 | 93265.58 | 36514.10 | 30559.08 | 72344.72 | 87856.66 | 117603.1 | 206581.5 | ▂▇▃▁▁ |

salary | Sales | 0 | 1 | 90148.67 | 37746.84 | 30488.15 | 66664.75 | 85180.55 | 116104.2 | 197965.1 | ▅▇▅▁▂ |

salary | Research | 0 | 1 | 99425.10 | 37606.81 | 33276.54 | 73047.08 | 91205.32 | 119657.4 | 203528.9 | ▂▇▅▁▂ |

salary | Product Development | 0 | 1 | 92429.42 | 37322.31 | 32443.79 | 68360.73 | 87066.84 | 114993.6 | 195345.4 | ▅▇▅▁▂ |

salary | IT and Analytics | 0 | 1 | 93897.64 | 37367.83 | 29848.56 | 70115.29 | 88521.84 | 116088.1 | 211621.0 | ▃▇▃▁▁ |

salary | Finance and Operations | 0 | 1 | 93893.34 | 39945.11 | 37115.29 | 70317.59 | 88760.26 | 113516.5 | 212134.7 | ▆▇▃▁▂ |

`n()`

and `count()`

The `group_by()`

function is also used for obtaining counts of factor or character variable values.

The `n()`

function is used to count the number of rows within a data frame. It can be used within either `summarise()`

or `mutate()`

. If a grouped data frame is used, the `n()`

function will tally the number of rows within each level of the grouping variable.

Unlike other `dplyr`

functions, `n()`

takes no arguments. In the example below, we group `employee_data`

by `left_company`

and use `n()`

within `summarise()`

to count the number of rows within the levels of `left_company`

. This will tally the number “Yes” and “No” values within `left_company`

.

```
# The n() function in dplyr returns the number of rows per group
employee_data %>% group_by(left_company) %>%
summarise(number_employees = n())
```

This type of summary is so common, that `dplyr`

provides a function named `count()`

to provide counts by levels of single or multiple variable combinations. This function creates a column named `n`

with the counts. The `count()`

function is best used with factor or character columns that do not have too many unique values.

```
# Count the occurrence of each unique value in department
employee_data %>% count(department)
```

```
# Counts can be sorted
employee_data %>% count(department, sort = TRUE)
```

```
# Count variable can be renamed
employee_data %>% count(department, sort = TRUE, name = 'number_of_employees')
```

The `count`

function also takes multiple variables as arguments. This will count all of the combinations of variable values that are present in the data.

```
# The count() function also takes multiple arguments
employee_data %>% count(left_company, department, name = 'number_of_employees')
```

Keep in mind that all `dplyr`

functions return a data frame so we can easily use other `dplyr`

functions on the results on `count()`

.

```
# Arrange the output by department and left_company
employee_data %>% count(left_company, department, name = 'number_of_employees') %>%
arrange(department, left_company)
```

One more example. How many employees are there by job satisfaction? What are their average miles from the office by each category?

```
employee_data %>% group_by(job_satisfaction) %>%
summarise(number_of_employees = n(),
avg_miles = mean(miles_from_home))
```

Can you figure out why the code below doesn’t work?

**Hint**: highlight and execute

`employee_data %>% count(job_satisfaction, name = 'number_of_employees')`

What does this give you?

This is what is being passed to `summarise()`

by the second `%>%`

operator in the expression below.

Why doesn’t `summarise()`

know how to calculate the mean of the `miles_from_home`

variable?

```
employee_data %>% count(job_satisfaction, name = 'number_of_employees') %>%
summarise(avg_miles = mean(miles_from_home))
```

`group_by()`

When using `group_by()`

it’s important to know that all `summarise()`

and `mutate()`

functions are `applied by group`

. If you need to apply functions to all rows of a data frame and not by groups, then use the `ungroup()`

function to un-group the input data frame.

Let’s say that we want to create a data frame that has employees grouped by `left_company`

and `job_level`

. For each group, we want the number of employees in that group and the percentage these employees represent of **all** employees in the data set.

In the code below, I first calculate this the incorrect way. Notice that the `percent_total_employees`

sums to 200. This is because the `mutate()`

function was applied to the grouped data frame that `summarise()`

produced. So we get the percentage for each group within “Yes” and “No”.

```
# Not what we wanted
employee_data %>% group_by(left_company, job_level) %>%
summarise(employees = n()) %>%
mutate(percent_of_total_employees = 100*(employees/sum(employees)))
```

To avoid these type of surprises, I recommended dropping groups after summary statistics have been calculated. This is done with the `ungroup()`

function.

```
# This is correct
employee_data %>% group_by(left_company, job_level) %>%
summarise(employees = n()) %>%
ungroup() %>%
mutate(percent_of_total_employees = 100*(employees/sum(employees)))
```

The `n()`

function counts the number of rows in an data frame while the `n_distinct()`

function counts the number of unique values observed in a variable. This is useful in many situations.

Like the `n()`

function, `n_distinct()`

is used within `summarise()`

or `mutate()`

.

Let’s say we wanted to know the number of distinct departments that are in the `employee_data`

data frame. The code below will calculate this. We simply pass the `department`

variable into `n_distinct()`

```
employee_data %>%
summarise(number_of_departments = n_distinct(department))
```

`top_n()`

The `top_n()`

function can be used to subset a data frame based on the ranked values of a particular variable. This function takes three inputs:

- a data frame
- number of rows to return
- variable used for ordering

In the example below, I filter the `employee_data`

dataset to include only the top 3 rows based on the `salary`

values. By default, `top_n()`

will return the rows with the 3 largest values of `salary`

.

`employee_data %>% top_n(3, salary)`

If there are any ties, the `top_n()`

function will return all values. Notice that we get 4 rows in the output below. This is because the top 3 values of `yrs_at_company`

are 40, 37, and 36 and there are two instances of 36 in the data.

`employee_data %>% top_n(3, yrs_at_company)`

`slice()`

The `slice()`

function is used to select rows by a numeric index. If you want only the first 3 rows of a data frame, you could use *slice(1:3)*.

Let’s re-create the top 3 analysis we just performed, but this time we’ll keep only the first three 3 rows.

```
# Keep just the first 3 rows
employee_data %>% slice(1:3)
```

```
# Keep rows 1, 200, 1002
employee_data %>% slice(1, 200, 1002)
```

The `slice()`

function can be used as an alternative to `top_n()`

when paired with `arrange()`

. This has the benefit of dropping tied values.

The code below returns the top 3 values of `yrs_at_company`

just as above, but only keeps the first 3 rows. This is achieved by first ordering the values of `yrs_at_company`

in descending order.

```
employee_data %>% arrange(desc(yrs_at_company)) %>%
slice(1:3)
```

The `slice()`

and `top_n()`

functions are also useful for creating subsets by groups in a data frame.

Let’s say that we wanted to filter the `employee_data`

data frame to only include the top 2 employees with the largest values of `salary`

for each department. The code below shows how we would accomplish this with `group_by()`

and `top_n()`

.

```
employee_data %>% group_by(department) %>%
top_n(2, salary)
```

To make things easier to interpret, we can arrange the results from above by `department`

and `salary`

```
employee_data %>% group_by(department) %>%
top_n(2, salary) %>%
arrange(department, desc(salary))
```

Imagine we have data from a survey we recently conducted where 7 people responded and provided their age. This data is stored in the age vector below.

```
age <- c(23, 31, 27, 41, 54, 34, 25)
age
```

`[1] 23 31 27 41 54 34 25`

What if we would like to know the number of people who are 30 or older and what percentage of the total respondents this group represents.

We can answer this question by first using the `>=`

operator to find where values stored in the age vector are greater than or equal to the value 30. Anytime we use comparison operators (>, >=, <, <=, ==) on a vector, we will get a logical vector consisting of `TRUE/FALSE`

values indicating where our condition was met.

For example, running the code below produces a sequence of `TRUE/FALSE`

values that test where our respondents are 30 or older in the age vector.

`age >= 30`

`[1] FALSE TRUE FALSE TRUE TRUE TRUE FALSE`

To answer our question above, we can use the following properties of logical vectors in `R`

:

- the sum of a logical vector returns the number of TRUE values
- the mean of a logical vector returns the proportion of TRUE values

We see from the output below that 4 people in our survey were 30 years or older and that this represents 57% of the total respondents.

`sum(age >= 30)`

`[1] 4`

`mean(age >= 30)`

`[1] 0.5714286`

`summarise()`

We can use these properties to extend our capabilities with `summarise()`

. What if someone asked us how many employees there are by department and the count and proportion of those employees that make less than $60,000?

```
employee_data %>% group_by(department) %>%
summarise(employees = n(),
employees_less_60 = sum(salary < 60000),
employees_less_60_prop = mean(salary < 60000))
```

This section covers some common data analysis tasks that we will be performing on data frames throughout the semester. To demonstrate these tasks, let’s work with the following simple data frame.

The code below creates a simple data set with results from a job survey.

```
survey <- tibble(age = c(26, 31, 28, 42, 31, 37, 51, 29),
job_function = c('Biotechnology', 'Analytics', 'Machine Learning',
'Marketing','Biotechnology', 'Machine Learning',
'Analytics', 'Biotechnology'),
job_industry = c('Healthcare', 'Healthcare', 'Financial Services',
'Retail', 'Non-Profit', 'Education',
'Retail', 'Healthcare'),
job_level = c('Entry', 'Mid', 'Mid', 'Senior', 'Mid',
'Mid', 'Senior', 'Mid'),
salary = c(75500, 87600, 97000, 92000, 89000,
108500, 121000, 94000))
survey
```

In many cases, you will need to re-code the values of a given variable. In our data, we have a `job_function`

variable that captures our survey respondents’ specific role within their company.

What if we wanted to replace “Analytics” and “Machine Learning” with “Data Science”. We can achieve this goal by using the `recode()`

function.

This functions takes a vector to modify as its first argument and a series of replacements in the form `old value = new value`

separated by commas. All values that are left out will be kept the same.

Below is an example of how to use this function on the `job_function`

column within our `survey`

data.

```
recode(survey$job_function,
'Analytics' = 'Data Science',
'Machine Learning' = 'Data Science')
```

```
[1] "Biotechnology" "Data Science" "Data Science" "Marketing"
[5] "Biotechnology" "Data Science" "Data Science" "Biotechnology"
```

To obtain the results as a factor variable, we can use the `recode_factor()`

function. The `recode_factor()`

function has the same arguments as `recode()`

, but will return a factor as a result.

In the code below, we will use `mutate()`

to create new variables that recode “Analytics” and “Machine Learning” into “Data Science”. One will be with `recode()`

, which will return a character variable, and one with `recode_factor()`

.

Notice in the output, that even though the two new columns look identical, their types are different.

```
survey %>%
mutate(job_function_chr = recode(job_function,
'Analytics' = 'Data Science',
'Machine Learning' = 'Data Science'),
job_function_fct = recode_factor(job_function,
'Analytics' = 'Data Science',
'Machine Learning' = 'Data Science'))
```

For more examples of the `recode()`

functions, see the tidyverse documentation

`case_when()`

The `case_when()`

function from the `dplyr`

package is particularly useful when you need to create a new variable that relies on a complex combination of existing variables in a data frame.

The `case_when()`

function takes a sequence of two-sided formulas. The left hand side determines which values match this case and the right hand side provides the replacement value.

The general syntax is `logical condition ~ replacement value`

, where the logical condition can involve multiple variables from a data frame. The sequence ends with `TRUE ~ value for all other cases`

.

The sequence is evaluated in the order that it is put into `case_when()`

. This syntax is best understood with an example.

In the code below, we use `case_when()`

to create a new character vector from the `age`

and `job_function`

columns in the `survey`

data.

Let’s say that we would like to create a new column that labels people that are 30 years or older and work in Analytic or Machine Learning as “Data Science, 30+” and people who are 30 and older and work in Biotechnology as “Biotechnology, 30+”. For all other respondents to our survey, we would like “Other”.

```
survey_updated <- survey %>%
mutate(ds_biotech_30 = case_when(age >= 30 & job_function %in% c('Analytics', 'Machine Learning') ~ 'Data Science, 30+',
age >= 30 & job_function == "Biotechnology" ~ "Biotechnology, 30+",
TRUE ~ 'Other'))
survey_updated
```

For more examples of the `case_when()`

function, see the tidyverse documentation

Converting numeric columns into categories is a common data analysis task. A numeric variable can be partitioned into categories by using the `cut()`

function. Let’s say that we want to create a categorical age variable with the following values: “Less than 25”, “25 - 35”, and “36 and older”.

The `cut()`

function divides a numeric vector into intervals and codes the values according to which interval they fall. The `cut()`

function takes the following arguments:

`x`

- a numeric vector

`breaks`

- a numeric vector of cut points (can include `-Inf`

and `Inf`

)

`labels`

- a vector of labels for the resulting intervals (optional)

`right`

- logical (default is TRUE), indicating if the intervals should be closed on the right (i.e. (a, b])

The nice thing about `cut()`

is that it returns an ordered factor vector as a result.

Below is an example of using `cut()`

on the `age`

vector in `survey`

.

The symbols `-Inf`

, and `Inf`

in the `breaks`

argument stand for the lowest and largest observed values, respectively.

Using the `breaks`

from below, the `cut()`

function will segment the `age`

values into the following categories:

```
# original values
survey$age
```

`[1] 26 31 28 42 31 37 51 29`

```
# Using cut()
cut(x = survey$age,
breaks = c(-Inf, 24, 35, Inf),
labels = c("Less than 25", "25 - 35", "36 and older"),
right = TRUE)
```

```
[1] 25 - 35 25 - 35 25 - 35 36 and older 25 - 35
[6] 36 and older 36 and older 25 - 35
Levels: Less than 25 25 - 35 36 and older
```

Here is an example of doing the above task with `dplyr`

and `mutate()`

.

```
survey %>%
mutate(age_category = cut(age,
breaks = c(-Inf, 24, 35, Inf),
labels = c("Less than 25", "25 - 35", "36 and older"),
right = TRUE)) %>%
select(age, age_category)
```

There are three very useful functions within the `tidyverse`

package that perform automating binning of numeric data for quick exploratory analysis. These functions take a numeric vector as input and return a factor with labeled levels.

`cut_interval()`

makes n groups with`equal range`

`cut_number()`

makes n groups with (approximately)`equal numbers of observations`

`cut_width()`

makes groups of a`given width`

`cut_interval()`

```
# Original variable
survey$age
```

`[1] 26 31 28 42 31 37 51 29`

```
# cut_interval
age_interval <- cut_interval(survey$age, n = 3)
# View results
age_interval
```

```
[1] [26,34.3] [26,34.3] [26,34.3] (34.3,42.7] [26,34.3] (34.3,42.7]
[7] (42.7,51] [26,34.3]
Levels: [26,34.3] (34.3,42.7] (42.7,51]
```

```
# Obtain factor level counts
summary(age_interval)
```

```
[26,34.3] (34.3,42.7] (42.7,51]
5 2 1
```

`cut_number()`

```
# cut_number
age_number <- cut_number(survey$age, n = 3)
# View results
age_number
```

```
[1] [26,29.7] (29.7,35] [26,29.7] (35,51] (29.7,35] (35,51] (35,51]
[8] [26,29.7]
Levels: [26,29.7] (29.7,35] (35,51]
```

```
# View factor level counts
summary(age_number)
```

```
[26,29.7] (29.7,35] (35,51]
3 2 3
```

`cut_width()`

The `cut_interval()`

takes a `width`

argument which specifies the width of each interval and a `boundary`

argument for centering intervals. By default, boundary is equal to 0.5 which may lead to intervals that have non-integer end points. I usually set boundary to 0 so that the intervals always fall between integer values.

```
# Original variable
survey$age
```

`[1] 26 31 28 42 31 37 51 29`

```
# cut_width
age_width <- cut_width(survey$age, width = 10, boundary = 0)
```

```
# View result
age_width
```

```
[1] [20,30] (30,40] [20,30] (40,50] (30,40] (30,40] (50,60] [20,30]
Levels: [20,30] (30,40] (40,50] (50,60]
```

```
# View counts
summary(age_width)
```

```
[20,30] (30,40] (40,50] (50,60]
3 3 1 1
```

`employee_data`

Let’s use the `cut_width()`

function to accomplish the following data analysis workflow with `dplyr`

- Start with
`employee_data`

- Create a
`miles_category`

variable that bins`miles_from_home`

values in 5 mile increments - Form groups by
`left_company`

and`miles_category`

- Count the number of employees that fall into each combination

```
employee_data %>%
mutate(miles_category = cut_width(miles_from_home, width = 5, boundary = 0)) %>%
group_by(left_company, miles_category) %>%
summarise(employees = n())
```