In this tutorial, we will learn about `R`

functions and data analysis with the `tidyverse`

package.

Please click the button below to open an interactive version of all course `R`

tutorials through RStudio Cloud.

**Note**: you will need to register for an account before opening the project. Please remember to use your GMU e-mail address.

Click the button below to launch an interactive RStudio environment using `Binder.org`

. This will launch a pre-configured RStudio environment within your browser. Unlike RStudio cloud, this service has no monthly usage limits, but it may take up to 10 minutes to launch and you will not be able to save your work.

In this section will learn about common built-in functions that are useful for obtaining summary statistics, ranking data, and data analysis. We will also learn how to write our own custom functions in `R`

.

The functions below are useful for studying the distribution of numeric values within a data set. All of these functions take a *numeric vector* as their input.

`min()`

- Returns the minimum value

`max()`

- Returns the maximum value

`range()`

- Returns a vector of length 2 with the range of observed values (minimum and maximum values)

`median()`

- Returns the median value (50th percentile)

`fivenum()`

- Returns a vector of length 5 with the minimum, 25th percentile, median, 75th percentile, maximum values

`quantile()`

- Returns the specified percentile(s) of a set of numeric values

Obtaining the range of values present in a numeric vector.

```
data_vector <- c(3, 9, 11.2, 14, 28.7, 30, 15, 21, 5.7, 9.1, 24.6)
# minimum value in data_vector
min(data_vector)
```

`[1] 3`

```
# maximum value
max(data_vector)
```

`[1] 30`

```
# range of data values
range(data_vector)
```

`[1] 3 30`

The `median()`

and `quantile()`

functions are used for obtaining specific percentiles from a distribution of numbers. A percentile of a set of numbers is a value below which a given percentage of the total values fall at or below. For example, the 50th percentile (also called the median) represents the center of a set of numeric data. This means that 50% of all the values are less than or equal to the 50th percentile.

The `quantile()`

function requires two inputs. The first is a numeric vector of data values and the second is a vector with values ranging from 0 to 1, representing the percentile(s) to calculate.

```
# median
median(data_vector)
```

`[1] 14`

```
# 30th percentile
quantile(data_vector, 0.3)
```

```
30%
9.1
```

```
# 30th, 60th, and 90th percentiles
quantile(data_vector, c(0.3, 0.6, 0.9))
```

```
30% 60% 90%
9.1 15.0 28.7
```

The `fivenum()`

function calculates the five number summary (min, 25th, median, 75th, max) of a numeric vector.

`fivenum(data_vector)`

`[1] 3.00 9.05 14.00 22.80 30.00`

The `mean()`

and `sd()`

functions are used to calculate the mean and standard deviation of a set of data values.

```
# mean value
mean(data_vector)
```

`[1] 15.57273`

```
# standard deviation
sd(data_vector)
```

`[1] 9.241114`

The `sum()`

and `cumsum()`

functions are used for summing the numbers within a vector. The `sum()`

function simply returns the sum of all numbers within a vector.

The `cumsum()`

functions calculates a cumulative sum for every position within a vector. This function always returns a vector of the same length as the input.

```
# sum of all values
sum(data_vector)
```

`[1] 171.3`

```
# cumulative sum
cumsum(data_vector)
```

` [1] 3.0 12.0 23.2 37.2 65.9 95.9 110.9 131.9 137.6 146.7 171.3`

The `abs()`

and `rank()`

functions are useful for ranking data values. The `abs()`

function returns the absolute values of a vector.

```
negative_data <- c(-2, 4.5, -6, 10, 12)
# returns the absolute value of all elements
abs(negative_data)
```

`[1] 2.0 4.5 6.0 10.0 12.0`

The `rank()`

function returns the ranks of a set of data values from smallest to largest. The smallest value is given a rank of 1.

`data_vector`

` [1] 3.0 9.0 11.2 14.0 28.7 30.0 15.0 21.0 5.7 9.1 24.6`

`rank(data_vector)`

` [1] 1 3 5 6 10 11 7 8 2 4 9`

To obtain ranks from largest to smallest, where rank 1 represents the largest value, just take the rank of the negative of a numeric vector. In the example below, the value 30 is given a rank of 1.

`data_vector`

` [1] 3.0 9.0 11.2 14.0 28.7 30.0 15.0 21.0 5.7 9.1 24.6`

`rank(-data_vector)`

` [1] 11 9 7 6 2 1 5 4 10 8 3`

There are many cases when we will have to write our own functions to achieve tasks in an analytics project. `R`

functions can be defined to take any amount of inputs (usually called arguments) but only return one object.

The basic syntax of creating a function with arguments x and y is as follows:

```
my_function <- function(x, y) {
R Code here
}
```

Assume that we would like to write a function that takes a numeric vector as input and returns a vector of scaled values. For each value in our original vector, we will subtract the mean and divide by the standard deviation. In Statistics, this transformation is sometimes called a **z-score**.

In the code cell below, I demonstrate how this can be done *without* writing a function.

```
numeric_data <- c(3, 8, 4, 7, 12, 2)
# Calculate the z-scores of numeric_data
(numeric_data - mean(numeric_data)) / sd(numeric_data)
```

`[1] -0.8017837 0.5345225 -0.5345225 0.2672612 1.6035675 -1.0690450`

Instead of typing the above expression every time we need to perform this transformation, let’s write a custom function that performs this task.

I will show two equivalent ways of writing this function and discuss the difference.

Note that the input value is named `x`

. This is completely arbitrary. The input value could also have been named `input`

as long as the same name is used within the code of the function. In our code below, `x`

simply represents the numeric vector that we expect to get passed into the function.

```
z_score_1 <- function(x) {
return((x - mean(x))/sd(x))
}
```

```
# Let's test our function
age_vector <- c(18, 24, 21, 37, 51, 34, 41)
z_score_1(age_vector)
```

`[1] -1.1992327 -0.6955550 -0.9473939 0.3957468 1.5709949 0.1439079 0.7315320`

By default, an `R`

function returns the results of the **last** operation that it performed. The code below is an equivalent way of writing the same function. In this case we do not need to use `return`

to give us the result.

```
# Equivalent
z_score_2 <- function(x) {
(x - mean(x))/sd(x)
}
```

```
# Check results
z_score_2(age_vector)
```

`[1] -1.1992327 -0.6955550 -0.9473939 0.3957468 1.5709949 0.1439079 0.7315320`

The `return()`

call is useful when you need to return a **list of results** from a function. The function below creates three objects, `mean_x`

, `sd_x`

, and `scaled_data`

. To obtain all of these results, we must use `return`

and build a list that contains all of the objects.

```
# return a list
z_score_3 <- function(x) {
mean_x <- mean(x) # Calculate and save the mean
sd_x <- sd(x) # Calculate and save the standard deviation
scaled_data <- (x - mean_x)/sd_x # Save the transformed vector
return(list(mean_value = mean_x,
sd_value = sd_x,
scaled_vector = scaled_data))
}
```

```
detailed_results <- z_score_3(age_vector)
# View the results
detailed_results
```

```
$mean_value
[1] 32.28571
$sd_value
[1] 11.91238
$scaled_vector
[1] -1.1992327 -0.6955550 -0.9473939 0.3957468 1.5709949 0.1439079 0.7315320
```

This section will cover the basics of data manipulation using the `tidyverse`

package. Before we can use the package, we must load it into our environment with the following code `library(tidyverse)`

. This will import all of the functions available in the `tidyverse`

package into our environment.

The `tidyverse`

is a collection of 8 packages that are designed specifically for data science tasks.

In this course, I have installed all required packages into our RStudio Cloud environment. If you are ever working with RStudio on your desktop, you must install packages before they can be used. This is done with the following code `install.packages('tidyverse')`

.

To get more details about the `tidyverse`

package see the tidyverse documentation

We will also load the `skimr`

package which is used for exploring the structure of a data frame.

```
# This will load all 8 of the tidyverse packages
library(tidyverse)
library(skimr)
```

The first package we will explore is `tibble`

. The `tibble`

package is used for creating special types of data frames called tibbles.

Tibbles are data frames with added properties and functionality. Many of the core functions in the `tidyverse`

take tibbles as arguments and return them as results after execution.

`R`

has many built-in datasets that can be loaded as data frames. One example is the `iris`

data frame. To load this data, you just have to type `iris`

in the `R`

console.

Each row in `iris`

represents a flower with corresponding measurements of height and width of the sepal and petal.

By default, `R`

will try to print every row of a data frame, easily overwhelming your console. Another property of `R`

data frames is that each row is labeled with a number. These are known as row labels.

`iris`

```
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
18 5.1 3.5 1.4 0.3 setosa
19 5.7 3.8 1.7 0.3 setosa
20 5.1 3.8 1.5 0.3 setosa
21 5.4 3.4 1.7 0.2 setosa
22 5.1 3.7 1.5 0.4 setosa
23 4.6 3.6 1.0 0.2 setosa
24 5.1 3.3 1.7 0.5 setosa
25 4.8 3.4 1.9 0.2 setosa
26 5.0 3.0 1.6 0.2 setosa
27 5.0 3.4 1.6 0.4 setosa
28 5.2 3.5 1.5 0.2 setosa
29 5.2 3.4 1.4 0.2 setosa
30 4.7 3.2 1.6 0.2 setosa
31 4.8 3.1 1.6 0.2 setosa
32 5.4 3.4 1.5 0.4 setosa
33 5.2 4.1 1.5 0.1 setosa
34 5.5 4.2 1.4 0.2 setosa
35 4.9 3.1 1.5 0.2 setosa
36 5.0 3.2 1.2 0.2 setosa
37 5.5 3.5 1.3 0.2 setosa
38 4.9 3.6 1.4 0.1 setosa
39 4.4 3.0 1.3 0.2 setosa
40 5.1 3.4 1.5 0.2 setosa
41 5.0 3.5 1.3 0.3 setosa
42 4.5 2.3 1.3 0.3 setosa
43 4.4 3.2 1.3 0.2 setosa
44 5.0 3.5 1.6 0.6 setosa
45 5.1 3.8 1.9 0.4 setosa
46 4.8 3.0 1.4 0.3 setosa
47 5.1 3.8 1.6 0.2 setosa
48 4.6 3.2 1.4 0.2 setosa
49 5.3 3.7 1.5 0.2 setosa
50 5.0 3.3 1.4 0.2 setosa
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
56 5.7 2.8 4.5 1.3 versicolor
57 6.3 3.3 4.7 1.6 versicolor
58 4.9 2.4 3.3 1.0 versicolor
59 6.6 2.9 4.6 1.3 versicolor
60 5.2 2.7 3.9 1.4 versicolor
61 5.0 2.0 3.5 1.0 versicolor
62 5.9 3.0 4.2 1.5 versicolor
63 6.0 2.2 4.0 1.0 versicolor
64 6.1 2.9 4.7 1.4 versicolor
65 5.6 2.9 3.6 1.3 versicolor
66 6.7 3.1 4.4 1.4 versicolor
67 5.6 3.0 4.5 1.5 versicolor
68 5.8 2.7 4.1 1.0 versicolor
69 6.2 2.2 4.5 1.5 versicolor
70 5.6 2.5 3.9 1.1 versicolor
71 5.9 3.2 4.8 1.8 versicolor
72 6.1 2.8 4.0 1.3 versicolor
73 6.3 2.5 4.9 1.5 versicolor
74 6.1 2.8 4.7 1.2 versicolor
75 6.4 2.9 4.3 1.3 versicolor
76 6.6 3.0 4.4 1.4 versicolor
77 6.8 2.8 4.8 1.4 versicolor
78 6.7 3.0 5.0 1.7 versicolor
79 6.0 2.9 4.5 1.5 versicolor
80 5.7 2.6 3.5 1.0 versicolor
81 5.5 2.4 3.8 1.1 versicolor
82 5.5 2.4 3.7 1.0 versicolor
83 5.8 2.7 3.9 1.2 versicolor
84 6.0 2.7 5.1 1.6 versicolor
85 5.4 3.0 4.5 1.5 versicolor
86 6.0 3.4 4.5 1.6 versicolor
87 6.7 3.1 4.7 1.5 versicolor
88 6.3 2.3 4.4 1.3 versicolor
89 5.6 3.0 4.1 1.3 versicolor
90 5.5 2.5 4.0 1.3 versicolor
91 5.5 2.6 4.4 1.2 versicolor
92 6.1 3.0 4.6 1.4 versicolor
93 5.8 2.6 4.0 1.2 versicolor
94 5.0 2.3 3.3 1.0 versicolor
95 5.6 2.7 4.2 1.3 versicolor
96 5.7 3.0 4.2 1.2 versicolor
97 5.7 2.9 4.2 1.3 versicolor
98 6.2 2.9 4.3 1.3 versicolor
99 5.1 2.5 3.0 1.1 versicolor
100 5.7 2.8 4.1 1.3 versicolor
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
103 7.1 3.0 5.9 2.1 virginica
104 6.3 2.9 5.6 1.8 virginica
105 6.5 3.0 5.8 2.2 virginica
106 7.6 3.0 6.6 2.1 virginica
107 4.9 2.5 4.5 1.7 virginica
108 7.3 2.9 6.3 1.8 virginica
109 6.7 2.5 5.8 1.8 virginica
110 7.2 3.6 6.1 2.5 virginica
111 6.5 3.2 5.1 2.0 virginica
112 6.4 2.7 5.3 1.9 virginica
113 6.8 3.0 5.5 2.1 virginica
114 5.7 2.5 5.0 2.0 virginica
115 5.8 2.8 5.1 2.4 virginica
116 6.4 3.2 5.3 2.3 virginica
117 6.5 3.0 5.5 1.8 virginica
118 7.7 3.8 6.7 2.2 virginica
119 7.7 2.6 6.9 2.3 virginica
120 6.0 2.2 5.0 1.5 virginica
121 6.9 3.2 5.7 2.3 virginica
122 5.6 2.8 4.9 2.0 virginica
123 7.7 2.8 6.7 2.0 virginica
124 6.3 2.7 4.9 1.8 virginica
125 6.7 3.3 5.7 2.1 virginica
126 7.2 3.2 6.0 1.8 virginica
127 6.2 2.8 4.8 1.8 virginica
128 6.1 3.0 4.9 1.8 virginica
129 6.4 2.8 5.6 2.1 virginica
130 7.2 3.0 5.8 1.6 virginica
131 7.4 2.8 6.1 1.9 virginica
132 7.9 3.8 6.4 2.0 virginica
133 6.4 2.8 5.6 2.2 virginica
134 6.3 2.8 5.1 1.5 virginica
135 6.1 2.6 5.6 1.4 virginica
136 7.7 3.0 6.1 2.3 virginica
137 6.3 3.4 5.6 2.4 virginica
138 6.4 3.1 5.5 1.8 virginica
139 6.0 3.0 4.8 1.8 virginica
140 6.9 3.1 5.4 2.1 virginica
141 6.7 3.1 5.6 2.4 virginica
142 6.9 3.1 5.1 2.3 virginica
143 5.8 2.7 5.1 1.9 virginica
144 6.8 3.2 5.9 2.3 virginica
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
```

To convert any `R`

data frame into a tibble, we can use the `as_tibble()`

function from the `tibble`

package. In the code below, we create a tibble named `iris_tbl`

.

A nice property of tibbles is that they only print the first 10 rows of data and label each column with its respective data type. In the output below, “dbl” stands for numeric.

```
iris_tbl <- as_tibble(iris)
iris_tbl
```

When we pass `iris_tbl`

to the `str()`

function, we see that it lets us know that we have a tibble.

`str(iris_tbl)`

```
tibble [150 × 5] (S3: tbl_df/tbl/data.frame)
$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
```

In general, tibbles are much easier to work with than data frames. However, not all `R`

functions are able to work with them. If you ever encounter this situation, it is easy to convert a tibble back to a data frame with the `as.data.frame()`

function.

The code below converts out `iris_tbl`

back to a data frame.

```
iris_df <- as.data.frame(iris_tbl)
str(iris_df)
```

```
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
```

`tibble()`

We can create tibbles from individual vectors using the `tibble()`

function. This is similar to how data frames are created with the `data.frame()`

function.

One major difference is that `tibble()`

allows you to reference variables within the function call. You can even use `R`

functions to create new columns. See the example below that uses `tibble()`

to create a simple dataset.

```
my_tbl <- tibble(column_1 = c(1, 3, 7, 2.5, 22),
column_2 = c('A', 'B', 'C', 'D', 'E'),
column_3 = (column_1 * 2) + 10,
column_4 = column_1 + mean(column_1))
my_tbl
```

Before we are able to perform data analysis, we must import data into our `R`

environment.

The `tidyverse`

package loads the `readr`

package which contains a number of functions for importing data into `R`

.

The `read_delim()`

function is used to import flat files such as comma-delimited (.csv) or tab-delimited (.txt) files.

The `read_delim()`

functions takes many arguments, but the 3 most important are:

`file`

- the first argument is the path to a file on your computer or website address of the data file`delim`

- the type of delimiter in the data file (either “,” for comma, “\t” for tab, or any other character)`col_names`

- TRUE or FALSE to indicate whether a file has column names

To see how this function works, let’s import the Wine Dataset from the UCI Machine Learning Repository.

If there are no column names in a dataset, `read_delim()`

will auto-generate names that begin with an **X** and cycle through a sequence of integers.

The `read_delim()`

function will also print a message to the `R`

console about the data types it has assigned to each column.

```
wine_data <- read_delim('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',
delim = ',',
col_names = FALSE)
```

```
Parsed with column specification:
cols(
X1 = col_double(),
X2 = col_double(),
X3 = col_double(),
X4 = col_double(),
X5 = col_double(),
X6 = col_double(),
X7 = col_double(),
X8 = col_double(),
X9 = col_double(),
X10 = col_double(),
X11 = col_double(),
X12 = col_double(),
X13 = col_double(),
X14 = col_double()
)
```

`wine_data`

In this course, we will be loading tibbles from our course website with the `read_rds()`

function (as demonstrated below).

However, I recommend that you refer to the readr documentation to get more familiar with reading different types of data into your `R`

environment.

The code below will import a data set from our course website. The data consists of 1,470 employee records for a U.S. based product company. The rows in this data frame represent the attributes of an employee at this company across the variables listed in the table below.

Variable | Definition |
---|---|

left_company | Did the employee leave the company? (Yes/No) |

department | Department within the company |

job_level | Job Level (Associate - Vice President) |

salary | Employee yearly salary (US Dollars) |

weekly_hours | Self-reported average weekly hours spent on the job (company survey) |

business_travel | Level of required business travel |

yrs_at_company | Tenure at the company (years) |

yrs_since_promotion | Years since last promotion |

previous_companies | Number of previous companies for which the employee has worked |

job_satisfaction | Self-reported job satisfaction (company survey) |

performance_rating | Most recent annual performance rating |

marital_status | Marital status (Single, Married, or Divorced) |

miles_from_home | Distance from employee address to office location |

This data is a special type of data frame known as a `tibble`

. All data frames in the `tidyverse`

are usually stored in this format. It has special properties which include better printing features and labels for column data types.

```
employee_data <- read_rds(url('https://gmudatamining.com/data/employee_data.rds'))
# View data
employee_data
```

`skimr`

The first step in a data analysis project is to explore your data source. This includes summarizing the values within each column, checking for missing data, checking the data types of each column, and verifying the number of rows and columns.

The `skim()`

function can be used to accomplish all of this. It takes your data frame as an argument. In the output below, we first get the number of rows and columns along with the data types present in our data.

The results are then grouped by the type of variables in our data.

First we get a summary of our factor variables, including the number of missing observations, whether our factor levels are ordered, the count of unique levels, and an abbreviated list of the most frequent factor levels.

Then we get a summary of our numeric variables which include the number of missing observations, the mean and standard deviation, a five number summary, and a plot of the distribution of values.

```
# View data frame properties and summary statistics
skim(employee_data)
```

Name | employee_data |

Number of rows | 1470 |

Number of columns | 13 |

_______________________ | |

Column type frequency: | |

factor | 7 |

numeric | 6 |

________________________ | |

Group variables | None |

**Variable type: factor**

skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|

left_company | 0 | 1 | FALSE | 2 | No: 1233, Yes: 237 |

department | 0 | 1 | FALSE | 6 | IT : 399, Res: 293, Sal: 252, Mar: 238 |

job_level | 0 | 1 | FALSE | 5 | Sen: 476, Man: 344, Dir: 331, Ass: 185 |

business_travel | 0 | 1 | FALSE | 3 | Rar: 1043, Fre: 277, Non: 150 |

job_satisfaction | 0 | 1 | FALSE | 4 | Ver: 459, Hig: 442, Low: 289, Med: 280 |

performance_rating | 0 | 1 | FALSE | 5 | Mee: 546, Exc: 472, Exc: 286, Min: 136 |

marital_status | 0 | 1 | FALSE | 3 | Mar: 673, Sin: 470, Div: 327 |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|

salary | 0 | 1 | 94076.25 | 37590.24 | 29848.56 | 70379.48 | 88555.53 | 117099.9 | 212134.7 | ▃▇▃▁▁ |

weekly_hours | 0 | 1 | 50.02 | 4.82 | 40.00 | 47.00 | 49.00 | 52.0 | 66.0 | ▂▇▃▂▁ |

yrs_at_company | 0 | 1 | 7.01 | 6.13 | 0.00 | 3.00 | 5.00 | 9.0 | 40.0 | ▇▂▁▁▁ |

yrs_since_promotion | 0 | 1 | 2.19 | 3.22 | 0.00 | 0.00 | 1.00 | 3.0 | 15.0 | ▇▁▁▁▁ |

previous_companies | 0 | 1 | 3.24 | 1.58 | 1.00 | 2.00 | 3.00 | 4.0 | 7.0 | ▇▇▂▂▃ |

miles_from_home | 0 | 1 | 9.19 | 8.11 | 1.00 | 2.00 | 7.00 | 14.0 | 29.0 | ▇▅▂▂▂ |

It is also possible to select a subset of variables to explore. Just pass a sequence of unquoted variable names into the `skim()`

function.

The `skimr`

package has many more features for exploring data. Once we cover the fundamentals of `dplyr`

in the next sections, I encourage interested students to explore the skimr documentation

```
# View data frame properties and summary statistics
skim(employee_data, left_company, department, salary, weekly_hours)
```

Name | employee_data |

Number of rows | 1470 |

Number of columns | 13 |

_______________________ | |

Column type frequency: | |

factor | 2 |

numeric | 2 |

________________________ | |

Group variables | None |

**Variable type: factor**

skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|

left_company | 0 | 1 | FALSE | 2 | No: 1233, Yes: 237 |

department | 0 | 1 | FALSE | 6 | IT : 399, Res: 293, Sal: 252, Mar: 238 |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|

salary | 0 | 1 | 94076.25 | 37590.24 | 29848.56 | 70379.48 | 88555.53 | 117099.9 | 212134.7 | ▃▇▃▁▁ |

weekly_hours | 0 | 1 | 50.02 | 4.82 | 40.00 | 47.00 | 49.00 | 52.0 | 66.0 | ▂▇▃▂▁ |

`dplyr`

In this section we will cover data manipulation with the `dplyr`

package. This is one of the core `tidyverse`

packages used for exploring data frames.

Chapter 5 of R for Data Science covers the basics of manipulating data frames in `R`

. In this tutorial, I would like to provide additional examples of the main functions of `dplyr`

, including `filter()`

, `select()`

, `arrange()`

, `summarise()`

, and `mutate()`

.

The first argument to all of these functions is a data frame, followed by additional arguments that perform various manipulations on the data. The output from all of these functions will also be a special type of data frame known as a `tibble`

.

`filter()`

The `filter()`

function is used for subsetting rows of a data frame. It is much more intuitive than subsetting with the base `R`

functions `[ ]`

and `[[ ]]`

.

The first argument to `filter()`

is a data frame, followed by one or more logical conditions on the variables within the data frame. **Logical conditions separated by a comma are treated as an AND (&) operation**. The advantage of `dplyr`

, is that you can pass variable names of a data frame in raw, unquoted format to many functions. The `filter()`

function returns a data frame that has been subsetted by the logical conditions within its arguments.

```
# employees that left the company
filter(employee_data, left_company == 'Yes')
```

```
# View employees that left from the Sales department
filter(employee_data, left_company == 'Yes', department == 'Sales')
```

To filter a data frame using an OR condition, we must use the `|`

operator.

```
# employees from Sales or Marketing department
filter(employee_data, department == 'Sales' | department == 'Marketing')
```

Another way to execute OR statements is by using the `%in%`

function. This function is used to check whether a column’s variable values match at least one element within a vector. In many cases, it can save lots of typing. The code below will produce the same result as the previous command

```
# employees from Sales or Marketing department
filter(employee_data, department %in% c('Sales', 'Marketing'))
```

What if we are interested in employees from Sales or Marketing that make over $80,000? We can just add another condition to the previous code. Remember that conditions separated by a comma represent an AND operation. So in the code below, we are passing the following condition: employees with `salary`

> 80000 **AND** (`department`

is Sales **OR** `department`

is Marketing)

```
# employees from Sales or Marketing department
filter(employee_data, salary > 80000, department %in% c('Sales', 'Marketing'))
```

`select()`

The `select()`

function allows you to select a subset of columns from a data frame. There are multiple ways to enter the selection condition and many helper functions, such as `starts_with()`

, `ends_with()`

, and `contains()`

. See the documentation for more examples.

We can select columns by used unquoted column names.

```
# Select the first three columns
select(employee_data, left_company, department, job_level)
```

We can also select columns by using their numeric positions

```
# Select the first three columns with a numeric vector
select(employee_data, c(1, 2, 3))
```

We can also pass a sequence of numeric positions separated by a comma.

```
# Select the first three columns with raw numbers
select(employee_data, 1, 2, 3)
```

Adding a `-`

in front of numeric positions or variable names excludes those variables and returns all others

```
# Select all columns except department and job_level
select(employee_data, -department, -job_level)
```

```
# Exclude the first 5 columns
select(employee_data, -1, -2, -3, -4, -5)
```