In this tutorial, we will learn about R functions and data analysis with the tidyverse package.

Please click the button below to open an interactive version of all course R tutorials through RStudio Cloud.

Note: you will need to register for an account before opening the project. Please remember to use your GMU e-mail address.



Click the button below to launch an interactive RStudio environment using Binder.org. This will launch a pre-configured RStudio environment within your browser. Unlike RStudio cloud, this service has no monthly usage limits, but it may take up to 10 minutes to launch and you will not be able to save your work.


Binder



Functions

In this section will learn about common built-in functions that are useful for obtaining summary statistics, ranking data, and data analysis. We will also learn how to write our own custom functions in R.


Built-in Functions

Percentiles

The functions below are useful for studying the distribution of numeric values within a data set. All of these functions take a numeric vector as their input.

  • min()
    • Returns the minimum value
  • max()
    • Returns the maximum value
  • range()
    • Returns a vector of length 2 with the range of observed values (minimum and maximum values)
  • median()
    • Returns the median value (50th percentile)
  • fivenum()
    • Returns a vector of length 5 with the minimum, 25th percentile, median, 75th percentile, maximum values
  • quantile()
    • Returns the specified percentile(s) of a set of numeric values


Examples

Obtaining the range of values present in a numeric vector.

data_vector <- c(3, 9, 11.2, 14, 28.7, 30, 15, 21, 5.7, 9.1, 24.6)

# minimum value in data_vector
min(data_vector)
[1] 3
# maximum value
max(data_vector)
[1] 30
# range of data values
range(data_vector)
[1]  3 30



The median() and quantile() functions are used for obtaining specific percentiles from a distribution of numbers. A percentile of a set of numbers is a value below which a given percentage of the total values fall at or below. For example, the 50th percentile (also called the median) represents the center of a set of numeric data. This means that 50% of all the values are less than or equal to the 50th percentile.

The quantile() function requires two inputs. The first is a numeric vector of data values and the second is a vector with values ranging from 0 to 1, representing the percentile(s) to calculate.

# median
median(data_vector)
[1] 14
# 30th percentile
quantile(data_vector, 0.3)
30% 
9.1 
# 30th, 60th, and 90th percentiles
quantile(data_vector, c(0.3, 0.6, 0.9))
 30%  60%  90% 
 9.1 15.0 28.7 



The fivenum() function calculates the five number summary (min, 25th, median, 75th, max) of a numeric vector.

fivenum(data_vector)
[1]  3.00  9.05 14.00 22.80 30.00



Mean and Standard Deviation

The mean() and sd() functions are used to calculate the mean and standard deviation of a set of data values.

# mean value
mean(data_vector)
[1] 15.57273
# standard deviation
sd(data_vector)
[1] 9.241114



Adding Elements of a Numeric Vector

The sum() and cumsum() functions are used for summing the numbers within a vector. The sum() function simply returns the sum of all numbers within a vector.

The cumsum() functions calculates a cumulative sum for every position within a vector. This function always returns a vector of the same length as the input.

# sum of all values
sum(data_vector)
[1] 171.3
# cumulative sum
cumsum(data_vector)
 [1]   3.0  12.0  23.2  37.2  65.9  95.9 110.9 131.9 137.6 146.7 171.3



Functions Useful for Ranking Data

The abs() and rank() functions are useful for ranking data values. The abs() function returns the absolute values of a vector.

negative_data <- c(-2, 4.5, -6, 10, 12)

# returns the absolute value of all elements
abs(negative_data)
[1]  2.0  4.5  6.0 10.0 12.0


The rank() function returns the ranks of a set of data values from smallest to largest. The smallest value is given a rank of 1.

data_vector
 [1]  3.0  9.0 11.2 14.0 28.7 30.0 15.0 21.0  5.7  9.1 24.6
rank(data_vector)
 [1]  1  3  5  6 10 11  7  8  2  4  9


To obtain ranks from largest to smallest, where rank 1 represents the largest value, just take the rank of the negative of a numeric vector. In the example below, the value 30 is given a rank of 1.

data_vector
 [1]  3.0  9.0 11.2 14.0 28.7 30.0 15.0 21.0  5.7  9.1 24.6
rank(-data_vector)
 [1] 11  9  7  6  2  1  5  4 10  8  3



Writing Functions in R

There are many cases when we will have to write our own functions to achieve tasks in an analytics project. R functions can be defined to take any amount of inputs (usually called arguments) but only return one object.

The basic syntax of creating a function with arguments x and y is as follows:

my_function <- function(x, y) {
               R Code here 
}



Assume that we would like to write a function that takes a numeric vector as input and returns a vector of scaled values. For each value in our original vector, we will subtract the mean and divide by the standard deviation. In Statistics, this transformation is sometimes called a z-score.

In the code cell below, I demonstrate how this can be done without writing a function.

numeric_data <- c(3, 8, 4, 7, 12, 2)

# Calculate the z-scores of numeric_data
(numeric_data - mean(numeric_data)) / sd(numeric_data)
[1] -0.8017837  0.5345225 -0.5345225  0.2672612  1.6035675 -1.0690450



Instead of typing the above expression every time we need to perform this transformation, let’s write a custom function that performs this task.

I will show two equivalent ways of writing this function and discuss the difference.

Note that the input value is named x. This is completely arbitrary. The input value could also have been named input as long as the same name is used within the code of the function. In our code below, x simply represents the numeric vector that we expect to get passed into the function.

z_score_1 <- function(x) {
              return((x - mean(x))/sd(x))
}
# Let's test our function
age_vector <- c(18, 24, 21, 37, 51, 34, 41)

z_score_1(age_vector)
[1] -1.1992327 -0.6955550 -0.9473939  0.3957468  1.5709949  0.1439079  0.7315320



By default, an R function returns the results of the last operation that it performed. The code below is an equivalent way of writing the same function. In this case we do not need to use return to give us the result.

# Equivalent
z_score_2 <- function(x) {
              (x - mean(x))/sd(x)
}
# Check results
z_score_2(age_vector)
[1] -1.1992327 -0.6955550 -0.9473939  0.3957468  1.5709949  0.1439079  0.7315320



The return() call is useful when you need to return a list of results from a function. The function below creates three objects, mean_x, sd_x, and scaled_data. To obtain all of these results, we must use return and build a list that contains all of the objects.

# return a list
z_score_3 <- function(x) {
                mean_x <- mean(x)  # Calculate and save the mean
                
                sd_x <- sd(x)  # Calculate and save the standard deviation
                
                scaled_data <- (x - mean_x)/sd_x  # Save the transformed vector
                
                return(list(mean_value = mean_x,
                            sd_value = sd_x,
                            scaled_vector = scaled_data)) 
}
detailed_results <- z_score_3(age_vector)

# View the results
detailed_results
$mean_value
[1] 32.28571

$sd_value
[1] 11.91238

$scaled_vector
[1] -1.1992327 -0.6955550 -0.9473939  0.3957468  1.5709949  0.1439079  0.7315320



Introduction to the Tidyverse

This section will cover the basics of data manipulation using the tidyverse package. Before we can use the package, we must load it into our environment with the following code library(tidyverse). This will import all of the functions available in the tidyverse package into our environment.

The tidyverse is a collection of 8 packages that are designed specifically for data science tasks.

In this course, I have installed all required packages into our RStudio Cloud environment. If you are ever working with RStudio on your desktop, you must install packages before they can be used. This is done with the following code install.packages('tidyverse').

To get more details about the tidyverse package see the tidyverse documentation

We will also load the skimr package which is used for exploring the structure of a data frame.

# This will load all 8 of the tidyverse packages
library(tidyverse)
library(skimr)



Tibbles

The first package we will explore is tibble. The tibble package is used for creating special types of data frames called tibbles.

Tibbles are data frames with added properties and functionality. Many of the core functions in the tidyverse take tibbles as arguments and return them as results after execution.

Creating tibbles

R has many built-in datasets that can be loaded as data frames. One example is the iris data frame. To load this data, you just have to type iris in the R console.

Each row in iris represents a flower with corresponding measurements of height and width of the sepal and petal.

By default, R will try to print every row of a data frame, easily overwhelming your console. Another property of R data frames is that each row is labeled with a number. These are known as row labels.

iris
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
4            4.6         3.1          1.5         0.2     setosa
5            5.0         3.6          1.4         0.2     setosa
6            5.4         3.9          1.7         0.4     setosa
7            4.6         3.4          1.4         0.3     setosa
8            5.0         3.4          1.5         0.2     setosa
9            4.4         2.9          1.4         0.2     setosa
10           4.9         3.1          1.5         0.1     setosa
11           5.4         3.7          1.5         0.2     setosa
12           4.8         3.4          1.6         0.2     setosa
13           4.8         3.0          1.4         0.1     setosa
14           4.3         3.0          1.1         0.1     setosa
15           5.8         4.0          1.2         0.2     setosa
16           5.7         4.4          1.5         0.4     setosa
17           5.4         3.9          1.3         0.4     setosa
18           5.1         3.5          1.4         0.3     setosa
19           5.7         3.8          1.7         0.3     setosa
20           5.1         3.8          1.5         0.3     setosa
21           5.4         3.4          1.7         0.2     setosa
22           5.1         3.7          1.5         0.4     setosa
23           4.6         3.6          1.0         0.2     setosa
24           5.1         3.3          1.7         0.5     setosa
25           4.8         3.4          1.9         0.2     setosa
26           5.0         3.0          1.6         0.2     setosa
27           5.0         3.4          1.6         0.4     setosa
28           5.2         3.5          1.5         0.2     setosa
29           5.2         3.4          1.4         0.2     setosa
30           4.7         3.2          1.6         0.2     setosa
31           4.8         3.1          1.6         0.2     setosa
32           5.4         3.4          1.5         0.4     setosa
33           5.2         4.1          1.5         0.1     setosa
34           5.5         4.2          1.4         0.2     setosa
35           4.9         3.1          1.5         0.2     setosa
36           5.0         3.2          1.2         0.2     setosa
37           5.5         3.5          1.3         0.2     setosa
38           4.9         3.6          1.4         0.1     setosa
39           4.4         3.0          1.3         0.2     setosa
40           5.1         3.4          1.5         0.2     setosa
41           5.0         3.5          1.3         0.3     setosa
42           4.5         2.3          1.3         0.3     setosa
43           4.4         3.2          1.3         0.2     setosa
44           5.0         3.5          1.6         0.6     setosa
45           5.1         3.8          1.9         0.4     setosa
46           4.8         3.0          1.4         0.3     setosa
47           5.1         3.8          1.6         0.2     setosa
48           4.6         3.2          1.4         0.2     setosa
49           5.3         3.7          1.5         0.2     setosa
50           5.0         3.3          1.4         0.2     setosa
51           7.0         3.2          4.7         1.4 versicolor
52           6.4         3.2          4.5         1.5 versicolor
53           6.9         3.1          4.9         1.5 versicolor
54           5.5         2.3          4.0         1.3 versicolor
55           6.5         2.8          4.6         1.5 versicolor
56           5.7         2.8          4.5         1.3 versicolor
57           6.3         3.3          4.7         1.6 versicolor
58           4.9         2.4          3.3         1.0 versicolor
59           6.6         2.9          4.6         1.3 versicolor
60           5.2         2.7          3.9         1.4 versicolor
61           5.0         2.0          3.5         1.0 versicolor
62           5.9         3.0          4.2         1.5 versicolor
63           6.0         2.2          4.0         1.0 versicolor
64           6.1         2.9          4.7         1.4 versicolor
65           5.6         2.9          3.6         1.3 versicolor
66           6.7         3.1          4.4         1.4 versicolor
67           5.6         3.0          4.5         1.5 versicolor
68           5.8         2.7          4.1         1.0 versicolor
69           6.2         2.2          4.5         1.5 versicolor
70           5.6         2.5          3.9         1.1 versicolor
71           5.9         3.2          4.8         1.8 versicolor
72           6.1         2.8          4.0         1.3 versicolor
73           6.3         2.5          4.9         1.5 versicolor
74           6.1         2.8          4.7         1.2 versicolor
75           6.4         2.9          4.3         1.3 versicolor
76           6.6         3.0          4.4         1.4 versicolor
77           6.8         2.8          4.8         1.4 versicolor
78           6.7         3.0          5.0         1.7 versicolor
79           6.0         2.9          4.5         1.5 versicolor
80           5.7         2.6          3.5         1.0 versicolor
81           5.5         2.4          3.8         1.1 versicolor
82           5.5         2.4          3.7         1.0 versicolor
83           5.8         2.7          3.9         1.2 versicolor
84           6.0         2.7          5.1         1.6 versicolor
85           5.4         3.0          4.5         1.5 versicolor
86           6.0         3.4          4.5         1.6 versicolor
87           6.7         3.1          4.7         1.5 versicolor
88           6.3         2.3          4.4         1.3 versicolor
89           5.6         3.0          4.1         1.3 versicolor
90           5.5         2.5          4.0         1.3 versicolor
91           5.5         2.6          4.4         1.2 versicolor
92           6.1         3.0          4.6         1.4 versicolor
93           5.8         2.6          4.0         1.2 versicolor
94           5.0         2.3          3.3         1.0 versicolor
95           5.6         2.7          4.2         1.3 versicolor
96           5.7         3.0          4.2         1.2 versicolor
97           5.7         2.9          4.2         1.3 versicolor
98           6.2         2.9          4.3         1.3 versicolor
99           5.1         2.5          3.0         1.1 versicolor
100          5.7         2.8          4.1         1.3 versicolor
101          6.3         3.3          6.0         2.5  virginica
102          5.8         2.7          5.1         1.9  virginica
103          7.1         3.0          5.9         2.1  virginica
104          6.3         2.9          5.6         1.8  virginica
105          6.5         3.0          5.8         2.2  virginica
106          7.6         3.0          6.6         2.1  virginica
107          4.9         2.5          4.5         1.7  virginica
108          7.3         2.9          6.3         1.8  virginica
109          6.7         2.5          5.8         1.8  virginica
110          7.2         3.6          6.1         2.5  virginica
111          6.5         3.2          5.1         2.0  virginica
112          6.4         2.7          5.3         1.9  virginica
113          6.8         3.0          5.5         2.1  virginica
114          5.7         2.5          5.0         2.0  virginica
115          5.8         2.8          5.1         2.4  virginica
116          6.4         3.2          5.3         2.3  virginica
117          6.5         3.0          5.5         1.8  virginica
118          7.7         3.8          6.7         2.2  virginica
119          7.7         2.6          6.9         2.3  virginica
120          6.0         2.2          5.0         1.5  virginica
121          6.9         3.2          5.7         2.3  virginica
122          5.6         2.8          4.9         2.0  virginica
123          7.7         2.8          6.7         2.0  virginica
124          6.3         2.7          4.9         1.8  virginica
125          6.7         3.3          5.7         2.1  virginica
126          7.2         3.2          6.0         1.8  virginica
127          6.2         2.8          4.8         1.8  virginica
128          6.1         3.0          4.9         1.8  virginica
129          6.4         2.8          5.6         2.1  virginica
130          7.2         3.0          5.8         1.6  virginica
131          7.4         2.8          6.1         1.9  virginica
132          7.9         3.8          6.4         2.0  virginica
133          6.4         2.8          5.6         2.2  virginica
134          6.3         2.8          5.1         1.5  virginica
135          6.1         2.6          5.6         1.4  virginica
136          7.7         3.0          6.1         2.3  virginica
137          6.3         3.4          5.6         2.4  virginica
138          6.4         3.1          5.5         1.8  virginica
139          6.0         3.0          4.8         1.8  virginica
140          6.9         3.1          5.4         2.1  virginica
141          6.7         3.1          5.6         2.4  virginica
142          6.9         3.1          5.1         2.3  virginica
143          5.8         2.7          5.1         1.9  virginica
144          6.8         3.2          5.9         2.3  virginica
145          6.7         3.3          5.7         2.5  virginica
146          6.7         3.0          5.2         2.3  virginica
147          6.3         2.5          5.0         1.9  virginica
148          6.5         3.0          5.2         2.0  virginica
149          6.2         3.4          5.4         2.3  virginica
150          5.9         3.0          5.1         1.8  virginica



Coverting Data Frames to Tibbles

To convert any R data frame into a tibble, we can use the as_tibble() function from the tibble package. In the code below, we create a tibble named iris_tbl.

A nice property of tibbles is that they only print the first 10 rows of data and label each column with its respective data type. In the output below, “dbl” stands for numeric.

iris_tbl <- as_tibble(iris)

iris_tbl



When we pass iris_tbl to the str() function, we see that it lets us know that we have a tibble.

str(iris_tbl)
tibble [150 × 5] (S3: tbl_df/tbl/data.frame)
 $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...



Converting Tibbles to Date Frames

In general, tibbles are much easier to work with than data frames. However, not all R functions are able to work with them. If you ever encounter this situation, it is easy to convert a tibble back to a data frame with the as.data.frame() function.

The code below converts out iris_tbl back to a data frame.

iris_df <- as.data.frame(iris_tbl)

str(iris_df)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...



Creating Tibbles with tibble()

We can create tibbles from individual vectors using the tibble() function. This is similar to how data frames are created with the data.frame() function.

One major difference is that tibble() allows you to reference variables within the function call. You can even use R functions to create new columns. See the example below that uses tibble() to create a simple dataset.

my_tbl <- tibble(column_1 = c(1, 3, 7, 2.5, 22),
                 column_2 = c('A', 'B', 'C', 'D', 'E'),
                 column_3 = (column_1 * 2) + 10,
                 column_4 = column_1 + mean(column_1))

my_tbl



Introduction to Data Analysis

Loading Data into R

Before we are able to perform data analysis, we must import data into our R environment.

The tidyverse package loads the readr package which contains a number of functions for importing data into R.

The read_delim() function is used to import flat files such as comma-delimited (.csv) or tab-delimited (.txt) files.

The read_delim() functions takes many arguments, but the 3 most important are:

  • file - the first argument is the path to a file on your computer or website address of the data file
  • delim - the type of delimiter in the data file (either “,” for comma, “\t” for tab, or any other character)
  • col_names - TRUE or FALSE to indicate whether a file has column names

To see how this function works, let’s import the Wine Dataset from the UCI Machine Learning Repository.

If there are no column names in a dataset, read_delim() will auto-generate names that begin with an X and cycle through a sequence of integers.

The read_delim() function will also print a message to the R console about the data types it has assigned to each column.


wine_data <- read_delim('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',
                        delim = ',',
                        col_names = FALSE)
Parsed with column specification:
cols(
  X1 = col_double(),
  X2 = col_double(),
  X3 = col_double(),
  X4 = col_double(),
  X5 = col_double(),
  X6 = col_double(),
  X7 = col_double(),
  X8 = col_double(),
  X9 = col_double(),
  X10 = col_double(),
  X11 = col_double(),
  X12 = col_double(),
  X13 = col_double(),
  X14 = col_double()
)
wine_data



In this course, we will be loading tibbles from our course website with the read_rds() function (as demonstrated below).

However, I recommend that you refer to the readr documentation to get more familiar with reading different types of data into your R environment.

Employee Attrition Data

The code below will import a data set from our course website. The data consists of 1,470 employee records for a U.S. based product company. The rows in this data frame represent the attributes of an employee at this company across the variables listed in the table below.

Variable Definition
left_company Did the employee leave the company? (Yes/No)
department Department within the company
job_level Job Level (Associate - Vice President)
salary Employee yearly salary (US Dollars)
weekly_hours Self-reported average weekly hours spent on the job (company survey)
business_travel Level of required business travel
yrs_at_company Tenure at the company (years)
yrs_since_promotion Years since last promotion
previous_companies Number of previous companies for which the employee has worked
job_satisfaction Self-reported job satisfaction (company survey)
performance_rating Most recent annual performance rating
marital_status Marital status (Single, Married, or Divorced)
miles_from_home Distance from employee address to office location


This data is a special type of data frame known as a tibble. All data frames in the tidyverse are usually stored in this format. It has special properties which include better printing features and labels for column data types.

employee_data <- read_rds(url('https://gmudatamining.com/data/employee_data.rds'))

# View data
employee_data



Exploring Data Frames with skimr

The first step in a data analysis project is to explore your data source. This includes summarizing the values within each column, checking for missing data, checking the data types of each column, and verifying the number of rows and columns.

The skim() function can be used to accomplish all of this. It takes your data frame as an argument. In the output below, we first get the number of rows and columns along with the data types present in our data.

The results are then grouped by the type of variables in our data.

First we get a summary of our factor variables, including the number of missing observations, whether our factor levels are ordered, the count of unique levels, and an abbreviated list of the most frequent factor levels.

Then we get a summary of our numeric variables which include the number of missing observations, the mean and standard deviation, a five number summary, and a plot of the distribution of values.

# View data frame properties and summary statistics
skim(employee_data)
Data summary
Name employee_data
Number of rows 1470
Number of columns 13
_______________________
Column type frequency:
factor 7
numeric 6
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
left_company 0 1 FALSE 2 No: 1233, Yes: 237
department 0 1 FALSE 6 IT : 399, Res: 293, Sal: 252, Mar: 238
job_level 0 1 FALSE 5 Sen: 476, Man: 344, Dir: 331, Ass: 185
business_travel 0 1 FALSE 3 Rar: 1043, Fre: 277, Non: 150
job_satisfaction 0 1 FALSE 4 Ver: 459, Hig: 442, Low: 289, Med: 280
performance_rating 0 1 FALSE 5 Mee: 546, Exc: 472, Exc: 286, Min: 136
marital_status 0 1 FALSE 3 Mar: 673, Sin: 470, Div: 327

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
salary 0 1 94076.25 37590.24 29848.56 70379.48 88555.53 117099.9 212134.7 ▃▇▃▁▁
weekly_hours 0 1 50.02 4.82 40.00 47.00 49.00 52.0 66.0 ▂▇▃▂▁
yrs_at_company 0 1 7.01 6.13 0.00 3.00 5.00 9.0 40.0 ▇▂▁▁▁
yrs_since_promotion 0 1 2.19 3.22 0.00 0.00 1.00 3.0 15.0 ▇▁▁▁▁
previous_companies 0 1 3.24 1.58 1.00 2.00 3.00 4.0 7.0 ▇▇▂▂▃
miles_from_home 0 1 9.19 8.11 1.00 2.00 7.00 14.0 29.0 ▇▅▂▂▂



It is also possible to select a subset of variables to explore. Just pass a sequence of unquoted variable names into the skim() function.

The skimr package has many more features for exploring data. Once we cover the fundamentals of dplyr in the next sections, I encourage interested students to explore the skimr documentation

# View data frame properties and summary statistics
skim(employee_data, left_company, department, salary, weekly_hours)
Data summary
Name employee_data
Number of rows 1470
Number of columns 13
_______________________
Column type frequency:
factor 2
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
left_company 0 1 FALSE 2 No: 1233, Yes: 237
department 0 1 FALSE 6 IT : 399, Res: 293, Sal: 252, Mar: 238

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
salary 0 1 94076.25 37590.24 29848.56 70379.48 88555.53 117099.9 212134.7 ▃▇▃▁▁
weekly_hours 0 1 50.02 4.82 40.00 47.00 49.00 52.0 66.0 ▂▇▃▂▁



Data Manipulation with dplyr

In this section we will cover data manipulation with the dplyr package. This is one of the core tidyverse packages used for exploring data frames.

Chapter 5 of R for Data Science covers the basics of manipulating data frames in R. In this tutorial, I would like to provide additional examples of the main functions of dplyr, including filter(), select(), arrange(), summarise(), and mutate().

The first argument to all of these functions is a data frame, followed by additional arguments that perform various manipulations on the data. The output from all of these functions will also be a special type of data frame known as a tibble.

filter()

The filter() function is used for subsetting rows of a data frame. It is much more intuitive than subsetting with the base R functions [ ] and [[ ]].

The first argument to filter() is a data frame, followed by one or more logical conditions on the variables within the data frame. Logical conditions separated by a comma are treated as an AND (&) operation. The advantage of dplyr, is that you can pass variable names of a data frame in raw, unquoted format to many functions. The filter() function returns a data frame that has been subsetted by the logical conditions within its arguments.

# employees that left the company
filter(employee_data, left_company == 'Yes') 


# View employees that left from the Sales department
filter(employee_data, left_company == 'Yes', department == 'Sales') 



To filter a data frame using an OR condition, we must use the | operator.

# employees from Sales or Marketing department
filter(employee_data, department == 'Sales' | department == 'Marketing')



Another way to execute OR statements is by using the %in% function. This function is used to check whether a column’s variable values match at least one element within a vector. In many cases, it can save lots of typing. The code below will produce the same result as the previous command

# employees from Sales or Marketing department
filter(employee_data, department %in% c('Sales', 'Marketing'))



What if we are interested in employees from Sales or Marketing that make over $80,000? We can just add another condition to the previous code. Remember that conditions separated by a comma represent an AND operation. So in the code below, we are passing the following condition: employees with salary > 80000 AND (department is Sales OR department is Marketing)

# employees from Sales or Marketing department
filter(employee_data, salary > 80000, department %in% c('Sales', 'Marketing'))



select()

The select() function allows you to select a subset of columns from a data frame. There are multiple ways to enter the selection condition and many helper functions, such as starts_with(), ends_with(), and contains(). See the documentation for more examples.

We can select columns by used unquoted column names.

# Select the first three columns
select(employee_data, left_company, department, job_level)


We can also select columns by using their numeric positions

# Select the first three columns with a numeric vector
select(employee_data, c(1, 2, 3))


We can also pass a sequence of numeric positions separated by a comma.

# Select the first three columns with raw numbers
select(employee_data, 1, 2, 3)


Adding a - in front of numeric positions or variable names excludes those variables and returns all others

# Select all columns except department and job_level
select(employee_data, -department, -job_level)


# Exclude the first 5 columns
select(employee_data, -1, -2, -3, -4, -5)