This tutorial will focus on exploratory data analysis with `R`

. We will introduce new functions that automatically summarize various combinations of data types. These functions can be viewed as helpers/extensions of `dyplr`

and `ggplot2`

that automate some portions of the data analysis process.

Please click the button below to open an interactive version of all course `R`

tutorials through RStudio Cloud.

**Note**: you will need to register for an account before opening the project. Please remember to use your GMU e-mail address.

Click the button below to launch an interactive RStudio environment using `Binder.org`

. This will launch a pre-configured RStudio environment within your browser. Unlike RStudio cloud, this service has no monthly usage limits, but it may take up to 10 minutes to launch and you will not be able to save your work.

The `R`

code below will import the employee attrition and Seattle home sales data sets. It also loads a number of new `R`

packages.

- janitor - useful for common data cleaning and exploration tasks including cross tabulation tables
- funModeling - a set of functions for automatic plotting and descriptive statistics
- corrr - functions for creating and exploring correlations
- corrplot - tools for visualizing correlations between sets of numeric variables

```
# Make sure to install new packages
library(funModeling)
library(skimr)
library(janitor)
library(corrr)
library(corrplot)
library(tidyverse)
```

```
# Employee attrition data
employee_data <- read_rds(url('https://gmudatamining.com/data/employee_data.rds'))
```

```
# Seattle home sales data
home_sales <- read_rds(url('https://gmudatamining.com/data/home_sales.rds')) %>%
select(-selling_date)
```

We will be working with the `employee_data`

and `home_sales`

data frames in this lesson.

Take a moment to explore these data sets below.

The data consists of 1,470 employee records for a U.S. based product company. The rows in this data frame represent an employee at this company that either left the company or not (`left_company`

) and their associated characteristics.

A row in this data frame represents a home that was sold in the Seattle area between 2014 and 2015.

The response variable in this data is `selling_price`

.

The first step in a data analysis project is to explore your data source. This includes summarizing the values within each column, checking for missing data, checking the data types of each column, and verifying the number of rows and columns.

The `skim()`

function was introduced in tutorial 2 and can be used to accomplish these tasks.

The `skim()`

function takes a data frame or tibble as an argument and produces a detailed set of summary statistics including:

- the number of rows and columns in the data frame
- the number of missing values per column, counts of unique values, and other numeric summary statistics grouped by column type

We see from the results below that `employee_data`

has 1,470 rows, 13 columns, 7 factor columns, 6 numeric columns, and no missing values (`n_missing`

) across all columns of the data frame.

```
# View data frame properties and summary statistics
skim(employee_data)
```

Name | employee_data |

Number of rows | 1470 |

Number of columns | 13 |

_______________________ | |

Column type frequency: | |

factor | 7 |

numeric | 6 |

________________________ | |

Group variables | None |

**Variable type: factor**

skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|

left_company | 0 | 1 | FALSE | 2 | No: 1233, Yes: 237 |

department | 0 | 1 | FALSE | 6 | IT : 399, Res: 293, Sal: 252, Mar: 238 |

job_level | 0 | 1 | FALSE | 5 | Sen: 476, Man: 344, Dir: 331, Ass: 185 |

business_travel | 0 | 1 | FALSE | 3 | Rar: 1043, Fre: 277, Non: 150 |

job_satisfaction | 0 | 1 | FALSE | 4 | Ver: 459, Hig: 442, Low: 289, Med: 280 |

performance_rating | 0 | 1 | FALSE | 5 | Mee: 546, Exc: 472, Exc: 286, Min: 136 |

marital_status | 0 | 1 | FALSE | 3 | Mar: 673, Sin: 470, Div: 327 |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|

salary | 0 | 1 | 94076.25 | 37590.24 | 29848.56 | 70379.48 | 88555.53 | 117099.9 | 212134.7 | ▃▇▃▁▁ |

weekly_hours | 0 | 1 | 50.02 | 4.82 | 40.00 | 47.00 | 49.00 | 52.0 | 66.0 | ▂▇▃▂▁ |

yrs_at_company | 0 | 1 | 7.01 | 6.13 | 0.00 | 3.00 | 5.00 | 9.0 | 40.0 | ▇▂▁▁▁ |

yrs_since_promotion | 0 | 1 | 2.19 | 3.22 | 0.00 | 0.00 | 1.00 | 3.0 | 15.0 | ▇▁▁▁▁ |

previous_companies | 0 | 1 | 3.24 | 1.58 | 1.00 | 2.00 | 3.00 | 4.0 | 7.0 | ▇▇▂▂▃ |

miles_from_home | 0 | 1 | 9.19 | 8.11 | 1.00 | 2.00 | 7.00 | 14.0 | 29.0 | ▇▅▂▂▂ |

Categorical variables are either nominal or ordinal with respect to measurement scale. Therefore the most common summary statistics include frequency counts and proportions.

The `skim()`

function is a good first step for analyzing character or factor variables. Each time the `skim()`

function is run, it produces a data frame as the output.

When we print this data frame to the console, it has the nice printing properties seen in the previous section. One of the columns in the output data frame is named `skim_type`

and records the variable type for a particular column.

We can use this column in combination with the `filter()`

function to subset the output based on variable type.

For example, the code below will display only the categorical variables in the `employee_data`

data frame. Since categorical data can be stored as factors or character columns within a data frame, we must add both to our filter condition.

The `n_unique`

and `top_counts`

fields in the output below gives the number of unique values and most frequent (abbreviated) values within each column.

```
skim(employee_data) %>%
filter(skim_type %in% c('factor', 'character'))
```

Name | employee_data |

Number of rows | 1470 |

Number of columns | 13 |

_______________________ | |

Column type frequency: | |

factor | 7 |

________________________ | |

Group variables | None |

**Variable type: factor**

skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|

left_company | 0 | 1 | FALSE | 2 | No: 1233, Yes: 237 |

department | 0 | 1 | FALSE | 6 | IT : 399, Res: 293, Sal: 252, Mar: 238 |

job_level | 0 | 1 | FALSE | 5 | Sen: 476, Man: 344, Dir: 331, Ass: 185 |

business_travel | 0 | 1 | FALSE | 3 | Rar: 1043, Fre: 277, Non: 150 |

job_satisfaction | 0 | 1 | FALSE | 4 | Ver: 459, Hig: 442, Low: 289, Med: 280 |

performance_rating | 0 | 1 | FALSE | 5 | Mee: 546, Exc: 472, Exc: 286, Min: 136 |

marital_status | 0 | 1 | FALSE | 3 | Mar: 673, Sin: 470, Div: 327 |

Crosstabs (also called contingency tables) are used to summarize the the frequency counts in categorical variables. The `tabyl()`

function from the `janitor`

package can be used to create crosstabs in `R`

.

The `tabyl()`

function from the `janitor`

package is useful for creating tables of count values for distinct levels of factor or character variables. The `tabyl()`

function takes a data frame as the first argument followed by one or two variables of interest.

The `tabyl()`

function includes a number of `adorn()`

functions which do things like add row and column totals and percentages. These are demonstrated in the `R`

code below.

The results of either a `tably()`

or `adorn()`

function will be a data frame.

To explore the full set of features provided by this function, refer to the following website:

We can summarize the frequency counts of a single variable by passing a data frame and the column of interest into `tabyl()`

. By default, we obtain the counts, `n`

, and proportions, `percent`

, of the levels of a categorical column.

```
# Summarize a single categorical variable
tabyl(employee_data, department)
```

Since the output of `tably()`

is a data frame, we can use `dplyr`

functions such as `filter()`

to subset the results.

```
tabyl(employee_data, department) %>%
filter(percent >= 0.17)
```

The function `adorn_totals()`

can be used to add column totals. To accomplish this, we add `'row'`

as a parameter to `adorn_totals()`

. This is a bit confusing, since to get column totals we need to input `'row'`

. This argument instructs `adorn_totals()`

to place the sum of values in the rows below the table.

```
# Add column total in the last row
tabyl(employee_data, department) %>% adorn_totals('row')
```

The power of the `tabyl()`

function rests in its ability to produces formatted two-way crosstabs that display counts and percentages for the interaction of two categorical variables.

To produce a two-way crosstab, just pass the two columns of interest into the `tabyl()`

function.

`tabyl(employee_data, left_company, business_travel)`

In the example below, additional `adorn()`

functions are used to add useful summary statistics to the simple table from above. We add both column and row totals with the `adorn_totals()`

function, row percentages with the `adorn_percentages()`

function, remove decimals from our row percentages with `adorn_pct_formatting()`

function, combine counts with our percentage values in each cell with the `adorn_ns()`

function, and add a combined title in the first column with `adorn_title()`

.

With row percentages, we are exploring the distribution of `business_travel`

within each category of `left_company`

. For example, we see that of the employees who left the company, 29% traveled frequently for business.

```
tabyl(employee_data, left_company, business_travel) %>%
adorn_totals(c('row', 'col')) %>%
adorn_percentages('row') %>%
adorn_pct_formatting(digits = 0) %>%
adorn_ns() %>%
adorn_title('combined')
```

To explore columns percentages, we simply replace `row`

with `col`

in our `adorn_percentages()`

functions. In this case, we are studying the distribution of `left_company`

values within each category of `business_travel`

.

```
tabyl(employee_data, left_company, business_travel) %>%
adorn_totals(c('row', 'col')) %>%
adorn_percentages('col') %>%
adorn_pct_formatting(digits = 0) %>%
adorn_ns() %>%
adorn_title('combined')
```

The most common visualization for categorical data are bar and column charts that plot the frequency of distinct values within a variable.

The `freq()`

function from the `funModeling`

package can be used to automatically create bar and column charts for the factor or character variables within a data frame. The `freq()`

function takes a data frame as the first argument and produces frequency plots and tables of **all** factor/character variables.

This function is meant to interactively explore data, not for saving the results. To save the results, we would have to use the `tabyl()`

and `ggplot()`

functions to create custom tables and plots.

`freq(employee_data)`

```
left_company frequency percentage cumulative_perc
1 No 1233 83.88 83.88
2 Yes 237 16.12 100.00
```

```
department frequency percentage cumulative_perc
1 IT and Analytics 399 27.14 27.14
2 Research 293 19.93 47.07
3 Sales 252 17.14 64.21
4 Marketing 238 16.19 80.40
5 Product Development 178 12.11 92.51
6 Finance and Operations 110 7.48 100.00
```

```
job_level frequency percentage cumulative_perc
1 Senior Manager 476 32.38 32.38
2 Manager 344 23.40 55.78
3 Director 331 22.52 78.30
4 Associate 185 12.59 90.89
5 Vice President 134 9.12 100.00
```

```
business_travel frequency percentage cumulative_perc
1 Rarely 1043 70.95 70.95
2 Frequently 277 18.84 89.79
3 None 150 10.20 100.00
```

```
job_satisfaction frequency percentage cumulative_perc
1 Very High 459 31.22 31.22
2 High 442 30.07 61.29
3 Low 289 19.66 80.95
4 Medium 280 19.05 100.00
```

```
performance_rating frequency percentage cumulative_perc
1 Meets Expectations 546 37.14 37.14
2 Exceeds Expectations 472 32.11 69.25
3 Exceptional 286 19.46 88.71
4 Minimally Effective 136 9.25 97.96
5 Not Effective 30 2.04 100.00
```

```
marital_status frequency percentage cumulative_perc
1 Married 673 45.78 45.78
2 Single 470 31.97 77.75
3 Divorced 327 22.24 100.00
```

`[1] "Variables processed: left_company, department, job_level, business_travel, job_satisfaction, performance_rating, marital_status"`

To calculate numerical summaries, including averages, standard deviations, and percentiles for numeric variables, we can use the `skim()`

function.

The `skim()`

function will automatically calculate all key descriptive statistics for the numeric variables in our data. In the example below, we use the `skim()`

function on the `home_sales`

data and select the numeric columns printing.

In the output under the `Variable type: numeric`

section, we obtain a complete summary of our numeric variables. Calculated statistics include:

- the mean and standard deviation
- a five number summary (minimum (
`p0`

), 25^{th}, 50^{th}, and 75^{th}percentiles, and the maximum (`p100`

)) - histogram plots

From the output below, we are able to say the following about the `selling_price`

variable:

- the average selling price of homes is $516,613.50 with a standard deviation of $182,978.15
- the median selling price of homes is $479,950
- the range of selling prices is between $260,000 and $970,000
- the interquartile range is $288,500
- the distribution of selling prices is skewwed to the right based on the histogram. This means the values tend to cluster more towards lower selling prices

```
skim(home_sales) %>%
filter(skim_type == 'numeric')
```

Name | home_sales |

Number of rows | 6659 |

Number of columns | 10 |

_______________________ | |

Column type frequency: | |

numeric | 9 |

________________________ | |

Group variables | None |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|

selling_price | 0 | 1 | 516613.50 | 182978.15 | 260000 | 360000.0 | 479950.0 | 648500.0 | 970000.0 | ▇▆▅▃▂ |

house_age | 0 | 1 | 13.83 | 8.74 | 0 | 7.0 | 12.0 | 21.0 | 30.0 | ▆▇▅▃▅ |

bedrooms | 0 | 1 | 3.45 | 0.78 | 0 | 3.0 | 3.0 | 4.0 | 9.0 | ▁▇▇▁▁ |

bathrooms | 0 | 1 | 2.54 | 0.48 | 0 | 2.5 | 2.5 | 2.5 | 7.5 | ▁▇▁▁▁ |

sqft_living | 0 | 1 | 2302.69 | 770.17 | 550 | 1690.0 | 2230.0 | 2800.0 | 7120.0 | ▅▇▂▁▁ |

sqft_lot | 0 | 1 | 12814.40 | 33201.81 | 600 | 3800.0 | 6160.0 | 9254.0 | 1024068.0 | ▇▁▁▁▁ |

sqft_basement | 0 | 1 | 133.21 | 305.05 | 0 | 0.0 | 0.0 | 0.0 | 2200.0 | ▇▁▁▁▁ |

floors | 0 | 1 | 1.96 | 0.45 | 1 | 2.0 | 2.0 | 2.0 | 3.5 | ▁▇▁▁▁ |

scenic_views | 0 | 1 | 0.10 | 0.50 | 0 | 0.0 | 0.0 | 0.0 | 4.0 | ▇▁▁▁▁ |

Another useful function in the `funModeling`

package is `plot_num()`

which takes a data frame and automatically plots histograms of numeric variables

`plot_num(home_sales)`

Studying the correlation structure of a numeric data set is important for uncovering linear associations between variables.

The correlation coefficient is a measure of the linear relationship between two numeric variables and ranges from -1 to 1. A value of -1 indicates an inverse linear relationship, where the values of one variable **decrease** linearly as the values of another increase.

A value of 1 indicates a positive linear relationships where the values of one variable **increase** linearly as the values of another increase.

A correlation of 0 indicates that there is no linear relationship between two numeric variables. However, this does not mean that two variables are not associated with each other. Variables many have non-linear relationships despite having correlations of 0.

A correlation data frame can be produced by passing a data frame **of numeric variables** to the `correlate()`

function from the `corrr`

package. Correlation data frames contain the correlation coefficients for every pair of numeric variables in a data frame.

If you have non-numeric data, make sure to filter your data frame to remove these columns. We will demonstrate creating correlation data frames using the `home_sales`

data.

The `select()`

function from `dplyr`

can conditionally select columns by their data type properties. This is done by passing one of the following function names into the `where()`

function within `select()`

:

- is.numeric
- is.factor
- is.character

The example below will select only numeric columns from `home_sales`

. This is what we need to pass into the `correlate()`

function to calculate a correlation data frame.

`home_sales %>% select(where(is.numeric))`