This tutorial will focus on exploratory data analysis with R. We will introduce new functions that automatically summarize various combinations of data types. These functions can be viewed as helpers/extensions of dyplr and ggplot2 that automate some portions of the data analysis process.


Please click the button below to open an interactive version of all course R tutorials through RStudio Cloud.

Note: you will need to register for an account before opening the project. Please remember to use your GMU e-mail address.



Click the button below to launch an interactive RStudio environment using Binder.org. This will launch a pre-configured RStudio environment within your browser. Unlike RStudio cloud, this service has no monthly usage limits, but it may take up to 10 minutes to launch and you will not be able to save your work.


Binder



The R code below will import the employee attrition and Seattle home sales data sets. It also loads a number of new R packages.


  • janitor - useful for common data cleaning and exploration tasks including cross tabulation tables
  • funModeling - a set of functions for automatic plotting and descriptive statistics
  • corrr - functions for creating and exploring correlations
  • corrplot - tools for visualizing correlations between sets of numeric variables


# Make sure to install new packages
library(funModeling)
library(skimr)
library(janitor)
library(corrr)
library(corrplot)
library(tidyverse)
# Employee attrition data
employee_data <- read_rds(url('https://gmudatamining.com/data/employee_data.rds'))
# Seattle home sales data
home_sales <- read_rds(url('https://gmudatamining.com/data/home_sales.rds')) %>% 
              select(-selling_date)



Data

We will be working with the employee_data and home_sales data frames in this lesson.

Take a moment to explore these data sets below.



Employee Attrition Data


The data consists of 1,470 employee records for a U.S. based product company. The rows in this data frame represent an employee at this company that either left the company or not (left_company) and their associated characteristics.


Seattle Home Sales


A row in this data frame represents a home that was sold in the Seattle area between 2014 and 2015.

The response variable in this data is selling_price.




First Step in Data Analysis

Number of Observations, Missing Values, and Summary Statistics

The first step in a data analysis project is to explore your data source. This includes summarizing the values within each column, checking for missing data, checking the data types of each column, and verifying the number of rows and columns.

The skim() function was introduced in tutorial 2 and can be used to accomplish these tasks.

The skim() function takes a data frame or tibble as an argument and produces a detailed set of summary statistics including:


  • the number of rows and columns in the data frame
  • the number of missing values per column, counts of unique values, and other numeric summary statistics grouped by column type


We see from the results below that employee_data has 1,470 rows, 13 columns, 7 factor columns, 6 numeric columns, and no missing values (n_missing) across all columns of the data frame.


# View data frame properties and summary statistics
skim(employee_data)
Data summary
Name employee_data
Number of rows 1470
Number of columns 13
_______________________
Column type frequency:
factor 7
numeric 6
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
left_company 0 1 FALSE 2 No: 1233, Yes: 237
department 0 1 FALSE 6 IT : 399, Res: 293, Sal: 252, Mar: 238
job_level 0 1 FALSE 5 Sen: 476, Man: 344, Dir: 331, Ass: 185
business_travel 0 1 FALSE 3 Rar: 1043, Fre: 277, Non: 150
job_satisfaction 0 1 FALSE 4 Ver: 459, Hig: 442, Low: 289, Med: 280
performance_rating 0 1 FALSE 5 Mee: 546, Exc: 472, Exc: 286, Min: 136
marital_status 0 1 FALSE 3 Mar: 673, Sin: 470, Div: 327

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
salary 0 1 94076.25 37590.24 29848.56 70379.48 88555.53 117099.9 212134.7 ▃▇▃▁▁
weekly_hours 0 1 50.02 4.82 40.00 47.00 49.00 52.0 66.0 ▂▇▃▂▁
yrs_at_company 0 1 7.01 6.13 0.00 3.00 5.00 9.0 40.0 ▇▂▁▁▁
yrs_since_promotion 0 1 2.19 3.22 0.00 0.00 1.00 3.0 15.0 ▇▁▁▁▁
previous_companies 0 1 3.24 1.58 1.00 2.00 3.00 4.0 7.0 ▇▇▂▂▃
miles_from_home 0 1 9.19 8.11 1.00 2.00 7.00 14.0 29.0 ▇▅▂▂▂



Exploring Categorical Variables

Categorical variables are either nominal or ordinal with respect to measurement scale. Therefore the most common summary statistics include frequency counts and proportions.



Skimming by Variable Type

The skim() function is a good first step for analyzing character or factor variables. Each time the skim() function is run, it produces a data frame as the output.

When we print this data frame to the console, it has the nice printing properties seen in the previous section. One of the columns in the output data frame is named skim_type and records the variable type for a particular column.

We can use this column in combination with the filter() function to subset the output based on variable type.

For example, the code below will display only the categorical variables in the employee_data data frame. Since categorical data can be stored as factors or character columns within a data frame, we must add both to our filter condition.

The n_unique and top_counts fields in the output below gives the number of unique values and most frequent (abbreviated) values within each column.


skim(employee_data) %>% 
  filter(skim_type %in% c('factor', 'character'))
Data summary
Name employee_data
Number of rows 1470
Number of columns 13
_______________________
Column type frequency:
factor 7
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
left_company 0 1 FALSE 2 No: 1233, Yes: 237
department 0 1 FALSE 6 IT : 399, Res: 293, Sal: 252, Mar: 238
job_level 0 1 FALSE 5 Sen: 476, Man: 344, Dir: 331, Ass: 185
business_travel 0 1 FALSE 3 Rar: 1043, Fre: 277, Non: 150
job_satisfaction 0 1 FALSE 4 Ver: 459, Hig: 442, Low: 289, Med: 280
performance_rating 0 1 FALSE 5 Mee: 546, Exc: 472, Exc: 286, Min: 136
marital_status 0 1 FALSE 3 Mar: 673, Sin: 470, Div: 327



Crosstabs

Crosstabs (also called contingency tables) are used to summarize the the frequency counts in categorical variables. The tabyl() function from the janitor package can be used to create crosstabs in R.



Single Variables

The tabyl() function from the janitor package is useful for creating tables of count values for distinct levels of factor or character variables. The tabyl() function takes a data frame as the first argument followed by one or two variables of interest.

The tabyl() function includes a number of adorn() functions which do things like add row and column totals and percentages. These are demonstrated in the R code below.

The results of either a tably() or adorn() function will be a data frame.

To explore the full set of features provided by this function, refer to the following website:

tabyl tutorial


We can summarize the frequency counts of a single variable by passing a data frame and the column of interest into tabyl(). By default, we obtain the counts, n, and proportions, percent, of the levels of a categorical column.


# Summarize a single categorical variable
tabyl(employee_data, department)


Since the output of tably() is a data frame, we can use dplyr functions such as filter() to subset the results.


tabyl(employee_data, department) %>% 
  filter(percent >= 0.17)



The function adorn_totals() can be used to add column totals. To accomplish this, we add 'row' as a parameter to adorn_totals(). This is a bit confusing, since to get column totals we need to input 'row'. This argument instructs adorn_totals() to place the sum of values in the rows below the table.


# Add column total in the last row
tabyl(employee_data, department) %>% adorn_totals('row')



Two Variables

The power of the tabyl() function rests in its ability to produces formatted two-way crosstabs that display counts and percentages for the interaction of two categorical variables.

To produce a two-way crosstab, just pass the two columns of interest into the tabyl() function.


tabyl(employee_data, left_company, business_travel)



Table With Row Percentages

In the example below, additional adorn() functions are used to add useful summary statistics to the simple table from above. We add both column and row totals with the adorn_totals() function, row percentages with the adorn_percentages() function, remove decimals from our row percentages with adorn_pct_formatting() function, combine counts with our percentage values in each cell with the adorn_ns() function, and add a combined title in the first column with adorn_title().

With row percentages, we are exploring the distribution of business_travel within each category of left_company. For example, we see that of the employees who left the company, 29% traveled frequently for business.


tabyl(employee_data, left_company, business_travel) %>%
  adorn_totals(c('row', 'col')) %>%
  adorn_percentages('row') %>% 
  adorn_pct_formatting(digits = 0) %>%
  adorn_ns() %>%
  adorn_title('combined')



Table With Column Percentages

To explore columns percentages, we simply replace row with col in our adorn_percentages() functions. In this case, we are studying the distribution of left_company values within each category of business_travel.


tabyl(employee_data, left_company, business_travel) %>%
  adorn_totals(c('row', 'col')) %>%
  adorn_percentages('col') %>% 
  adorn_pct_formatting(digits = 0) %>%
  adorn_ns() %>%
  adorn_title('combined')



Visualizing Categorical Variables

The most common visualization for categorical data are bar and column charts that plot the frequency of distinct values within a variable.



Automatic Plotting

The freq() function from the funModeling package can be used to automatically create bar and column charts for the factor or character variables within a data frame. The freq() function takes a data frame as the first argument and produces frequency plots and tables of all factor/character variables.

This function is meant to interactively explore data, not for saving the results. To save the results, we would have to use the tabyl() and ggplot() functions to create custom tables and plots.



freq(employee_data)

  left_company frequency percentage cumulative_perc
1           No      1233      83.88           83.88
2          Yes       237      16.12          100.00

              department frequency percentage cumulative_perc
1       IT and Analytics       399      27.14           27.14
2               Research       293      19.93           47.07
3                  Sales       252      17.14           64.21
4              Marketing       238      16.19           80.40
5    Product Development       178      12.11           92.51
6 Finance and Operations       110       7.48          100.00

       job_level frequency percentage cumulative_perc
1 Senior Manager       476      32.38           32.38
2        Manager       344      23.40           55.78
3       Director       331      22.52           78.30
4      Associate       185      12.59           90.89
5 Vice President       134       9.12          100.00

  business_travel frequency percentage cumulative_perc
1          Rarely      1043      70.95           70.95
2      Frequently       277      18.84           89.79
3            None       150      10.20          100.00

  job_satisfaction frequency percentage cumulative_perc
1        Very High       459      31.22           31.22
2             High       442      30.07           61.29
3              Low       289      19.66           80.95
4           Medium       280      19.05          100.00

    performance_rating frequency percentage cumulative_perc
1   Meets Expectations       546      37.14           37.14
2 Exceeds Expectations       472      32.11           69.25
3          Exceptional       286      19.46           88.71
4  Minimally Effective       136       9.25           97.96
5        Not Effective        30       2.04          100.00

  marital_status frequency percentage cumulative_perc
1        Married       673      45.78           45.78
2         Single       470      31.97           77.75
3       Divorced       327      22.24          100.00
[1] "Variables processed: left_company, department, job_level, business_travel, job_satisfaction, performance_rating, marital_status"





Exploring Numeric Variables

To calculate numerical summaries, including averages, standard deviations, and percentiles for numeric variables, we can use the skim() function.



Descriptive Statistics

The skim() function will automatically calculate all key descriptive statistics for the numeric variables in our data. In the example below, we use the skim() function on the home_sales data and select the numeric columns printing.

In the output under the Variable type: numeric section, we obtain a complete summary of our numeric variables. Calculated statistics include:


  • the mean and standard deviation
  • a five number summary (minimum (p0), 25th, 50th, and 75th percentiles, and the maximum (p100))
  • histogram plots


From the output below, we are able to say the following about the selling_price variable:


  • the average selling price of homes is $516,613.50 with a standard deviation of $182,978.15
  • the median selling price of homes is $479,950
  • the range of selling prices is between $260,000 and $970,000
  • the interquartile range is $288,500
  • the distribution of selling prices is skewwed to the right based on the histogram. This means the values tend to cluster more towards lower selling prices


skim(home_sales) %>% 
  filter(skim_type == 'numeric')
Data summary
Name home_sales
Number of rows 6659
Number of columns 10
_______________________
Column type frequency:
numeric 9
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
selling_price 0 1 516613.50 182978.15 260000 360000.0 479950.0 648500.0 970000.0 ▇▆▅▃▂
house_age 0 1 13.83 8.74 0 7.0 12.0 21.0 30.0 ▆▇▅▃▅
bedrooms 0 1 3.45 0.78 0 3.0 3.0 4.0 9.0 ▁▇▇▁▁
bathrooms 0 1 2.54 0.48 0 2.5 2.5 2.5 7.5 ▁▇▁▁▁
sqft_living 0 1 2302.69 770.17 550 1690.0 2230.0 2800.0 7120.0 ▅▇▂▁▁
sqft_lot 0 1 12814.40 33201.81 600 3800.0 6160.0 9254.0 1024068.0 ▇▁▁▁▁
sqft_basement 0 1 133.21 305.05 0 0.0 0.0 0.0 2200.0 ▇▁▁▁▁
floors 0 1 1.96 0.45 1 2.0 2.0 2.0 3.5 ▁▇▁▁▁
scenic_views 0 1 0.10 0.50 0 0.0 0.0 0.0 4.0 ▇▁▁▁▁




Automatic Plotting

Another useful function in the funModeling package is plot_num() which takes a data frame and automatically plots histograms of numeric variables



plot_num(home_sales)



Correlation Analysis

Studying the correlation structure of a numeric data set is important for uncovering linear associations between variables.

The correlation coefficient is a measure of the linear relationship between two numeric variables and ranges from -1 to 1. A value of -1 indicates an inverse linear relationship, where the values of one variable decrease linearly as the values of another increase.

A value of 1 indicates a positive linear relationships where the values of one variable increase linearly as the values of another increase.

A correlation of 0 indicates that there is no linear relationship between two numeric variables. However, this does not mean that two variables are not associated with each other. Variables many have non-linear relationships despite having correlations of 0.



Creating Correlation Data Frames

A correlation data frame can be produced by passing a data frame of numeric variables to the correlate() function from the corrr package. Correlation data frames contain the correlation coefficients for every pair of numeric variables in a data frame.

If you have non-numeric data, make sure to filter your data frame to remove these columns. We will demonstrate creating correlation data frames using the home_sales data.

The select() function from dplyr can conditionally select columns by their data type properties. This is done by passing one of the following function names into the where() function within select():


  • is.numeric
  • is.factor
  • is.character


The example below will select only numeric columns from home_sales. This is what we need to pass into the correlate() function to calculate a correlation data frame.


home_sales %>% select(where(is.numeric))