In this tutorial, we will learn about data types and common data structures used in R.

Data types represent different types of information that can be stored in R. The most common R data types are:

• numeric
• integer
• logical
• character

Data structures provide a way to organize and work with different types of data. The data structures we will learn about include:

• vectors
• matrices
• lists
• data frames

Please click the button below to open an interactive version of all course R tutorials through RStudio Cloud.

Note: you will need to register for an account before opening the project. Please remember to use your GMU e-mail address.

Click the button below to launch an interactive RStudio environment using Binder.org. This will launch a pre-configured RStudio environment within your browser. Unlike RStudio cloud, this service has no monthly usage limits, but it may take up to 10 minutes to launch and you will not be able to save your work.

Before starting this tutorial, I recommend watching the video below. It was created by the RStudio team and serves as a quick tour of the many features of the RStudio IDE, which is what we will be using this semester through RStudio Cloud. Please see the RStudio Cheatsheet for more information about the features available in RStudio.

# Data Types

The most common R data types include numeric, integer, logical, and character. The table below provides examples of how each data type is represented in R.

Data Type Example Values
Numeric (Double) 8.123, 10, 2.71812
Integer 1L, 19L, 2000L
Logical TRUE, FALSE, T, F
Character “A character string of text”, “d”, “8.23”

Numeric data types represent real numbers, such as 2.345, $$\pi$$, and 4.001.

Integer data types represent whole counting numbers and are entered into R by adding an “L” after the number.

Logical data types represent the logical conditions TRUE and FALSE. They can be entered as the unquoted text, TRUE, or just T, for example.

Character data types represent text data and must be entered enclosed between quotes, either single ’ or double ".

# Data Structures

## Summary of R Data Structures

The most common data structures in R, can be categorized by their dimension (one or two) and restrictions on their contents in terms of data types.

They can be homogeneous, where all elements are of the same data type or heterogeneous, where contents may have multiple data types.

In this course, we will be using vectors, matrices, data frames, and lists. The key features of these data structures are summarized in the table below.

Data Structure Data Type Restriction Dimension
Vector Homogeneous 1
List Heterogeneous 1
Matrix Homogeneous 2
Data frame Heterogeneous 2

## Vectors

### Creating Vectors

A vector is a one-dimensional sequence of data elements of the same type.

Vectors are constructed with the c() function. To assign a vector to a variable, use the <- operator (a keyboard shortcut of this symbol is “Alt” + “-”).

The code below will create a numeric vector with 4 elements and print the result to the R console.

# A numeric vector
c(4, 23, 4.1, 3.5)
[1]  4.0 23.0  4.1  3.5

To assign the results to a variable in our R environment, we use the <- operator.

# Assign results to numeric_vec
numeric_vec <- c(4, 23, 4.1, 3.5)

When working with any data structure in R, it is important to be able to explore its contents and obtain information about the type of data stored in the structure.

To get information about any data structure, we can use the str() function. This will display the data type of a vector and print it contents.

# Check the type and contents of numeric_vec
# We see that it is numeric (num)
str(numeric_vec)
 num [1:4] 4 23 4.1 3.5

Another important attribute of vectors is how many data elements it contains. This is provided by passing a vector into the length() function.

length(numeric_vec)
[1] 4

The c() function can be used to create vectors using single input data elements separated by commas, pre-defined vectors, vectors created with the c() function, or a mixture of all formats. In the examples below, various ways of creating new vectors is demonstrated.

# Combine a pre-defined vector with additional data
numeric_vec_2 <- c(numeric_vec, 4.7, 5.1)

# View result
numeric_vec_2
[1]  4.0 23.0  4.1  3.5  4.7  5.1
# Adding another c() function within an outer c() function
numeric_vec_3 <- c(1.2,
numeric_vec,
c(1.1, 2.4, 4.1))

numeric_vec_3
[1]  1.2  4.0 23.0  4.1  3.5  1.1  2.4  4.1

### Special Functions for Creating Numeric Vectors

There are two useful functions, seq() and :, for creating numeric or integer vectors.

The seq() function has three important arguments:

• The first is from (where should the values begin)
• The second is to (where should the values end)
• The third is by (by how much should the values increment)

These arguments can be provided by name, as shown below

seq_vec <- seq(from = 1, to = 6, by = 1)

str(seq_vec)
 num [1:6] 1 2 3 4 5 6

Or by position

seq(1, 6, 1)
[1] 1 2 3 4 5 6

The seq() function will create integer vectors if we pass numbers followed by “L” into the function.

seq_int_vec <- seq(1L, 10L, 2L)

str(seq_int_vec)
 int [1:5] 1 3 5 7 9

The : function can be used to quickly generate a numeric/integer vector that increments by one. The vector is created using the following rule: start value:end value

# Numeric vector
1:5
[1] 1 2 3 4 5
# Integer vector
1L:10L
 [1]  1  2  3  4  5  6  7  8  9 10
# Also works in reverse
5:1
[1] 5 4 3 2 1

To learn more about the : or any other function in R, just execute ?: in the console and the help page will pop up in the lower right portion of RStudio.

### Coercion Rules in R

All elements of a vector must be of the same type. When combining different data types into a single vector, it will be coerced in the following precedence order:

1. character
2. numeric
3. integer
4. logical

This means that if you mix character elements with numeric and integer element, then all elements get converted to character (since it has higher precedence).

These rules are important to understand, since many errors that show up in your code will be due to a mismatch of data types.

The vector below will get converted to a character vector.

vector_1 <- c(2.45, 5.1, 1L, 'character')

str(vector_1)
 chr [1:4] "2.45" "5.1" "1" "character"

The vector below will get converted to a numeric vector. Notice that R converts logical values in the following way: TRUE becomes 1 and FALSE becomes 0.

vector_2 <- c(4.234, 10L, TRUE, T, FALSE)

str(vector_2)
 num [1:5] 4.23 10 1 1 0

This vector will be converted to an integer vector.

vector_3 <- c(10L, 5L, TRUE, FALSE)

str(vector_3)
 int [1:4] 10 5 1 0

Create a numeric vector with a range from 1 to 22 that increments by 3. You should get the result below when you execute your code in R

[1]  1  4  7 10 13 16 19 22

Using the class_vct defined below, create a new vector called updated_class_vct that contains the information printed below

class_vct <- c("MIS 431", "MIS 432")
[1] "MIS 310" "MIS 431" "MIS 432" "MBA 738"

## Factors

Factors are a special data structure for working with categorical data. Categorical data represents data that only differs by label (such ‘yes’/‘no’) or ranks (such as ‘1st’, ‘2nd’, etc.).

In R, factors are a special type of labeled integer vector. Factors are created with the factor() function. This function takes as arguments, a vector, the levels of the factor, and the labels of the factor.

Think of factors as a way to label your data. Factors should only be used when you have a pre-determined number of categories.

# Creating a factor vector
weekday_factor <- factor(c('M', 'T', 'W', 'Th', 'F', 'M', 'W'),
levels = c('M', 'T', 'W', 'Th', 'F'),
labels =  c('Monday', 'Tuesday', 'Wednesday',
'Thursday', 'Friday'))
# View results
weekday_factor
[1] Monday    Tuesday   Wednesday Thursday  Friday    Monday    Wednesday
Levels: Monday Tuesday Wednesday Thursday Friday

The str() function will tell us that the vector is factor, display some of the levels, and show the underlying mapping of levels to integer values that happened behind the scenes.

The levels of a factor are always mapped to a sequence of numbers starting at 1 and increasing by 1 for every level. This mapping is based on the order in which the levels are entered in factor()

str(weekday_factor)
 Factor w/ 5 levels "Monday","Tuesday",..: 1 2 3 4 5 1 3

The summary() function will automatically count the occurrence of factor labels.

summary(weekday_factor)
   Monday   Tuesday Wednesday  Thursday    Friday
2         1         2         1         1 

Factors can also be created with numeric vectors as input. Let’s say that we have a vector of 1s and 0s where 1 represents the occurrence of an event and 0 otherwise. The code below shows how to create a labeled factor from the data.

event_indicator <- c(1, 0, 0, 1, 0, 0)

event_fct <- factor(event_indicator,
levels = c(0, 1),
labels = c('No', 'Yes'))

summary(event_fct)
 No Yes
4   2 

Note that the order in which the levels are entered affects how they are stored in the factor.

event_fct_2 <- factor(event_indicator,
levels = c(1, 0),
labels = c('Yes', 'No'))

summary(event_fct_2)
Yes  No
2   4 

To access the levels of any factor and see their order, use the levels() function.

levels(event_fct)
[1] "No"  "Yes"
levels(event_fct_2)
[1] "Yes" "No" 

By default, if we do not provide input to the levels and labels arguments in factor(), levels are automatically assigned in alphabetic order (for character vectors) or numeric order. The labels are then set to the levels values.

fct_from_chr <- factor(c('Yes', 'No', 'No', 'Yes'))

str(fct_from_chr)
 Factor w/ 2 levels "No","Yes": 2 1 1 2
fct_from_num <- factor(c(1, 1, 1, 4, 5))

str(fct_from_num)
 Factor w/ 3 levels "1","4","5": 1 1 1 2 3

The survey vector below represents survey responses where people indicated their level of comfort with data analysis.

survey <- c(1, 3, 3, 2, 2, 1, 1, 1, 1)

The numeric values have the following meaning:

• 1 represents ‘not comfortable’
• 2 represents ‘moderately comfortable’
• 3 represents ‘very comfortable’

Use the factor() function to label this vector. You should get the results below if you pass your factor into the summary() function.

       not comfortable moderately comfortable       very comfortable
5                      2                      2 

## Matrices

A matrix is an R data structure that stores a collection of data arranged in a 2 dimensional table with rows and columns. Like vectors, all data elements of a matrix must be of the same type.

If you build a matrix with vectors of different data types, the matrix will be coerced with the same precedence rules as above. A matrix can’t store a numeric column as well as a character column. This would get coerced into a character matrix.

You can create matrices with the matrix() function. Vectors are created with the c() function. This generalizes to cbind() or rbind() for matrices. The code below demonstrates these functions.

Matrices also have the ability to capture row and column names. To check these, we can use the rownames() and colnames()

To create a matrix, use the matrix() function.

# Build a numeric 2 X 2 matrix
A <- matrix(data = c(1, 2, 3, 4), # data is entered as a vector
nrow = 2, # number of rows
ncol = 2, # number of columns
byrow = TRUE) # read data in by row (default is FALSE)

The str() function will let us know that we have a numeric matrix with 2 rows and 2 columns.

# View properties with str()
str(A)
 num [1:2, 1:2] 1 3 2 4

We can check the dimensions of a matrix using the dim() function.

# Check the dimensions of A (rows,columns)
dim(A)
[1] 2 2
# View A
A
     [,1] [,2]
[1,]    1    2
[2,]    3    4

Matrices can also be created with pre-defined vectors.

# We can build a matrix with a predefined vector
vector_1 <- c(1, 2, 3, 4)

B <- matrix(vector_1,
nrow = 2,
ncol = 2,
byrow = FALSE)
B
     [,1] [,2]
[1,]    1    3
[2,]    2    4

We can combine multiple vectors to create matrices using the cbind() and rbind() functions.

cbind() will stack vectors horizontally (by column) and rbind() will stack vectors vertically (by row).

vector_2 <- c(5, 6, 7, 8)

C <- cbind(vector_1, vector_2)

C
     vector_1 vector_2
[1,]        1        5
[2,]        2        6
[3,]        3        7
[4,]        4        8

Since we created our matrix using named vectors, our matrix C has column names. To view the column names of any matrix, use the colnames() function.

# Obtain column names
colnames(C)
[1] "vector_1" "vector_2"

rbind() creates matrices by row.

D <- rbind(vector_1, vector_2)

D
         [,1] [,2] [,3] [,4]
vector_1    1    2    3    4
vector_2    5    6    7    8

We can access the row names by using the rownames() function on our matrix D.

rownames(D)
[1] "vector_1" "vector_2"

We can also create or overwrite row/columns names for any matrix. In the code below, we assign column names to our matrix D and overwrite the original row names.

colnames(D) <- c('column_1', 'column_2', 'column_3', 'column_4')
rownames(D) <- c('row_1', 'row_2')

D
      column_1 column_2 column_3 column_4
row_1        1        2        3        4
row_2        5        6        7        8

Use the matrix() function to create the matrix below. Note that this matrix has custom row and column labels.

              variable_1 variable_2
observation_1          1          5
observation_2          3          9

## Lists

Like matrices, lists are R objects that can hold multiple vectors. But unlike matrices, they are one dimensional and can store mixed data types.

The advantage of lists is that they can hold varying data types with different lengths and dimensions. Think of lists as special vectors that can store different data structures in each position. Lists can be recursive, meaning that a list can contain a list.

Lists are very important as most output from statistical models in R, such as linear regression or clustering, are returned as lists.

Lists are created with the list() function. To obtain the named contents of a list, if any exist, use the names() function.

We will discuss how to obtain the various objects within a list in the subsetting section of this tutorial.

my_list <- list(char_vector = c('A', 'B'),
numeric_vector = c(1.2, 3.4, 5, 12.01),
a_matrix = cbind(c(1, 2), c(3, 4)),
a_list = list(c(1L, 4L), c('A', 'D', 'E')))
# View contents
my_list
$char_vector [1] "A" "B"$numeric_vector
[1]  1.20  3.40  5.00 12.01

$a_matrix [,1] [,2] [1,] 1 3 [2,] 2 4$a_list
$a_list[[1]] [1] 1 4$a_list[[2]]
[1] "A" "D" "E"

We can use the names() function to see whether a list has named elements. Remember that a list is one-dimensional. We can see from the output below that my_list contains 4 elements, where the vector named char_vector is in the first position of the list.

# View the names (if they exist) of my_list
names(my_list)
[1] "char_vector"    "numeric_vector" "a_matrix"       "a_list"        

## Data Frames

The most common R data structure is the data frame. A data frame is a specialized 2-dimensional list that must contain equal-length vectors, possibly of varying type. Data frames are created with the data.frame() function.

A good comparison for a data frame would be a SQL table.

Data frames are rectangular tables of data, where columns represent variables and rows represent observations on these variables. Unlike general one-dimensional lists, data frames must have vectors of the same length. However, vectors may have different types, such as numeric, character, or factor.

To obtain the names of the variables in a data frame, use names() or colnames(). To get the number of rows in a data frame, use nrow() or dim().

# Let's create a simple data frame with the data.frame function
my_data <- data.frame(student_id = c(100234, 132454, 453123),
session = c("7 AM", "7 PM", "7 AM"))
# View the data
my_data

We can obtain the column names of our data frame by either using names() or colnames().

# Get the variable names
names(my_data)
[1] "student_id"   "test_1_grade" "hw_1_grade"   "session"     
colnames(my_data)
[1] "student_id"   "test_1_grade" "hw_1_grade"   "session"     

To get the number of rows or columns in a data frame, we can use nrow(), ncol(), or dim().

ncol(my_data)
[1] 4
nrow(my_data)
[1] 3
# Number of rows and columns
dim(my_data)
[1] 3 4

# Operators in R

In this section we will cover the most common operators in R that are used to manipulate data structures.

## Assignment Operator

The <- operator is used to create variables in R. This operator will assign what is to the right of it to the variable name on the left side. We’ve been using this throughout the tutorial.

A keyboard shortcut for <- is ‘Alt’ + ‘-’ in Windows.

my_vector <-  c(10, 20)
my_vector
[1] 10 20

## Arithmetic Operators

Standard mathematical operations, such as addition and multiplication, are available in R. In the examples below, these operations are demonstrated with the appropriate operators.

R computes operations in a vectorized manner, meaning that if you add two vectors, for example, the addition is performed element-wise within the corresponding vectors.

If you multiply a vector by a single number, all elements of the vector are multiplied by that number. This is commonly referred to as broadcasting.

# + operator adds two vectors, element-wise
v_1 <- c(2, 4, 7)
v_2 <- c(2, 5, 8)
v_1 + v_2
[1]  4  9 15
# - operator subtracts two vectors
v_1 - v_2
[1]  0 -1 -1
# * operator multiples two vectors
v_1 * v_2
[1]  4 20 56
# Multiplication by a constant
5*v_1
[1] 10 20 35
# Exponentiation
v_1 ^ 2
[1]  4 16 49
# / operator divides two vectors
v_1/v_2
[1] 1.000 0.800 0.875

## Relational Operators

Relational operators, such as <= or >, are used to compare data values to each other. The results from using relational operators will always return a logical vector.

# Check where elements of v_1 are greater than v_2
# Produces a logical vector
v_1 > v_2
[1] FALSE FALSE FALSE
# >= greater than or equal to
v_1 >= v_2
[1]  TRUE FALSE FALSE
# <
v_1 < v_2
[1] FALSE  TRUE  TRUE
# <=
v_1 <= v_2
[1] TRUE TRUE TRUE
# == operator checks for equality
v_1 == v_2
[1]  TRUE FALSE FALSE
# != operator finds where the two vectors are not equal
v_1 != v_2
[1] FALSE  TRUE  TRUE
# Check where v_1 is greater than 2
v_1 > 2
[1] FALSE  TRUE  TRUE

## Logical Operators

Logical operators, such as AND (&), OR (|), and NOT (!), are used to perform calculations with logical data types in R. These are important for filtering rows of data frames where variables meet certain conditions.

a <- 5
b <- 10

The & operator represents the logical AND operation. It will return TRUE if both conditions on the left and right of it are TRUE.

a == 5 & b == 10
[1] TRUE

For vectors, all pairwise elements are compared.

v_1 >= 3 & v_2 >= 2
[1] FALSE  TRUE  TRUE

The | operator represents the logical OR operation. It will return TRUE if either one of the conditions on the left and right of it are TRUE.

a > 6 | b > 7
[1] TRUE

The | operator also compares all pairwise elements.

v_1 > 5 | v_2 > 5
[1] FALSE FALSE  TRUE

Finally, ! is used for negation. This means it will convert all TRUE values to FALSE and FALSE values to TRUE.

L1 <- c(TRUE, FALSE, TRUE)
L1
[1]  TRUE FALSE  TRUE
# Give L1 opposite logical values
!L1
[1] FALSE  TRUE FALSE

Be careful when using the ! operator. It is always best to enclose any logical test within parentheses to make sure you are getting the negation.

Without parentheses in the code below, !v_1 >= v_2 would be carried out as ‘negate v_1 and test whether it is greater than v_2’. This is different from ‘test where v_1 is greater than v_2 and negate the result’ that is executed by the code below.

# Find where v_1 is not greater than or equal to v_2
!(v_1 >= v_2)
[1] FALSE  TRUE  TRUE

# Subsetting Data Structures

When you want to extract elements of a vector that meet a logical condition, or vectors stored in lists or data frames, you will have to subset the R objects with the [ ], [[ ]] or $ functions. This is best demonstrated with some examples. ## Subsetting Vectors ### Logical Subsetting It’s possible to subset vectors in R by using logical or numeric vectors. I will demonstrate both methods below. number_vector <- c(1, 3, 6, 10) The code below produces a logical vector that indicates where number_vector is greater than 5 number_vector > 5 [1] FALSE FALSE TRUE TRUE We can use the logical vector result from above to get only the elements from number_vector that are greater that 5. We just place the logical condition within the [ ] function to the right of the original vector. number_vector[number_vector > 5] [1] 6 10 We can also use relational or logical operators. number_vector > 3 | number_vector <= 10 [1] TRUE TRUE TRUE TRUE number_vector[number_vector > 3 | number_vector <= 10] [1] 1 3 6 10 ### Numeric Subsetting We can also subset vectors with a numeric indexing vector. The code below returns the 2nd and 4th elements from number_vector. number_vector[c(2, 4)] [1] 3 10 Unlike many other programming languages, elements within R data structures start at 1, not 0. number_vector[1] [1] 1 ## Subsetting Lists Remember that lists are collections of various data structures. To access the data structures stored within lists, we can use the [ ], [[ ]] or $ functions.

If you have a list, call it my_list, then my_list[1] returns a list with the first element of my_list. This is different from my_list[[1]], which returns the contents of the first element of my_list.

If you imagine that my_list is a large box that contains 10 small boxes, then my_list[1] returns the first box (which is still a box), while my_list[[1]] returns the contents of the first box (which may no longer be a box).

student_list <- list(student_id = c(12, 15),
section = c('001', '003'),
age = c(26, 20))

The code below will return a list that contains the first element of student_list.

student_list[1]
$student_id [1] 12 15 # Check properties with str() str(student_list[1]) List of 1$ student_id: num [1:2] 12 15

If we want to extract the first element from the list, we need to use [[ ]]. Notice that str() now tells us that we got a numeric vector.

student_list[[1]]
[1] 12 15
# Check properties with str()
str(student_list[[1]])
 num [1:2] 12 15

You can subset lists with the names of the elements stored in it. The code below is eqivalent to the code from above.

student_list[["student_id"]]
[1] 12 15

What if we want to extract the first number from the student_id vector that is in the first position of student_list? We will have to use [[ ]] followed by [ ]

student_list[["student_id"]][1]
[1] 12

The $ operator is a shortcut for extracting elements from a list and works like [[ ]] # This will extract the student_id vector student_list$student_id
[1] 12 15

This is an alternate way to get the first element of the student_id vector within student_list.

student_list$student_id[1] [1] 12 ## Subsetting Data Frames To subset rows and columns of a data frame we can use the following syntax: my_data_frame[row condition, column condition] The row/column conditions may be either numeric indexes, logical expressions, or vectors of column names (for column selection) my_data_frame <- data.frame(make = c("Toyota","Honda","Ford", "Toyota", "Ford", "Honda"), mpg = c(34, 33, 22, 32, 29, 27), cylinders = c(4, 4, 8, 6, 6, 8)) # View data my_data_frame ### Numeric Indexing In the example below, we select rows 1 - 3 and columns 1 - 2. Remember that the : functions creates a sequence of number values. 1:3 will create the following vector [1, 2, 3]. # First three rows, first two columns my_data_frame[1:3,1:2] my_data_frame[c(1, 2, 3), c(1, 2)] # row 2, columns 1 and 3 my_data_frame[2, c(1, 3)] Leaving a row or column condition blank will return all values along that axis. # Rows 1 and 2, all columns my_data_frame[1:2, ] ### Logical Indexing We can also pass logical vectors into the row condition to obtain a subset of our data. Let’s demonstrate this by selecting rows that have mpg values greater than or equal to 30. # Create logical vector logical_condition <- my_data_frame$mpg >= 30
# View results
logical_condition
[1]  TRUE  TRUE FALSE  TRUE FALSE FALSE

Now we pass this logical vector into the row condition.

# Use to subset data
my_data_frame[logical_condition, ]

In practice, you would write this in one step.

my_data_frame[my_data_frame$mpg >= 30, ] ### Numeric and Logical Subsetting Both methods of indexing can be combined in a single expression. Below are some examples. # All rows with at least 32 mpg, columns 2 and 3 my_data_frame[my_data_frame$mpg >= 32, c(2, 3)]
# The same conditions, but using column names
my_data_frame[my_data_frame$mpg >= 32, c("mpg", "cylinders")] ### Extracting Columns Just like with lists, a single [] will return a data frame and a double [[ ]] will extract a column vector from a data frame. # This returns a data frame my_data_frame[1] str(my_data_frame[1]) 'data.frame': 6 obs. of 1 variable:$ make: chr  "Toyota" "Honda" "Ford" "Toyota" ...
# This extracts the first column
my_data_frame[[1]]
[1] "Toyota" "Honda"  "Ford"   "Toyota" "Ford"   "Honda" 
str(my_data_frame[[1]])
 chr [1:6] "Toyota" "Honda" "Ford" "Toyota" "Ford" "Honda"

The above can also be accomplished by using the name of the first column.

my_data_frame[['make']]
[1] "Toyota" "Honda"  "Ford"   "Toyota" "Ford"   "Honda" 

Just like with lists, we can use $ instead of [[ ]]. In this case, you must use the name of the column. my_data_frame$make
[1] "Toyota" "Honda"  "Ford"   "Toyota" "Ford"   "Honda" 

# Exercises

## Exercise 1

In this exercise, we will be working with the following list

my_list <- list(classes_offered = c("MIS 431", "MIS 310", "MIS 410", "MIS 412"),
student_data = data.frame(student_id = c(54, 100, 32, 423,
2, 19, 39),
age = c(18, 22, 27, 18, 29,
22, 20),
gpa = c(3.1, 2.8, 3.7, 3.4, 3.2,
3.4, 3.2),
stringsAsFactors = FALSE))

Write the R code that calculates the median value (use the median() function) of the gpa variable in student_data. All you need to do is pass the student_id vector into the median() function.

To read about the median() function, just execute the following in your R session: ?median

## Exercise 2

Subset my_data_frame to only include rows that have a cylinders value of 4.