Education Committee


Introduction to R

Part 2: Preparing for Data Analysis

By Arun Ramamurthy


Introduction

In this second of SAAS’s crash courses on R programming, you will learn about how to actually use R to conduct data analysis, or more appropriately, how to clean your data to prepare them for data analysis. You will continue your foray into R packages with this primer on tidyverse, a collection of packages meant to make data science in R rapid and painless. In particular, you will learn how to use dplyr and tidyr for data cleaning, readr and readxl for data import, and finally an introduction to R’s functional tools, such as the map and reduce functions from the purrr package.


About this Document

Prerequisites

The immediate prerequisite this tutorial is r1. You will also need to install both R and RStudio to use the r2-workbook associated workbook. Visit r0 for general information on the philosophy and functionality of R and RStudio, as well as installation guides for both.


r2

This document contains textbook-style information on R programming. It will cover the essentials of several tidyverse packages such as dplyr, tidyr, readr, readxl, and purrr.

Throughout this tutorial, you will be working with three distinct datasets, to give you familiarity and practice with the tidyverse. These datasets are as follows:

  1. iris, the Edgar Anderson’s classical multivariate dataset on 150 flowers in a Canadian field. This dataset will be used to illustrate dplyr, tidyr, purrr functions.
  2. articles, an unclean dataset of Wellcome Trust’s open access spend from 2012 to 2013. This dataset will be used to give readers practice with dplyr and even r1’s stringr section.
  3. plant-ants, an untidy matrix of plant-ant interactions recorded in 1996. This dataset will be used to give readers practice with tidyr and readxl.

r2-workbook

The r2-workbook contains associated exercises to work through as you learn about the concepts within this document. They are aimed to help you get practice and familiarity with R programming concepts and functions. At the end of each section of this document, solve the problems in the matching section of the workbook to help your understanding of the material.


The tidyverse Ecosystem

Before we begin learning about the various packages contained within the tidyverse collection, it’s important to conceptually understand what tidyverse aspires to do.


Introduction

According to the official website,

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Let’s unpack this a little. The tidyverse is a collection of R packages. These R packages all share the same underlying data structures, grammar, and philosophy, and so are designed to work well together. Finally, the tidyverse is designed for data science. It aspires to make data science in R rapid, intuitive, and cohesive.

As you will see throughout the rest of this tutorial, the packages in the tidyverse each cover separate areas of data science, but work so well together that you will find yourself tying them together into a single seamless workflow. This natural tendency to begin piping readr commands into dplyr commands into broom commands into ggplot2 commands (the list could go on and on!) is exactly an artifact of the underlying philosophy and grammar pioneered by the tidyverse.

The bottom-line: The tidyverse is designed to make data science as human as possible.


Installation

Installing every package in the canonical tidyverse takes only a single command:

install.packages("tidyverse")

From there, you can load the tidyverse packages into your work environment with:

library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ readr   1.1.1
	## ✔ tibble  1.3.4     ✔ purrr   0.2.4
	## ✔ tidyr   0.7.2     ✔ forcats 0.2.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
	## ✖ tidyr::extract()   masks magrittr::extract()
	## ✖ dplyr::filter()    masks stats::filter()
	## ✖ dplyr::lag()       masks stats::lag()
	## ✖ purrr::set_names() masks magrittr::set_names()

Notice the output message, which tells you which specific packages of the tidyverse were loaded.

While we’re at it, let’s also make sure magrittr is loaded into our environment too:

library(magrittr)

Data Wrangling

The first general topic we will cover in this tutorial is data wrangling. Data wrangling, or manipulating dataframes, has two components: data cleaning and tidying. While you may wrangle your dataframe for a variety of purposes (e.g. changing the units of your variables or selecting only part of your dataset), they usually fall into one of those two broader categories.

Unfortunately for all data scientists, statisticians, and machine learning specialists, in the wild, real datasets are ugly. There are missing values, random whitespace where you would never guess it, typos, annoying header cells, and more! You cannot analyze or even visualize your data while it’s still “dirty”. Data cleaning is the process of massaging your data into an error-free form (usually tidy) that works best for your analysis. As you’ll soon learn in your Statistics career,

80% of your time working with data is spent cleaning.

Cleaning data is one thing (hint: dplyr’s got your back!), and involves a lot of time and care, but tidying data is still more stringent. As you’ll learn next week in r3, the tidyverse runs on tidy data. Its functions are optimized to work seamlessly with any clean dataset, as long as the dataset is tidy.

A tidy dataset has two (loose) properties:

  1. Every observation is a single row.
  2. Every variable is a single column.

While interpretations of observation and variable vary, it helps to know that tidy data is the antonym of wide data - tidy dataframes tend to be longer, rather than wider. For example, if you’re working with a monthly-resolution time series of restaurant sales data, the wide dataset would have a row for each year, and a column for each month. The tidy dataset equivalent would have three columns: one for the year, one for the month, and one for the sales for that month.

A diagram of tidy data, from RStudio’s Data Wrangling might be useful:

Just as the dplyr is specialized for manipulating data, The tidyr package is specialized to take cleaned data and output its tidy equivalent. Without further ado, let’s learn how to use these powerful tidyverse packages!


dplyr

dplyr is the tidyverse package for data manipulation. First, let’s do a quick review of dataframes with the simple functions to help view data.


head, View, arrange

The head function is useful to just see the first few entries of a dataframe. Let’s use head to print the first six lines of iris, one of R’s default datasets.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
	## 1          5.1         3.5          1.4         0.2  setosa
	## 2          4.9         3.0          1.4         0.2  setosa
	## 3          4.7         3.2          1.3         0.2  setosa
	## 4          4.6         3.1          1.5         0.2  setosa
	## 5          5.0         3.6          1.4         0.2  setosa
	## 6          5.4         3.9          1.7         0.4  setosa

If you look at the documentation for head with ?head, you can verify that the default number of lines outputted is 6.

Rather than printing the dataframe to the console, the View function (note the capital V!) will open up a new pane in RStudio to view your dataframe as a spreadsheet, and also displays additional information like the number of observations (rows) and variables (columns). This pane is filterable and searchable, so it’s useful to quickly glance through large datasets! Let’s view our iris dataset in full:

View(iris) ## Run this command in your R Console!

Although you can sort your dataset in the View pane, suppose you want to sort your dataframe before using head to see the first six entries by some column. This is where dplyr’s sorting function comes in, arrange. By default, it will arrange your dataframe from least to greatest of some column variable. For example, let’s check the shortest petals in the iris dataset:

iris %>% arrange(Petal.Length) %>% head
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
	## 1          4.6         3.6          1.0         0.2  setosa
	## 2          4.3         3.0          1.1         0.1  setosa
	## 3          5.8         4.0          1.2         0.2  setosa
	## 4          5.0         3.2          1.2         0.2  setosa
	## 5          4.7         3.2          1.3         0.2  setosa
	## 6          5.4         3.9          1.3         0.4  setosa

If we wanted to check the longest petals instead, we would run:

iris %>% arrange(desc(Petal.Length)) %>% head
##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
	## 1          7.7         2.6          6.9         2.3 virginica
	## 2          7.7         3.8          6.7         2.2 virginica
	## 3          7.7         2.8          6.7         2.0 virginica
	## 4          7.6         3.0          6.6         2.1 virginica
	## 5          7.9         3.8          6.4         2.0 virginica
	## 6          7.3         2.9          6.3         1.8 virginica

arrange is the first of the dplyr functions we’ve learned. Now that we’re warmed up, lets tackle the rest.


Dataframe Subsetting

One of the most essential operations in data science is to select some subset of either the columns or the rows of a dataset.


Row Selection (sample_*, slice, filter)

First, we’ll learn how to select only certain rows in a dataset, either randomly, by row number, or according to a logical expression.


sample_n, sample_frac

To randomly select a few rows from your dataset, use sample_n, which takes a dataframe and a number of entries to sample. There’s also sample_frac, which takes a dataframe and a fraction of the size of the dataframe to sample. Observe the following behaviors:

iris %>% sample_n(size = 5)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
	## 54           5.5         2.3          4.0         1.3 versicolor
	## 147          6.3         2.5          5.0         1.9  virginica
	## 132          7.9         3.8          6.4         2.0  virginica
	## 47           5.1         3.8          1.6         0.2     setosa
	## 105          6.5         3.0          5.8         2.2  virginica
## NOTE: iris has 150 entries. 33% of 150 is ~50.
	iris %>% sample_frac(.33) %>% nrow
## [1] 50
iris %>% sample_n(50) %>% nrow
## [1] 50

slice

To select a row by its row number, we used the slice function. n is a helper function that returns the number of entries in whatever is passed to the function it is used in. Observe the following behaviors:

iris %>% slice(c(1,2,3,100,150))
## # A tibble: 5 x 5
	##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
	##          <dbl>       <dbl>        <dbl>       <dbl>     <fctr>
	## 1          5.1         3.5          1.4         0.2     setosa
	## 2          4.9         3.0          1.4         0.2     setosa
	## 3          4.7         3.2          1.3         0.2     setosa
	## 4          5.7         2.8          4.1         1.3 versicolor
	## 5          5.9         3.0          5.1         1.8  virginica
iris %>% slice((n() - 3):n())
## # A tibble: 4 x 5
	##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
	##          <dbl>       <dbl>        <dbl>       <dbl>    <fctr>
	## 1          6.3         2.5          5.0         1.9 virginica
	## 2          6.5         3.0          5.2         2.0 virginica
	## 3          6.2         3.4          5.4         2.3 virginica
	## 4          5.9         3.0          5.1         1.8 virginica

filter

What if you wanted all the entries in your dataframe that satisfied certain properties? filter is a dplyr verb that takes in a dataframe and any number of logical expressions about the dataframe’s columns. Observe the following behaviors:

iris %>% filter(Sepal.Width < 2.5)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
	## 1           4.5         2.3          1.3         0.3     setosa
	## 2           5.5         2.3          4.0         1.3 versicolor
	## 3           4.9         2.4          3.3         1.0 versicolor
	## 4           5.0         2.0          3.5         1.0 versicolor
	## 5           6.0         2.2          4.0         1.0 versicolor
	## 6           6.2         2.2          4.5         1.5 versicolor
	## 7           5.5         2.4          3.8         1.1 versicolor
	## 8           5.5         2.4          3.7         1.0 versicolor
	## 9           6.3         2.3          4.4         1.3 versicolor
	## 10          5.0         2.3          3.3         1.0 versicolor
	## 11          6.0         2.2          5.0         1.5  virginica
iris %>% filter(Sepal.Width < 2.5, Species != "versicolor")
##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
	## 1          4.5         2.3          1.3         0.3    setosa
	## 2          6.0         2.2          5.0         1.5 virginica

Column Selection (select, rename, pull)

Next, we’ll learn how to select specific columns, rather than rows, of our data. We will also learn how to rename columns, and how to extract singular columns as vectors rather than sub-dataframes with pull.


select

select accepts a dataframe and one or more column names, and returns a new dataframe with just those columns. Apart from giving the exact column names you want, select also uses several helper verbs, like ends_with and contains. You can read about these verbs with ?starts_with. Observe the following behaviors:

iris %>% select(Species, Petal.Width, Petal.Length) %>% head
##   Species Petal.Width Petal.Length
	## 1  setosa         0.2          1.4
	## 2  setosa         0.2          1.4
	## 3  setosa         0.2          1.3
	## 4  setosa         0.2          1.5
	## 5  setosa         0.2          1.4
	## 6  setosa         0.4          1.7
iris %>% select(starts_with("Petal")) %>% head
##   Petal.Length Petal.Width
	## 1          1.4         0.2
	## 2          1.4         0.2
	## 3          1.3         0.2
	## 4          1.5         0.2
	## 5          1.4         0.2
	## 6          1.7         0.4
iris %>% select(- Species) %>% head
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
	## 1          5.1         3.5          1.4         0.2
	## 2          4.9         3.0          1.4         0.2
	## 3          4.7         3.2          1.3         0.2
	## 4          4.6         3.1          1.5         0.2
	## 5          5.0         3.6          1.4         0.2
	## 6          5.4         3.9          1.7         0.4

rename

You can also use select to rename columns as you select them. Observe the following behaviors:

iris %>% select(`Species Name` = Species) %>% head
##   Species Name
	## 1       setosa
	## 2       setosa
	## 3       setosa
	## 4       setosa
	## 5       setosa
	## 6       setosa
iris %>% select("Species Name" = Species, everything()) %>% head
##   Species Name Sepal.Length Sepal.Width Petal.Length Petal.Width
	## 1       setosa          5.1         3.5          1.4         0.2
	## 2       setosa          4.9         3.0          1.4         0.2
	## 3       setosa          4.7         3.2          1.3         0.2
	## 4       setosa          4.6         3.1          1.5         0.2
	## 5       setosa          5.0         3.6          1.4         0.2
	## 6       setosa          5.4         3.9          1.7         0.4

However, if you don’t want to specify that you want everything after you list your renamed columns, you can use the rename function, which has the same syntax as select but includes all of the other columns as well. The following command is nearly equivalent to the previous command:

iris %>% rename("Species Name" = Species) %>% head
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Name
	## 1          5.1         3.5          1.4         0.2       setosa
	## 2          4.9         3.0          1.4         0.2       setosa
	## 3          4.7         3.2          1.3         0.2       setosa
	## 4          4.6         3.1          1.5         0.2       setosa
	## 5          5.0         3.6          1.4         0.2       setosa
	## 6          5.4         3.9          1.7         0.4       setosa

pull

Importantly, note that select returns a dataframe, not a vector. If we wanted to extract just a single column from a dataframe as a vector, we can use the pull command. Observe the following behaviors:

iris %>% select(Petal.Width) %>% str()
## 'data.frame':    150 obs. of  1 variable:
	##  $ Petal.Width: num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
iris %>% pull(Petal.Width) %>% str()
##  num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

Adding & Modifying Columns (mutate, transmute)

We now know how to subtract and rename columns from dataframes, but what about adding new ones? In this section, we’ll learn about mutate, dplyr’s general function for modifying or adding columns, and its close cousin, transmute.


mutate

mutate will apply a window function to every row of particular columns in a dataframe. A window function will output a row for every row passed to it.

mutate is used for two extremely common use cases. The first use for mutate is to modify existing columns. For example, suppose we wanted to view the iris dataset, but wanted to capitalize the species names. Observe the following behavior:

iris %>% head(3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
	## 1          5.1         3.5          1.4         0.2  setosa
	## 2          4.9         3.0          1.4         0.2  setosa
	## 3          4.7         3.2          1.3         0.2  setosa
iris %>% mutate(Species = toupper(Species)) %>% head(3) ## Still ugly...
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
	## 1          5.1         3.5          1.4         0.2  SETOSA
	## 2          4.9         3.0          1.4         0.2  SETOSA
	## 3          4.7         3.2          1.3         0.2  SETOSA
tocamelcase <- function(str) {
	  str_sub(str, 1, 1) <- str_sub(str, 1, 1) %>% toupper
	  return(str)
	}
	iris %>% mutate(Species = tocamelcase(Species)) %>% head(3) ## Much better :)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
	## 1          5.1         3.5          1.4         0.2  Setosa
	## 2          4.9         3.0          1.4         0.2  Setosa
	## 3          4.7         3.2          1.3         0.2  Setosa

The second usecase of mutate is to add new columns. Suppose we wanted to compute the Total.Length of the flowers in iris, which is the sum of their Petal.Length and Sepal.Length. The new columns will be appended to the right by default. Observe the following behavior:

iris %>% mutate(Total.Length = Petal.Length + Sepal.Length) %>% head(3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Total.Length
	## 1          5.1         3.5          1.4         0.2  setosa          6.5
	## 2          4.9         3.0          1.4         0.2  setosa          6.3
	## 3          4.7         3.2          1.3         0.2  setosa          6.0

transmute

A close cousin of mutate is transmute, which essentially does a mutate and then only selects the columns you mutated. You can also forcibly include columns by just having them as arguments without assignments. Observe the following behavior:

iris %>% transmute(Petal.Length, Sepal.Length, Total.Length = Petal.Length + Sepal.Length) %>% head(3)
##   Petal.Length Sepal.Length Total.Length
	## 1          1.4          5.1          6.5
	## 2          1.4          4.9          6.3
	## 3          1.3          4.7          6.0

summarise, summary

A very common operation in data science is computing statistics, or summary functions of data. Some example summary functions include mean, median, mode, max, and min. The dplyr verb for computing summary statistics for a dataframe is summarise. For example, suppose we wanted to know the average petal and sepal lengths in iris. Observe the following behavior:

iris %>% summarise(`Average Petal Length` = mean(Petal.Length),
	                   `Average Sepal Length` = mean(Sepal.Length))
##   Average Petal Length Average Sepal Length
	## 1                3.758             5.843333

Just of note, there’s another function in base R called summary that seems similar to summarise but is actually completely different. Just as View, head, and str can be used to quickly glance at a dataset, summary accepts a dataframe and returns a print statement of summary statistics for each column, for human use. For numeric columns, it will give the five-number summary of the distribution of that column. For factor columns, it will give the counts of each `factor. Observe the following behavior:

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width
	##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100
	##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300
	##  Median :5.800   Median :3.000   Median :4.350   Median :1.300
	##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
	##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
	##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
	##        Species
	##  setosa    :50
	##  versicolor:50
	##  virginica :50
	##
	##
	## 

Grouped Data

While computing statistics is somewhat useful for whole datasets, we are often faced with questions about statistics for particular groups in our data. For example, we may wish to know the mean petal length for setosa, virginica, and versicolor (the mean for each Species) flowers. To condition by a particular variable, use the group_by function. Then, functions like summarise and mutate will apply per group rather than to the entire dataset at once. Observe the following behaviors:

str(iris)
## 'data.frame':    150 obs. of  5 variables:
	##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
	##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
	##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
	##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
	##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
grouped_iris <- iris %>% group_by(Species)
	str(grouped_iris)
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  150 obs. of  5 variables:
	##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
	##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
	##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
	##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
	##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
	##  - attr(*, "vars")= chr "Species"
	##  - attr(*, "drop")= logi TRUE
	##  - attr(*, "indices")=List of 3
	##   ..$ : int  0 1 2 3 4 5 6 7 8 9 ...
	##   ..$ : int  50 51 52 53 54 55 56 57 58 59 ...
	##   ..$ : int  100 101 102 103 104 105 106 107 108 109 ...
	##  - attr(*, "group_sizes")= int  50 50 50
	##  - attr(*, "biggest_group_size")= int 50
	##  - attr(*, "labels")='data.frame':   3 obs. of  1 variable:
	##   ..$ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
	##   ..- attr(*, "vars")= chr "Species"
	##   ..- attr(*, "drop")= logi TRUE

As you can see from the output above, there’s a few new properties in our grouped_iris dataframe, like group_sizes and labels. You can also group by multiple variables at once, by passing multiple arguments into group_by. Finally, you can un-group a grouped dataframe by simply passing it through the ungroup function.


Grouped summarise

When paired with summarise, grouped data will apply the summary function per group, outputting a single row per group. Observe the following behavior:

iris %>%
	  group_by(Species) %>%
	  summarise(`Average Petal Length` = mean(Petal.Length),
	            `Average Sepal Length` = mean(Sepal.Length))
## # A tibble: 3 x 3
	##      Species `Average Petal Length` `Average Sepal Length`
	##       <fctr>                  <dbl>                  <dbl>
	## 1     setosa                  1.462                  5.006
	## 2 versicolor                  4.260                  5.936
	## 3  virginica                  5.552                  6.588

Grouped mutate

When paired with mutate, grouped data will apply the window function per group, outputting a row for each row in the original datset. Observe the following behavior:

iris %>% group_by(Species) %>%
	  mutate(`Petal Length Percentile within Species` = percent_rank(Petal.Length) %>% round(2)) %>%
	  select(Species, Petal.Length, `Petal Length Percentile within Species`) %>% sample_n(3) ## Notice: sample_n is for each group!
## # A tibble: 9 x 3
	## # Groups:   Species [3]
	##      Species Petal.Length `Petal Length Percentile within Species`
	##       <fctr>        <dbl>                                    <dbl>
	## 1     setosa          1.6                                     0.76
	## 2     setosa          1.5                                     0.49
	## 3     setosa          1.3                                     0.08
	## 4 versicolor          4.0                                     0.22
	## 5 versicolor          4.3                                     0.47
	## 6 versicolor          4.4                                     0.51
	## 7  virginica          5.7                                     0.63
	## 8  virginica          5.1                                     0.18
	## 9  virginica          5.8                                     0.69

Combining Datasets

At this point, we’ve covered all the essential dplyr verbs used to manipulate and transform individual datasets. But what if we wanted to combine datasets?

For example, suppose we had three separate dataframes with the same column variables, and wanted to combine them into a single dataset. dplyr’s bind_* functions are useful here.

Alternatively, suppose we had two dataframes that share a single column, and we wanted a new dataframe that encapsulated the columns in both in a single dataframe. dplyr’s *_join functions are useful here.


Binds

Suppose rather than having a single iris dataset, we had three distinct datasets, one for each group, as so:

head(iris_setosa, 3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
	## 1          5.1         3.5          1.4         0.2  setosa
	## 2          4.9         3.0          1.4         0.2  setosa
	## 3          4.7         3.2          1.3         0.2  setosa
head(iris_virginica, 3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
	## 1          6.3         3.3          6.0         2.5 virginica
	## 2          5.8         2.7          5.1         1.9 virginica
	## 3          7.1         3.0          5.9         2.1 virginica
head(iris_versicolor, 3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
	## 1          7.0         3.2          4.7         1.4 versicolor
	## 2          6.4         3.2          4.5         1.5 versicolor
	## 3          6.9         3.1          4.9         1.5 versicolor

If we wanted to bind these together into a single dataframe, we first notice that we want to combine rows, not columns. Observe the following behavior:

bind_rows(iris_setosa, iris_virginica, iris_versicolor) %>% sample_n(size = 10)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
	## 115          5.6         2.9          3.6         1.3 versicolor
	## 16           5.7         4.4          1.5         0.4     setosa
	## 56           7.6         3.0          6.6         2.1  virginica
	## 52           5.8         2.7          5.1         1.9  virginica
	## 98           6.5         3.0          5.2         2.0  virginica
	## 141          5.5         2.6          4.4         1.2 versicolor
	## 82           7.9         3.8          6.4         2.0  virginica
	## 74           6.3         2.7          4.9         1.8  virginica
	## 76           7.2         3.2          6.0         1.8  virginica
	## 147          5.7         2.9          4.2         1.3 versicolor

If we instead binded the columns together, we would see the following behavior:

bind_cols(iris_setosa, iris_virginica, iris_versicolor) %>% str
## 'data.frame':    50 obs. of  15 variables:
	##  $ Sepal.Length : num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
	##  $ Sepal.Width  : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
	##  $ Petal.Length : num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
	##  $ Petal.Width  : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
	##  $ Species      : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
	##  $ Sepal.Length1: num  6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 ...
	##  $ Sepal.Width1 : num  3.3 2.7 3 2.9 3 3 2.5 2.9 2.5 3.6 ...
	##  $ Petal.Length1: num  6 5.1 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 ...
	##  $ Petal.Width1 : num  2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8 1.8 2.5 ...
	##  $ Species1     : Factor w/ 3 levels "setosa","versicolor",..: 3 3 3 3 3 3 3 3 3 3 ...
	##  $ Sepal.Length2: num  7 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 ...
	##  $ Sepal.Width2 : num  3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 ...
	##  $ Petal.Length2: num  4.7 4.5 4.9 4 4.6 4.5 4.7 3.3 4.6 3.9 ...
	##  $ Petal.Width2 : num  1.4 1.5 1.5 1.3 1.5 1.3 1.6 1 1.3 1.4 ...
	##  $ Species2     : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...

Joins

The other class of functions that combine datasets together are the *_join family. A join will align two datasets based on one or more key columns, then combine them together. There are a variety of different joins. I find that the differences are best illustrated visually. Look over this diagram from RStudio’s Data Wrangling Cheatsheet:

As a simple example of a join, suppose we had two dataframes:

worker_hours <- data.frame(Name = c("Alex", "Bridgett", "Cooper"), Shift = c("Day", "Day", "Night"))
	worker_hours
##       Name Shift
	## 1     Alex   Day
	## 2 Bridgett   Day
	## 3   Cooper Night
worker_pay <- data.frame(Name = c("Cooper", "Alex", "Bridgett"), Pay = c(12.5, 15, 14))
	worker_pay
##       Name  Pay
	## 1   Cooper 12.5
	## 2     Alex 15.0
	## 3 Bridgett 14.0

Obviously, these two dataframes are talking about the same people, but are in different orders. A bind_cols won’t quite work out-of-the-box. Here’s where join is useful:

full_join(worker_hours, worker_pay)
## Joining, by = "Name"
##       Name Shift  Pay
	## 1     Alex   Day 15.0
	## 2 Bridgett   Day 14.0
	## 3   Cooper Night 12.5
## More specifically,
	full_join(worker_hours, worker_pay, by = c("Name"))
##       Name Shift  Pay
	## 1     Alex   Day 15.0
	## 2 Bridgett   Day 14.0
	## 3   Cooper Night 12.5

By default, dplyr will try to guess which columns you are joining by, joining by columns with matching names automatically. Since this is the default behavior for joins, this is known as a natural join.


tidyr

Now that you’re all caught up on dplyr, let’s look at its sister package, tidyr. tidyr also has functions to clean data, but its primary focus is on tidying data. Recall from the introduction of this section that tidy data has two properties:

  1. Every observation is a single row.
  2. Every variable is a single column.

Let’s illustrate the difference between tidy and wide data with a couple of examples:

head(wide_dataset)
##   Year       Jan      Feb       Mar       Apr       May      June
	## 1 2015  97.36476 109.8633 101.81566  96.76982 107.30600  93.08192
	## 2 2016 111.08158  89.7403  91.72008 112.65582 106.20730  96.43232
	## 3 2017 112.82670 112.1236  99.70381  97.37177  97.82331 100.32960
	##        July       Aug       Sep       Oct      Nov       Dec
	## 1  97.89800  82.64665 102.48015  95.81758 97.55374  98.75837
	## 2  96.83301 107.57467 105.48091 112.74316 87.09467  91.67383
	## 3 101.32571 105.09665  85.28725  96.74593 88.63767 109.71750
str(wide_dataset)
## 'data.frame':    3 obs. of  13 variables:
	##  $ Year: int  2015 2016 2017
	##  $ Jan : num  97.4 111.1 112.8
	##  $ Feb : num  109.9 89.7 112.1
	##  $ Mar : num  101.8 91.7 99.7
	##  $ Apr : num  96.8 112.7 97.4
	##  $ May : num  107.3 106.2 97.8
	##  $ June: num  93.1 96.4 100.3
	##  $ July: num  97.9 96.8 101.3
	##  $ Aug : num  82.6 107.6 105.1
	##  $ Sep : num  102.5 105.5 85.3
	##  $ Oct : num  95.8 112.7 96.7
	##  $ Nov : num  97.6 87.1 88.6
	##  $ Dec : num  98.8 91.7 109.7
head(tidy_dataset)
##   Year Month Stock Price
	## 1 2015   Jan    97.36476
	## 2 2015   Feb   109.86328
	## 3 2015   Mar   101.81566
	## 4 2015   Apr    96.76982
	## 5 2015   May   107.30600
	## 6 2015  June    93.08192
str(tidy_dataset)
## 'data.frame':    36 obs. of  3 variables:
	##  $ Year       : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
	##  $ Month      : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 1 2 3 4 5 6 7 8 9 10 ...
	##  $ Stock Price: num  97.4 109.9 101.8 96.8 107.3 ...

Tidy data is very desirable for tidyverse functions because tidy data works best with concepts such as grouping, window functions, and the grammar for ggplot2, which we will learn next week.


gather

gather is how you get from wide data to tidy data. gather will gather columns to instead treat them as rows. The syntax for gather is df %>% gather(key, value, ...). The key is the name of the first of the new columns, which will hold the current column names. The value is the name of the second of the new columns, which will hold the information associated with the current column names. Finally, the ... represents the columns you wish to pull down into rows.

This is best illustrated with an example. Let’s turn wide_dataset into tidy_dataset:

tidy_dataset_2 <- wide_dataset %>% gather("Month", "Stock Price", - c("Year")) # Gather every column that is NOT Year
	head(tidy_dataset_2)
##   Year Month Stock Price
	## 1 2015   Jan    97.36476
	## 2 2016   Jan   111.08158
	## 3 2017   Jan   112.82670
	## 4 2015   Feb   109.86328
	## 5 2016   Feb    89.74030
	## 6 2017   Feb   112.12365

spread

The inverse of the gather function is spread. In other words, spread is how you get from tidy data to wide data. spread will take rows and spread them up on-top as column names. The syntax for spread is similar to gather: df %>% spread(key, value). The key is the column which you wish to spread as column names. The value is the column you want to use to fill up the new columns you’re making.

Again, this is best illustrated with an example. Let’s turn the tidy_dataset into the wide_dataset:

wide_dataset_2 <- tidy_dataset %>% spread(Month, `Stock Price`)
	head(wide_dataset_2)
##   Year       Jan      Feb       Mar       Apr       May      June
	## 1 2015  97.36476 109.8633 101.81566  96.76982 107.30600  93.08192
	## 2 2016 111.08158  89.7403  91.72008 112.65582 106.20730  96.43232
	## 3 2017 112.82670 112.1236  99.70381  97.37177  97.82331 100.32960
	##        July       Aug       Sep       Oct      Nov       Dec
	## 1  97.89800  82.64665 102.48015  95.81758 97.55374  98.75837
	## 2  96.83301 107.57467 105.48091 112.74316 87.09467  91.67383
	## 3 101.32571 105.09665  85.28725  96.74593 88.63767 109.71750
NA Values

tidyr also contains a useful function for dealing with NA values. NA stands for Not Available. NA is used to denote missing values in datasets. For example suppose we had a dataset that was missing some values:

missing_dataset
##        Name Age
	## 1     Alice  23
	## 2       Bob  20
	## 3 Catherine  NA
	## 4     David  19
	## 5    Esther  NA

In this dataset, we don’t know the ages of Catherine and Esther! NA values are annoying to deal with, becuase they are infectious - any operation with an NA value, with the exception of is.na, will return NA, no matter what. Keeping too many NA values in your dataset may infect your outputs of mutate or summarise to also be NA.


Using tidyr (drop_na)

One common approach to NA values is to simply remove them. This is very easily accomplished with tidyr’s drop_na function. Observe the following behavior:

missing_dataset %>% drop_na()
##    Name Age
	## 1 Alice  23
	## 2   Bob  20
	## 4 David  19

Using dplyr

The behavior of drop_na can be emulated with filter and the predicate is.na. Observe the following behavior:

missing_dataset %>% filter(! is.na(Age))
##    Name Age
	## 1 Alice  23
	## 2   Bob  20
	## 3 David  19

Data Import

Phew! That was a lot of functions to cover, but hopefully you should have down the essentials to tackle most data cleaning and manipulation tasks. Now, let’s talk about how to import your (probably dirty) data into R.


readr & readxl

The tidyverse contains two packages for importing data files into R as dataframes. The first is readr, used for most general file formats, including CSV (comma-separated value files) and TSV (tab-separated value files) files. The second is readxl, which is specialized for reading Excel spreadsheets, and includes optional arguments to only read specific regions and sheets.

The syntax for readr’s various read_* functions is pretty standard across all file types. For whatever file you’re importing, check the documentation for that filetype. As an example, check the documentation for read_csv in you R Console with ?read_csv. Some remarks on the more useful arguments:

  • file: The (relative or absolute) path to the file you want to import as a dataframe
  • col_names: If TRUE (default) will interpret the first line as the names of the columns. If FALSE, will name them X1, X2, etc. automatically. If a vector of strings is passed instead of a logical value, will use those as column names.
  • na: What values should be converted into NA upon import?
  • skip: Some CSV’s start with the first few lines as a description of the dataset, rather than the dataset itself. Use this argument to skip entire lines from your CSV when importing into a dataframe.
  • n_max: Sometimes, you only want the first few rows. Use n_max to tell readr how many rows you want to keep in your outputted dataframe.

The syntax for readxl’s read_excel is very similar to that of the read_* functions. By default, read_excel will try to guess if you’re reading an .xls or .xlsx file by reading the file extension. Read the documentation of read_excel. Some remarks on the more useful arguments:

  • sheet: Since some Excel spreadsheets contain multiple sheets, this can be either the name of the sheet you want or its position (e.g. 1, 2, 3, etc.)
  • range: If you simply want a particular range (rectangular region of cells) from your sheet, you can specify it with a notation like "B3:D87", which will cover all cells with rows 3:87 and columns B, C, D. This takes precedence over skip and n_max.

Column Types

You’ll notice I left out col_types in the remarks above. The col_types argument is usually a confusing argument for new users of readr and readxl, so this entire section is dedicated to explaining how it works!

By default, col_types = NULL, and readr will try to guess the correct types for each column by reading a few rows in each column. However, it’s not always correct in its guess.

One common problem is readr getting confused between factors and characters. Recall that a factor is a type that refers to categorical data with a finite number of categories. R prefers to work with factors over strings if possible, as it makes checking equality and sorting faster than if it had to parse strings. readr will implicitly convert strings into factors by default. However, that isn’t always desirable.

To specify which column types you want for each column, use the cols and col_* helper functions. An example usage looks like this:

read_csv("somedataset.csv", col_names = c("A", "B", "C"), col_types = cols(A = col_double(), B = col_date(), C = col_guess()))

Alternatively, you could skip using the cols function altogether and just pass a string of single letters, each representing a col_* function. This is the abbreviated, but equivalent command for the same command above:

read_csv("somedataset.csv", col_names = c("A", "B", "C"), col_types = "dD?")

Here is a quick tablular summary of all of the different column types and their equivalent col_* functions and abbreviations:

Data Type col_* Function String Abbreviation
guess col_guess “?”
character col_character “c”
integer col_integer “i”
number col_number “n”
double col_double “d”
logical col_logical “l”
Date col_date “D”
factor col_factor N/A
skip col_skip “-”

Functional Programming in R (purrr)

Now that we’ve reviewed the essentials data wrangling and data import, we’re going to cover one last topic - functional programming in R. Functional programming is the paradigm that relies on applying functions to elements in multi-element structures. As we’ve seen so far, R has many multi-element structures, including vectors, lists, and dataframes. Because so much of data science relies on algorithms that “for each X, do Y” or “for each X, compute f(X)”, R contains a bunch of functions designed for functional programming. Known as the apply family, it consists of functions like mapply, tapply, sapply, and lapply. However, as with dataframe manipulation and data visualization, the tidyverse has packages that make things even easier. In this section, we will give a brief primer on tidyverse’s functional programming library, purrr.


The map Family

The basic syntax of the map family is that they take a collection of elements (e.g. a list), apply a function to each one, and then return the new collection of elements. For example, suppose you had a list of three-element vectors each representing quadratic polynomials. You know the general Quadratic Formula for finding the larger root, and you wish to find the larger root of each one, without copying-and-pasting or doing it all manually. This is where the map family comes in.


map

As a simpler problem, suppose we had a list of vectors and wanted to find the maximal value of each one. With map that’s easy! map accepts a list and a function, and returns a list where the function was applied to each element. Observe the following behaviors:

my_vecs <- list(
	  c(1,5,4),
	  1:10,
	  13:300,
	  c("r" = 15, "s" = 9, "t" = 6, "u" = 1, "d" = 25, "o" = 6)
	)
	str(my_vecs)
## List of 4
	##  $ : num [1:3] 1 5 4
	##  $ : int [1:10] 1 2 3 4 5 6 7 8 9 10
	##  $ : int [1:288] 13 14 15 16 17 18 19 20 21 22 ...
	##  $ : Named num [1:6] 15 9 6 1 25 6
	##   ..- attr(*, "names")= chr [1:6] "r" "s" "t" "u" ...
map(my_vecs, max) ## applies `max` to each element of `my_vecs`
## [[1]]
	## [1] 5
	##
	## [[2]]
	## [1] 10
	##
	## [[3]]
	## [1] 300
	##
	## [[4]]
	## [1] 25
map_int, map_dbl, map_lgl, map_chr

Hmmm… map always returns a list. But wouldn’t it be better if our output this time was a numeric vector? Since they’re all of the same type and the output looks flat anyway.

To return a vector rather than a list, use the map_* functions. In this case, since we want our output to be a numeric vector, we use map_dbl:

map_dbl(my_vecs, max) ## Much cleaner in than `map` in this case
## [1]   5  10 300  25

Here is a quick tabular summary of the more common map_* functions:

map Function Input Type Output Type
map list, vector list
map_lgl list, vector logical vector
map_int list, vector integer vector
map_dbl list, vector double vector
map_chr list, vector character vector

There are many more functions related to map. Run ?map in your R Console to check them out!

More purrr

purrr is an extremely versatile package with dozens of functions. We cannot cover them all here, but I’ll touch upon a couple to give you a flavor of what purrr is used for.


reduce

If you want to cumulatively apply a function to each element of a list or vector, use the reduce function. By default, reduce will apply an operation repeatedly to the list, from left to right. reduce_right is a version of reduce that instead combines from right to left. Mathematically, reduce(x, f) will return $f(f(f(x_1, x_2), x_3), x_4)…) $. Observe the following behavior, where reduce is used to do interative subtraction:

v <- 6:1
	reduce(v, subtract)
## [1] -9
# 6 - 5 - 4 - 3 - 2 - 1 = -9

If you ever have a list of dataframes, and want to apply the same combining function to all of them, reduce works here too! Observe the following behavior:

iris_dfs <- list(iris_setosa, iris_virginica, iris_versicolor)
	iris_dfs %>% reduce(bind_rows) %>% sample_n(10)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
	## 3            4.7         3.2          1.3         0.2     setosa
	## 25           4.8         3.4          1.9         0.2     setosa
	## 39           4.4         3.0          1.3         0.2     setosa
	## 43           4.4         3.2          1.3         0.2     setosa
	## 47           5.1         3.8          1.6         0.2     setosa
	## 84           6.3         2.8          5.1         1.5  virginica
	## 147          5.7         2.9          4.2         1.3 versicolor
	## 81           7.4         2.8          6.1         1.9  virginica
	## 17           5.4         3.9          1.3         0.4     setosa
	## 137          6.7         3.1          4.7         1.5 versicolor

walk

Suppose you want to print the length of each element of a nested list. walk will conduct a function on each element of a list, just like map, but does not return anything. Observe the following behavior:

complicated <- list(
	  c(1,2,3),
	  3:1,
	  list("pink",
	       "blue",
	       "saas",
	       "colors"),
	  list(1:100, 100:10, 10:1, "hello", "CX!", "world")
	)
	walk(complicated, function(element){print(length(element))})
## [1] 3
	## [1] 3
	## [1] 4
	## [1] 6

Conclusion

This ends our textbook-style tutorial on dplyr, readr, and other tidyverse packages. For more practice, check out the mini-projects section of r2-workbook.


Sneakpeek at r3

Next week, we will be covering my personal favorite topic, data visualization in R with ggplot2! You will continue practicing tidying your data for effective piping into ggplot2 functions. We will also cover the first of our machine learning algorithms in R, linear regression using the lm function and the broom package.


Additional Reading