tidyr
This lab focuses on manipulating, cleaning, and preparing data for visualization (or other analyses) using packages from the tidyverse suite.
Directions (Please read before starting)
\(~\)
Show below are the acceptance rates of three different liberal arts colleges from 2018 to 2020:
As a data scientist, you must be able to connect the format of a data set with the information presented in a graph.
ggplot()
function?\(~\)
Data in a wide format record many different values or variables for a single entity in a single row (ie: acceptance rates for a college in different years, or different test scores for the same employee).
Data in a long format use multiple rows and a single column for the outcome or value of interest (ie: acceptance rate, test score, etc.) with additional columns identifying the meaning of that value.
Example:
The ggplot2
package, as well as the implementations of
many statistical models, expect data in long
format.
\(~\)
Converting between “wide” and “long” formats is often the most challenging step in creating a “tidy” data set, or one that is fully prepared for graphing/modeling.
In general, tidy data are defined by the following criteria:
This lab will introduce several data manipulation functions used to help obtain tidy data.
\(~\)
This lab primarily uses tidyr
package, which is used to
“tidy” or reshape data into a more usable format. It will also use the
ggplot2
package.
# Please install and load the following packages
# install.packages("tidyr")
library(tidyr)
library(ggplot2)
The lab will use several data sets:
collegeAdm <- read.csv("https://remiller1450.github.io/data/college_adm.csv")
bluechips <- read.csv("https://remiller1450.github.io/data/bluechips.csv")
polls <- read.csv("https://remiller1450.github.io/data/polls2016.csv")
\(~\)
At this point you will begin working with your partner. Please read through the text/examples and make sure you both understand before attempting to answer the embedded questions.
\(~\)
Consider the collegeAdm
data frame:
## Look at all 9 observations
head(collegeAdm, n = 9)
## Adm_Rate Year College
## 1 28.9 2018 Grinnell
## 2 24.4 2019 Grinnell
## 3 23.1 2020 Grinnell
## 4 21.2 2018 Carlton
## 5 19.8 2019 Carlton
## 6 19.1 2020 Carlton
## 7 33.7 2018 Oberlin
## 8 36.2 2019 Oberlin
## 9 36.4 2020 Oberlin
This data is currently in “long” format, meaning that each row is
associated with a single observation (year, college, rate). We can
convert our college admit data to a “wide” format using the
pivot_wider()
function:
## Pivot from long to wide to get 1 row per Year
wideCollegeAdm <- pivot_wider(collegeAdm,
id_cols = Year,
names_from = College,
values_from = Adm_Rate)
head(wideCollegeAdm)
## # A tibble: 3 × 4
## Year Grinnell Carlton Oberlin
## <int> <dbl> <dbl> <dbl>
## 1 2018 28.9 21.2 33.7
## 2 2019 24.4 19.8 36.2
## 3 2020 23.1 19.1 36.4
The following arguments guide the transformation:
id_cols
defines the rows of the “wide” data (i.e., each
row will be a unique value of “Year”)names_from
defines the single column from the “long”
data that should be spread into multiple distinct columns in the “wide”
data (i.e., each “College” should get a column)values_from
defines the single column from the “long”
data containing the values used to populate the cells of the “wide” data
(i.e., the columns created for each “College” should contain values of
“Adm_Rate”)Notice what happens when id_cols
and
names_from
are swapped:
## Pivot from long to wide to get 1 row per College
wideCollegeAdm2 <- pivot_wider(collegeAdm,
id_cols = College,
names_from = Year,
values_from = Adm_Rate)
head(wideCollegeAdm2)
## # A tibble: 3 × 4
## College `2018` `2019` `2020`
## <chr> <dbl> <dbl> <dbl>
## 1 Grinnell 28.9 24.4 23.1
## 2 Carlton 21.2 19.8 19.1
## 3 Oberlin 33.7 36.2 36.4
Similarly, the pivot_long()
function will transform
“wide” data into “long” data:
pivot_longer(wideCollegeAdm2,
cols = c("2018", "2019", "2020"),
names_to = "Year",
values_to = "Adm_Rate")
## # A tibble: 9 × 3
## College Year Adm_Rate
## <chr> <chr> <dbl>
## 1 Grinnell 2018 28.9
## 2 Grinnell 2019 24.4
## 3 Grinnell 2020 23.1
## 4 Carlton 2018 21.2
## 5 Carlton 2019 19.8
## 6 Carlton 2020 19.1
## 7 Oberlin 2018 33.7
## 8 Oberlin 2019 36.2
## 9 Oberlin 2020 36.4
cols
defines the column(s) used in the pivot, here this
is each of the years variables that will be combined into a single
variablenames_to
is the name of the single column in the “long”
data frame that will store the column names of the “wide” data
framevalues_to
is the name of the single column in the
“long” data frame that will store the values from the cells of
the “wide” data frameIt is worth noting that there a few alternative ways to specify the
argument cols
. The first is a syntax that is unique to the
“tidyverse” and takes the form !variable
to specify that we
want to include all column names except variable
.
In the case of our admit date, it would look like !College
to indicate that we want all column names except college
## Note: for more than one column, you can use !C(col1, col2, ...)
pivot_longer(wideCollegeAdm2,
cols = !College,
names_to = "Year",
values_to = "Adm_Rate")
## # A tibble: 9 × 3
## College Year Adm_Rate
## <chr> <chr> <dbl>
## 1 Grinnell 2018 28.9
## 2 Grinnell 2019 24.4
## 3 Grinnell 2020 23.1
## 4 Carlton 2018 21.2
## 5 Carlton 2019 19.8
## 6 Carlton 2020 19.1
## 7 Oberlin 2018 33.7
## 8 Oberlin 2019 36.2
## 9 Oberlin 2020 36.4
Alternatively, you can use the names()
function, which
returns all of the column names of a data frame, along with the
setdiff()
function, which returns the difference between
two sets. So, for example, if we wanted all of the column names
except for college, we could express this as follows:
names(wideCollegeAdm2)
## [1] "College" "2018" "2019" "2020"
setdiff(names(wideCollegeAdm2), "College")
## [1] "2018" "2019" "2020"
pivot_longer(wideCollegeAdm2,
cols = setdiff(names(wideCollegeAdm2), "College"),
names_to = "Year",
values_to = "Adm_Rate")
## # A tibble: 9 × 3
## College Year Adm_Rate
## <chr> <chr> <dbl>
## 1 Grinnell 2018 28.9
## 2 Grinnell 2019 24.4
## 3 Grinnell 2020 23.1
## 4 Carlton 2018 21.2
## 5 Carlton 2019 19.8
## 6 Carlton 2020 19.1
## 7 Oberlin 2018 33.7
## 8 Oberlin 2019 36.2
## 9 Oberlin 2020 36.4
Ultimately, you are welcome to use whichever you find most comfortable or easiest to remember.
Question 1: Convert the bluechips
data
to a long format where each stock’s closing price on the first trading
day of each year is recorded in a single column named “Price”.
Question 2: Starting with the long format data frame
you created in Question #1, recreate the original bluechips
data set using pivot_wider()
.
\(~\)
tidyr
functionsReshaping is often only one step in the process of tidying a data
set. Sometimes, data will contain multiple variables in a single column.
For example, consider the polls
data set:
head(polls)
## Poll Date Sample MoE Clinton..D. Trump..R.
## 1 Monmouth 7/14 - 7/16 688 LV 3.7 45 43
## 2 CNN/ORC 7/13 - 7/16 872 RV 3.5 42 37
## 3 ABC News/Wash Post 7/11 - 7/14 816 RV 4.0 42 38
## 4 NBC News/Wall St. Jrnl 7/9 - 7/13 1000 RV 3.1 41 35
## 5 Economist/YouGov 7/9 - 7/11 932 RV 4.5 40 37
## 6 Associated Press-GfK 7/7 - 7/11 837 RV NA 40 36
## Johnson..L. Stein..G.
## 1 5 1
## 2 13 5
## 3 8 5
## 4 11 6
## 5 5 2
## 6 6 2
The column Date
contains two distinct variables, the
start and end of the poll’s sampling period. Similarly,
Sample
also contains two variables, the number of
participants in the poll and the population that was sampled (registered
voters or likely voters).
The separate()
function is used to break a
character column into multiple new columns:
## Example #1
tidy_polls <- separate(polls,
col = Date,
into = c("Begin", "End"),
sep = " - ")
head(tidy_polls)
## Poll Begin End Sample MoE Clinton..D. Trump..R.
## 1 Monmouth 7/14 7/16 688 LV 3.7 45 43
## 2 CNN/ORC 7/13 7/16 872 RV 3.5 42 37
## 3 ABC News/Wash Post 7/11 7/14 816 RV 4.0 42 38
## 4 NBC News/Wall St. Jrnl 7/9 7/13 1000 RV 3.1 41 35
## 5 Economist/YouGov 7/9 7/11 932 RV 4.5 40 37
## 6 Associated Press-GfK 7/7 7/11 837 RV NA 40 36
## Johnson..L. Stein..G.
## 1 5 1
## 2 13 5
## 3 8 5
## 4 11 6
## 5 5 2
## 6 6 2
col
is the column to be separatedinto
indicates the names of the new columns that should
result from the separationsep
is the character string used to identify splitting
points (in this example it’s -
surrounded by a space on
each side)## Example #2
tidy_polls <- separate(polls,
col = Sample,
into = c("Size", "Population"))
head(tidy_polls)
## Poll Date Size Population MoE Clinton..D. Trump..R.
## 1 Monmouth 7/14 - 7/16 688 LV 3.7 45 43
## 2 CNN/ORC 7/13 - 7/16 872 RV 3.5 42 37
## 3 ABC News/Wash Post 7/11 - 7/14 816 RV 4.0 42 38
## 4 NBC News/Wall St. Jrnl 7/9 - 7/13 1000 RV 3.1 41 35
## 5 Economist/YouGov 7/9 - 7/11 932 RV 4.5 40 37
## 6 Associated Press-GfK 7/7 - 7/11 837 RV NA 40 36
## Johnson..L. Stein..G.
## 1 5 1
## 2 13 5
## 3 8 5
## 4 11 6
## 5 5 2
## 6 6 2
In Example #2, the “sep” argument is not explicitly provided. In this
situation, separate()
will attempt to guess an appropriate
separator. While this is not generally recommended, it works well for
simple character strings. More complex strings might require the use of
regular expressions, a topic we’ll cover later this
semester.
Question 3 (Part A): Using either the
pivot_longer()
or pivot_wider()
function,
create a version of the tidy_polls
data containing the
variables Candidate
and Percentage
, where
Candidate
is taken from the names of the last four columns
of the data frame, and “Percentage” is taken from the values contained
in these columns. It should look like this:
tt <- pivot_longer(polls, cols = c("Clinton..D.", "Trump..R.", "Johnson..L.", "Stein..G."), names_to = "Candidate", values_to = "Percentage")
head(tt, n = 4)
## # A tibble: 4 × 6
## Poll Date Sample MoE Candidate Percentage
## <chr> <chr> <chr> <dbl> <chr> <int>
## 1 Monmouth 7/14 - 7/16 688 LV 3.7 Clinton..D. 45
## 2 Monmouth 7/14 - 7/16 688 LV 3.7 Trump..R. 43
## 3 Monmouth 7/14 - 7/16 688 LV 3.7 Johnson..L. 5
## 4 Monmouth 7/14 - 7/16 688 LV 3.7 Stein..G. 1
Question 3 (Part B): Using the
separate()
function, split the column “Candidate” (created
in Part A) into two distinct columns containing the name of the
candidate (ie: Clinton, Trump, etc.) and their political party (ie: D,
R, etc.). Hint: periods, or .
, are a special
character in R
, but you can reference one using the
expression: \\.
. The \\
is known as an “escape
sequence” telling R
to ignore the special character status
of whatever follows. If you have multiple special characters in a row,
you will need to be sure to “escape” each one of them
\(~\)
The “airlines” data set (loaded below) contains data used in the article Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past? that appeared on fivethirtyeight.com.
airlines <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/airline-safety.csv")
head(airlines)
## airline avail_seat_km_per_week incidents.1985_1999
## 1 Aer Lingus 320906734 2
## 2 Aeroflot* 1197672318 76
## 3 Aerolineas Argentinas 385803648 6
## 4 Aeromexico* 596871813 3
## 5 Air Canada 1865253802 2
## 6 Air France 3004002661 14
## fatal_accidents.1985_1999 fatalities.1985_1999 incidents.2000_2014
## 1 0 0 0
## 2 14 128 6
## 3 0 0 1
## 4 1 64 5
## 5 0 0 2
## 6 4 79 6
## fatal_accidents.2000_2014 fatalities.2000_2014
## 1 0 0
## 2 1 88
## 3 0 0
## 4 0 0
## 5 0 0
## 6 2 337
Recall that a “tidy” version of these data should satisfy the following:
Question 4 (Part A): Use pivot_longer()
to gather the last six columns of the airlines
data into a
column named accidents
and a column named
count
.
Question 4 (Part B): Use separate()
to
split the accidents
column into two variables named
type
and years
. Hint: remember that
the period is a special character in R
.
Question 4 (Part C): Use pivot_wider()
to spread out the type
column into three new columns
containing the type of accident. Your data should now contain two rows
per airline (one for each time period), you can check if the first few
rows match those printed below.
## # A tibble: 5 × 6
## airline avail_seat_km_per_week years incidents fatal_accidents fatalities
## <chr> <dbl> <chr> <int> <int> <int>
## 1 Aer Lingus 320906734 1985… 2 0 0
## 2 Aer Lingus 320906734 2000… 0 0 0
## 3 Aeroflot* 1197672318 1985… 76 14 128
## 4 Aeroflot* 1197672318 2000… 6 1 88
## 5 Aerolineas … 385803648 1985… 6 0 0
Question 4 (Part D): Using ggplot2
,
create a scatter plot displaying the relationship between
avail_seat_km_per_week
, fatal_accidents
, and
years
Question 5 The iris
dataset builtin to
R
is a collection of measurements (in cm) of the sepal and
petal dimensions of 50 different flowers coming from three different
species of iris (I have verified that all of the values in this dataset
are what they should be – no surprise numerics or factors!)
Your goal is to recreate the following graphic using the
tidyr
functions covered in this lab
## Load data
data(iris)