tidyrThis lab is intended to help introduce you to the tidyr
package, one of several tools we have available for manipulating,
cleaning, and preparing our data according to “tidy principles”
This exercise will be done in class. For completeness, it is included here.
Shown below are the acceptance rates of three different liberal arts colleges from 2018 to 2020:
Question 0: Use the plot above to answer the following:
Based on the answers you gave above, create a sketch of what the associated dataset likely looks like in R.
These next few sections will highlight a few relevant definitions for creating and working with tidy data
We saw previously that data we will be primarily working with in this course will be tidy data. Tidy data is data that subscribes to the following properties:
As we will see, what exactly constitutes an “observation” or a “variable” will depend on context, and a given data frame may be oriented in a number of different ways. This will depend on whether data is stored in wide format or long format.
Data that is in a wide format is dat such that each row constitutes a single unique entity. Necessarily this is a bit ambiguous, but for now let’s entertain the idea that is a unique item and every attribute that can be specified about this item is included as a variable. In this case, our observation is the single entity, and the variables are each thing that can possibly be said about it
Data that is in a long format differs in that our single unique entity can be repeated in multiple rows. This happens frequently when we have repeated observations of a single entity. In this case, an observation is a specific instance of our single unique entity, and the variables are each thing that can be said about that specific instance.
Consider this illustration below, where we have an employee ID (signifying a single unique entity) who has been given both a pre-test and a post-test, scored between 0-100:
On the left hand side, we see data in wide format. Each single unique entity (employee ID) has a single row, and the variables each indicate all of the tests that they took. The values in the cells represent their scores
On the right hand side, we see data in long format. Here, each single unique entity is repeated, according to the number of observations. The variables in this case are now entirely different: the first indicates which type of test was taken (pre/post), and the third variable accounts for their respective scores.
The goal of the present lab is to illustrate how to identify wide/long data and convert it as necessary into whichever form is needed.
This lab will require both ggplot2 (already installed)
and tidyr (likely not installed). Copy
install.packages("tidyr") into your console to install.
Then, load the required packages and data by copying the R chunk below
into your own lab
library(ggplot2)
library(tidyr)
admit <- read.csv("https://remiller1450.github.io/data/college_adm.csv")
bluechips <- read.csv("https://remiller1450.github.io/data/bluechips.csv")
polls <- read.csv("https://remiller1450.github.io/data/polls2016.csv")
admit – 2018-2020 admission data for three selective
collegesbluechip – Closing prices on the first day of trading
from 2010-2021 for four stockspolls – Select polling data leading up to the 2016 US
Presidential Elections scraped from RealClearPolitics.comMain functions we introduce are:
pivot_wider() – Function to transform long data to
widepivot_longer() – Transform wide data to longseparate() – Used to expand a single column into
multiple columns based on a delimiterThough we will not cover this topic directly in this lab, the
tidyr and dplyr package both subscribe to a
framework of “tidy selection”, a collection of functions used to
identify and select names of columns. See
?tidyr_tidy_select if you are curious.
The first step in identifying if a dataset is in long format is to ask yourself two questions:
For example, suppose we consider the college admission data from
above and decide that the individual college will serve as our unique
entity. In wide form, each college should appear in at most one row.
Once we have determined which column contains our unique entity, we need
to specify which columns will become new variables (typically,
categorical) and what values those variables will take. We’ll illustrate
with an example; note that we can pass in the variable names directly
(e.g., College and Year) just as we do when
using dplyr functions.
admit_wide <- pivot_wider(data = admit,
id_cols = College,
names_from = Year,
values_from = Adm_Rate)
admit_wide
## # A tibble: 3 × 4
## College `2018` `2019` `2020`
## <chr> <dbl> <dbl> <dbl>
## 1 Grinnell 28.9 24.4 23.1
## 2 Carlton 21.2 19.8 19.1
## 3 Oberlin 33.7 36.2 36.4
From this transformation, we see:
id_cols specifies our single unique entity. These are
now each on one rownames_from defines the columns from our “long” data
that we want to cast wide. Almost always, this will be a categorical
variablevalues_from indicates which values the column in
names_to should take. In this case, each Year
was subsequently associated with its Adm_RatePivoting from wide to long works similarly, though in reverse order. We need to decide:
For example, if I wanted to cast our wide dataset above into something long, I might use the following call:
## I don't assign this to any value so it prints out by default
pivot_longer(data = admit_wide,
cols = c("2018", "2019", "2020"), # which variables become values of new var
names_to = "Year", # name of new var
values_to = "Adm_Rate")
## # A tibble: 9 × 3
## College Year Adm_Rate
## <chr> <chr> <dbl>
## 1 Grinnell 2018 28.9
## 2 Grinnell 2019 24.4
## 3 Grinnell 2020 23.1
## 4 Carlton 2018 21.2
## 5 Carlton 2019 19.8
## 6 Carlton 2020 19.1
## 7 Oberlin 2018 33.7
## 8 Oberlin 2019 36.2
## 9 Oberlin 2020 36.4
This last example illustrates a case in which “tidy selection” can
make this process easier. If, for example, we wanted to collapse down
hundreds of rows, we wouldn’t want to type each out individually.
Instead, looking through ?tidyr_tidy_select can sometimes
hint at useful tools. In this case
## Alternate method to select variables
pivot_longer(data = admit_wide,
cols = starts_with("20"),
names_to = "Year",
values_to = "Adm_Rate")
## # A tibble: 9 × 3
## College Year Adm_Rate
## <chr> <chr> <dbl>
## 1 Grinnell 2018 28.9
## 2 Grinnell 2019 24.4
## 3 Grinnell 2020 23.1
## 4 Carlton 2018 21.2
## 5 Carlton 2019 19.8
## 6 Carlton 2020 19.1
## 7 Oberlin 2018 33.7
## 8 Oberlin 2019 36.2
## 9 Oberlin 2020 36.4
Typically, using a tidy selection function will be preferred as there are fewer possibilities for a typo and it is robust to changes in your data (e.g., what if somebody sent you an updated dataset that also included 2021-2025. Would you want to go make those changes everywhere in your code?)
separate()Reshaping your data is typically only one step in the process of
preparing data for analysis. It’s not uncommon to create new variables
out of old with functions such as mutate(), but sometimes
the operation is much simpler, and all that is needed is to turn the one
column of a dataset into two. This most frequently appears as a
consequence of data entry. For example, consider the Date
variable in the polls data:
polls <- read.csv("https://remiller1450.github.io/data/polls2016.csv")
head(polls)
## Poll Date Sample MoE Clinton..D. Trump..R.
## 1 Monmouth 7/14 - 7/16 688 LV 3.7 45 43
## 2 CNN/ORC 7/13 - 7/16 872 RV 3.5 42 37
## 3 ABC News/Wash Post 7/11 - 7/14 816 RV 4.0 42 38
## 4 NBC News/Wall St. Jrnl 7/9 - 7/13 1000 RV 3.1 41 35
## 5 Economist/YouGov 7/9 - 7/11 932 RV 4.5 40 37
## 6 Associated Press-GfK 7/7 - 7/11 837 RV NA 40 36
## Johnson..L. Stein..G.
## 1 5 1
## 2 13 5
## 3 8 5
## 4 11 6
## 5 5 2
## 6 6 2
This variable in fact contains two distinct variables, the start and
end dates, which we may be interested in individually. Likewise, the
Sample column contains both the number of voters, as well
as the type of voters (LV = Likely Voter, RV = Registered voter). The
separate() function is used to break a single column into
multiple new columns. We do this by:
We illustrate with the Date variable here
separate(polls,
col = Date,
into = c("Begin", "End"),
sep = " - ")
## Poll Begin End Sample MoE Clinton..D. Trump..R.
## 1 Monmouth 7/14 7/16 688 LV 3.7 45 43
## 2 CNN/ORC 7/13 7/16 872 RV 3.5 42 37
## 3 ABC News/Wash Post 7/11 7/14 816 RV 4.0 42 38
## 4 NBC News/Wall St. Jrnl 7/9 7/13 1000 RV 3.1 41 35
## 5 Economist/YouGov 7/9 7/11 932 RV 4.5 40 37
## 6 Associated Press-GfK 7/7 7/11 837 RV NA 40 36
## 7 McClatchy/Marist 7/5 7/9 1053 RV 3.0 40 35
## Johnson..L. Stein..G.
## 1 5 1
## 2 13 5
## 3 8 5
## 4 11 6
## 5 5 2
## 6 6 2
## 7 10 5
The sep argument is new to us in that it accepts some
arguments we are likely not familiar with. The most basic arguments are
simply character strings, wrapped in quotes, indicating what the
delimiter should be. Some delimiters are special characters
that handled uniquely in R. Additionally it takes as arguments
regular expressions. We will learn about these in a future lab
as well.
Lastly
This question will use the bluechips dataset
bluechips <- read.csv("https://remiller1450.github.io/data/bluechips.csv")
print(bluechips)
## Year AAPL KO JNJ AXP
## 1 2010 7.6432 28.52 64.68 40.92
## 2 2011 11.7704 32.61 62.82 43.40
## 3 2012 14.6868 35.07 65.88 48.39
## 4 2013 19.6082 37.60 70.84 58.75
## 5 2014 19.7546 40.66 91.03 89.45
## 6 2015 27.3325 42.14 104.52 93.02
## 7 2016 26.3375 42.40 100.48 67.59
## 8 2017 29.0375 41.80 115.84 75.35
## 9 2018 43.0650 45.54 139.23 98.94
## 10 2019 35.5475 46.64 125.72 93.43
## 11 2020 74.3575 54.69 144.28 124.60
## 12 2021 129.4100 52.76 156.50 118.04
## 13 2022 182.0100 59.30 171.54 168.21
Part A: Convert the bluechips data to a
long format where each stock’s closing price on the first trading day of
each year is recorded in a single column named “Price”.
Part B:: Starting with the long format data frame
you created in Part A, recreate the original bluechips data
set using pivot_wider().
polls <- read.csv("https://remiller1450.github.io/data/polls2016.csv")
Part A: Using the polls data, pivot
either wide or long to create a new data.frame where each row is
associated with a single poll and presidential candidate. The resulting
data.frame should have 28 rows and 6 columns
Part B: Using the separate() function
and the data.frame from Part A, split the candidate column into two
separate columns, one for each their name and party affiliation. Note
that . is considered a special character in R; to use it as
separator, it must be “escaped” using two slashes (\\.). If
you have multiple special characters in succession, each must be escaped
individually.
The “airlines” data set (loaded below) contains data used in the article Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past? that appeared on fivethirtyeight.com.
airlines <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/airline-safety.csv")
head(airlines)
## airline avail_seat_km_per_week incidents.1985_1999
## 1 Aer Lingus 320906734 2
## 2 Aeroflot* 1197672318 76
## 3 Aerolineas Argentinas 385803648 6
## 4 Aeromexico* 596871813 3
## 5 Air Canada 1865253802 2
## 6 Air France 3004002661 14
## fatal_accidents.1985_1999 fatalities.1985_1999 incidents.2000_2014
## 1 0 0 0
## 2 14 128 6
## 3 0 0 1
## 4 1 64 5
## 5 0 0 2
## 6 4 79 6
## fatal_accidents.2000_2014 fatalities.2000_2014
## 1 0 0
## 2 1 88
## 3 0 0
## 4 0 0
## 5 0 0
## 6 2 337
Follow the steps below, investigating between each step that your resulting dataset makes sense. (Note the available seat km per week is a constructed measure of how frequently individual airlines fly)
airline dataset so that the last 6 columns
are collapsed into a new variable called accidents taking
values under the new variable, counttype and year. Recall that a period is a
special character.type of accident is
represented in its own column. This should give you a dataset with 112
observations with 6 variablesavail_seat_km_per_week, fatal_accidents, and
years. Using a log scale transform on the x-axis containing
avail_set_km_per_week. Based on the resulting plot, does it
seem that the relationship between seat kilometers and fatal accidents
is the same in both time periods? What would you conclude?The iris dataset built-in to R is a
collection of measurements (in cm) of the sepal and petal dimensions of
50 different flowers coming from three different species of iris
Your goal is to recreate the following graphic using the
tidyr functions covered in this lab
## Load data
data(iris)