Introduction

This lab is intended to help introduce you to the tidyr package, one of several tools we have available for manipulating, cleaning, and preparing our data according to “tidy principles”

Motivation

This exercise will be done in class. For completeness, it is included here.

Shown below are the acceptance rates of three different liberal arts colleges from 2018 to 2020:

Question 0: Use the plot above to answer the following:

  • What variables are included in this plot? What are the associated aesthetics?
  • What layers have been added to the plot?
  • How many individual values does this dataset appear to have?

Based on the answers you gave above, create a sketch of what the associated dataset likely looks like in R.

Definitions and Setup

These next few sections will highlight a few relevant definitions for creating and working with tidy data

Tidy Data

We saw previously that data we will be primarily working with in this course will be tidy data. Tidy data is data that subscribes to the following properties:

  1. Each row is an observation
  2. Each column is a variable
  3. Columns and rows combine to make a tabular array
  4. Each cell of the table corresponds to a value

As we will see, what exactly constitutes an “observation” or a “variable” will depend on context, and a given data frame may be oriented in a number of different ways. This will depend on whether data is stored in wide format or long format.

  • Data that is in a wide format is dat such that each row constitutes a single unique entity. Necessarily this is a bit ambiguous, but for now let’s entertain the idea that is a unique item and every attribute that can be specified about this item is included as a variable. In this case, our observation is the single entity, and the variables are each thing that can possibly be said about it

  • Data that is in a long format differs in that our single unique entity can be repeated in multiple rows. This happens frequently when we have repeated observations of a single entity. In this case, an observation is a specific instance of our single unique entity, and the variables are each thing that can be said about that specific instance.

Consider this illustration below, where we have an employee ID (signifying a single unique entity) who has been given both a pre-test and a post-test, scored between 0-100:

On the left hand side, we see data in wide format. Each single unique entity (employee ID) has a single row, and the variables each indicate all of the tests that they took. The values in the cells represent their scores

On the right hand side, we see data in long format. Here, each single unique entity is repeated, according to the number of observations. The variables in this case are now entirely different: the first indicates which type of test was taken (pre/post), and the third variable accounts for their respective scores.

The goal of the present lab is to illustrate how to identify wide/long data and convert it as necessary into whichever form is needed.

Setup

This lab will require both ggplot2 (already installed) and tidyr (likely not installed). Copy install.packages("tidyr") into your console to install. Then, load the required packages and data by copying the R chunk below into your own lab

library(ggplot2)
library(tidyr)

admit <- read.csv("https://remiller1450.github.io/data/college_adm.csv")
bluechips <- read.csv("https://remiller1450.github.io/data/bluechips.csv")
polls <- read.csv("https://remiller1450.github.io/data/polls2016.csv")
  • admit – 2018-2020 admission data for three selective colleges
  • bluechip – Closing prices on the first day of trading from 2010-2021 for four stocks
  • polls – Select polling data leading up to the 2016 US Presidential Elections scraped from RealClearPolitics.com

Lab

Main functions we introduce are:

  • pivot_wider() – Function to transform long data to wide
  • pivot_longer() – Transform wide data to long
  • separate() – Used to expand a single column into multiple columns based on a delimiter

Though we will not cover this topic directly in this lab, the tidyr and dplyr package both subscribe to a framework of “tidy selection”, a collection of functions used to identify and select names of columns. See ?tidyr_tidy_select if you are curious.

Pivoting long to wide

The first step in identifying if a dataset is in long format is to ask yourself two questions:

  • What is considered my “single unique entity”
  • Are entries for this entry repeated in my dataset

For example, suppose we consider the college admission data from above and decide that the individual college will serve as our unique entity. In wide form, each college should appear in at most one row. Once we have determined which column contains our unique entity, we need to specify which columns will become new variables (typically, categorical) and what values those variables will take. We’ll illustrate with an example; note that we can pass in the variable names directly (e.g., College and Year) just as we do when using dplyr functions.

admit_wide <- pivot_wider(data = admit, 
                          id_cols = College, 
                          names_from = Year, 
                          values_from = Adm_Rate)
admit_wide
## # A tibble: 3 × 4
##   College  `2018` `2019` `2020`
##   <chr>     <dbl>  <dbl>  <dbl>
## 1 Grinnell   28.9   24.4   23.1
## 2 Carlton    21.2   19.8   19.1
## 3 Oberlin    33.7   36.2   36.4

From this transformation, we see:

  • id_cols specifies our single unique entity. These are now each on one row
  • names_from defines the columns from our “long” data that we want to cast wide. Almost always, this will be a categorical variable
  • values_from indicates which values the column in names_to should take. In this case, each Year was subsequently associated with its Adm_Rate

Pivoting wide to long

Pivoting from wide to long works similarly, though in reverse order. We need to decide:

  • Which collection variables we want to “lengthen” (typically, these are variables that are similar in kind)
  • The name of the new variable that should take as its values the names of the old variables
  • The name of the new variable that should take as its values the values of the old variables

For example, if I wanted to cast our wide dataset above into something long, I might use the following call:

## I don't assign this to any value so it prints out by default
pivot_longer(data = admit_wide, 
             cols = c("2018", "2019", "2020"), # which variables become values of new var
             names_to = "Year", # name of new var
             values_to = "Adm_Rate")
## # A tibble: 9 × 3
##   College  Year  Adm_Rate
##   <chr>    <chr>    <dbl>
## 1 Grinnell 2018      28.9
## 2 Grinnell 2019      24.4
## 3 Grinnell 2020      23.1
## 4 Carlton  2018      21.2
## 5 Carlton  2019      19.8
## 6 Carlton  2020      19.1
## 7 Oberlin  2018      33.7
## 8 Oberlin  2019      36.2
## 9 Oberlin  2020      36.4

This last example illustrates a case in which “tidy selection” can make this process easier. If, for example, we wanted to collapse down hundreds of rows, we wouldn’t want to type each out individually. Instead, looking through ?tidyr_tidy_select can sometimes hint at useful tools. In this case

## Alternate method to select variables
pivot_longer(data = admit_wide, 
             cols = starts_with("20"), 
             names_to = "Year", 
             values_to = "Adm_Rate")
## # A tibble: 9 × 3
##   College  Year  Adm_Rate
##   <chr>    <chr>    <dbl>
## 1 Grinnell 2018      28.9
## 2 Grinnell 2019      24.4
## 3 Grinnell 2020      23.1
## 4 Carlton  2018      21.2
## 5 Carlton  2019      19.8
## 6 Carlton  2020      19.1
## 7 Oberlin  2018      33.7
## 8 Oberlin  2019      36.2
## 9 Oberlin  2020      36.4

Typically, using a tidy selection function will be preferred as there are fewer possibilities for a typo and it is robust to changes in your data (e.g., what if somebody sent you an updated dataset that also included 2021-2025. Would you want to go make those changes everywhere in your code?)

Using separate()

Reshaping your data is typically only one step in the process of preparing data for analysis. It’s not uncommon to create new variables out of old with functions such as mutate(), but sometimes the operation is much simpler, and all that is needed is to turn the one column of a dataset into two. This most frequently appears as a consequence of data entry. For example, consider the Date variable in the polls data:

polls <- read.csv("https://remiller1450.github.io/data/polls2016.csv")
head(polls)
##                     Poll        Date  Sample MoE Clinton..D. Trump..R.
## 1               Monmouth 7/14 - 7/16  688 LV 3.7          45        43
## 2                CNN/ORC 7/13 - 7/16  872 RV 3.5          42        37
## 3     ABC News/Wash Post 7/11 - 7/14  816 RV 4.0          42        38
## 4 NBC News/Wall St. Jrnl  7/9 - 7/13 1000 RV 3.1          41        35
## 5       Economist/YouGov  7/9 - 7/11  932 RV 4.5          40        37
## 6   Associated Press-GfK  7/7 - 7/11  837 RV  NA          40        36
##   Johnson..L. Stein..G.
## 1           5         1
## 2          13         5
## 3           8         5
## 4          11         6
## 5           5         2
## 6           6         2

This variable in fact contains two distinct variables, the start and end dates, which we may be interested in individually. Likewise, the Sample column contains both the number of voters, as well as the type of voters (LV = Likely Voter, RV = Registered voter). The separate() function is used to break a single column into multiple new columns. We do this by:

  1. Specifying the data and the column we wish to separate
  2. The new variables we want it to be separated into
  3. The delimiter or string that is used to identify split points

We illustrate with the Date variable here

separate(polls, 
         col = Date, 
         into = c("Begin", "End"), 
         sep = " - ")
##                     Poll Begin  End  Sample MoE Clinton..D. Trump..R.
## 1               Monmouth  7/14 7/16  688 LV 3.7          45        43
## 2                CNN/ORC  7/13 7/16  872 RV 3.5          42        37
## 3     ABC News/Wash Post  7/11 7/14  816 RV 4.0          42        38
## 4 NBC News/Wall St. Jrnl   7/9 7/13 1000 RV 3.1          41        35
## 5       Economist/YouGov   7/9 7/11  932 RV 4.5          40        37
## 6   Associated Press-GfK   7/7 7/11  837 RV  NA          40        36
## 7       McClatchy/Marist   7/5  7/9 1053 RV 3.0          40        35
##   Johnson..L. Stein..G.
## 1           5         1
## 2          13         5
## 3           8         5
## 4          11         6
## 5           5         2
## 6           6         2
## 7          10         5

The sep argument is new to us in that it accepts some arguments we are likely not familiar with. The most basic arguments are simply character strings, wrapped in quotes, indicating what the delimiter should be. Some delimiters are special characters that handled uniquely in R. Additionally it takes as arguments regular expressions. We will learn about these in a future lab as well.

Lastly

Practice

Question 1

This question will use the bluechips dataset

bluechips <- read.csv("https://remiller1450.github.io/data/bluechips.csv")
print(bluechips)
##    Year     AAPL    KO    JNJ    AXP
## 1  2010   7.6432 28.52  64.68  40.92
## 2  2011  11.7704 32.61  62.82  43.40
## 3  2012  14.6868 35.07  65.88  48.39
## 4  2013  19.6082 37.60  70.84  58.75
## 5  2014  19.7546 40.66  91.03  89.45
## 6  2015  27.3325 42.14 104.52  93.02
## 7  2016  26.3375 42.40 100.48  67.59
## 8  2017  29.0375 41.80 115.84  75.35
## 9  2018  43.0650 45.54 139.23  98.94
## 10 2019  35.5475 46.64 125.72  93.43
## 11 2020  74.3575 54.69 144.28 124.60
## 12 2021 129.4100 52.76 156.50 118.04
## 13 2022 182.0100 59.30 171.54 168.21

Part A: Convert the bluechips data to a long format where each stock’s closing price on the first trading day of each year is recorded in a single column named “Price”.

Part B:: Starting with the long format data frame you created in Part A, recreate the original bluechips data set using pivot_wider().

Question 2

polls <- read.csv("https://remiller1450.github.io/data/polls2016.csv")

Part A: Using the polls data, pivot either wide or long to create a new data.frame where each row is associated with a single poll and presidential candidate. The resulting data.frame should have 28 rows and 6 columns

Part B: Using the separate() function and the data.frame from Part A, split the candidate column into two separate columns, one for each their name and party affiliation. Note that . is considered a special character in R; to use it as separator, it must be “escaped” using two slashes (\\.). If you have multiple special characters in succession, each must be escaped individually.

Question 3

The “airlines” data set (loaded below) contains data used in the article Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past? that appeared on fivethirtyeight.com.

airlines <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/airline-safety.csv")
head(airlines)
##                 airline avail_seat_km_per_week incidents.1985_1999
## 1            Aer Lingus              320906734                   2
## 2             Aeroflot*             1197672318                  76
## 3 Aerolineas Argentinas              385803648                   6
## 4           Aeromexico*              596871813                   3
## 5            Air Canada             1865253802                   2
## 6            Air France             3004002661                  14
##   fatal_accidents.1985_1999 fatalities.1985_1999 incidents.2000_2014
## 1                         0                    0                   0
## 2                        14                  128                   6
## 3                         0                    0                   1
## 4                         1                   64                   5
## 5                         0                    0                   2
## 6                         4                   79                   6
##   fatal_accidents.2000_2014 fatalities.2000_2014
## 1                         0                    0
## 2                         1                   88
## 3                         0                    0
## 4                         0                    0
## 5                         0                    0
## 6                         2                  337

Follow the steps below, investigating between each step that your resulting dataset makes sense. (Note the available seat km per week is a constructed measure of how frequently individual airlines fly)

  1. Pivot the airline dataset so that the last 6 columns are collapsed into a new variable called accidents taking values under the new variable, count
  2. Separate the accidents column into two variables named type and year. Recall that a period is a special character.
  3. Widen the dataset so that each type of accident is represented in its own column. This should give you a dataset with 112 observations with 6 variables
  4. Create a scatter plot displaying the relationship between avail_seat_km_per_week, fatal_accidents, and years. Using a log scale transform on the x-axis containing avail_set_km_per_week. Based on the resulting plot, does it seem that the relationship between seat kilometers and fatal accidents is the same in both time periods? What would you conclude?

Question 4

The iris dataset built-in to R is a collection of measurements (in cm) of the sepal and petal dimensions of 50 different flowers coming from three different species of iris

Your goal is to recreate the following graphic using the tidyr functions covered in this lab

## Load data
data(iris)