This lab focuses on manipulating, cleaning, and preparing data for visualization (or other analyses) using packages from the tidyverse suite.

Directions (Please read before starting)

  1. Please work together with your assigned partner. Make sure you both fully understand each concept before you move on.
  2. Please record your answers and any related code for all embedded lab questions. I encourage you to try out the embedded examples, but you shouldn’t turn them in.
  3. Please ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Preamble

Motivation

Show below are the acceptance rates of three different liberal arts colleges from 2018 to 2020:

As a data scientist, you must be able to connect the format of a data set with the information presented in a graph.

  1. If creating an excel spreadsheet, how do you think most people would record these data?
  2. What format of these data is most accessible to the ggplot() function?

\(~\)

Long vs. Wide Data

Data in a wide format record many different values or variables for a single entity in a single row (ie: acceptance rates for a college in different years, or different test scores for the same employee).

Data in a long format use multiple rows and a single column for the outcome or value of interest (ie: acceptance rate, test score, etc.) with additional columns identifying the meaning of that value.

Example:

The ggplot2 package, as well as the implementations of many statistical models, expect data in long format.

\(~\)

Tidy Data

Converting between “wide” and “long” formats is often the most challenging step in creating a “tidy” data set, or one that is fully prepared for graphing/modeling.

In general, tidy data are defined by the following criteria:

  1. Every column is a variable
  2. Every row is an observation
  3. Every cell is a single value

This lab will introduce several data manipulation functions used to help obtain tidy data.

\(~\)

Packages and Datasets

This lab primarily uses tidyr package, which is used to “tidy” or reshape data into a more usable format. It will also use the ggplot2 package.

# Please install and load the following packages
# install.packages("tidyr")
library(tidyr)
library(ggplot2)

The lab will use several data sets:

collegeAdm <- read.csv("https://remiller1450.github.io/data/college_adm.csv")
  • Description: Admissions rates of three Midwestern liberal arts colleges according to acceptancerate.com
bluechips <- read.csv("https://remiller1450.github.io/data/bluechips.csv")
  • Description: Closing prices on the first trading day of the year from 2010 to 2021 for four stocks that The Motley Fool calls “blue chip” investments.
polls <- read.csv("https://remiller1450.github.io/data/polls2016.csv")
  • Description: Polling data leading up to the 2016 US Presidential Elections scraped from RealClearPolitics.com

\(~\)

Lab

At this point you will begin working with your partner. Please read through the text/examples and make sure you both understand before attempting to answer the embedded questions.

\(~\)

Pivoting between long and wide formats

Consider the collegeAdm data frame:

## Look at all 9 observations
head(collegeAdm, n = 9)
##   Adm_Rate Year  College
## 1     28.9 2018 Grinnell
## 2     24.4 2019 Grinnell
## 3     23.1 2020 Grinnell
## 4     21.2 2018  Carlton
## 5     19.8 2019  Carlton
## 6     19.1 2020  Carlton
## 7     33.7 2018  Oberlin
## 8     36.2 2019  Oberlin
## 9     36.4 2020  Oberlin

This data is currently in “long” format, meaning that each row is associated with a single observation (year, college, rate). We can convert our college admit data to a “wide” format using the pivot_wider() function:

## Pivot from long to wide to get 1 row per Year
wideCollegeAdm <- pivot_wider(collegeAdm, 
                              id_cols = Year,
                              names_from = College, 
                              values_from = Adm_Rate)
head(wideCollegeAdm)
## # A tibble: 3 × 4
##    Year Grinnell Carlton Oberlin
##   <int>    <dbl>   <dbl>   <dbl>
## 1  2018     28.9    21.2    33.7
## 2  2019     24.4    19.8    36.2
## 3  2020     23.1    19.1    36.4

The following arguments guide the transformation:

  • id_cols defines the rows of the “wide” data (i.e., each row will be a unique value of “Year”)
  • names_from defines the single column from the “long” data that should be spread into multiple distinct columns in the “wide” data (i.e., each “College” should get a column)
  • values_from defines the single column from the “long” data containing the values used to populate the cells of the “wide” data (i.e., the columns created for each “College” should contain values of “Adm_Rate”)

Notice what happens when id_cols and names_from are swapped:

## Pivot from long to wide to get 1 row per College
wideCollegeAdm2 <- pivot_wider(collegeAdm, 
                              id_cols = College,
                              names_from = Year, 
                              values_from = Adm_Rate)
head(wideCollegeAdm2)
## # A tibble: 3 × 4
##   College  `2018` `2019` `2020`
##   <chr>     <dbl>  <dbl>  <dbl>
## 1 Grinnell   28.9   24.4   23.1
## 2 Carlton    21.2   19.8   19.1
## 3 Oberlin    33.7   36.2   36.4

Similarly, the pivot_long() function will transform “wide” data into “long” data:

pivot_longer(wideCollegeAdm2, 
             cols = c("2018", "2019", "2020"), 
             names_to = "Year", 
             values_to = "Adm_Rate")
## # A tibble: 9 × 3
##   College  Year  Adm_Rate
##   <chr>    <chr>    <dbl>
## 1 Grinnell 2018      28.9
## 2 Grinnell 2019      24.4
## 3 Grinnell 2020      23.1
## 4 Carlton  2018      21.2
## 5 Carlton  2019      19.8
## 6 Carlton  2020      19.1
## 7 Oberlin  2018      33.7
## 8 Oberlin  2019      36.2
## 9 Oberlin  2020      36.4
  • cols defines the column(s) used in the pivot, here this is each of the years variables that will be combined into a single variable
  • names_to is the name of the single column in the “long” data frame that will store the column names of the “wide” data frame
  • values_to is the name of the single column in the “long” data frame that will store the values from the cells of the “wide” data frame

It is worth noting that there a few alternative ways to specify the argument cols. The first is a syntax that is unique to the “tidyverse” and takes the form !variable to specify that we want to include all column names except variable. In the case of our admit date, it would look like !College to indicate that we want all column names except college

## Note: for more than one column, you can use !C(col1, col2, ...)
pivot_longer(wideCollegeAdm2, 
             cols = !College, 
             names_to = "Year", 
             values_to = "Adm_Rate")
## # A tibble: 9 × 3
##   College  Year  Adm_Rate
##   <chr>    <chr>    <dbl>
## 1 Grinnell 2018      28.9
## 2 Grinnell 2019      24.4
## 3 Grinnell 2020      23.1
## 4 Carlton  2018      21.2
## 5 Carlton  2019      19.8
## 6 Carlton  2020      19.1
## 7 Oberlin  2018      33.7
## 8 Oberlin  2019      36.2
## 9 Oberlin  2020      36.4

Alternatively, you can use the names() function, which returns all of the column names of a data frame, along with the setdiff() function, which returns the difference between two sets. So, for example, if we wanted all of the column names except for college, we could express this as follows:

names(wideCollegeAdm2)
## [1] "College" "2018"    "2019"    "2020"
setdiff(names(wideCollegeAdm2), "College")
## [1] "2018" "2019" "2020"
pivot_longer(wideCollegeAdm2, 
             cols = setdiff(names(wideCollegeAdm2), "College"), 
             names_to = "Year", 
             values_to = "Adm_Rate")
## # A tibble: 9 × 3
##   College  Year  Adm_Rate
##   <chr>    <chr>    <dbl>
## 1 Grinnell 2018      28.9
## 2 Grinnell 2019      24.4
## 3 Grinnell 2020      23.1
## 4 Carlton  2018      21.2
## 5 Carlton  2019      19.8
## 6 Carlton  2020      19.1
## 7 Oberlin  2018      33.7
## 8 Oberlin  2019      36.2
## 9 Oberlin  2020      36.4

Ultimately, you are welcome to use whichever you find most comfortable or easiest to remember.

Question 1: Convert the bluechips data to a long format where each stock’s closing price on the first trading day of each year is recorded in a single column named “Price”.

Question 2: Starting with the long format data frame you created in Question #1, recreate the original bluechips data set using pivot_wider().

\(~\)

Other tidyr functions

Reshaping is often only one step in the process of tidying a data set. Sometimes, data will contain multiple variables in a single column. For example, consider the polls data set:

head(polls)
##                     Poll        Date  Sample MoE Clinton..D. Trump..R.
## 1               Monmouth 7/14 - 7/16  688 LV 3.7          45        43
## 2                CNN/ORC 7/13 - 7/16  872 RV 3.5          42        37
## 3     ABC News/Wash Post 7/11 - 7/14  816 RV 4.0          42        38
## 4 NBC News/Wall St. Jrnl  7/9 - 7/13 1000 RV 3.1          41        35
## 5       Economist/YouGov  7/9 - 7/11  932 RV 4.5          40        37
## 6   Associated Press-GfK  7/7 - 7/11  837 RV  NA          40        36
##   Johnson..L. Stein..G.
## 1           5         1
## 2          13         5
## 3           8         5
## 4          11         6
## 5           5         2
## 6           6         2

The column Date contains two distinct variables, the start and end of the poll’s sampling period. Similarly, Sample also contains two variables, the number of participants in the poll and the population that was sampled (registered voters or likely voters).

The separate() function is used to break a character column into multiple new columns:

## Example #1
tidy_polls <- separate(polls, 
                       col = Date, 
                       into = c("Begin", "End"),
                       sep = " - ")
head(tidy_polls)
##                     Poll Begin  End  Sample MoE Clinton..D. Trump..R.
## 1               Monmouth  7/14 7/16  688 LV 3.7          45        43
## 2                CNN/ORC  7/13 7/16  872 RV 3.5          42        37
## 3     ABC News/Wash Post  7/11 7/14  816 RV 4.0          42        38
## 4 NBC News/Wall St. Jrnl   7/9 7/13 1000 RV 3.1          41        35
## 5       Economist/YouGov   7/9 7/11  932 RV 4.5          40        37
## 6   Associated Press-GfK   7/7 7/11  837 RV  NA          40        36
##   Johnson..L. Stein..G.
## 1           5         1
## 2          13         5
## 3           8         5
## 4          11         6
## 5           5         2
## 6           6         2
  • col is the column to be separated
  • into indicates the names of the new columns that should result from the separation
  • sep is the character string used to identify splitting points (in this example it’s - surrounded by a space on each side)
## Example #2
tidy_polls <- separate(polls, 
                       col = Sample, 
                       into = c("Size", "Population"))
head(tidy_polls)
##                     Poll        Date Size Population MoE Clinton..D. Trump..R.
## 1               Monmouth 7/14 - 7/16  688         LV 3.7          45        43
## 2                CNN/ORC 7/13 - 7/16  872         RV 3.5          42        37
## 3     ABC News/Wash Post 7/11 - 7/14  816         RV 4.0          42        38
## 4 NBC News/Wall St. Jrnl  7/9 - 7/13 1000         RV 3.1          41        35
## 5       Economist/YouGov  7/9 - 7/11  932         RV 4.5          40        37
## 6   Associated Press-GfK  7/7 - 7/11  837         RV  NA          40        36
##   Johnson..L. Stein..G.
## 1           5         1
## 2          13         5
## 3           8         5
## 4          11         6
## 5           5         2
## 6           6         2

In Example #2, the “sep” argument is not explicitly provided. In this situation, separate() will attempt to guess an appropriate separator. While this is not generally recommended, it works well for simple character strings. More complex strings might require the use of regular expressions, a topic we’ll cover later this semester.

Question 3 (Part A): Using either the pivot_longer() or pivot_wider() function, create a version of the tidy_polls data containing the variables Candidate and Percentage, where Candidate is taken from the names of the last four columns of the data frame, and “Percentage” is taken from the values contained in these columns. It should look like this:

tt <- pivot_longer(polls, cols = c("Clinton..D.", "Trump..R.", "Johnson..L.", "Stein..G."), names_to = "Candidate", values_to = "Percentage")
head(tt, n = 4)
## # A tibble: 4 × 6
##   Poll     Date        Sample   MoE Candidate   Percentage
##   <chr>    <chr>       <chr>  <dbl> <chr>            <int>
## 1 Monmouth 7/14 - 7/16 688 LV   3.7 Clinton..D.         45
## 2 Monmouth 7/14 - 7/16 688 LV   3.7 Trump..R.           43
## 3 Monmouth 7/14 - 7/16 688 LV   3.7 Johnson..L.          5
## 4 Monmouth 7/14 - 7/16 688 LV   3.7 Stein..G.            1

Question 3 (Part B): Using the separate() function, split the column “Candidate” (created in Part A) into two distinct columns containing the name of the candidate (ie: Clinton, Trump, etc.) and their political party (ie: D, R, etc.). Hint: periods, or ., are a special character in R, but you can reference one using the expression: \\.. The \\ is known as an “escape sequence” telling R to ignore the special character status of whatever follows. If you have multiple special characters in a row, you will need to be sure to “escape” each one of them

\(~\)

Practice

The “airlines” data set (loaded below) contains data used in the article Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past? that appeared on fivethirtyeight.com.

airlines <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/airline-safety.csv")
head(airlines)
##                 airline avail_seat_km_per_week incidents.1985_1999
## 1            Aer Lingus              320906734                   2
## 2             Aeroflot*             1197672318                  76
## 3 Aerolineas Argentinas              385803648                   6
## 4           Aeromexico*              596871813                   3
## 5            Air Canada             1865253802                   2
## 6            Air France             3004002661                  14
##   fatal_accidents.1985_1999 fatalities.1985_1999 incidents.2000_2014
## 1                         0                    0                   0
## 2                        14                  128                   6
## 3                         0                    0                   1
## 4                         1                   64                   5
## 5                         0                    0                   2
## 6                         4                   79                   6
##   fatal_accidents.2000_2014 fatalities.2000_2014
## 1                         0                    0
## 2                         1                   88
## 3                         0                    0
## 4                         0                    0
## 5                         0                    0
## 6                         2                  337

Recall that a “tidy” version of these data should satisfy the following:

  • Each row is a single airline in a specific time period (ie: Air Canada in 1985-1999 or Alaska Airlines in 2000-2014)
  • Each column contains only a single variable
  • Each cell contains only a single value

Question 4 (Part A): Use pivot_longer() to gather the last six columns of the airlines data into a column named accidents and a column named count.

Question 4 (Part B): Use separate() to split the accidents column into two variables named type and years. Hint: remember that the period is a special character in R.

Question 4 (Part C): Use pivot_wider() to spread out the type column into three new columns containing the type of accident. Your data should now contain two rows per airline (one for each time period), you can check if the first few rows match those printed below.

## # A tibble: 5 × 6
##   airline      avail_seat_km_per_week years incidents fatal_accidents fatalities
##   <chr>                         <dbl> <chr>     <int>           <int>      <int>
## 1 Aer Lingus                320906734 1985…         2               0          0
## 2 Aer Lingus                320906734 2000…         0               0          0
## 3 Aeroflot*                1197672318 1985…        76              14        128
## 4 Aeroflot*                1197672318 2000…         6               1         88
## 5 Aerolineas …              385803648 1985…         6               0          0

Question 4 (Part D): Using ggplot2, create a scatter plot displaying the relationship between avail_seat_km_per_week, fatal_accidents, and years

\(~\)

Question 5 The iris dataset builtin to R is a collection of measurements (in cm) of the sepal and petal dimensions of 50 different flowers coming from three different species of iris (I have verified that all of the values in this dataset are what they should be – no surprise numerics or factors!)

Your goal is to recreate the following graphic using the tidyr functions covered in this lab

## Load data
data(iris)