dplyr
This lab focuses on manipulating, cleaning, and preparing data for visualization (and other analyses) using packages from the tidyverse suite.
Directions (Please read before starting)
\(~\)
This lab will primarily use the `dplyr
package, which is
used for “data wrangling”, or the process of cleaning, restructuring,
and enriching a data set to make it more usable.
# Please install and load the following packages
# install.packages("dplyr")
library(dplyr)
library(ggplot2)
The lab will use several data sets:
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
colleges
data set
records attributes and outcomes for all primarily undergraduate
institutions in the United States with at least 400 full-time students
for the year 2019.bluechips <- read.csv("https://remiller1450.github.io/data/bluechips.csv")
As a data scientist, you should strive to write code that is:
Below is an example that is written in “base R” (that is, without any
external packages), that begins by taking the bluechips
data and creating a new data frame containing the average of each of the
four stock prices for each year for the years 2013, 2017, and 2021.
## Base R example
tmp1 <- subset(bluechips, Year %in% c(2013, 2017, 2021))
tmp2 <- data.frame(Year = tmp1$Year, Avg = (tmp1$AAPL + tmp1$KO + tmp1$JNJ + tmp1$AXP)/4)
tmp2
## Year Avg
## 1 2013 46.69955
## 2 2017 65.50687
## 3 2021 114.17750
We can greatly streamline our workflow in this example using a method
known as piping (%>%
), which we can
think of as a programming statement that is saying, “and then”, along
with a number of action/verb oriented functions that self-describe what
they are doing
## Good example
bluechips %>%
filter(Year %in% c(2013, 2017, 2021)) %>% # Subset to include the target years
mutate(Avg = (AAPL + KO + JNJ + AXP)/4) %>% # Calculate average
select(Year, Avg) # Drop everything but year and average
## Year Avg
## 1 2013 46.69955
## 2 2017 65.50687
## 3 2021 114.17750
The %>%
symbol will “pipe” the output of a preceding
function into the first argument of the subsequent function (usually as
the “data” argument).
Below is a description of each line within the piping example given above:
bluechips
is piped forward (into the
filter()
function on the next line)filter()
subsets the data it receives to include only
the target years and the resulting subset is piped forward (into the
mutate()
function on the next line)mutate()
adds a new column called “Avg” to the data
frame it received, and the resulting data frame is piped forward (into
the select()
function on the next line)select()
function drops all variables other than “Year”
and “Avg”Because the output of this pipeline is not stored as an object, the final data frame is simply printed. If we planned on using data frame prepared by this pipeline in a future data visualization or model, we’d want to store it as its own object:
## Storing the manipulated data set
new_bluechips <- bluechips %>%
filter(Year %in% c(2013, 2017, 2021)) %>% # Include only target years
mutate(Avg = (AAPL + KO + JNJ + AXP)/4) %>% # Calculate average
select(Year, Avg) # Drop extra vars
Note: all functions in the tidyverse
suite of
packages are compatible with the %>%
operator, so we can
including pivoting steps in a pipeline.
\(~\)
At this point you will begin working with your partner. Please read through the text/examples and make sure you both understand before attempting to answer the embedded questions.
\(~\)
The dplyr
package contains a suite of functions designed
to make data manipulation easier. The package’s core functions can be
viewed as verbs:
Verb/Function | Meaning |
---|---|
filter |
pick specific observations (i.e. specific rows) |
arrange |
reorder the rows |
select |
pick variables by their names (i.e. specific columns) |
mutate |
add new derived columns to a data frame |
summarize |
aggregate many rows into a summary measure |
Importantly, these functions can be strung together using piping. But first, let’s see a few examples of how they work individually.
\(~\)
The filter()
function is nearly identical to the
subset()
function from base R. The only difference is that
you can provide multiple logical conditions as separate arguments with
commas (rather than building a single condition using
&
). You should also be aware of it because of its
compatibility with piping.
Example:
colleges %>% filter(State == "IA", ACT_median > 25)
## Name City State Enrollment Private Region Adm_Rate
## 1 Cornell College Mount Vernon IA 1022 Private Plains 0.6102
## 2 Drake University Des Moines IA 2952 Private Plains 0.6766
## 3 Grinnell College Grinnell IA 1683 Private Plains 0.2438
## 4 Luther College Decorah IA 1974 Private Plains 0.6257
## 5 University of Iowa Iowa City IA 23410 Public Plains 0.8267
## ACT_median ACT_Q1 ACT_Q3 Cost Net_Tuition Avg_Fac_Salary PercentFemale
## 1 27 23 23 55817 16457 68832 0.4674157
## 2 27 24 24 53507 21160 85563 0.6097561
## 3 32 30 30 65814 20369 101979 0.5348837
## 4 26 23 23 54045 16779 67833 0.5242718
## 5 26 23 23 22607 14547 91440 0.5623043
## PercentWhite PercentBlack PercentHispanic PercentAsian FourYearComp_Males
## 1 0.8856 0.0456 0.0595 0.0209 0.6776860
## 2 0.9025 0.0370 0.0407 0.0250 0.6641791
## 3 0.7933 0.0971 0.0841 0.0400 0.7413793
## 4 0.9335 0.0223 0.0282 0.0172 0.6457399
## 5 0.9158 0.0303 0.0377 0.0224 0.6086310
## FourYearComp_Females Debt_median Salary10yr_median
## 1 0.7310345 22130 43000
## 2 0.7139175 19197 58300
## 3 0.8372093 15000 49100
## 4 0.7553191 25250 47400
## 5 0.6572554 16173 51900
\(~\)
The arange()
function sorts the rows of your data by one
or more numeric variables:
colleges %>%
filter(State == "IA" & ACT_median > 25) %>%
arrange(ACT_median)
## Name City State Enrollment Private Region Adm_Rate
## 1 Luther College Decorah IA 1974 Private Plains 0.6257
## 2 University of Iowa Iowa City IA 23410 Public Plains 0.8267
## 3 Cornell College Mount Vernon IA 1022 Private Plains 0.6102
## 4 Drake University Des Moines IA 2952 Private Plains 0.6766
## 5 Grinnell College Grinnell IA 1683 Private Plains 0.2438
## ACT_median ACT_Q1 ACT_Q3 Cost Net_Tuition Avg_Fac_Salary PercentFemale
## 1 26 23 23 54045 16779 67833 0.5242718
## 2 26 23 23 22607 14547 91440 0.5623043
## 3 27 23 23 55817 16457 68832 0.4674157
## 4 27 24 24 53507 21160 85563 0.6097561
## 5 32 30 30 65814 20369 101979 0.5348837
## PercentWhite PercentBlack PercentHispanic PercentAsian FourYearComp_Males
## 1 0.9335 0.0223 0.0282 0.0172 0.6457399
## 2 0.9158 0.0303 0.0377 0.0224 0.6086310
## 3 0.8856 0.0456 0.0595 0.0209 0.6776860
## 4 0.9025 0.0370 0.0407 0.0250 0.6641791
## 5 0.7933 0.0971 0.0841 0.0400 0.7413793
## FourYearComp_Females Debt_median Salary10yr_median
## 1 0.7553191 25250 47400
## 2 0.6572554 16173 51900
## 3 0.7310345 22130 43000
## 4 0.7139175 19197 58300
## 5 0.8372093 15000 49100
When sorting by multiple variables, the one listed first will be
given priority. Additionally, values can be arranged in descending order
via the desc()
function:
IA_selective <- colleges %>%
filter(State == "IA" & ACT_median > 25) %>%
arrange(ACT_median, desc(Adm_Rate))
IA_selective
## Name City State Enrollment Private Region Adm_Rate
## 1 University of Iowa Iowa City IA 23410 Public Plains 0.8267
## 2 Luther College Decorah IA 1974 Private Plains 0.6257
## 3 Drake University Des Moines IA 2952 Private Plains 0.6766
## 4 Cornell College Mount Vernon IA 1022 Private Plains 0.6102
## 5 Grinnell College Grinnell IA 1683 Private Plains 0.2438
## ACT_median ACT_Q1 ACT_Q3 Cost Net_Tuition Avg_Fac_Salary PercentFemale
## 1 26 23 23 22607 14547 91440 0.5623043
## 2 26 23 23 54045 16779 67833 0.5242718
## 3 27 24 24 53507 21160 85563 0.6097561
## 4 27 23 23 55817 16457 68832 0.4674157
## 5 32 30 30 65814 20369 101979 0.5348837
## PercentWhite PercentBlack PercentHispanic PercentAsian FourYearComp_Males
## 1 0.9158 0.0303 0.0377 0.0224 0.6086310
## 2 0.9335 0.0223 0.0282 0.0172 0.6457399
## 3 0.9025 0.0370 0.0407 0.0250 0.6641791
## 4 0.8856 0.0456 0.0595 0.0209 0.6776860
## 5 0.7933 0.0971 0.0841 0.0400 0.7413793
## FourYearComp_Females Debt_median Salary10yr_median
## 1 0.6572554 16173 51900
## 2 0.7553191 25250 47400
## 3 0.7139175 19197 58300
## 4 0.7310345 22130 43000
## 5 0.8372093 15000 49100
## Illustration of ordering priority, uncomment to run
# df <- data.frame(x = c(1,2,1,2,1,2),
# y = rep(c("b", "a"), each = 3),
# z = 1:6)
# df %>% arrange(x)
# df %>% arrange(x, y)
# df %>% arrange(x, y, desc(z))
Question #1: Filter the colleges
data
to include only private colleges in the Mid East and New England
regions. Then sort these schools according to the variable
“PercentFemale” such that the school with the largest share of female
students appears at the top of the list.
\(~\)
The select()
function is used to reduce the number of
variables in a data set:
IA_selective <- colleges %>%
filter(State == "IA" & ACT_median > 25) %>%
select(Name, ACT_median, Cost, Net_Tuition)
IA_selective
## Name ACT_median Cost Net_Tuition
## 1 Cornell College 27 55817 16457
## 2 Drake University 27 53507 21160
## 3 Grinnell College 32 65814 20369
## 4 Luther College 26 54045 16779
## 5 University of Iowa 26 22607 14547
Sometimes you’ll want to keep most of your variables, dropping only a
few that are no longer necessary. To drop a variable using
select()
, you can place a -
character in front
of its name:
IA_selective <- colleges %>%
filter(State == "IA" & ACT_median > 25) %>%
select(-State, -City) # Remove State and City
IA_selective
## Name Enrollment Private Region Adm_Rate ACT_median ACT_Q1
## 1 Cornell College 1022 Private Plains 0.6102 27 23
## 2 Drake University 2952 Private Plains 0.6766 27 24
## 3 Grinnell College 1683 Private Plains 0.2438 32 30
## 4 Luther College 1974 Private Plains 0.6257 26 23
## 5 University of Iowa 23410 Public Plains 0.8267 26 23
## ACT_Q3 Cost Net_Tuition Avg_Fac_Salary PercentFemale PercentWhite
## 1 23 55817 16457 68832 0.4674157 0.8856
## 2 24 53507 21160 85563 0.6097561 0.9025
## 3 30 65814 20369 101979 0.5348837 0.7933
## 4 23 54045 16779 67833 0.5242718 0.9335
## 5 23 22607 14547 91440 0.5623043 0.9158
## PercentBlack PercentHispanic PercentAsian FourYearComp_Males
## 1 0.0456 0.0595 0.0209 0.6776860
## 2 0.0370 0.0407 0.0250 0.6641791
## 3 0.0971 0.0841 0.0400 0.7413793
## 4 0.0223 0.0282 0.0172 0.6457399
## 5 0.0303 0.0377 0.0224 0.6086310
## FourYearComp_Females Debt_median Salary10yr_median
## 1 0.7310345 22130 43000
## 2 0.7139175 19197 58300
## 3 0.8372093 15000 49100
## 4 0.7553191 25250 47400
## 5 0.6572554 16173 51900
\(~\)
The mutate()
function is used to add a new column to
your data that is a function of one or more existing variables:
IA_selective <- colleges %>%
filter(State == "IA" & ACT_median > 25) %>%
mutate(Expected_Discount = (Cost - Net_Tuition)/Cost) %>%
select(Name, Cost, Net_Tuition, Expected_Discount)
IA_selective
## Name Cost Net_Tuition Expected_Discount
## 1 Cornell College 55817 16457 0.7051615
## 2 Drake University 53507 21160 0.6045377
## 3 Grinnell College 65814 20369 0.6905066
## 4 Luther College 54045 16779 0.6895365
## 5 University of Iowa 22607 14547 0.3565267
In the example shown above we add a new variable, “Expected_Discount”, that is a function of “Cost” and “New_Tuition”.
Question #2: Using the entire colleges
data set, create a new data frame containing only “Name”,
“State”, and a new variable named “Debt_Cost_Ratio” that describes each
college’s “Debt_median” relative to its expected cumulative cost of
attendance (as a ratio) under the assumption that a student enrolls for
4 years and “Cost” increases by 3% each year. Print this new data frame
as part of your answer. Hint: In year 1 the cumulative cost for
a student is Cost
, in year 2 the cumulative cost is
Cost + 1.03*Cost
, etc.,.
\(~\)
The summarize()
(or summarise()
) function
will calculate summary statistics that require an aggregation of rows.
For example:
colleges %>%
filter(State == "IA") %>%
summarize(Median_Cost = median(Cost))
## Median_Cost
## 1 43520
Without group-wise manipulation (explained in the next section),
summarize()
is most useful for generating a customized set
of summary measures:
colleges %>%
filter(State == "IA") %>%
summarize(Min_Cost = min(Cost),
TenPer_Cost = quantile(Cost, 0.1), ## 10th percentile
Median_Cost = median(Cost),
NinetyPer_Cost = quantile(Cost, 0.9), ## 90th percentile
Max_Cost = max(Cost))
## Min_Cost TenPer_Cost Median_Cost NinetyPer_Cost Max_Cost
## 1 20476 22368.2 43520 54256.2 65814
Question #3: Using the summarize()
function, report the interquartile range (IQR) and standard deviation of
the variable “Debt_Cost_Ratio” that you created in Question #2.
Hint: recall that the IQR is calculated as the 75th percentile
minus the 25th percentile.
\(~\)
Frequently, we’d like to manipulate data separately within certain groups. For example, you might want separate numeric summaries describing the colleges in different states, or maybe even for the private and public schools within each of those different states.
Groupwise manipulations require two steps:
group_by()
function is used to create an
internal tag defining the desired groupings.dplyr
functions (usually summarize()
or
mutate()
), and those functions are executed separately on
each group.Shown below are a few examples.
In Example #1, we find the median cost for colleges within each state located in the “Plains” region:
## Example #1
colleges %>%
filter(Region == "Plains") %>%
group_by(State) %>%
summarize(Median_Cost = median(Cost, na.rm = TRUE))
## # A tibble: 7 × 2
## State Median_Cost
## <chr> <dbl>
## 1 IA 43520
## 2 KS 38832
## 3 MN 35887
## 4 MO 30279
## 5 ND 19299
## 6 NE 29258.
## 7 SD 22609
In Example #2, we find each state’s median cost separately for
private and public colleges located in Iowa, Minnesota, or Missouri. We
also introduce the n()
function that, when used with
grouped data, will report the total number of observations in each
group. Here, we use it to record “N”, the number of colleges belonging
to each group reported in the summary.,
We also use n()
to count the number of colleges,
recorded as “N”, belonging to each group reported in the summary.
## Example #2
colleges %>%
filter(State == "IA" | State == "MN" | State == "MO") %>%
group_by(State, Private) %>%
summarize(Median_Cost = median(Cost, na.rm = TRUE),
N = n())
## # A tibble: 6 × 4
## # Groups: State [3]
## State Private Median_Cost N
## <chr> <chr> <dbl> <int>
## 1 IA Private 44206 25
## 2 IA Public 21295 4
## 3 MN Private 48860 24
## 4 MN Public 21416 12
## 5 MO Private 37788. 32
## 6 MO Public 19346 13
In Example #3, we find how much each state (either IA or MN) deviates from the average cost of all colleges within the same state.
## Example #3
colleges %>%
filter(State == "IA" | State == "MN" ) %>%
group_by(State) %>%
mutate(Cost_Avg = mean(Cost, na.rm = TRUE),
Cost_vs_Avg = Cost - mean(Cost, na.rm = TRUE)) %>%
select(Name, State, Cost, Cost_Avg, Cost_vs_Avg)
## # A tibble: 65 × 5
## # Groups: State [2]
## Name State Cost Cost_Avg Cost_vs_Avg
## <chr> <chr> <int> <dbl> <dbl>
## 1 Augsburg University MN 51251 37714. 13537.
## 2 Bemidji State University MN 20379 37714. -17335.
## 3 Bethany Lutheran College MN 36961 37714. -753.
## 4 Bethel University MN 50325 37714. 12611.
## 5 Briar Cliff University IA 42423 41987. 436.
## 6 Buena Vista University IA 45332 41987. 3345.
## 7 Capella University MN 20152 37714. -17562.
## 8 Carleton College MN 68835 37714. 31121.
## 9 Central College IA 50547 41987. 8560.
## 10 Clarke University IA 44891 41987. 2904.
## # ℹ 55 more rows
# colleges %>%
# filter(State == "IA" | State == "MN" ) %>%
# group_by(State) %>%
# mutate(Cost_Avg = mean(Cost, na.rm = TRUE),
# Cost_vs_Avg = Cost - Cost_Avg) %>% # We can also use other variables defined in mutate
# select(Name, State, Cost, Cost_Avg, Cost_vs_Avg)
Notice how summarize()
returns an object with 1 row per
group, while mutate()
returns an object with 1 row per
observation.
\(~\)
Intensive care units, or ICUs, are primary spaces in hospitals that are reserved for patients in critical condition. The data below is a random sample of n = 200 ICU patients from a research hospital affiliated with Carnegie Mellon University (CMU).
icu <- read.csv("https://remiller1450.github.io/data/ICUAdmissions.csv")
The data dictionary below documents each variable that was recorded:
Question #4: Filter the ICU data to include only
patients whose visit involves an infection. Then, for the “Age”
variable, find the mean, standard deviation, and group size (found using
the function n()
) of patients with and without a previous
admission in the prior 6 months. That is, your solution should indicate
the total number of patients with and without previous admission, along
with each group’s mean age and standard deviation.
Question #5: Considering all ICU patients in these
data, use the group_by()
and mutate()
functions to a sex-specific Z-score for the variable “HeartRate” of each
patient. Note: you should be using different means and standard
deviations within each sex to calculate this Z-score. If you are
unfamiliar with Z-scores, they take the form: \(Z = \tfrac{\text{Value} - \text{Mean}}{\text{Std.
Dev}}\)