dplyr
This lab serves as our introduction to the dplyr
(pronounced dee-ply-R) package in R
which provides a
cohesive framework for working with and manipulating data. Let’s begin
by creating a new R code block and copying the code down below. In
particular, make note of the fact that we now need to include
library(dplyr)
, along with library(ggplot2)
,
at the top of our documents when we are ready to knit.
## Copy and paste the code chunk below into the top of your R Markdown document
library(ggplot2)
library(dplyr)
theme_set(theme_bw())
## College data
college <- read.csv("https://collinn.github.io/data/college2019.csv")
## For this lab, we don't need all of the columns so we will
# select only a few of them
college <- select(college, Name, State, Enrollment, Type,
Region, Adm_Rate, ACT_median, Cost, Net_Tuition,
Avg_Fac_Salary, Debt_median)
## Load data from ggplot package
data(mpg)
Data manipulation can mean a lot of things. For this lab, we will
focus on the four main operations we will be using on standard,
table-shaped datasets. These operations are organized around verbs which
will correspond to functions within the dplyr
package. The
verbs for today are:
The dplyr
package contains a suite of functions that
accommodate these procedures, often chained together in a
sequence of operations, connected with our pipe operator,
%>%
(Ctrl+Shift+M for PC, Cmd+Shift+M for Mac). Recall
that the pipe operator takes the left hand side and passes it into the
function on the right hand side. This creates a “pipe” where I can
string together a sequence of operations.
For example, if I wanted to find the average cost for public and private schools in Iowa, my code may look something like this:
college %>% filter(State %in% "IA") %>% # Filter by state
group_by(Type) %>% # Group by type
summarize(meanCost = mean(Cost)) # Summarize the average cost
## # A tibble: 2 × 2
## Type meanCost
## <chr> <dbl>
## 1 Private 46645.
## 2 Public 21420
Here, we first filter our data to include only schools in
Iowa, then we group by the Type
variable so that
when I summarize my data by finding the mean, I am doing so
within each of my defined groups.
Along with some helper functions to facilitate these operations, the
main functions we will be using from the dplyr
package are
as follows:
Verb/Function | Meaning |
---|---|
filter |
pick specific observations (i.e. specific rows) |
mutate |
add new columns to a data frame |
summarize |
aggregate many rows into a summary measure |
group_by |
perform subsequent operations separately on the groups created by a categorical variable |
Often when organizing or manipulating our data, we are doing so according to a set of conditions (i.e., grab only private schools or schools with enrollment larger than 2000 students). Rather than try to find and grab these observations directly, we instead specify these conditions as logical statements for R to process – as such, it doesn’t matter how large our dataset is or if the rows get shuffled around; so long as we correctly specify a set of logical conditions, we will always end up with our intended output. This need motivates our next section, a short exploration of the world of logical operators.
Before beginning with dplyr
, it will be helpful to
introduce the idea of logical operators. A few common logical
operators are described in the table below:
Operator | Description |
---|---|
== |
equal to |
!= |
not equal to |
%in% |
equal to any element in a given vector |
< and <= |
less than (strict) or less than or equal to |
> and >= |
greater than (strict) or greater than or equal to |
& |
and |
| |
or |
! |
not |
Though this seems like a lot, they don’t really have to be memorized; many of them will soon become intuitive and we will have many opportunities to practice.
Briefly, a logical operator is an operator (like +
) that
evaluates if a particular statement is TRUE
or
FALSE
and returns those logical values accordingly.
Consider a simple case using single scalar values
## Using less than operator
3 < 4
## [1] TRUE
## Equality operator
3 == 4
## [1] FALSE
A common use of logical operators is to evaluate a vector against some condition, determining for which elements of a vector a logical statement is true or false. Consider a vector of length five, using logical operators to determine which values are greater than or equal to 3
## Create a vector with values 1,2,3,4,5
x <- 1:5
x
## [1] 1 2 3 4 5
## Determine for which positions the value in the vector is greater than
## or equal to 3
x >= 3
## [1] FALSE FALSE TRUE TRUE TRUE
In this case, the operation returns a logical vector indicating that
our statement is TRUE
in the third, fourth, and fifth
positions and FALSE
everywhere else. Generally speaking,
the length of the logical vector returned with be the same length as the
vector on the left hand side.
A very common use of logical vectors is subsetting, whereby
we retain only those elements of a vector for which a condition is true.
We can do this by either assigning our logical values to a new vector,
or by using them directly inside of the brackets, []
:
## Create a vector of logical values
logic_vector <- x >= 3
## Use this to subset x
x[logic_vector]
## [1] 3 4 5
## Subset directly
x[x >= 3]
## [1] 3 4 5
There are two important ways in which we might want to modify our logical vectors. The first is to negate them, which involves swapping all of the TRUE/FALSE values.
x <- 1:5
## Where is x less than 4
x < 4
## [1] TRUE TRUE TRUE FALSE FALSE
## Where is x not less than 4
!(x < 4)
## [1] FALSE FALSE FALSE TRUE TRUE
Typically, we negate logical vectors with a combination of
!
and ()
Sometimes we need to express logical statements that are more complicated than a single logical check. For example, consider a vector with the numbers 1-10 where we want to take all of the numbers between 3 and 7. Typing this directly will result in an error:
x <- 1:10
3 < x < 7
## Error: <text>:2:7: unexpected '<'
## 1: x <- 1:10
## 2: 3 < x <
## ^
What we need instead is to combine logical statements with a compound
operator like &
:
x <- 1:10
(3 < x) & (x < 7)
## [1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## We can use this to subset
x[(3 < x) & (x < 7)]
## [1] 4 5 6
More complicated logical operations like this are when the negation operator is most useful
## First expression
(3 < x) & (x < 7)
## [1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## Negation of expression
!((3 < x) & (x < 7))
## [1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
## Use the negation to subset the opposite
x[!((3 < x) & (x < 7))]
## [1] 1 2 3 7 8 9 10
%in%
The last operator worth exploring in some more detail is the the
value matching operator, %in%
. The value operator always
asks the question: “what values on my left hand side (LHS) are
also included on the right hand side (RHS). This distinction is
important in specifying the order that we include things:
## We see that the first and second position of LHS are also in RHS
c("a", "a", "b", "c") %in% c("a", "d", "r")
## [1] TRUE TRUE FALSE FALSE
## We see the first position on LHS is also in RHS
c("a", "d", "r") %in% c("a", "a", "b", "c")
## [1] TRUE FALSE FALSE
As we will shortly see, keeping this straight will be easier than you
might think. One final thing to keep in mind when value matching is that
we should never use ==
, as it will check
position-by-position for equality; this is a common source of error for
new R users.
x <- c("kitty", "cat")
y <- c("cat", "kitty")
## Because positions don't align, these will both be false
x == y
## [1] FALSE FALSE
Copy in the variables below and use logical operators as necessary to answer the questions. When you have finished, verify with somebody near you that your solutions look the same.
x <- 1:10
y <- 11:20
z <- seq(10, 100, by = 10)
char1 <- c("apple", "banana", "orange", "kiwi", "kiwi", "grape",
"mango", "necatrine", "apple", "tomato")
char2 <- c("grape", "starfruit", "apple", "grape")
Note: these problem will use indexing on the variable
z
to help verify correctness. For example, suppose I wanted
to select values of x
where x
is equal to 2 or
4, and I tell you that when subsetting z
it should return
c(20, 40)
, what I mean is:
## Create index based on values of x
idx <- x %in% c(2, 4)
## When I subset z with idx, it returns 20 and 40
z[idx]
## [1] 20 40
The following problems are solved the same way:
Create a logical vector indicating where x
is less
than 7 and y
is greater than or equal to 14.
Verify correctness by using this logical vector to subset
z
: it should return the values
c(40, 50, 60)
Create a logical vector showing all of the values from
char2
that are not in char1
;
Finally, create a logical vector where x
is less
than 3 OR where the vectors from char1
are
included in char2
. Use this vector to subset
z
. It should return
c(10, 20, 60, 90)
.
filter
We now move on to investigating the dplyr
functions.
Each of the functions we will be working with today takes as its first
argument a data frame. Additionally, each of the functions
returns a data frame. Because everything will involve a data
frame coming in and a data frame going out, we will make of the pipe
operator to chain our operations together. As the sequence of our
operations becomes more sophisticated, we will find that using the pipe
makes the entire sequence much easier to read and ultimately
understand.
The first function we will cover is filter()
, where we
will now have a chance to make use of our logical operators from the
previous section. filter()
works by subsetting our data
according to which observations (rows) satisfy a logical condition. For
example, if I wanted to filter my entire college
dataset to
contain only schools with a median ACT greater than 29, I can
do so with a logical operator like this:
## I need to save my operations to a new variable if I want to use it
act30 <- college %>% filter(ACT_median > 29)
## The head() function allows me to see the first 6 rows of my data
head(act30)
## Name State Enrollment Type Region Adm_Rate
## 1 Amherst College MA 1855 Private New England 0.1281
## 2 Bentley University MA 4177 Private New England 0.4321
## 3 Binghamton University NY 13990 Public Mid East 0.3981
## 4 Boston College MA 9639 Private New England 0.2789
## 5 Boston University MA 17238 Private New England 0.2209
## 6 Brandeis University MA 3627 Private New England 0.3115
## ACT_median Cost Net_Tuition Avg_Fac_Salary Debt_median
## 1 33 71300 27541 116775 11900
## 2 30 66180 30307 126936 23250
## 3 30 25846 8187 88011 16750
## 4 33 70588 34327 137907 16000
## 5 32 70216 33686 127224 22500
## 6 31 70848 24212 118584 24708
The filter
function can take more than one expression if
needed, either separated by commas or the logical operator
&
. Here, we illustrate filtering our dataset to include
only those with median ACT of 30 or more and colleges located
in Iowa (IA) or Minnesota (MN)
## REMEMBER: I need to assign this to a new variable name to use it again
act30_iamn <- college %>% filter(ACT_median > 29,
State %in% c("IA", "MN"))
head(act30_iamn)
## Name State Enrollment Type Region Adm_Rate ACT_median Cost
## 1 Grinnell College IA 1683 Private Plains 0.2438 32 65814
## 2 Macalester College MN 2160 Private Plains 0.4124 31 66280
## Net_Tuition Avg_Fac_Salary Debt_median
## 1 20369 101979 15000
## 2 23225 95607 19432
Notice how the %in%
operator is used here:
State %in% c("IA", "MN")
first considers the variable
State
and then matches it to each of the values in the
vector c("IA", "MN")
Question 2: Filter the college dataset to include
only schools located in the Plains and the Great Lakes regions and with
enrollments less than 20,000. Using your filtered data, create a two-way
table with the variables Region
and Type
.
Based on this table, in which region do most private schools tend to be
located?
mutate
The purpose of mutating is typically to create new variables out of old ones. For example, our college dataset contains variables for both cost and median debt, and it may be of interest to see for a given school what the typical ratio of debt to cost is
## Use mutate to create new variable called debt_cost_ratio
debt_cost <- college %>%
mutate(debt_cost_ratio = Debt_median/Cost)
Within the mutate
function, we see
debt_cost_ratio = Debt_median/Cost
; the left hand side of
this indicates the new variable name that we are creating, while the
right hand side shows us how we created it out of the existing variables
Debt_median
and Cost
. We are often creating
new variables with the intent of analyzing them, which we can do easily
with ggplot:
## Notice I am using debt_cost data.frame since it contains
## the new variable that I created
ggplot(debt_cost, aes(debt_cost_ratio, Type)) +
geom_boxplot()
From this, we see that students at public colleges take a generally much greater proportion of the total cost of college out as debt.
ifelse()
functionA very common trick in data analysis, and one that we have seen a few
times in class, is that of changing a continuous variable into a
categorical one. A very useful function for doing so is the
ifelse()
function. ifelse()
takes three
arguments:
TRUE
FALSE
For example, consider a variable x
with the values 1-10
which we may want to assign to one of two groups, depending on its
value.
x <- 1:10
x
## [1] 1 2 3 4 5 6 7 8 9 10
# If x < 5, return "A", otherwise return "B"
ifelse(x < 5, "A", "B")
## [1] "A" "A" "A" "A" "B" "B" "B" "B" "B" "B"
This is saying “Everywhere where x
is less than 5,
return the value ‘A’, otherwise, return the value”B”.’
Now consider our college dataset with the variable
Adm_Rate
, indicating the percentage of applicants that are
admitted. This is a continuous variable by default, but we could use it
to create a new variable indicating whether or not a school is
considered selective:
ia_select <- college %>%
filter(State %in% "IA") %>%
mutate(Selective = ifelse(Adm_Rate < 0.3, "selective", "not selective"))
# How many selective schools are in Iowa?
with(ia_select, table(Selective))
## Selective
## not selective selective
## 24 1
Question 3: Use the mutate()
and
ifelse()
functions to create a new variable called “Size”
that has the value “large” if enrollment is greater than 10,000 and
“small” if enrollment is less than 10,000. Then, create an appropriate
bar chart to determine which region has the greatest proportion of small
schools
Question 4: Mutating is also helpful for changing
the class of a variable that may be coded numerically but
should be treated nonetheless as a categorical. A class that we have not
yet formally introduced is that of a factor
, which takes a
vector and returns a categorical variable based on its values. This is
particular useful, as we will see, when dealing with numerical vectors
that have a small number of unique values. For this question, please do
the following:
mpg
dataset that is built into
ggplot2
, create a scatter plot with the variable
displ
(engine displacement, an indicator of size) on the
x-axis and cty
(city miles per gallon) on the y-axis. Add
cyl
(number of cylinders) as a color aesthetic.mutate()
to change the variable cyl
in
the mpg
dataset to a factor variable (a variable
that is categorical) using cyl = factor(cyl)
.summarize
The summarize()
function helps us calculate descriptive
statistics requiring an aggregation of rows. As a result, we will nearly
always end up with fewer rows than with which we begin. For example, I
may be interested in a table of quantitative statistics describing the
variable “Cost”:
## Notice here that I am NOT assigning it to a new variable name.
## That will cause my summary to print out, but it will not save it in my
## environment for later use. If you ever cannot find your modified data.frame
## make sure you saved it to a new variable first
college %>% summarize(meanCost = mean(Cost),
sdCost = sd(Cost),
minCost = min(Cost),
maxCost = max(Cost))
## meanCost sdCost minCost maxCost
## 1 37129 15251 12223 75735
The general syntax for something like
meanCost = mean(Cost)
is such that the name of our new
variable is on the left hand side, while the function of our original
variable “Cost” is on the right hand side. You will also note that the
data frame returned contains only the new variables that were
created in the summarize()
function.
While the summarize()
function is useful, it is most
frequently used with group_by()
, which we illustrate
next.
group_by
Typically when we are summarizing information in our data frame, we
want to do so at a group level. We can indicate that we wish to
do this using the group_by()
function. The function works
in two steps:
group_by()
determines which categorical
variables we wish to use to create groups and creates invisible tags so
that other functions know these groups exist. Typically this is one
variable, but it can also be several.dplyr
package see data that
is grouped (typically with the summarize()
function), it
performs whatever operations it intends to at the group level.The application of this is straightforward enough. Here, we indicate that we wish to group our college dataset by “Region” and then find the mean cost of attending
college %>% group_by(Region) %>%
summarize(meanCost = mean(Cost))
## # A tibble: 8 × 2
## Region meanCost
## <chr> <dbl>
## 1 Far West 40409.
## 2 Great Lakes 37403.
## 3 Mid East 43212.
## 4 New England 45682.
## 5 Plains 35810.
## 6 Rocky Mountains 26503.
## 7 South East 32933.
## 8 South West 31289.
All of the functions that we have introduced in this lab form a cohesive framework for interactive data analysis. In particular, there were designed with the philosophy in mind that they should be able to be chained together to succinctly perform some of the most common applications. Consider for example the code below where we
filter()
mutate()
group_by()
summarize()
college %>%
filter(State %in% c("IA", "MO", "MN")) %>%
mutate(debt_cost_ratio = Debt_median / Cost) %>%
group_by(State, Type) %>%
summarize(mean_dcr = mean(debt_cost_ratio),
sd_dcr = sd(debt_cost_ratio))
## # A tibble: 6 × 4
## # Groups: State [3]
## State Type mean_dcr sd_dcr
## <chr> <chr> <dbl> <dbl>
## 1 IA Private 0.399 0.0597
## 2 IA Public 0.780 0.0967
## 3 MN Private 0.423 0.0783
## 4 MN Public 0.673 0.0797
## 5 MO Private 0.420 0.121
## 6 MO Public 0.716 0.0459
Very quickly we are able to see that public schools in Iowa have the largest debt to cost ratio, while private schools in Iowa have the lowest.
It is worth noting an important distinction between
mutate()
and summarize()
when using
group_by()
, namely:
mutate()
will always create a new variable with the
same number of rowssummarize()
will always reduce the number of rows to
the number of groups. It will also delete any variables that
are not a part of the grouping or the summarizingConsider this toy example to see how the behavior changes between
each. We have a group
column, specifying group, a
val
column, specifying a value, and an extra
column which is here to be a place holder. This dataset is constructed
so that group “A” has a mean value of 5, while group B has a mean value
of 50.
df <- data.frame(group = rep(c("A", "B"), each = 4),
val = c(rnorm(4, 5), rnorm(4, 50)),
extra = -1)
df
## group val extra
## 1 A 3.8731 -1
## 2 A 4.7078 -1
## 3 A 5.4792 -1
## 4 A 5.2028 -1
## 5 B 49.4511 -1
## 6 B 49.0922 -1
## 7 B 49.2017 -1
## 8 B 51.1756 -1
If I first group and then mutate, it will add an additional column, returning the mean value for each group
df %>% group_by(group) %>%
mutate(meanVal = mean(val))
## # A tibble: 8 × 4
## # Groups: group [2]
## group val extra meanVal
## <chr> <dbl> <dbl> <dbl>
## 1 A 3.87 -1 4.82
## 2 A 4.71 -1 4.82
## 3 A 5.48 -1 4.82
## 4 A 5.20 -1 4.82
## 5 B 49.5 -1 49.7
## 6 B 49.1 -1 49.7
## 7 B 49.2 -1 49.7
## 8 B 51.2 -1 49.7
Contrast this with the summarize function
df %>% group_by(group) %>%
summarize(meanVal = mean(val))
## # A tibble: 2 × 2
## group meanVal
## <chr> <dbl>
## 1 A 4.82
## 2 B 49.7
Here, the “extra” column has been removed, and all of the rows have been reduced down to the number of values in the grouping variable. Both methods have their uses, depending on the context.
Finally, it is useful to be aware of the helper function
n()
which can be used in side of summarize – this permits
us to display the number of observations in our groups in addition to
other statistical summaries
college %>%
filter(State %in% c("IA", "MO", "MN")) %>%
mutate(debt_cost_ratio = Debt_median / Cost) %>%
group_by(State, Type) %>%
summarize(Count = n()) # helper function
## # A tibble: 6 × 3
## # Groups: State [3]
## State Type Count
## <chr> <chr> <int>
## 1 IA Private 22
## 2 IA Public 3
## 3 MN Private 16
## 4 MN Public 10
## 5 MO Private 18
## 6 MO Public 10
Question 5: Intensive care units, or ICUs, are primary spaces in hospitals that are reserved for patients in critical condition. The data below is a random sample of n = 200 ICU patients from a research hospital affiliated with Carnegie Mellon University (CMU).
icu <- read.csv("https://collinn.github.io/data/icuadmit.csv")
Descriptions of the relevant variables are indicated below:
Using the icu
dataset and the functions described in
this lab, complete the following stages:
Previous
variable to have values “no” and
“yes” instead of 0 and 1n()
) of patients with and
without a previous admission to the ICU in the prior 6 months.Your solution should indicate the total number of patients with and without previous admission, along with each group’s mean age and standard deviation. It will contain two rows and four columns – your final output should look like this:
## Previous MeanAge SDAge N
## 1 no 62.369 17.842 65
## 2 yes 57.000 17.601 19
(This is all FYI and can be safely skipped if you are not interested)
A short bonus section – the output of the summary()
function is often something we want to present. We can turn this into an
HTML table using a function from the knitr
package, which
should already be installed on your computer (it is used to knit these
documents)
## Load package
library(knitr)
## Create summary
tab <- college %>%
filter(State %in% c("IA", "MO", "MN")) %>%
mutate(debt_cost_ratio = Debt_median / Cost) %>%
group_by(State, Type) %>%
summarize(Count = n())
kable(tab)
State | Type | Count |
---|---|---|
IA | Private | 22 |
IA | Public | 3 |
MN | Private | 16 |
MN | Public | 10 |
MO | Private | 18 |
MO | Public | 10 |
You can see other ways to modify this by investigating
?kable
(stands for Knit
tABLE)
Finally, we can install one additional package that gives us even more options for presenting tables. Copy into your console (not in your rmarkdown document) and remember, this is something you only need to do once:
install.packages("kableExtra")
This package includes the kable_styling()
function which
includes more formatting options. We use it by piping in a kable
library(knitr)
library(kableExtra)
kable(tab) %>% kable_styling(full_width = FALSE, font_size = 30)
State | Type | Count |
---|---|---|
IA | Private | 22 |
IA | Public | 3 |
MN | Private | 16 |
MN | Public | 10 |
MO | Private | 18 |
MO | Public | 10 |
You can learn more with ?kable_styling