dplyr
library(ggplot2)
library(dplyr)
theme_set(theme_bw())
## College data
college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
## For this lab, we don't need all of the columns so we will
# select only a few of them
college <- select(college, Name, State, Enrollment, Private,
Region, Adm_Rate, ACT_median, Cost, Net_Tuition,
Avg_Fac_Salary, Debt_median,
FourYearComp_Males, FourYearComp_Females,
Salary10yr_median)
This lab serves as our introduction to the dplyr
(pronounced dee-ply-R) package in R
which provides a
cohesive framework for working with and manipulating data. In
particular, there are four aspects of data manipulation on which we will
focus today:
The dplyr
package contains a suite of functions that
accommodate these procedures, often chained together in a
sequence of operations, connected with our pipe operator,
%>%
. For example, if I wanted to subset my data to only
include colleges in Iowa and then compute the average cost for both
public and private schools, my code would look like this:
college %>% filter(State %in% "IA") %>%
group_by(Private) %>%
summarize(meanCost = mean(Cost))
## # A tibble: 2 × 2
## Private meanCost
## <chr> <dbl>
## 1 Private 46645.
## 2 Public 21420
Here, we first filter our data to include only schools in
Iowa, then we group by the Private
variable, and
finally we summarize the cost.
We will introduce several additional helper functions in this lab when they are needed, but primarily, we will be concerned with these functions:
Verb/Function | Meaning |
---|---|
select |
Choose variables we wish to keep in our data |
filter |
pick specific observations (i.e. specific rows) |
mutate |
add new derived columns to a data frame |
summarize |
aggregate many rows into a summary measure |
group_by |
perform subsequent operations separtely on the groups created by a categorical variable |
filter
The purpose of filtering is to reduce a data set to only contain rows that meet a certain criteria. That criteria must be specified in the form of a logical expression. A few common logical operators are described in the table below:
Operator | Description |
---|---|
== |
equal to |
!= |
not equal to |
%in% |
equal to any element in a given vector |
< and <= |
less than (strict) or less than or equal to |
> and >= |
greater than (strict) or greater than or equal to |
& |
and |
| |
or |
! |
not |
When we use the filter
function, we specify a logical
criteria for our data to meet. For example, here is how we would filter
to keep only colleges with a median ACT of 30 or more
act30 <- college %>% filter(ACT_median > 29)
head(act30)
## Name State Enrollment Private Region Adm_Rate
## 1 Amherst College MA 1855 Private New England 0.1281
## 2 Bentley University MA 4177 Private New England 0.4321
## 3 Binghamton University NY 13990 Public Mid East 0.3981
## 4 Boston College MA 9639 Private New England 0.2789
## 5 Boston University MA 17238 Private New England 0.2209
## 6 Brandeis University MA 3627 Private New England 0.3115
## ACT_median Cost Net_Tuition Avg_Fac_Salary Debt_median FourYearComp_Males
## 1 33 71300 27541 116775 11900 0.84211
## 2 30 66180 30307 126936 23250 0.81663
## 3 30 25846 8187 88011 16750 0.68695
## 4 33 70588 34327 137907 16000 0.87246
## 5 32 70216 33686 127224 22500 0.80713
## 6 31 70848 24212 118584 24708 0.78462
## FourYearComp_Females Salary10yr_median
## 1 0.89691 65000
## 2 0.89170 86900
## 3 0.76490 61600
## 4 0.91693 72500
## 5 0.79152 65300
## 6 0.84483 57900
The filter
function can take more than one expression if
needed, either separated by commas or the logical operator
&
. Here, we illustrate filtering our dataset to include
only those with median ACT of 30 or more and colleges located
in Iowa (IA) or Minnesota
act30_iamn <- college %>% filter(ACT_median > 29,
State %in% c("IA", "MN"))
head(act30_iamn)
## Name State Enrollment Private Region Adm_Rate ACT_median Cost
## 1 Grinnell College IA 1683 Private Plains 0.2438 32 65814
## 2 Macalester College MN 2160 Private Plains 0.4124 31 66280
## Net_Tuition Avg_Fac_Salary Debt_median FourYearComp_Males
## 1 20369 101979 15000 0.74138
## 2 23225 95607 19432 0.80714
## FourYearComp_Females Salary10yr_median
## 1 0.83721 49100
## 2 0.91237 47600
Notice how the %in%
operator is used here:
State %in% c("IA", "MN")
first considers the variable
State
and then matches it to each of the values in the
vector c("IA", "MN")
Question 1: Filter the college dataset to include
only schools located in the Plains and the Great Lakes regions and with
enrollments less than 20,000. Using your filtered data, create a two-way
table with the variables Region
and
Private
mutate
The purpose of mutating is typically to create new variables out of old ones. For example, our college dataset contains variables for both cost and median debt, and it may be of interest to see for a given school what the typical ratio of debt to cost is
debt_cost <- college %>%
mutate(debt_cost_ratio = Debt_median/Cost)
Within the mutate
function, we see
debt_cost_ratio = Debt_median/Cost
; the left hand side of
this indicates the new variable name that we are creating, while the
right hand side shows us how we created it out of the existing variables
Debt_median
and Cost
.
ggplot(debt_cost, aes(debt_cost_ratio, Private)) +
geom_boxplot()
From this, we see that students at public colleges take a generally much greater proportion of the total cost of college out as debt.
A very helpful function for changing the values of a vector or for
turning a continuous variable into a categorical one is the
ifelse()
function. ifelse()
takes three
arguments:
TRUE
FALSE
For example, consider a variable x
with the values 1-10
which we may want to assign to one of two groups, depending on its
value.
x <- 1:10
x
## [1] 1 2 3 4 5 6 7 8 9 10
# If x < 5, return "A", otherwise return "B"
ifelse(x < 5, "A", "B")
## [1] "A" "A" "A" "A" "B" "B" "B" "B" "B" "B"
Now consider our college dataset with the variable
Adm_Rate
, indicating the percentage of applicants that are
admitted. This is a continuous variable by default, but we could use it
to create a new variable indicating whether or not a school is
considered selective:
ia_select <- college %>%
filter(State %in% "IA") %>%
mutate(Selective = ifelse(Adm_Rate < 0.3, "selective", "not selective"))
# How many selective schools are in Iowa?
with(ia_select, table(Selective))
## Selective
## not selective selective
## 24 1
Question 2: Use the mutate()
and
ifelse()
functions to create a new variable called “Size”
that has the value “large” if enrollment is greater than 10,000 and
“small” if enrollment is less than 10,000. Then, create an appropriate
bar chart to determine which region has the greatest proportion of small
schools
summarize
The summarize()
function helps us calculate descriptive
statistics requiring an aggregation of rows. As a result, we will nearly
always end up with fewer rows than with which we begin. For example, I
may be interested in a table of quantitative statistics describing the
variable “Cost”:
college %>% summarize(meanCost = mean(Cost),
sdCost = sd(Cost),
minCost = min(Cost),
maxCost = max(Cost))
## meanCost sdCost minCost maxCost
## 1 37129 15251 12223 75735
The general syntax for something like
meanCost = mean(Cost)
is such that the name of our new
variable is on the left hand side, while the function of our original
variable “Cost” is on the right hand side. You will also note that the
data frame returned contains only the new variables that were
created in the summarize()
function.
While the summarize()
function is useful, it is most
frequently used with group_by()
, which we illustrate
next.
group_by
Typically when we are summarizing information in our data frame, we
want to do so at a group level. We can indicate that we wish to
do this using the group_by()
function. The function works
in two steps:
group_by()
determines which categorical
variables we wish to use to create groups and creates invisible tags so
that other functions know these groups exist. Typically this is one
variable, but it can also be several.dplyr
package see data that
is grouped (typically with the summarize()
function), it
performs whatever operations it intends to at the group level.The application of this is straightforward enough. Here, we indicate that we wish to group our college dataset by “Region” and then find the mean cost of attending
college %>% group_by(Region) %>%
summarize(meanCost = mean(Cost))
## # A tibble: 8 × 2
## Region meanCost
## <chr> <dbl>
## 1 Far West 40409.
## 2 Great Lakes 37403.
## 3 Mid East 43212.
## 4 New England 45682.
## 5 Plains 35810.
## 6 Rocky Mountains 26503.
## 7 South East 32933.
## 8 South West 31289.
All of the functions that we have introduced in this lab form a cohesive framework for interactive data analysis. In particular, there were designed with the philosophy in mind that they should be able to be chained together to succinctly perform some of the most common applications. Consider for example the code below where we
filter()
mutate()
group_by()
summarize()
college %>%
filter(State %in% c("IA", "MO", "MN")) %>%
mutate(debt_cost_ratio = Debt_median / Cost) %>%
group_by(State, Private) %>%
summarize(mean_dcr = mean(debt_cost_ratio),
sd_dcr = sd(debt_cost_ratio))
## # A tibble: 6 × 4
## # Groups: State [3]
## State Private mean_dcr sd_dcr
## <chr> <chr> <dbl> <dbl>
## 1 IA Private 0.399 0.0597
## 2 IA Public 0.780 0.0967
## 3 MN Private 0.423 0.0783
## 4 MN Public 0.673 0.0797
## 5 MO Private 0.420 0.121
## 6 MO Public 0.716 0.0459
Very quickly we are able to see that public schools in Iowa have the largest debt to cost ratio, while private schools in Iowa have the lowest.
Lastly, we have a helperl function n()
which can be used
in side of summarize – this permits us to display the number of
observations in our groups in addition to other statistical
summaries
college %>%
filter(State %in% c("IA", "MO", "MN")) %>%
mutate(debt_cost_ratio = Debt_median / Cost) %>%
group_by(State, Private) %>%
summarize(Count = n()) # helper function
## # A tibble: 6 × 3
## # Groups: State [3]
## State Private Count
## <chr> <chr> <int>
## 1 IA Private 22
## 2 IA Public 3
## 3 MN Private 16
## 4 MN Public 10
## 5 MO Private 18
## 6 MO Public 10
Question 3: Intensive care units, or ICUs, are primary spaces in hospitals that are reserved for patients in critical condition. The data below is a random sample of n = 200 ICU patients from a research hospital affiliated with Carnegie Mellon University (CMU).
icu <- read.csv("https://remiller1450.github.io/data/ICUAdmissions.csv")
The data dictionary below documents each variable that was recorded:
Using the icu
dataset and the functions described in
this lab, complete the following stages:
Previous
variable to have values “no” and
“yes” instead of 0 and 1n()
) of patients with and
without a previous admission to the ICU in the prior 6 months.Your solution should indicate the total number of patients with and without previous admission, along with each group’s mean age and standard deviation. It will contain two rows and four columns – your final output should look like this:
## Previous MeanAge SDAge N
## 1 no 62.369 17.842 65
## 2 yes 57.000 17.601 19