Lab 5 – dplyr

library(ggplot2)
library(dplyr)

theme_set(theme_bw())

## College data
college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")

## For this lab, we don't need all of the columns so we will  
# select only a few of them
college <- select(college, Name, State, Enrollment, Private, 
                  Region, Adm_Rate, ACT_median, Cost, Net_Tuition, 
                  Avg_Fac_Salary, Debt_median, 
                  FourYearComp_Males, FourYearComp_Females, 
                  Salary10yr_median)

Introduction

This lab serves as our introduction to the dplyr (pronounced dee-ply-R) package in R which provides a cohesive framework for working with and manipulating data. In particular, there are four aspects of data manipulation on which we will focus today:

Filtering – Filtering involves creating subsets of our data according to some criterion. For example, we may be interested in filtering our college dataset to only include private college or colleges located in Iowa
Mutating – Mutating is involved with the creation of new variables in our dataset or the transformation of existing one
Summarizing – Summarizing is used to calculate descriptive statistics
Grouping – Grouping allows us to specify groups of our data for the preceding functions rather than acting upon the entire dataset

The dplyr package contains a suite of functions that accommodate these procedures, often chained together in a sequence of operations, connected with our pipe operator, %>%. For example, if I wanted to subset my data to only include colleges in Iowa and then compute the average cost for both public and private schools, my code would look like this:

college %>% filter(State %in% "IA") %>% 
  group_by(Private) %>% 
  summarize(meanCost = mean(Cost))

## # A tibble: 2 × 2
##   Private meanCost
##   <chr>      <dbl>
## 1 Private   46645.
## 2 Public    21420

Here, we first filter our data to include only schools in Iowa, then we group by the Private variable, and finally we summarize the cost.

We will introduce several additional helper functions in this lab when they are needed, but primarily, we will be concerned with these functions:

Verb/Function	Meaning
`select`	Choose variables we wish to keep in our data
`filter`	pick specific observations (i.e. specific rows)
`mutate`	add new derived columns to a data frame
`summarize`	aggregate many rows into a summary measure
`group_by`	perform subsequent operations separtely on the groups created by a categorical variable

`filter`

The purpose of filtering is to reduce a data set to only contain rows that meet a certain criteria. That criteria must be specified in the form of a logical expression. A few common logical operators are described in the table below:

Operator	Description
`==`	equal to
`!=`	not equal to
`%in%`	equal to any element in a given vector
`<` and `<=`	less than (strict) or less than or equal to
`>` and `>=`	greater than (strict) or greater than or equal to
`&`	and
`\|`	or
`!`	not

When we use the filter function, we specify a logical criteria for our data to meet. For example, here is how we would filter to keep only colleges with a median ACT of 30 or more

act30 <- college %>% filter(ACT_median > 29)
head(act30)

##                    Name State Enrollment Private      Region Adm_Rate
## 1       Amherst College    MA       1855 Private New England   0.1281
## 2    Bentley University    MA       4177 Private New England   0.4321
## 3 Binghamton University    NY      13990  Public    Mid East   0.3981
## 4        Boston College    MA       9639 Private New England   0.2789
## 5     Boston University    MA      17238 Private New England   0.2209
## 6   Brandeis University    MA       3627 Private New England   0.3115
##   ACT_median  Cost Net_Tuition Avg_Fac_Salary Debt_median FourYearComp_Males
## 1         33 71300       27541         116775       11900            0.84211
## 2         30 66180       30307         126936       23250            0.81663
## 3         30 25846        8187          88011       16750            0.68695
## 4         33 70588       34327         137907       16000            0.87246
## 5         32 70216       33686         127224       22500            0.80713
## 6         31 70848       24212         118584       24708            0.78462
##   FourYearComp_Females Salary10yr_median
## 1              0.89691             65000
## 2              0.89170             86900
## 3              0.76490             61600
## 4              0.91693             72500
## 5              0.79152             65300
## 6              0.84483             57900

The filter function can take more than one expression if needed, either separated by commas or the logical operator &. Here, we illustrate filtering our dataset to include only those with median ACT of 30 or more and colleges located in Iowa (IA) or Minnesota

act30_iamn <- college %>% filter(ACT_median > 29, 
                            State %in% c("IA", "MN"))
head(act30_iamn)

##                 Name State Enrollment Private Region Adm_Rate ACT_median  Cost
## 1   Grinnell College    IA       1683 Private Plains   0.2438         32 65814
## 2 Macalester College    MN       2160 Private Plains   0.4124         31 66280
##   Net_Tuition Avg_Fac_Salary Debt_median FourYearComp_Males
## 1       20369         101979       15000            0.74138
## 2       23225          95607       19432            0.80714
##   FourYearComp_Females Salary10yr_median
## 1              0.83721             49100
## 2              0.91237             47600

Notice how the %in% operator is used here: State %in% c("IA", "MN") first considers the variable State and then matches it to each of the values in the vector c("IA", "MN")

Question 1: Filter the college dataset to include only schools located in the Plains and the Great Lakes regions and with enrollments less than 20,000. Using your filtered data, create a two-way table with the variables Region and Private

`mutate`

The purpose of mutating is typically to create new variables out of old ones. For example, our college dataset contains variables for both cost and median debt, and it may be of interest to see for a given school what the typical ratio of debt to cost is

debt_cost <- college %>% 
  mutate(debt_cost_ratio = Debt_median/Cost)

Within the mutate function, we see debt_cost_ratio = Debt_median/Cost; the left hand side of this indicates the new variable name that we are creating, while the right hand side shows us how we created it out of the existing variables Debt_median and Cost.

ggplot(debt_cost, aes(debt_cost_ratio, Private)) + 
  geom_boxplot()

From this, we see that students at public colleges take a generally much greater proportion of the total cost of college out as debt.

A very helpful function for changing the values of a vector or for turning a continuous variable into a categorical one is the ifelse() function. ifelse() takes three arguments:

A logical expression
What to return if TRUE
What to return if FALSE

For example, consider a variable x with the values 1-10 which we may want to assign to one of two groups, depending on its value.

x <- 1:10
x

##  [1]  1  2  3  4  5  6  7  8  9 10

# If x < 5, return "A", otherwise return "B"
ifelse(x < 5, "A", "B")

##  [1] "A" "A" "A" "A" "B" "B" "B" "B" "B" "B"

Now consider our college dataset with the variable Adm_Rate, indicating the percentage of applicants that are admitted. This is a continuous variable by default, but we could use it to create a new variable indicating whether or not a school is considered selective:

ia_select <- college %>% 
  filter(State %in% "IA") %>% 
  mutate(Selective = ifelse(Adm_Rate < 0.3, "selective", "not selective"))

# How many selective schools are in Iowa?
with(ia_select, table(Selective))

## Selective
## not selective     selective 
##            24             1

Question 2: Use the mutate() and ifelse() functions to create a new variable called “Size” that has the value “large” if enrollment is greater than 10,000 and “small” if enrollment is less than 10,000. Then, create an appropriate bar chart to determine which region has the greatest proportion of small schools

`summarize`

The summarize() function helps us calculate descriptive statistics requiring an aggregation of rows. As a result, we will nearly always end up with fewer rows than with which we begin. For example, I may be interested in a table of quantitative statistics describing the variable “Cost”:

college %>% summarize(meanCost = mean(Cost), 
                      sdCost = sd(Cost), 
                      minCost = min(Cost), 
                      maxCost = max(Cost))

##   meanCost sdCost minCost maxCost
## 1    37129  15251   12223   75735

The general syntax for something like meanCost = mean(Cost) is such that the name of our new variable is on the left hand side, while the function of our original variable “Cost” is on the right hand side. You will also note that the data frame returned contains only the new variables that were created in the summarize() function.

While the summarize() function is useful, it is most frequently used with group_by(), which we illustrate next.

`group_by`

Typically when we are summarizing information in our data frame, we want to do so at a group level. We can indicate that we wish to do this using the group_by() function. The function works in two steps:

First, group_by() determines which categorical variables we wish to use to create groups and creates invisible tags so that other functions know these groups exist. Typically this is one variable, but it can also be several.
Whenever functions from the dplyr package see data that is grouped (typically with the summarize() function), it performs whatever operations it intends to at the group level.

The application of this is straightforward enough. Here, we indicate that we wish to group our college dataset by “Region” and then find the mean cost of attending

college %>% group_by(Region) %>% 
  summarize(meanCost = mean(Cost))

## # A tibble: 8 × 2
##   Region          meanCost
##   <chr>              <dbl>
## 1 Far West          40409.
## 2 Great Lakes       37403.
## 3 Mid East          43212.
## 4 New England       45682.
## 5 Plains            35810.
## 6 Rocky Mountains   26503.
## 7 South East        32933.
## 8 South West        31289.

All of the functions that we have introduced in this lab form a cohesive framework for interactive data analysis. In particular, there were designed with the philosophy in mind that they should be able to be chained together to succinctly perform some of the most common applications. Consider for example the code below where we

First limit our colleges to those located in Iowa (IA), Missouri (MO), and Minnesota (MN) using filter()
We then compute the debt to cost ratio for each school just as we did before using mutate()
We indicate that we wish to group our data according to both State and Private, which we do with group_by()
Finally, we summarize our data to find the mean and standard deviation of the debt to cost ratio using summarize()

college %>% 
  filter(State %in% c("IA", "MO", "MN")) %>% 
  mutate(debt_cost_ratio = Debt_median / Cost) %>% 
  group_by(State, Private) %>% 
  summarize(mean_dcr = mean(debt_cost_ratio), 
            sd_dcr = sd(debt_cost_ratio))

## # A tibble: 6 × 4
## # Groups:   State [3]
##   State Private mean_dcr sd_dcr
##   <chr> <chr>      <dbl>  <dbl>
## 1 IA    Private    0.399 0.0597
## 2 IA    Public     0.780 0.0967
## 3 MN    Private    0.423 0.0783
## 4 MN    Public     0.673 0.0797
## 5 MO    Private    0.420 0.121 
## 6 MO    Public     0.716 0.0459

Very quickly we are able to see that public schools in Iowa have the largest debt to cost ratio, while private schools in Iowa have the lowest.

Lastly, we have a helperl function n() which can be used in side of summarize – this permits us to display the number of observations in our groups in addition to other statistical summaries

college %>% 
  filter(State %in% c("IA", "MO", "MN")) %>% 
  mutate(debt_cost_ratio = Debt_median / Cost) %>% 
  group_by(State, Private) %>% 
  summarize(Count = n()) # helper function

## # A tibble: 6 × 3
## # Groups:   State [3]
##   State Private Count
##   <chr> <chr>   <int>
## 1 IA    Private    22
## 2 IA    Public      3
## 3 MN    Private    16
## 4 MN    Public     10
## 5 MO    Private    18
## 6 MO    Public     10

Question 3: Intensive care units, or ICUs, are primary spaces in hospitals that are reserved for patients in critical condition. The data below is a random sample of n = 200 ICU patients from a research hospital affiliated with Carnegie Mellon University (CMU).

icu <- read.csv("https://remiller1450.github.io/data/ICUAdmissions.csv")

The data dictionary below documents each variable that was recorded:

ID - Patient ID number
Status - Patient status: 0=lived or 1=died
Age - Patient’s age (in years)
Sex - 0=male or 1=female
Race - Patient’s race: 1=white, 2=black, or 3=other
Service - Type of service: 0=medical or 1=surgical
Cancer - Is cancer involved? 0=no or 1=yes
Renal - Is chronic renal failure involved? 0=no or 1=yes
Infection - Is infection involved? 0=no or 1=yes
CPR - Patient received CPR prior to admission? 0=no or 1=yes
Systolic - Systolic blood pressure (in mm of Hg)
HeartRate - Pulse rate (beats per minute)
Previous - Previous admission to ICU within 6 months? 0=no or 1=yes
Type - Admission type: 0=elective or 1=emergency
Fracture - Fractured bone involved? 0=no or 1=yes
PO2 - Partial oxygen level from blood gases under 60? 0=no or 1=yes
PH - pH from blood gas under 7.25? 0=no or 1=yes
PCO2 - Partial carbon dioxide level from blood gas over 45? 0=no or 1=yes
Bicarbonate - Bicarbonate from blood gas under 18? 0=no or 1=yes
Creatinine - Creatinine from blood gas over 2.0? 0=no or 1=yes
Consciousness - Level upon arrival: 0=conscious, 1=deep stupor, or 2=coma

Using the icu dataset and the functions described in this lab, complete the following stages:

Change the Previous variable to have values “no” and “yes” instead of 0 and 1
Filter the data to only include patients whose visit involves an infection
For the “Age” variable, find the mean, standard deviation, and group size (found using the function n()) of patients with and without a previous admission to the ICU in the prior 6 months.

Your solution should indicate the total number of patients with and without previous admission, along with each group’s mean age and standard deviation. It will contain two rows and four columns – your final output should look like this:

##   Previous MeanAge  SDAge  N
## 1       no  62.369 17.842 65
## 2      yes  57.000 17.601 19

Lab 5 – `dplyr`

2024-02-22

Introduction

`filter`

`mutate`

`summarize`

`group_by`