knitr::opts_chunk$set(echo = TRUE,
fig.align = 'center',
fig.width = 4,
fig.height = 4,
message = FALSE, warning = FALSE)
library(ggplot2)
library(dplyr)
theme_set(theme_bw(base_size = 16))
Frequently in this class (and elsewhere), we will find ourselves in situations in which having either a table or a data.frame will be more convenient. Converting a data.frame to a table is simple enough and is likely something we have done many times before:
## Read in college dataset
college <- read.csv("https://collinn.github.io/data/college2019.csv")
## Construct new size variable based on enrollment
college$Size <- ifelse(college$Enrollment < 2500, "Small", "Large")
## Cross classify values of observations
with(college, table(Type, Size))
## Size
## Type Large Small
## Private 207 440
## Public 374 74
Note here the use of the with()
function, taking as its
first argument a data.frame and as the second an expression using column
names in the data.frame. This is in contrast, for example, doing
extracting the vectors individually:
## More typing and loses "Type" and "Size" headers
table(college$Type, college$Size)
##
## Large Small
## Private 207 440
## Public 374 74
Moving from a table to a data.frame is less obvious, however. To start, its worthwhile to ask ourselves what we need this data.frame to look like. From the table, we can anticipate that:
We begin by creating data.frame that contains all combinations of
values we see in our table using the expand.grid()
function. This function works by taking a collection of vectors and
building all pairwise combinations:
df <- expand.grid(Type = c("Private", "Public"),
Size = c("Large", "Small"))
## data.frame with values
df
## Type Size
## 1 Private Large
## 2 Public Large
## 3 Private Small
## 4 Public Small
Having established our values, we now need to construct it to be the appropriate size. Two do this, we will use two ideas: indexing and replication
We are likely aware that we can use integer values in the rows of a data.frame to indicate which rows we would like to subset:
## Get second row
df[2, ]
## Type Size
## 2 Public Large
We can also use this to change the order of the rows:
## Change order
df[c(2,1,4,3), ]
## Type Size
## 2 Public Large
## 1 Private Large
## 4 Public Small
## 3 Private Small
In fact, this generalizes to any set of integers supplied, so long as they correspond with a row number:
## We can extract the same row multiple times
df[c(2,2,2,2,2), ]
## Type Size
## 2 Public Large
## 2.1 Public Large
## 2.2 Public Large
## 2.3 Public Large
## 2.4 Public Large
Having an idea of how to use row numbers to reproduce entries, we now consider how to specify how many of each we need.
Replication means several things in R that we will interact with in
the coming weeks. For now, it is sufficient to introduce the
rep()
function. The rep()
function takes
several arguments, but two in particular for now: a vector to replicate
and the number of instances of each element of the vector:
rep(x = 1:3, times = 2)
## [1] 1 2 3 1 2 3
The times
argument is conceptually is a bit weird as it
doesn’t behave like we would expect it should. Consider what happens
when I give it the following
rep(1:3, times = c(2,2,2))
## [1] 1 1 2 2 3 3
When times
gets a vector argument, it repeats each
element of 1:3
the corresponding number of times. When it
gets an integer value, it repeats the entire vector supplied to
x
. This is in contrast to the each
argument
which only takes a length one integer and operates the same way
that times
does when supplied a vector.
rep(1:3, each = 2)
## [1] 1 1 2 2 3 3
## The 3 is ignored since it only takes a single integer value
rep(1:3, each = c(2,3))
## [1] 1 1 2 2 3 3
At any rate, the times
version is what we want to
differentially indicate how many of each row we want from our original
data.frame
## Supply row numbers and a times argument
idx <- rep(1:nrow(df), times = 1:4)
idx
## [1] 1 2 2 3 3 3 4 4 4 4
df[idx, ]
## Type Size
## 1 Private Large
## 2 Public Large
## 2.1 Public Large
## 3 Private Small
## 3.1 Private Small
## 3.2 Private Small
## 4 Public Small
## 4.1 Public Small
## 4.2 Public Small
## 4.3 Public Small
table(df[idx, ])
## Size
## Type Large Small
## Private 1 3
## Public 2 4
From here, we have all of the tools to recreate our original
data.frame from the table. You’ll need to be careful to put things in
the correct order (top to bottom, left to right), but from our original
table, we get the values for the times
argument we need
## Original values
with(college, table(Type, Size))
## Size
## Type Large Small
## Private 207 440
## Public 374 74
## Create idx for each row
idx <- rep(1:nrow(df), times = c(207, 374, 440, 74))
## Create data.frame
df <- df[idx, ]
head(df)
## Type Size
## 1 Private Large
## 1.1 Private Large
## 1.2 Private Large
## 1.3 Private Large
## 1.4 Private Large
## 1.5 Private Large
## Confirm correctness
table(df)
## Size
## Type Large Small
## Private 207 440
## Public 374 74
Question 1 Using what we learned above, create a data.frame based on the following table:
## Infarction
## Group Yes No
## Placebo 24 1356
## Aspirin 13 1367
Question 2 The Titanic
dataset
differentiates itself from others that we have seen in that it is a
four-way table across the variables Class, Sex, Age, and Survival. Start
with:
as.data.frame(Titanic)
and create a data.frame similar to the one created above. From here, recreate the following plot:
Functions are an important part of interacting with the R language, their primary utility being the ability to abstract specific instances of code to solve more general problems. For example, we can write a function to find the hypotenuse of a triangle, given its sides.
f <- function(x, y) {
sqrt(x^2 + y^2)
}
f(3,4)
## [1] 5
The main components of a function are its name, its arguments, and its body. By default, any function in R will return as its value the result of the last expression.
It’s worth noting that we can also specify a default value for each of our function arguments that will be overwritten if we pass a value in for it:
f <- function(x, y = 2) {
sqrt(x^2 + y^2)
}
f(3) # uses y = 2 by default
## [1] 3.6056
f(3, 4) # overwritten with y = 4
## [1] 5
Question 3 Write a function that, given a table, will return the relative risk of the first two rows. Using the Aspirin study, verify that the RR is 1.82
Question 4 Write a function that, given a table, will return the odds ratio of the first two rows. Using the Aspirin study, verify that \(\hat{\theta} = 1.83\)
A somewhat unique aspect of the the R language is how it handles scoping. For example, a variable that is defined by itself in the global environment can be called directly
## Define x and then use it
x <- 2
x^2
## [1] 4
Confusion arises, however, when a variable is defined with the same
name in multiple places. For example, consider a data.frame with a
column named x
for which we wish to take the mean. What is
being returned with df[[x]]
?
df <- data.frame(x = 1:5, y = 21:25)
mean(df[[x]])
## [1] 23
Here, as our expression is being evaluated in the global environment,
it first looks for x
there. Finding that x=2
,
df[[x]]
is equivalent to df[[2]]
, giving us
instead the mean value of the second column of df
. If we
wish to get the first column directly, we need to pass in a character
string instead:
mean(df[["x"]])
## [1] 3
This can get confusing if you use the same name several times:
y <- "x"
mean(df[[y]])
## [1] 3
Note that this is in direct contrast to dplyr
which uses
non-standard
evaluation to use the names of variables directly
library(dplyr)
x <- "y"
summarize(df, mean(x))
## mean(x)
## 1 3
This works because data.frames in R function as “mini-environments”
where the names of the columns make up the variables in the environment.
In the code chunk above, summarize(df, mean(x))
is looking
for the variable x
within the data.frame
df
instead of checking the global environment where
x = "y"
This has implications for how we interact with and subset data.frames inside of a function which has its own scoping rules:
f_dplyr <- function(z) {
summarize(df, mean(z))
}
## It's trying to find the value of mean("y")
x <- "y"
f_dplyr(x)
## mean(z)
## 1 NA
## We see that if we swap x with a number instead
x <- -99
f_dplyr(x)
## mean(z)
## 1 -99
If that wasn’t confusing enough, see what happens when we change the
argument of f_dplyr
to x
instead of
z
:
f_dplyr_x <- function(x) {
summarize(df, mean(x))
}
## In both cases, it is using `x` from the df environment
f_dplyr_x("y")
## mean(x)
## 1 3
f_dplyr_x(-99)
## mean(x)
## 1 3
This is all just to say, using dplyr
inside of functions
can be a recipe for total disaster. Instead, we can rely on base R
syntax for subsetting with character vectors
f_base <- function(x) {
mean(df[[x]])
}
f_base("y") # Grabs correct column
## [1] 23
x <- "y"
f_base(x) # Still grabs correct column
## [1] 23
f_base(99) # error because there is no column 99
## Error in .subset2(x, i, exact = exact): subscript out of bounds
As a final note on scoping: when a function is called and a variable
inside of it is used, it will first look inside a function for that
variable. If it can’t be found, it then “moves up” a level to look in
the environment in which the function was called; this is why we could
use df
inside of the functions above without passing it in
directly. If a variable is defined inside of a function, though, it will
use that instead. It is generally considered poor practice to use
variables that are not defined in a function or passed in directly,
especially as the code used becomes more complex.
x <- 4
f1 <- function() {
x^2
}
f2 <- function() {
x <- 2
x^2
}
f1() ## Finds x in global env
## [1] 16
f2() ## Finds x in function env
## [1] 4
Question 5 This problem will use the
tips
dataset:
tips <- read.csv("https://collinn.github.io/data/tips.csv")
head(tips)
## total_bill tip sex smoker day time size
## 1 16.99 1.01 Female No Sun Dinner 2
## 2 10.34 1.66 Male No Sun Dinner 3
## 3 21.01 3.50 Male No Sun Dinner 3
## 4 23.68 3.31 Male No Sun Dinner 2
## 5 24.59 3.61 Female No Sun Dinner 4
## 6 25.29 4.71 Male No Sun Dinner 4
Modify your odds ratio function from Question 4 to take three arguments: a data frame, and two character vectors to specify rows and columns. It should generate output like this:
oddsratio(tips, "sex", "smoker")
## [1] 1.0122
oddsratio(tips, "smoker", "time")
## [1] 0.77397
Question 6 (Bonus) Modify the function from Question 5 so that:
oddsratio2(tips, "time", "smoker")
##
## No Yes
## Dinner 106 70
## Lunch 45 23
## [1] 0.77397
oddsratio2(tips, "time", "smoker", flipRow = TRUE)
##
## No Yes
## Lunch 45 23
## Dinner 106 70
## [1] 1.292
oddsratio2(tips, "time", "smoker", flipCol = TRUE)
##
## Yes No
## Dinner 70 106
## Lunch 23 45
## [1] 1.292
oddsratio2(tips, "time", "smoker", flipCol = TRUE, flipRow = TRUE)
##
## Yes No
## Lunch 23 45
## Dinner 70 106
## [1] 0.77397