knitr::opts_chunk$set(echo = TRUE, 
                      fig.align = 'center', 
                      fig.width = 4, 
                      fig.height = 4, 
                      message = FALSE, warning = FALSE)
library(ggplot2)
library(dplyr)

theme_set(theme_bw(base_size = 16))

Making tables and data.frames

Frequently in this class (and elsewhere), we will find ourselves in situations in which having either a table or a data.frame will be more convenient. Converting a data.frame to a table is simple enough and is likely something we have done many times before:

## Read in college dataset
college <- read.csv("https://collinn.github.io/data/college2019.csv")

## Construct new size variable based on enrollment
college$Size <- ifelse(college$Enrollment < 2500, "Small", "Large")

## Cross classify values of observations
with(college, table(Type, Size))
##          Size
## Type      Large Small
##   Private   207   440
##   Public    374    74

Note here the use of the with() function, taking as its first argument a data.frame and as the second an expression using column names in the data.frame. This is in contrast, for example, doing extracting the vectors individually:

## More typing and loses "Type" and "Size" headers
table(college$Type, college$Size)
##          
##           Large Small
##   Private   207   440
##   Public    374    74

Moving from a table to a data.frame is less obvious, however. To start, its worthwhile to ask ourselves what we need this data.frame to look like. From the table, we can anticipate that:

We begin by creating data.frame that contains all combinations of values we see in our table using the expand.grid() function. This function works by taking a collection of vectors and building all pairwise combinations:

df <- expand.grid(Type = c("Private", "Public"),
                  Size = c("Large", "Small"))

## data.frame with values
df
##      Type  Size
## 1 Private Large
## 2  Public Large
## 3 Private Small
## 4  Public Small

Having established our values, we now need to construct it to be the appropriate size. Two do this, we will use two ideas: indexing and replication

Indexing

We are likely aware that we can use integer values in the rows of a data.frame to indicate which rows we would like to subset:

## Get second row
df[2, ]
##     Type  Size
## 2 Public Large

We can also use this to change the order of the rows:

## Change order
df[c(2,1,4,3), ]
##      Type  Size
## 2  Public Large
## 1 Private Large
## 4  Public Small
## 3 Private Small

In fact, this generalizes to any set of integers supplied, so long as they correspond with a row number:

## We can extract the same row multiple times
df[c(2,2,2,2,2), ]
##       Type  Size
## 2   Public Large
## 2.1 Public Large
## 2.2 Public Large
## 2.3 Public Large
## 2.4 Public Large

Having an idea of how to use row numbers to reproduce entries, we now consider how to specify how many of each we need.

Replication

Replication means several things in R that we will interact with in the coming weeks. For now, it is sufficient to introduce the rep() function. The rep() function takes several arguments, but two in particular for now: a vector to replicate and the number of instances of each element of the vector:

rep(x = 1:3, times = 2)
## [1] 1 2 3 1 2 3

The times argument is conceptually is a bit weird as it doesn’t behave like we would expect it should. Consider what happens when I give it the following

rep(1:3, times = c(2,2,2))
## [1] 1 1 2 2 3 3

When times gets a vector argument, it repeats each element of 1:3 the corresponding number of times. When it gets an integer value, it repeats the entire vector supplied to x. This is in contrast to the each argument which only takes a length one integer and operates the same way that times does when supplied a vector.

rep(1:3, each = 2)
## [1] 1 1 2 2 3 3
## The 3 is ignored since it only takes a single integer value
rep(1:3, each = c(2,3))
## [1] 1 1 2 2 3 3

At any rate, the times version is what we want to differentially indicate how many of each row we want from our original data.frame

## Supply row numbers and a times argument
idx <- rep(1:nrow(df), times = 1:4)
idx
##  [1] 1 2 2 3 3 3 4 4 4 4
df[idx, ]
##        Type  Size
## 1   Private Large
## 2    Public Large
## 2.1  Public Large
## 3   Private Small
## 3.1 Private Small
## 3.2 Private Small
## 4    Public Small
## 4.1  Public Small
## 4.2  Public Small
## 4.3  Public Small
table(df[idx, ])
##          Size
## Type      Large Small
##   Private     1     3
##   Public      2     4

Putting it together

From here, we have all of the tools to recreate our original data.frame from the table. You’ll need to be careful to put things in the correct order (top to bottom, left to right), but from our original table, we get the values for the times argument we need

## Original values
with(college, table(Type, Size))
##          Size
## Type      Large Small
##   Private   207   440
##   Public    374    74
## Create idx for each row
idx <- rep(1:nrow(df), times = c(207, 374, 440, 74))

## Create data.frame
df <- df[idx, ]
head(df)
##        Type  Size
## 1   Private Large
## 1.1 Private Large
## 1.2 Private Large
## 1.3 Private Large
## 1.4 Private Large
## 1.5 Private Large
## Confirm correctness
table(df)
##          Size
## Type      Large Small
##   Private   207   440
##   Public    374    74

Question 1 Using what we learned above, create a data.frame based on the following table:

##          Infarction
## Group      Yes   No
##   Placebo   24 1356
##   Aspirin   13 1367

Question 2 The Titanic dataset differentiates itself from others that we have seen in that it is a four-way table across the variables Class, Sex, Age, and Survival. Start with:

as.data.frame(Titanic)

and create a data.frame similar to the one created above. From here, recreate the following plot:

Functions

Functions are an important part of interacting with the R language, their primary utility being the ability to abstract specific instances of code to solve more general problems. For example, we can write a function to find the hypotenuse of a triangle, given its sides.

f <- function(x, y) {
  sqrt(x^2 + y^2)
}

f(3,4)
## [1] 5

The main components of a function are its name, its arguments, and its body. By default, any function in R will return as its value the result of the last expression.

It’s worth noting that we can also specify a default value for each of our function arguments that will be overwritten if we pass a value in for it:

f <- function(x, y = 2) {
  sqrt(x^2 + y^2)
}

f(3) # uses y = 2 by default
## [1] 3.6056
f(3, 4) # overwritten with y = 4
## [1] 5

Question 3 Write a function that, given a table, will return the relative risk of the first two rows. Using the Aspirin study, verify that the RR is 1.82

Question 4 Write a function that, given a table, will return the odds ratio of the first two rows. Using the Aspirin study, verify that \(\hat{\theta} = 1.83\)

Functions, data.frames, and dplyr

A somewhat unique aspect of the the R language is how it handles scoping. For example, a variable that is defined by itself in the global environment can be called directly

## Define x and then use it
x <- 2
x^2
## [1] 4

Confusion arises, however, when a variable is defined with the same name in multiple places. For example, consider a data.frame with a column named x for which we wish to take the mean. What is being returned with df[[x]]?

df <- data.frame(x = 1:5, y = 21:25)
mean(df[[x]])
## [1] 23

Here, as our expression is being evaluated in the global environment, it first looks for x there. Finding that x=2, df[[x]] is equivalent to df[[2]], giving us instead the mean value of the second column of df. If we wish to get the first column directly, we need to pass in a character string instead:

mean(df[["x"]])
## [1] 3

This can get confusing if you use the same name several times:

y <- "x"
mean(df[[y]])
## [1] 3

Note that this is in direct contrast to dplyr which uses non-standard evaluation to use the names of variables directly

library(dplyr)
x <- "y"
summarize(df, mean(x))
##   mean(x)
## 1       3

This works because data.frames in R function as “mini-environments” where the names of the columns make up the variables in the environment. In the code chunk above, summarize(df, mean(x)) is looking for the variable x within the data.frame df instead of checking the global environment where x = "y"

This has implications for how we interact with and subset data.frames inside of a function which has its own scoping rules:

f_dplyr <- function(z) {
  summarize(df, mean(z))
}

## It's trying to find the value of mean("y")
x <- "y"
f_dplyr(x)
##   mean(z)
## 1      NA
## We see that if we swap x with a number instead
x <- -99
f_dplyr(x)
##   mean(z)
## 1     -99

If that wasn’t confusing enough, see what happens when we change the argument of f_dplyr to x instead of z:

f_dplyr_x <- function(x) {
  summarize(df, mean(x))
}

## In both cases, it is using `x` from the df environment
f_dplyr_x("y")
##   mean(x)
## 1       3
f_dplyr_x(-99)
##   mean(x)
## 1       3

This is all just to say, using dplyr inside of functions can be a recipe for total disaster. Instead, we can rely on base R syntax for subsetting with character vectors

f_base <- function(x) {
  mean(df[[x]])
}

f_base("y") # Grabs correct column
## [1] 23
x <- "y"
f_base(x) # Still grabs correct column
## [1] 23
f_base(99) # error because there is no column 99
## Error in .subset2(x, i, exact = exact): subscript out of bounds

As a final note on scoping: when a function is called and a variable inside of it is used, it will first look inside a function for that variable. If it can’t be found, it then “moves up” a level to look in the environment in which the function was called; this is why we could use df inside of the functions above without passing it in directly. If a variable is defined inside of a function, though, it will use that instead. It is generally considered poor practice to use variables that are not defined in a function or passed in directly, especially as the code used becomes more complex.

x <- 4

f1 <- function() {
  x^2
}

f2 <- function() {
  x <- 2
  x^2
}

f1() ## Finds x in global env
## [1] 16
f2() ## Finds x in function env
## [1] 4

Question 5 This problem will use the tips dataset:

tips <- read.csv("https://collinn.github.io/data/tips.csv")
head(tips)
##   total_bill  tip    sex smoker day   time size
## 1      16.99 1.01 Female     No Sun Dinner    2
## 2      10.34 1.66   Male     No Sun Dinner    3
## 3      21.01 3.50   Male     No Sun Dinner    3
## 4      23.68 3.31   Male     No Sun Dinner    2
## 5      24.59 3.61 Female     No Sun Dinner    4
## 6      25.29 4.71   Male     No Sun Dinner    4

Modify your odds ratio function from Question 4 to take three arguments: a data frame, and two character vectors to specify rows and columns. It should generate output like this:

oddsratio(tips, "sex", "smoker")
## [1] 1.0122
oddsratio(tips, "smoker", "time")
## [1] 0.77397

Question 6 (Bonus) Modify the function from Question 5 so that:

  • It prints out the table associated with the odds ratio
  • It takes an additional argument that allows you to flip the columns or the rows
oddsratio2(tips, "time", "smoker")
##         
##           No Yes
##   Dinner 106  70
##   Lunch   45  23
## [1] 0.77397
oddsratio2(tips, "time", "smoker", flipRow = TRUE)
##         
##           No Yes
##   Lunch   45  23
##   Dinner 106  70
## [1] 1.292
oddsratio2(tips, "time", "smoker", flipCol = TRUE)
##         
##          Yes  No
##   Dinner  70 106
##   Lunch   23  45
## [1] 1.292
oddsratio2(tips, "time", "smoker", flipCol = TRUE, flipRow = TRUE)
##         
##          Yes  No
##   Lunch   23  45
##   Dinner  70 106
## [1] 0.77397