Introduction

Titanic Data

For this lab we will be using the Titanic dataset built into R, providing information on the fate of the passengers on the fatal maiden voyage of the ocean liner Titanic summarized according to economic status (class), sex, age, and survival. See ?Titanic for more details.

As the Titanic dataset in R is, by default, stored as a 4 dimensional array, we will start by transforming it into a data.frame.

library(ggplot2)
library(dplyr)

theme_set(theme_bw())


## Data for lab
data(Titanic)
titanic <- as.data.frame(Titanic)
titanic <- titanic[rep(1:nrow(titanic), times = titanic$Freq), ]
titanic$Freq <- NULL

## head() shows us the first 6 rows of a data.frame
head(titanic)
##     Class  Sex   Age Survived
## 3     3rd Male Child       No
## 3.1   3rd Male Child       No
## 3.2   3rd Male Child       No
## 3.3   3rd Male Child       No
## 3.4   3rd Male Child       No
## 3.5   3rd Male Child       No

Our goal in this lab is to familiarize ourselves with the use of tables in R. To this end, there are two primary functions we will be concerning ourselves with:

  • The table() function which returns a count of categorical data
  • The proportion() function which takes a table as an argument and returns a (optionally conditional) table of proportions

In addition to these, we will introduce four helper functions to assist us:

  • The “pipe” operator, %>% (Ctrl+Shift+M), which takes the results of the left hand side of the pipe and passes them to the right hand side. This is helpful for “chaining together” a sequence of operations. We will use this extensively later in the semester
  • The sort() function will sort either a table or numeric vector in increasing or decreasing order
  • The addmargins() function will sum the margins of a table
  • We can use facet_grid to facet two nested categorical variables in ggplot2

You can learn more about these functions, and see some additional examples, by using the help documentation (i.e., ?table()).

“Pipe” Operator

The pipe operator, %>% (Ctrl+Shift+M) is a special type of operator (similar to +, for example) that is included in the R package dplyr which we can load into our R session with library(dplyr). While not essential, the pipe operator facilitates interactive programming by “chaining together” sequences of operations; it works by taking the output on the left hand side and putting it as the first argument on the right hand side.

For example, the standard deviation of a vector is the square root of its variance. Consider two ways that we could do this, with and without the pipe operator

## Create random vector x with 50 entries
# (rnorm = random normal)
x <- rnorm(50)

## Without pipe
sqrt(var(x))
## [1] 1.0113
## With pipe
var(x) %>% sqrt()
## [1] 1.0113

Using Tables

The first part of this lab will be oriented around the construction and manipulation of one- and two-way tables in R.

One-way tables

The first function we will introduce is the table() function; in its most basic form, table() takes as its argument a single vector, returning a named numeric vector detailing the quantities of each category. There are two ways to express this in R, though I tend to prefer the first. This is both because the first method prints out the name of the variable corresponding to the table and it omits the ungainly use of $ for selecting variables (this issue is more evident when considering two-way tables, as we will see):

## How many people lived or died on the Titanic
with(titanic, table(Survived))
## Survived
##   No  Yes 
## 1490  711
## How many men and women were on the Titanic
table(titanic$Sex)
## 
##   Male Female 
##   1731    470

Each gives us an example of a frequency table. Note that this corresponds directly with a univariate bar chart that we saw previously

ggplot(titanic, aes(Sex)) + geom_bar(fill = 'hotpink1')

Additionally, we see in both cases that all of our observations are included in the count, as both of the table totals sum up to 2201. We can verify this with the helper function addmargins which will include a “Sum” column adding up the table

## Assign table to tab variable
tab <- with(titanic, table(Survived))

## Adds the Sum value
addmargins(tab)
## Survived
##   No  Yes  Sum 
## 1490  711 2201

Notice how in this case we first assigned the result of table() to a variable, tab (though you could name it whatever you wanted) and then passed that variable to addmargins(). This is an excellent example of a process that is facilitated with the pipe operator, %>% (Ctrl+Shift+M):

## This is equivalent to what we saw above
with(titanic, table(Survived)) %>% addmargins()
## Survived
##   No  Yes  Sum 
## 1490  711 2201

We may also be interested in identifying the proportion of individuals who survived or died on the Titanic. Similar to the addmargins() function, we can pass the results of table() to the function proportions()

with(titanic, table(Survived)) %>% proportions()
## Survived
##      No     Yes 
## 0.67697 0.32303

Here we see that 68% of the passengers aboard the Titanic died while 32% survived. This is the same information that we would have found had we computed the values by hand:

\[ \text{% Dead} = \frac{\# \text{Dead}}{\text{Total Passengers}} = \frac{1490}{2201} = 0.677 \] Finally, let’s introduce the sort() function, which takes either a one-way table or a vector and sorts the values from smallest to largest (or in alphabetical order if they are character strings)

## Unsorted table
with(titanic, table(Class))
## Class
##  1st  2nd  3rd Crew 
##  325  285  706  885
## Table sorted smallest to largest
with(titanic, table(Class)) %>% sort()
## Class
##  2nd  1st  3rd Crew 
##  285  325  706  885
## Table sorted largest to smallest
with(titanic, table(Class)) %>% sort(decreasing = TRUE)
## Class
## Crew  3rd  1st  2nd 
##  885  706  325  285

There are a few things to note about the sort() function:

  1. Like addmargins() and proportions(), we can pass an argument to sort() using the pipe operator %>%
  2. By default, sort() reorders observations from smallest to largest. We can change this behavior by adding decreasing = TRUE as an argument to sort() as we did above
  3. When we have a named vector, i.e., a set of values with names, such as the case with the table above, sort() will sort the vector according to its values, not its names.

Let’s practice using some of these functions on the Titanic dataset.

Question 1

Part A Create a frequency table using the titanic data set to find how many children and adults were on board the Titanic.

Part B Determine what percentage of the passengers on-board the Titanic were adults.

Part C Determine what percentage of the passengers on-board the Titanic were members of the crew.

Two-way Tables

The two-way table, as the name suggests, is a table of two categorical variables. In R, this can be done by passing each of the two vectors in as arguments, with the first vector becoming the rows and the second becoming the column:

## Sex as row
with(titanic, table(Sex, Survived))
##         Survived
## Sex        No  Yes
##   Male   1364  367
##   Female  126  344
## Sex as column
with(titanic, table(Survived, Sex))
##         Sex
## Survived Male Female
##      No  1364    126
##      Yes  367    344

You can, as before, use only the table() function with the $ operator to extract vectors, though in addition to losing the variable name in the table, you also have to duplicate much of your typing

table(titanic$Sex, titanic$Survived)
##         
##            No  Yes
##   Male   1364  367
##   Female  126  344

The two-way table gives us frequencies of the cross-sections of groups: for example, in the one-way table we saw that there were 470 women on-board the titanic. By also including the Survived variable, I can see that of the 470 women, 126 died while 344 lived. Additionally, of the 711 individuals who survived, the two-way table shows me that 367 were men and 344 were women. Note that this table corresponds to both of the following plots (note that it is customary for the row variable to serve as the x-axis)

## See ?scale_fill_brewer for more palette colors
ggplot(titanic, aes(x = Sex, fill = Survived)) + 
  geom_bar(position = "dodge") + 
  scale_fill_brewer(palette = "Set2")

We can recover the information from the one-way tables by using addmargins() to give us row and column totals

## Add row and column margins to  plots
with(titanic, table(Sex, Survived)) %>% addmargins()
##         Survived
## Sex        No  Yes  Sum
##   Male   1364  367 1731
##   Female  126  344  470
##   Sum    1490  711 2201

We should see that the Sum row and columns are exactly the one-way tables we created in the last section for their corresponding variables. In both cases, they sum to 2201 in the bottom right corner.


Just as before, we can pass a two-way table to the proportions() function to return a table of proportions:

with(titanic, table(Sex, Survived)) %>% proportions()
##         Survived
## Sex            No      Yes
##   Male   0.619718 0.166742
##   Female 0.057247 0.156293

By default, this returns absolute proportions, calculated by dividing each of the entries in the two-way table by its total of 2201. This table of proportions tells us, for example, that of all of the passengers who were on the Titanic, 62% of them were males who died.

We can specify conditional proportions by passing in an additional argument called margins to the proportions() function. In R, 1 refers to rows and 2 refers to columns: so, in order to compute a table of proportions conditional on the row variable (meaning that proportions are taken within the row), we will pass the argument margin = 1 to the proportion() function:

# Compute row proportions
with(titanic, table(Sex, Survived)) %>% proportions(margin = 1)
##         Survived
## Sex           No     Yes
##   Male   0.78798 0.21202
##   Female 0.26809 0.73191

We can be sure that we are conditioning on the row margins because the sum of each row is equal to 1.

From this, we see that given that an individual was male, 21% survived while 79% did not. Similarly, given that an individual was female, we see that 27% were cast to a watery grave while 73% went to normal graves at some other point in their lives. In this example, we are conditioning on sex, our explanatory variable, with survival status serving as our response.

The variable we condition on should also serve as the x-axis in any conditional plots we make

ggplot(titanic, aes(Sex, fill = Survived)) +
  geom_bar(position = "fill") + 
  scale_fill_brewer(palette = "Purples")

Finding column proportions works the same way, passing margin = 2 into our function instead

# Compute column proportions
with(titanic, table(Sex, Survived)) %>% proportions(margin = 2)
##         Survived
## Sex            No      Yes
##   Male   0.915436 0.516174
##   Female 0.084564 0.483826

Doing so replaces which variable we include as the x-axis on our plot

ggplot(titanic, aes(Survived, fill = Sex)) +
  geom_bar(position = "fill") + 
  scale_fill_brewer(palette = "Accent")

We will wrap up this section by showing a slightly more detailed use of addmargins() for two-way tables. Just like proportions(), addmargins() also takes an argument telling it which margin to take the sum, though it is unfortunately backwards from what it should be based on the proportions() function:

## Row and column margins
with(titanic, table(Sex, Age)) %>% addmargins()
##         Age
## Sex      Child Adult  Sum
##   Male      64  1667 1731
##   Female    45   425  470
##   Sum      109  2092 2201
## Adds a row for sums (across the columns)
with(titanic, table(Sex, Age)) %>% addmargins(margin = 1)
##         Age
## Sex      Child Adult
##   Male      64  1667
##   Female    45   425
##   Sum      109  2092
## Adds a column for sums (across the rows)
with(titanic, table(Sex, Age)) %>% addmargins(margin = 2)
##         Age
## Sex      Child Adult  Sum
##   Male      64  1667 1731
##   Female    45   425  470

Let’s conclude this section with a little bit of practice.

Question 2

Part A How many children were included in second class?

Part B What percentage of the crew survived? How about children? (You will need two different tables to answer this question)

Part C What proportion of individuals who survived were members of the crew? Construct the plot associated with the table you create.

Three-way tables

A natural extension of the two-way table is the three-way table (and four-way, and so on). These differ from the two-way and one-way tables in that switching the order of the variables is no longer as simple as changing out the rows and columns. We won’t be asked to do much with three-way tables, but it is worth considering what information can be gained from them. Consider for example the two table below:

## Table 1
with(titanic, table(Class, Sex, Survived))
## , , Survived = No
## 
##       Sex
## Class  Male Female
##   1st   118      4
##   2nd   154     13
##   3rd   422    106
##   Crew  670      3
## 
## , , Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st    62    141
##   2nd    25     93
##   3rd    88     90
##   Crew  192     20
## Table 2
with(titanic, table(Sex, Class, Survived))
## , , Survived = No
## 
##         Class
## Sex      1st 2nd 3rd Crew
##   Male   118 154 422  670
##   Female   4  13 106    3
## 
## , , Survived = Yes
## 
##         Class
## Sex      1st 2nd 3rd Crew
##   Male    62  25  88  192
##   Female 141  93  90   20

Complimentary to the three-way table, we introduce here the ggplot2 function facet_grid(); it works just as facet_wrap(), except it takes two categorical variables and creates a grid of facets

ggplot(titanic, aes(Class)) + 
  geom_bar() + 
  facet_grid(Survived ~ Sex)

There are no questions for this section, but you should be aware of the facet_grid() function and how it works.

Odds Ratios and Contingency Tables

In class, we introduced the contingency table, a special two-way table made up of binary variables, common in fields of epidemiology and biostatistics:

Event Non-Event
Exposure A B
No Exposure C D

Given a table such as this, we are often interested in asking if the response variable (columns) are associated with the explanatory variable (rows). While there are a multitude of methods for doing so, we have limited our discussion for now to the topic of odds and odds ratios.

While there are several packages available in R to facilitate this, most of them include machinery beyond what we currently need. As such, we will briefly detail how to do it manually here.

First, we should understand that data in R (including matrices and tables) are stored in column order meaning that if we were to index all of the elements from 1 to the last, we would begin with the first element being in the top left and then interatively proceed down the column. For example, consider this matrix with multiples of ten as its entries:

mm <- matrix(10*1:4, nrow = 2)
mm
##      [,1] [,2]
## [1,]   10   30
## [2,]   20   40

The “first” element of this matrix is 10 since it is in the top left, and we can access it similar to how we selected elements of a vector the previous lab

## The first element is 10
mm[1]
## [1] 10

Moving vertically down the column, we see that our second should be 20. This ends the first column so we start again at the next, giving us our third as 30, and finally the fourth as 40.

## The second element is 20, in the same column as 10
mm[2]
## [1] 20
## The third element is 30, starting at the top of the next column
mm[3]
## [1] 30
## And so on
mm[4]
## [1] 40

Now consider a table from the titanic data looking survival for men and women

tab <- with(titanic, table(Sex, Survived))
tab
##         Survived
## Sex        No  Yes
##   Male   1364  367
##   Female  126  344

The indexing works the same. If I wanted to find the total number of men who survived, I would start counting down from the top left and see that it is the third position. As such, I would extract it with the index 3

tab[3]
## [1] 367

To find the odds of a male dying, we then divide the two numbers in the first row; from above, we know these are positions 1 and 3

## odds of man dying
tab[1] / tab[3]
## [1] 3.7166

This tells us that 3.7 men died for each one that lived. For women, the computation is similar, selecting the second and fourth positions:

## odds of woman dying
tab[2] / tab[4]
## [1] 0.36628

Here we see that 0.36 women died for each one that lived. In order to see how the odds differ between groups, we then evaluate the odds ratio, the ratio between these two groups

man_odds <- tab[1] / tab[3]
woman_odds <- tab[2] / tab[4]

man_odds / woman_odds
## [1] 10.147

We find that the odds of a man dying aboard the Titanic were 10 times that of a woman dying, suggesting that there may be some association between these two variables. To facilitate our work for this lab, I have included a small function here that, given a contingency table, will report back the odds ratio for you.

## Function called or()
oddsratio <- function(x) {
  (x[1] * x[4]) / (x[2] * x[3])
}

## Create a 2x2 table
tab <- with(titanic, table(Sex, Survived))

## Pass table as argument to or()
oddsratio(tab)
## [1] 10.147

As a critical note about odds and tables – when reporting odds, we need to be conscientious of what is our “success” or “event” and what is considered “failure” or “no event”. Our formula assumes the first column of our table to be event. In the case of Titanic data, this refers to death, so when we speak of odds, we are talking about the odds of dying. If we are interested in discussing the odds of living, we need to change the order of the columns. We do that like this:

## This is our original table
tab
##         Survived
## Sex        No  Yes
##   Male   1364  367
##   Female  126  344
## Here, we switch columns, when we include numbers after
# a comma, we are referring to column numbers. In this case, we are 
# saying that we want column 2 first, and  then column 1
tab[, c(2, 1)]
##         Survived
## Sex       Yes   No
##   Male    367 1364
##   Female  344  126
## I can save these results to my table to override
tab2 <- tab[, c(2, 1)]
tab2
##         Survived
## Sex       Yes   No
##   Male    367 1364
##   Female  344  126

This is important in how we report our results. For example, consider the odds ratio for each of the following tables. In the first, the “event” is “not surviving”, with an odds ratio of 10.14. This indicates that the odds of males (row 1) dying were 10.14 times that of females

tab
##         Survived
## Sex        No  Yes
##   Male   1364  367
##   Female  126  344
oddsratio(tab)
## [1] 10.147

By contrast, consider what happens when the columns are flipped and the “event” is now “survived”. With an odds ratio of 0.09, meaning that the odds of a male (row 1) surviving were 0.09 times the odds of a female surviving.

tab2
##         Survived
## Sex       Yes   No
##   Male    367 1364
##   Female  344  126
oddsratio(tab2)
## [1] 0.098552

We need to be very careful in how we report our results; there are often several ways to say the same thing.

Question 3

Create a contingency table showing the age of the passenger (Child or Adult) and their survival status. How do the odds of dying as an adult compare with the odds of dying as a child? Would you say there is an association between age and survival? Find and report the odds ratio in support of your conclusion.

Question 4

The following question involves the UCBAdmissions dataset collecting aggregate data on applications to graduate school at UC Berkeley for the six largest departments in 1973. Variables in this dataset include the department (A-E), the sex of the applicant, and the associated admissions decision. This data became famous in its use as evidence relating to claims of sex bias in admission practices. For this problem, we are going to investigate the veracity of these claims.

Start by copying the code block below to modify the data as it is built-in in R to something we are used to working with. Then, answer the following questions.

## Create UC Berkeley data from built-in R dataset
data("UCBAdmissions")
ucb <- as.data.frame(UCBAdmissions)
ucb <- ucb[rep(1:nrow(ucb), times = ucb$Freq), ]
ucb$Freq <- NULL

Part A Create the appropriate table to show how many applicants there were of each gender and whether or not they were admitted. State what you find in terms of the relationship between gender and admission status, based on an odds ratio.

Part B Create a frequency table showing how many applicants of each sex applied to each department. Which two departments had the greatest number of male applicants? Which had the greatest number of female applicants?

Part C Create a table of proportions showing the rate at which each department accepted and rejected candidates. Which two departments had the highest acceptance rate? Which had the lowest?

Part D Create a bar chart showing for each sex what proportion of them were admitted or rejected. What do you see from this chart?

Part E To the plot you made in Part D, now add faceting to also show the proportion of each sex accepted by department. Do the admission rates seem similar in each department, or are there any disparities? Based on what you have seen, what statements can you make about the existence of sex-based descrimination in admissions as Berkely?