For this lab we will be using the Titanic
dataset built
into R, providing information on the fate of the passengers on the fatal
maiden voyage of the ocean liner ‘Titanic’ summarized according to
economic status (class), sex, age, and survival. See
?Titanic
for more details.
As the Titanic
dataset in R is, by default, stored as a
4 dimensional array, we will start by transforming it into a
data.frame.
library(ggplot2)
library(dplyr)
theme_set(theme_bw())
## Data for lab
titanic <- as.data.frame(Titanic)
titanic <- titanic[rep(1:nrow(titanic), times = titanic$Freq), ]
titanic$Freq <- NULL
## head() shows us the first 6 rows of a data.frame
head(titanic)
## Class Sex Age Survived
## 3 3rd Male Child No
## 3.1 3rd Male Child No
## 3.2 3rd Male Child No
## 3.3 3rd Male Child No
## 3.4 3rd Male Child No
## 3.5 3rd Male Child No
Our goal in this lab is to familiarize ourselves with the use of tables in R. To this end, there are two primary functions we will be concerning ourselves with:
table()
function which returns a count of
categorical dataproportion()
function which takes a table as an
argument and returns a (optionally conditional) table of
proportionsIn addition to these, we will introduce four helper functions to assist us:
%>%
(Ctrl+Shift+M), which takes
the results of the left hand side of the pipe and passes them to the
right hand side. This is helpful for “chaining together” a sequence of
operations. We will use this extensively later in the semestersort()
function will sort either a table or numeric
vector in increasing or decreasing orderaddmargins()
function will sum the margins of a
tablefacet_grid
to facet two nested categorical
variables in ggplot2
You can learn more about these functions, and see some additional
examples, by using the help documentation (i.e.,
?table()
).
Tables are ubiquitous in R.
The first function we will introduce is the table()
function; in its most basic form, table()
takes as its
argument a single vector, returning a named numeric vector detailing the
quantities of each category:
## How many people lived or died on the Titanic
table(titanic$Survived)
##
## No Yes
## 1490 711
## How many men and women were on the Titanic
table(titanic$Sex)
##
## Male Female
## 1731 470
This gives us an example of a frequency table. Note that this corresponds directly with a univariate bar chart that we saw previously
ggplot(titanic, aes(Sex)) + geom_bar(fill = 'hotpink1')
Additionally, we see in both cases that all of our observations are
included in the count, as both of the table totals sum up to 2201. We
can verify this with the helper function addmargins
which
will include a “Sum” column adding up the table
## Assign table to tab variable
tab <- table(titanic$Survived)
## Adds the Sum value
addmargins(tab)
##
## No Yes Sum
## 1490 711 2201
Notice how in this case we first assigned the result of
table()
to a variable and then passed that variable to
addmargins()
. This is an excellent example of a process
that is facilitated with the pipe operator, %>%
(Ctrl+Shift+M):
## This is equivalent to what we saw above
table(titanic$Survived) %>% addmargins()
##
## No Yes Sum
## 1490 711 2201
We may also be interested in identifying the proportion of
individuals who survived or died on the Titanic. Similar to the
addmargins()
function, we can pass the results of
table()
to the function proportions()
table(titanic$Survived) %>% proportions()
##
## No Yes
## 0.67697 0.32303
Here we see that 68% of the passengers aboard the Titanic died while 32% survived. This is the same information that we would have found had we computed the values by hand:
\[
\text{% Dead} = \frac{\# \text{Dead}}{\text{Total Passengers}} =
\frac{1490}{2201} = 0.677
\] Finally, let’s introduce the sort()
function,
which takes either a one-way table or a vector and sorts the values from
smallest to largest (or in alphabetical order if they are character
strings)
## Unsorted table
table(titanic$Class)
##
## 1st 2nd 3rd Crew
## 325 285 706 885
## Table sorted smallest to largest
table(titanic$Class) %>% sort()
##
## 2nd 1st 3rd Crew
## 285 325 706 885
## Table sorted largest to smallest
table(titanic$Class) %>% sort(decreasing = TRUE)
##
## Crew 3rd 1st 2nd
## 885 706 325 285
There are a few things to note about the sort()
function:
addmargins()
and proportions()
, we
can pass an argument to sort()
using the pipe operator
%>%
sort()
reorders observations from smallest
to largest. We can change this behavior by adding
decreasing = TRUE
as an argument to sort()
as
we did abovesort()
will
sort the vector according to its values, not its names.Let’s practice using some of these functions on the Titanic dataset.
Question 1:
Create a frequency table using the titanic
data set to find how many children and adults were on board the
Titanic.
Determine what percentage of the passengers on-board the Titanic were adults.
Determine what percentage of the passengers on-board the Titanic survived.
Below is an example of using the pipe operator twice in a row to both add proportions and margins to a table. Does it matter which order we include them in? Why? Which of these is correct?
table(titanic$Age) %>% addmargins() %>% proportions()
##
## Child Adult Sum
## 0.024761 0.475239 0.500000
table(titanic$Age) %>% proportions() %>% addmargins()
##
## Child Adult Sum
## 0.049523 0.950477 1.000000
The two-way table, as the name suggests, is a table of two categorical variables. In R, this can be done by passing each of the two vectors in as arguments, with the first vector becoming the rows and the second becoming the column:
## Sex as row
table(titanic$Sex, titanic$Survived)
##
## No Yes
## Male 1364 367
## Female 126 344
## Sex as column
table(titanic$Survived, titanic$Sex)
##
## Male Female
## No 1364 126
## Yes 367 344
A simplified syntax for performing this operation can be done using
with()
, though this is completely optional. I will use it
for clarity and brevity, though all of the examples shown will work with
the syntax above. One minor benefit to using with()
is that
the table now includes the name of the variable along with it’s values
(notice that the labels “Sex” and “Survived” are missing in the tables
above):
## Simplified syntax using with()
with(titanic, table(Sex, Survived))
## Survived
## Sex No Yes
## Male 1364 367
## Female 126 344
The two-way table gives us frequencies of the cross-sections of
groups: for example, in the one-way table we saw that there were 470
women on-board the titanic. By also including the Survived
variable, I can see that of the 470 women, 126 died while 344 lived.
Additionally, of the 711 individuals who survived, the two-way table
shows me that 367 were men and 344 were women. Note that this table
corresponds to both of the following plots (note that it is customary
for the row variable to serve as the x-axis)
ggplot(titanic, aes(Sex, fill = Survived)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set2")
We can recover the information from the one-way tables by using
addmargins()
to give us row and column totals
## Add row and column margins to plots
with(titanic, table(Sex, Survived)) %>% addmargins()
## Survived
## Sex No Yes Sum
## Male 1364 367 1731
## Female 126 344 470
## Sum 1490 711 2201
We should see that the Sum row and columns are exactly the one-way tables we created in the last section. In both cases, we see they sum to 2201 in the bottom right corner.
Just as before, we can pass a two-way table to the
proportions()
function to return a table of
proportions:
with(titanic, table(Sex, Survived)) %>% proportions()
## Survived
## Sex No Yes
## Male 0.619718 0.166742
## Female 0.057247 0.156293
Note that these are absolute proportions, calculated by dividing each of the entries in the two-way table by its total of 2201. This table of proportions tells us, for example, that of all of the passengers who were on the Titanic, 62% of them were males who died.
We can specify conditional proportions by telling the
proportions()
function on which margin we wish to compute
the proportions. In R, 1 refers to rows and 2 refers to columns: so, in
order to compute a table of proportions conditional on the row variable
(meaning that proportions are taken down the columns), we will pass the
argument margin = 1
to the proportion()
function:
# Compute row proportions
with(titanic, table(Sex, Survived)) %>% proportions(margin = 1)
## Survived
## Sex No Yes
## Male 0.78798 0.21202
## Female 0.26809 0.73191
From this, we see that given that an individual was male, 21% survived while 79% perished. Similarly, given that an individual was female, we see that 27% were cast to a watery grave while 73% went to normal graves at some other point in their lives. In this example, we are conditioning on sex, our explanatory variable, with survival status serving as our response.
The variable we condition on should also serve as the x-axis in any conditional plots we make
ggplot(titanic, aes(Sex, fill = Survived)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set2")
Finding column proportions works the same way, passing
margin = 2
into our function instead
# Compute column proportions
with(titanic, table(Sex, Survived)) %>% proportions(margin = 2)
## Survived
## Sex No Yes
## Male 0.915436 0.516174
## Female 0.084564 0.483826
Doing so replaces which variable we include as the x-axis on our plot
ggplot(titanic, aes(Survived, fill = Sex)) +
geom_bar(position = "fill") +
scale_fill_brewer(palette = "Set2")
We will wrap up this section by showing the use of
addmargins()
for two-way tables. Just like
proportions()
, addmargins()
also takes an
argument telling it which margin to take the sum, though it is
unfortunately backwards from what we think based on the
proportions()
function:
## Row and column margins
with(titanic, table(Sex, Age)) %>% addmargins()
## Age
## Sex Child Adult Sum
## Male 64 1667 1731
## Female 45 425 470
## Sum 109 2092 2201
## Adds a row for sums (across the columns)
with(titanic, table(Sex, Age)) %>% addmargins(margin = 1)
## Age
## Sex Child Adult
## Male 64 1667
## Female 45 425
## Sum 109 2092
## Adds a column for sums (across the rows)
with(titanic, table(Sex, Age)) %>% addmargins(margin = 2)
## Age
## Sex Child Adult Sum
## Male 64 1667 1731
## Female 45 425 470
Let’s conclude this section with a little bit of practice.
Question 2:
How many children were included in second class?
What percentage of the crew survived? How about children?
What proportion of individuals who survived were members of the crew? Construct the plot associated with the table you create.
A natural extension of the two-way table is the three-way table (and so on). These differ from the two-way and one-way tables in that switching the order of the variables is no longer as simple as changing out the rows and columns. We won’t be asked to do much with three-way tables, but it is worth considering what information can be gained from them. Consider for example the two table below:
## Table 1
with(titanic, table(Class, Sex, Survived))
## , , Survived = No
##
## Sex
## Class Male Female
## 1st 118 4
## 2nd 154 13
## 3rd 422 106
## Crew 670 3
##
## , , Survived = Yes
##
## Sex
## Class Male Female
## 1st 62 141
## 2nd 25 93
## 3rd 88 90
## Crew 192 20
## Table 2
with(titanic, table(Sex, Class, Survived))
## , , Survived = No
##
## Class
## Sex 1st 2nd 3rd Crew
## Male 118 154 422 670
## Female 4 13 106 3
##
## , , Survived = Yes
##
## Class
## Sex 1st 2nd 3rd Crew
## Male 62 25 88 192
## Female 141 93 90 20
Complimentary to the three-way table, we introduce here the
ggplot2
function facet_grid()
; it works just
as facet_wrap()
, except it takes two categorical
variables and creates a grid of facets
ggplot(titanic, aes(Class)) +
geom_bar() +
facet_grid(Survived ~ Sex)
There are no questions for this section, but you should be aware of
the facet_grid()
function and how it works.
In class, we introduced the contingency table, a special two-way table made up of binary variables, common in fields of epidemiology and biostatistics:
Event | Non-Event | |
---|---|---|
Exposure | A | B |
No Exposure | C | D |
Given a table such as this, we are often interested in asking if the response variable (columns) are associated with the explanatory variable (rows). While there are a multitude of methods for doing so, we have limited our discussion for now to the topic of odds and odds ratios.
While there are several packages available in R to facilitate this, most of them include machinery beyond what we currently need. As such, we will briefly detail how to do it manually here.
First, we should understand that data in R (including matrices and tables) are stored in column order meaning that if we were to index all of the elements from 1 to the last, we would begin with the first element being in the top left and then interatively proceed down the column. For example, consider this matrix with multiples of ten as its entries:
mm <- matrix(10*1:4, nrow = 2)
mm
## [,1] [,2]
## [1,] 10 30
## [2,] 20 40
The “first” element of this matrix is 10 since it is in the top left, and we can access it similar to how we selected elements of a vector the previous lab
## The first element is 10
mm[1]
## [1] 10
Moving vertically down the column, we see that our second should be 20. This ends the first column so we start again at the next, giving us our third as 30, and finally the fourth as 40.
mm[2]
## [1] 20
mm[3]
## [1] 30
mm[4]
## [1] 40
Now consider a table from the titanic data looking survival for men and women
tab <- with(titanic, table(Sex, Survived))
tab
## Survived
## Sex No Yes
## Male 1364 367
## Female 126 344
The indexing works the same. If I wanted to find the total number of men who survived, I would start counting down from the top left and see that it is the third position. As such, I would extract it with the index 3
tab[3]
## [1] 367
To find the odds of a male dying, we then divide the two numbers in the first row; from above, we know these are positions 1 and 3
## odds of man dying
tab[1] / tab[3]
## [1] 3.7166
This tells us that 3.7 men died for each one that lived. For women, the computation is similar, selecting the second and fourth positions:
## odds of woman dying
tab[2] / tab[4]
## [1] 0.36628
Here we see that 0.36 women died for each one that lived. In order to see how the odds differ between groups, we then evaluate the odds ratio, the ratio between these two groups
man_odds <- tab[1] / tab[3]
woman_odds <- tab[2] / tab[4]
man_odds / woman_odds
## [1] 10.147
We find that the odds of a man dying aboard the Titanic were 10 times that of a woman dying, suggesting that there may be some association between these two variables. To facilitate our work for this lab, I have included a small function here that, given a contingency table, will report back the odds ratio for you.
## Function called or()
or <- function(x) {
(x[1] * x[4]) / (x[2] * x[3])
}
## Create a 2x2 table
tab <- with(titanic, table(Sex, Survived))
## Pass table as argument to or()
or(tab)
## [1] 10.147
As a critical note about odds and tables – when reporting odds, we need to be conscientious of what is our “success” or “event” and what is considered “failure” or “no event”. Our formula assumes the first column of our table to be event. In the case of Titanic data, this refers to death, so when we speak of odds, we are talking about the odds of dying. If we are interested in discussing the odds of living, we need to change the order of the columns. We do that like this:
## This is our original table
tab
## Survived
## Sex No Yes
## Male 1364 367
## Female 126 344
## Here, we switch columns, when we include numbers after
# a comma, we are referring to column numbers. In this case, we are
# saying that we want column 2 first, and then column 1
tab[, c(2, 1)]
## Survived
## Sex Yes No
## Male 367 1364
## Female 344 126
## I can save these results to my table to override
tab <- tab[, c(2, 1)]
tab
## Survived
## Sex Yes No
## Male 367 1364
## Female 344 126
Question 3 Create a contingency table showing the age of the passenger (Child or Adult) and their survival. How do the odds of dying as an adult compare with the odds of dying as a child? Would you say there is an association between age and survival? Find the odds ratio in support of your conclusion.
The following set of exercises involve the UCBAdmissions
dataset collecting aggregate data on applications to graduate school at
UC Berkeley for the six largest departments in 1973. Variables in this
dataset include the department (A-E), the sex of the applicant, and the
associated admissions decision. This data became famous in its use as
evidence relating to claims of sex bias in admission practices. For this
problem, we are going to investigate the veracity of these claims.
Start by copying the code block below to modify the data as it is built-in in R to something we are used to working with. Then, answer the following questions.
## Create UC Berkeley data from built-in R dataset
ucb <- as.data.frame(UCBAdmissions)
ucb <- ucb[rep(1:nrow(ucb), times = ucb$Freq), ]
ucb$Freq <- NULL
Create the appropriate table to illustrate the frequency of male and female applicants and whether or not their application was accepted.
Using your table from (a), compute the odds ratio of a male applicant being accepted over a female one. Does it appear that there is any evidence of sex bias?
Which departments appear to have the greatest total number of male applicants? Which departments had the greatest total number of female applicants?
Which two departments appear to have the highest proportion of applications accepted? Of applications rejected?
Based on what we have seen in the first parts of this question, what do you suspect is happening? Create two plots to help tell this story: the first plot should be suggestive of gender bias at UCB, while the second should demonstrate how the question is resolved, either concluding with evidence of gender bias or providing evidence to refute it. Consider which attributes of the problem you find most compelling in support of your argument (number of students, proportions, etc). Finally, in 3-4 sentences, summarize your findings and justify the two plots selected to illustrate the problem.