This lab is intended to introduce the concept of principal components analysis, or PCA. PCA is a method for reducing the dimensions of a dataset, transforming a large number of variables into fewer while still retaining most of the information as the larger set.
Broadly speaking, while the specific details of this method are highly technical and unintuitive, being able to correctly use principal components is not particularly difficult. The essence of the process is this: the method takes a collection of \(p\) covariates (\(x_1, x_2, \dots, x_p\)) and creates \(p\) new variables called the principal components, where each component is made up of some additive combination of the originals. In addition to this, there are two other properties that will be useful to keep in mind:
Principal components are perpendicular to one another in the same sense that the x-axis and the y-axis are perpendicular, meaning that we can use our principal components as a new axis system on which to plot our variables. This means that instead of describing our points in terms of \((x, y)\), we can instead describe them in an alternate axis system, \((PC1, PC2)\).
In what follows, we will briefly cover how PCA is used as a form of variable reduction and some of the key properties associated with the analysis. We will conclude with several examples utilizing PCA in our own analyses.
A general goal in the construction of statistical models is that of model parsimony, the inclusion of as few relevant variables as possible as needed to model the outcome. This was mentioned in passing in a previous lab, alluding to a penalty typically associated with additional variability in predictions for our outcomes.
All things being equal, having model covariates with more variability is preferable to those with less, accounting for the fact that the amount of information contained within a variable is directly associated with its variability. To consider an extreme example of this phenomenon, image that we were conducting a study on the current students at Grinnell and we included a variable indicating whether or not the participant was enrolled in the college. As this study would be limited only to current students, this enrollment variable would be identical for all participants. Consequently it would have no variability and no value in differentiating the participants with respect to whatever outcome.
Suppose, then, that we are interested in modeling some outcome and that we have two uncorrleated variables, \(x_1\) and \(x_2\), where \(x_1\) is normally distributed and \(x_2\) is a constant covariate with no variability, i.e., it is identically zero.
Even though we have two separate variables (which together make a 2D plane), the total lack of variability in the second makes these functionally a 1D line that corresponds to the values of \(x_1\). In this case, we receive the exact same amount of information if we only consider \(x_1\) as if we considered \(x_1\) and \(x_2\) together. In other words, we can perform data reduction here without the loss of any information.
In this case, \(x_1\) alone no longer gives us all of the information, but it does give us a lot. One way to quantify the amount of information retained is to consider how much of the total variability between \(x_1\) and \(x_2\) is retained if we only kept \(x_1\).
# Variance of x_1
var(x1)
## [1] 13.332
# Variance of x_2
var(x2)
## [1] 0.93506
# Proportion of total variance for x_1
var(x1) / (var(x1) + var(x2))
## [1] 0.93446
Here, we see that keeping only \(x_1\) retains 93% of the total variability. For illustration, let’s plot a red line demonstrating what it is that we are going to keep. We will return to this idea in the next section
In this example, \(x_1\) and \(x_2\) were uncorrelated, reducing our problem down to retaining the most variability with the fewest variables. Transition sentence
Consider now a situation in which our two covariates are correlated, as is the case when considering the relationship between systolic and diastolic blood pressure
bp <-read.csv("https://collinn.github.io/data/bp.csv")
ggplot(bp, aes(ss, dd)) + geom_point() +
labs(y = "Diastolic Blood Pressure",
x = "Systolic Blood Pressure")
In this case, one of our variables is redundant in that much of the information contained within diastolic blood pressure (DBP) can also be found in systolic blood pressure (SDP). Said another way, if we know the value of SDP, say 175, then we have a pretty good idea of the range of possible values for DBP.
As before, we could simply keep the covariate the captures most of the variability, in this case SDP
# Systolic
var(bp$ss)
## [1] 623.33
# Diastolic
var(bp$dd)
## [1] 368.1
# Total variance
(var(bp$ss) + var(bp$dd))
## [1] 991.44
# Proportion
var(bp$ss) / (var(bp$ss) + var(bp$dd))
## [1] 0.62872
but doing so only retains at most 63% of the original variance.
Here, we introduce the principal components. As we have two variables we will also have two principal components, both of which are combinations of our original variables.
This plot is analogous to the plot we saw above comparing the relationship between \(x_1\) and \(x_2\). However, instead of our red line representing just one of the variables, we now see it is a combination of SBP and DBP, a 1D line taken from a 2D plane. This red line, this combination of variables, makes up our first principal component.
In a similar vein, we can plot the second PC in blue. Although the scaling is off, this line is perpendicular to the red one.
How is this useful? Well now, instead of considering our observations in terms of the original variables (SBP, DBP), we can now consider them in terms of their principal component coordinates. Let’s plot this again using PC1 and PC2 as our x and y axis
# Principal Components function
pc <- prcomp(bp)
library(gridExtra)
p1 <- ggplot(bp, aes(ss, dd)) + geom_point() +
labs(x = "Systolic Blood Pressure", y = "Diastolic Blood Pressure") +
geom_abline(aes(slope = 0.697, intercept = -4.812), color = "red") +
geom_abline(aes(slope = -1 / 0.697, intercept = 314.95), color = 'blue')
p2 <- ggplot(data.frame(pc$x), aes(PC1, PC2)) + geom_point()
grid.arrange(p1, p2, nrow = 1)
Now we have the exact same observations, plotted in a different coordinate system. Unlike before, these variables are no longer correlated, meaning that there is no longer redundant information in PC2 that is already contained in PC1. We can also consider the amount of variability within our PCs
# The variability for each PC
pc$sdev^2
## [1] 948.382 43.053
# Same total variance as before
sum(pc$sdev^2)
## [1] 991.44
# Proportion of total variance in PC1
pc$sdev[1]^2 / (sum(pc$sdev^2))
## [1] 0.95658
While the total variability contained within both components is the same as in the original variables, we now see that over 95% of this variability is contained within the first PC. Consider this in contrast to our original variables, where the most variance we could get by retaining one variable was less than 65%
From this, we summarize with the following:
For this example, we will be using USArrests
dataset in
R.
Principal components analysis is typically done using the
prcomp()
function in R. Although we did not do this in the
previous section, it is usually necessary that we scale our data first
using prcomp(, scale = TRUE)
.
The prcomp()
function returns three things:
Let’s look at this with USArrests
, which gives summary
information on murder, assault, rape, and urban population for each of
the 50 states.
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
pc <- prcomp(USArrests, scale = TRUE)
## Rotation matrix
pc$rotation
## PC1 PC2 PC3 PC4
## Murder -0.53590 -0.41818 0.34123 0.649228
## Assault -0.58318 -0.18799 0.26815 -0.743407
## UrbanPop -0.27819 0.87281 0.37802 0.133878
## Rape -0.54343 0.16732 -0.81778 0.089024
First, we have the rotation matrix, the values within being called loadings which essentially give us the weight of each of our original variables contained within the principal components (signs do not matter), with loadings over 0.4 are generally considered relevant. It’s important to note that most often, principal components do not have any direct interpretation, so it’s not common to perform inference on them directly. However, looking at the loadings does sometimes give us useful information. In the case above, we see that for PC1, the variables with loadings greater than 0.4 are those associated with murder, assault, and rape. As this information is likely to be highly correlated, it should make sense to us that they show up together. We might call this first PC an index of violent crime.
Question 1: Looking at the rotation matrix above, what sort of information seems to be primarily captured in the second principal component?
The second item returned by prcomp()
is the collection
of standard deviations of each of the principal components
# Standard deviation of PCs
pc$sdev
## [1] 1.57488 0.99487 0.59713 0.41645
As these have first been scaled, they will no longer add up to the total variance in our data, but rather they should sum to the total number of variables (since each has been scaled to have SD = 1). This is useful for a number of things. First, we might be interested in seeing the proportion of total variance explained by each (squaring first to get the variance)
pc$sdev^2 / sum(pc$sdev^2)
## [1] 0.620060 0.247441 0.089141 0.043358
Here, we see 62% of the total variance is in PC1, 25% in PC2, then 8% and 4% in PC3 and PC4, respectively. Similarly, we might be interested in viewing the proportion of variance explained by including each subsequent PC
cumsum(pc$sdev^2) / sum(pc$sdev^2)
## [1] 0.62006 0.86750 0.95664 1.00000
Again, the first alone is 62%, but we now see that the first two together make up 86% and the first three make up 96%. In other words, nearly all of the information in our 4 variable dataset can be represented with only three variables, and a considerable amount with only two.
Finally, we have the principal components themselves. This represents the new coordinate points for each of our covariates in terms of the new PCs
# Matrix of Principal components
head(pc$x)
## PC1 PC2 PC3 PC4
## Alabama -0.97566 -1.12200 0.43980 0.1546966
## Alaska -1.93054 -1.06243 -2.01950 -0.4341755
## Arizona -1.74544 0.73846 -0.05423 -0.8262642
## Arkansas 0.14000 -1.10854 -0.11342 -0.1809736
## California -2.49861 1.52743 -0.59254 -0.3385592
## Colorado -1.49934 0.97763 -1.08400 0.0014502
This can be infinitely useful in identifying groups or patterns latent within our data. Here, we plot the first two PCs with state names rather than points
df <- as.data.frame(pc$x)
df$State <- rownames(df)
ggplot(df, aes(PC1, PC2, label = State)) + geom_text()
Question 2: Consider again the rotation matrix above which gave the loadings of each of the original variables for each of the PCs. Considering their positions on this plot, comment on the following:
Question 3: The code below creates a subset of the “colleges2019” data set with no missing data in the variables that are selected:
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
cd <- colleges %>%
select(Name, State, Enrollment, Adm_Rate, ACT_median,
Net_Tuition, FourYearComp_Males, FourYearComp_Females,
Debt_median, Salary10yr_median)
cd <- cd[complete.cases(cd),] ## Keeps only complete cases (no missing values)
In the questions that follow you will progress through the basic steps of PCA:
Part A: Principal components only work for numeric
values, so we will have to remove Name and State from the dataset before
using prcomp()
. However, we will need Name and State later
in the problem, so consider an adequate way to address this.
Part B: Perform PCA using the prcomp()
function, not forgetting to use scale = TRUE
. Then,
considering only loadings above 0.4 in absolute value (a commonly used
threshold), come up with a label for the first principal
component.
Part C: Considering only loadings above 0.4 in absolute value (a commonly used threshold), come up with a label for the second principal component.
Part D: Using geom_text()
create a
graphic that displays the \(PC_1\) and
\(PC_2\) scores of all colleges in
Iowa. Briefly comment on the position of Grinnell College in this
graph, mentioning why you think its location is so far from the other
Iowa colleges.
As mentioned previously, principal components are constructed in such a way that the first PC accounts for the greatest proportion of variance, followed by the second, the third, and so on. When a set of \(p\) variables are scaled such that the standard deviation of each variable is 1, the total variability within the data is equal to \(p\) (i.e., \(var(Z_1) + var(Z_2) + \dots var(Z_p) = 1 + 1 + \dots + 1 = p\)). That our principal components can have a standard deviation greater or less than 1 reflects the relative variability captured by each component.
For example, we saw in the US Arrest data that the first PC had a variance of 2.5, while the last PC had a variance of less than 0.2
pca <- prcomp(USArrests, scale = TRUE)
pca$sdev^2
## [1] 2.48024 0.98977 0.35656 0.17343
A popular way to quickly visualize the degree of variability
associated with each PC is with the use of a scree plot. There
is the screeplot()
function in R which can make a quick and
dirty plot
screeplot(pca)
Here, the y-axis is a measure of total variability, while the bars on the x-axis account for each subsequent PC. With a little effort, we can make our screeplots look a little nicer, replacing the total variability of the current y-axis with a percentage of total variability which is more readily interpretable
scree <- function(x) {
# Find variance as proportion of total
vv <- x$sdev^2 / sum(x$sdev^2)
# Create data frame to plot
df <- data.frame(PC = paste0("PC", 1:length(vv)),
Percentage = vv)
df$PC <- factor(df$PC, levels = df$PC) # preserve order so that PC2 is before PC10
ggplot(df, aes(x = PC, y = Percentage)) + geom_bar(stat = "identity")
}
scree(pca)
This allows us to see more readily that the first PC accounts for over 60% of the total variance, a more meaningful metric than simply saying that the variance of the first PC is 2.4.
We can make this function even more useful to plot cumulative sums instead
scree <- function(x, cum = FALSE) {
if (cum) {
vv <- cumsum(x$sdev^2) / sum(x$sdev^2)
yl <- "Cumulative Percentage"
} else {
vv <- x$sdev^2 / sum(x$sdev^2)
yl <- "Percentage"
}
# Create data frame to plot
df <- data.frame(PC = paste0("PC", 1:length(vv)),
Percentage = vv)
df$PC <- factor(df$PC, levels = df$PC) # preserve order so that PC2 is before PC10
ggplot(df, aes(x = PC, y = Percentage)) + geom_bar(stat = "identity") +
ylab(yl)
}
p1 <- scree(pca, cum = FALSE)
p2 <- scree(pca, cum = TRUE)
gridExtra::grid.arrange(p1, p2, nrow = 1)
Here, we see that the first PC still accounts for 60% of the total variability, but we allso see that the first two combined account for 85%.
Scree plots are particularly useful in a regression context. For example, consider a dataset containing 47 patients with acute lymphoblastic leukemia (ALL) and 25 patients with acute myeloid leukemia (AML). For each participant, a bone marrow sample was collected at the time of diagnosis and gene expression measurements were taken using Affymetrix Hgu6800 chips, resulting in 7129 gene expressions for each patient.
leukemia <- readRDS(url("https://collinn.github.io/data/Golub1999.rds"))
# Outcome data
y <- leukemia$y
table(y)
## y
## ALL AML
## 47 25
# Genetic data
X <- leukemia$X
dim(X)
## [1] 72 7129
# First 5 genes
X[1:10, 1:5]
## AFFX-BioB-5_at AFFX-BioB-M_at AFFX-BioB-3_at AFFX-BioC-5_at AFFX-BioC-3_at
## 39 4.5287 5.1492 6.7787 8.3746 5.0282
## 40 5.8727 4.9657 8.0803 8.2198 5.0692
## 42 6.6042 5.3040 6.5633 8.2321 4.9799
## 47 4.5664 4.6889 4.9934 7.8153 4.3623
## 48 5.4351 5.1918 6.0592 7.8813 5.2263
## 49 5.2654 5.2947 4.7121 6.7217 4.3642
## 41 5.7011 6.0832 6.2497 7.6929 4.5128
## 43 7.1845 6.1377 5.3334 8.2256 4.9711
## 44 5.3899 5.9681 7.9704 7.9940 6.3511
## 45 5.3021 5.3021 8.5825 7.7785 7.6051
In some cases, the expression of a particular gene can serve as an indicator for risk of a particular cancer, for example with the BRCA1 gene associated with breast cancer. In other cases, such as this, there is no clear marker and we are called upon to rely on the genetic data we have been given. However, with only 72 total participants, there is no way to effectively create a model utilizing all genes. A common solution to this is to use PCA to reduce our 7129 genes into a handful of meaningful components.
pca <- prcomp(X, scale = TRUE)
p1 <- scree(pca, cum = FALSE)
p2 <- scree(pca, cum = TRUE)
gridExtra::grid.arrange(p1, p2, nrow = 1)
Remarkably, we find that a single component out of 7129 contains over 10% of the total variability.
In addition to scree plots, we can plot our outcome against the first two PCs.
df <- data.frame(y = y, pca$x)
ggplot(df, aes(PC1, PC2, color = y)) + geom_point()
Even here, we see great association with our outcome – AML appears to be associated with smaller values in PC1 and larger values in PC2, while ALL tends to be associated with smaller values of PC2 and large values of PC1. (Note the similarity of what this plot is showing in terms of “groups” and the “groups” that appeared in the PC plot illustrating violent crime in US states.)
Finally, we see the fruits of our labor when we use our components as covariates in regression
df <- mutate(df, outcome = 1*(y == "AML"))
fit1 <- glm(outcome ~ PC1, data = df, family = binomial)
fit2 <- glm(outcome ~ PC1 + PC2, data = df, family = binomial)
library(plotROC)
p1 <- ggplot() + geom_roc(aes(d = df$outcome, m = fitted(fit1))) +
ggtitle("Model 1")
p2 <- ggplot() + geom_roc(aes(d = df$outcome, m = fitted(fit2))) +
ggtitle("Model 2")
gridExtra::grid.arrange(p1, p2, nrow = 1)
Problem 4: (Bonus Problem, No Credit) This problem will be using data collected from an online personality test which measure the big-five personality traits: openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism.
The personality test asked participants to rate 50 statements on a five-point scale (1 = disagree, 3 = neutral, 5 = agree). The researchers designed 10 statements to target each of the big five traits. For example, questions tagged with “E” were designed to measure extraversion, while questions tagged with “A” were designed to measure agreeableness.
The exact questions used, along with a description of the data, can be found here..
big5 <- read.delim("https://remiller1450.github.io/data/big5data.csv")
# Exclude demographics, Male is 0, Female is 1
big5 <- select(big5, -race, -age, -engnat, -hand, -source, -country) %>%
filter(gender %in% 1:2) %>%
mutate(gender = gender - 1)
head(big5)
## gender E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 A1 A2 A3
## 1 0 4 2 5 2 5 1 4 3 5 1 1 5 2 5 1 1 1 1 1 1 1 5 1
## 2 1 2 2 3 3 3 3 1 5 1 5 2 3 4 2 3 4 3 2 2 4 1 3 3
## 3 1 5 1 1 4 5 1 1 5 5 1 5 1 5 5 5 5 5 5 5 5 5 1 5
## 4 1 2 5 2 4 3 4 3 4 4 5 5 4 4 2 4 5 5 5 4 5 2 5 4
## 5 1 3 1 3 3 3 1 3 1 3 5 3 3 3 4 3 3 3 3 3 4 5 5 3
## 6 1 1 5 2 4 1 3 2 4 1 5 1 5 4 5 1 4 4 1 5 2 2 2 3
## A4 A5 A6 A7 A8 A9 A10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 O1 O2 O3 O4 O5 O6 O7 O8
## 1 5 2 3 1 5 4 5 4 1 5 1 5 1 4 1 4 5 4 1 3 1 5 1 4 2
## 2 4 4 4 2 3 4 3 4 1 3 2 3 1 5 1 4 4 3 3 3 3 2 3 3 1
## 3 5 1 5 1 5 5 5 4 1 5 1 5 1 5 1 5 5 4 5 5 1 5 1 5 5
## 4 4 3 5 3 4 4 3 3 3 4 5 1 4 5 4 2 3 4 3 5 2 4 2 5 2
## 5 5 1 5 1 5 5 5 3 1 5 3 3 1 1 3 3 3 3 1 1 1 3 1 3 1
## 6 4 3 4 3 5 5 3 2 5 4 3 3 4 5 3 5 3 4 2 1 3 3 5 5 4
## O9 O10
## 1 5 5
## 2 3 2
## 3 5 5
## 4 5 5
## 5 5 3
## 6 5 3
The big-five personality traits are constructed in such a way that the measures tend to be orthogonal. For example, a measure of how open someone is to new experiences tells us very little about how agreeable they may be. As such, it is little surprise that the first five PC contain a great bit of total variance
# 2:ncol(big5) because column 1 is sex and we don't want that in our PC
# since its the outcome
pc <- prcomp(big5[, 2:ncol(big5)], scale = TRUE)
scree(pc)
Our goal for this problem is two-fold. The first is to determine if we can reasonably predict a participant’s sex given the information contained in their personality responses, while the second is to determine if any of the big five personality traits tends to be more associated with one sex or the other.
Part A: First, use the first five principal components (computed above) to construct a logistic model with sex as the outcome. Looking at the exponential of the coefficients, are we able to determine with any certainty which traits are associated with which sex? Why or why not.
Part B: The dplyr
function
starts_with()
allows us to select all columns that begin
with a character string or regular expression. For example, to grab all
of the columns starting with "A"
(for agreeableness), we
could use
acol <- select(big5, starts_with("A", ignore.case = FALSE))
head(acol)
## A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
## 1 1 5 1 5 2 3 1 5 4 5
## 2 1 3 3 4 4 4 2 3 4 3
## 3 5 1 5 5 1 5 1 5 5 5
## 4 2 5 4 4 3 5 3 4 4 3
## 5 5 5 3 5 1 5 1 5 5 5
## 6 2 2 3 4 3 4 3 5 5 3
With this, we can create a set of principal components made up only of those questions associated with agreeableness. For now, we won’t worry about standard deviations, but we will consider the rotation matrix. In particular, we consider the first column of the rotation matrix which is associated with the greatest deal of information.
pcA <- prcomp(acol, scale = TRUE)
pcA$rotation[, 1]
## PC1
## A1 0.23896
## A2 -0.31872
## A3 0.21314
## A4 -0.38763
## A5 0.35264
## A6 -0.30272
## A7 0.35367
## A8 -0.32416
## A9 -0.36590
## A10 -0.25610
Whereas previously we noted that we can use a rough cutoff of 0.4 to determine loadings, here instead we are asked to consider the sign. Looking at PC1, from A1 to A10 the sign goes \(+-+-+-+---\). Comparing this with questions from the index, we find that the positive loadings are associated with questions of disagreeability (i.e., A1 is “I feel little concern for others”), while those with negative loadings are associated with agreeability (i.e., A2 is “I am interested in people”.) From this, we could conclude that the first PC of the agreeable columns is associated with disagreeability.
Repeat this process for each set of questions, with letters “E”, “O”, “N”, “C”, and “A”, and determine for each set of questions the sentiment of the first principal component, i.e., high pcA is disagreeable, high pcE is (…)
Part C: The code below creates a matrix using the first PC from each set of questions and then runs a logistic regression model. Copy and run the code then look at the exponential of the coefficients. What do we see? Is there any information regarding the traits you found with B with either of the sexes?
pcaLetter <- function(v) {
tmp <- select(big5, starts_with(v, ignore.case = FALSE))
prcomp(tmp, scale = TRUE)$x[, 1]
}
ll <- c("E", "O", "N", "C", "A")
m <- lapply(ll, pcaLetter)
m <- Reduce(cbind, m)
colnames(m) <- paste0("PC_", ll)
fit2 <- glm(big5$gender ~ m, family = binomial)
Part D: Create ROC plots for the model fit with the first five principal components of all of the questions as well as the one fit using the individual principal components from each question. Does one appear significantly better than the other? Explain. (Hint: Recall that the big five traits were constructed as to be uncorrelated with one another. What information, then, should we expect to find in the first five principal components of all the questions?)