library(ggplot2)
library(dplyr)
## Less uggo plots
theme_set(theme_bw())
## Copy this to turn off scientific notation (also uggo)
options(scipen = 9999)
## College data
college <- read.csv("https://collinn.github.io/data/college2019.csv")
This lab is intended to be a first introduction to linear regression
in R. In pursuit of this, we will begin with a brief introduction on the
use of Latex in R Markdown, a convenient typesetting language that will
help us express mathematical idea notationally. Following this, we will
investigate the use of the correlation function, cor()
,
introduce how linear models are represented in ggplot, and finally
conclude by showcasing a variety of linear modeling techniques,
demonstrate how to extract and interpret useful information, presented
along with a collection of helper functions to extend their utility.
One of the benefits of Markdown (and R Markdown in particular) is the ability to include alongside written text and code various Latex expressions. Latex, similar to markdown, is a typesetting language that can help us represent various mathematical symbols. They are not necessary, but I have included below a number that you may find useful in this course:
Symbol | Latex Expression | Output |
---|---|---|
Beta coefficient | $\beta$ | \(\beta\) |
“Hat” notation | $\hat{y}$ | \(\hat{y}\) |
“Wide Hat” notation | $\widehat{hello}$ | \(\widehat{hello}\) |
Overline | $\overline{x}$ | \(\overline{x}\) |
Subscript | $b_0$ | \(b_0\) |
Superscript | $X^2$ | \(X^2\) |
Mu, sigma, epsilon, rho | $\mu$, $\sigma$, $\epsilon$, $\rho$ | \(\mu\), \(\sigma\), \(\epsilon\), \(\rho\) |
Indicator Variables | $\mathbb{1}_{\text{Public}}$ | \(\mathbb{1}_{\text{Public}}\) |
Linear Model | $ \hat{y} = \hat{\beta}_0 + x \hat{\beta}_1$ | \(\hat{y} = \hat{\beta}_0 + x \hat{\beta}_1\) |
You can write an equation in “equation” form by using a double dollar sign
$$
\hat{y} = \hat{\beta}_0 + x \hat{\beta}_1
$$
will render as
\[ \hat{y} = \hat{\beta}_0 + x \hat{\beta}_1 \] Question 0 Recreate the following expression in latex:
\[ \overline{\beta} = x^2 - \hat{9}_9 \]
Correlation analysis in R is done using the cor()
function, which takes two variables as its argument:
# cor(college$Adm_Rate, college$ACT_median) also works
with(college, cor(Adm_Rate, ACT_median))
## [1] -0.43393
By default, the method used is Pearson’s, but we can change this with
the method
argument:
## This is the default, notice it returns same value as above
with(college, cor(Adm_Rate, ACT_median, method = "pearson"))
## [1] -0.43393
## We can find spearman's just as easily
with(college, cor(Adm_Rate, ACT_median, method = "spearman"))
## [1] -0.21997
Question 1: This problem uses the mpg
dataset with the variables displ
and cty
:
Fortunate for us, ggplot includes a geom layer that is designed for
the express purpose of visualizing mathematical relationships between
two quantitative variables: geom_smooth()
. By default,
geom_smooth()
fits what is called a “Loess” curve with
standard error bars, though we will not be using this much in class:
## Using geom_smooth
ggplot(college, aes(ACT_median, Salary10yr_median)) +
geom_point() +
geom_smooth()
We can remove the standard error bars with the argument
se = FALSE
ggplot(college, aes(ACT_median, Salary10yr_median)) +
geom_point() +
geom_smooth(se = FALSE)
Critical for our mission today, however, is the fact that we can
modify this smoother to add a simple linear regression line to our plot
by passing in the argument method = lm
ggplot(college, aes(ACT_median, Salary10yr_median)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)
This will be a useful tool for visualization as we continue in the lab.
While correlation is a useful way of quantitatively describing the (linear) relationship between two variables, simple linear regression extends this utility by allowing us to make predictions according to the line
\[ \begin{aligned} y = \beta_0 + x \beta_1 + \epsilon \end{aligned} \] where \(y\) is our outcome variable, \(x\) is our predictor, and \(\epsilon\) (epsilon) represents the error. We include this now for completeness, but we will not worry about it any further for now. Recall that
Note that this is a theoretical model; in practice, we need to use our data to estimate the values of \(\beta_0\) and \(\beta_1\). We do this by fitting a linear model. We denote our estimated values (also known as fitted values) using “hat” notation. For example, if \(y\) represents our observed outcome, then \(\hat{y}\) represents our predicted or fitted values. We thus represent the fitted regression line in terms of its estimates:
\[ \hat{y} = \hat{\beta_0} + x \hat{\beta_1} \] Note that this formula does not have the error term, \(\epsilon\), as this equation represents the line itself rather than our theoretical model.
Let’s run through an example together to build a linear model using
the mpg
dataset with the variables hwy
and
cty
which represent the miles per gallon when driving on
the highway or in the city, respectively.
hwy
our outcome and cty
our predictor
variable. Our regression model is then:\[ \text{Highway mpg} = \beta_0 + \text{City mpg} \times \beta_1 \]
lm()
in
R:fit <- lm(hwy ~ cty, data = mpg)
hwy ~ cty
, is called a
formula in R, where our outcome is on the left and our
predictor is on the rightcoef()
function which returns values for the intercept and
slope:coef(fit)
## (Intercept) cty
## 0.89204 1.33746
Note that this is a vector, so we can save it to a value and extract its elements if we wish
## Assign coefficients to a variable
model_coef <- coef(fit)
## Extract the intercept, the first element
model_coef[1]
## (Intercept)
## 0.89204
## Extract the slope
model_coef[2]
## cty
## 1.3375
\[ \widehat{\text{Highway mpg}} = 0.89 + 1.33 \times \text{City mpg} \] Thus, we see that for every increase in city miles per gallon, we can expect to see a 1.33 increase in highway miles per gallon. Further, if I wanted to predict highway miles per gallon for a vehicle with 18 city miles per gallon, I could do so with
\[ \widehat{\text{Highway mpg}} = 0.89 + 1.33 \times 18 = 24.83 \]
Question 2: This question uses the
college
dataset from above
FourYearComp_Males
using the variable
FourYearComp_Females
.In addition to extracting coefficients from our model, there is quite
a bit more that we are able to do. The summary()
function
provides a lot of useful information that we will be using later in the
course. For now, see if you can identify where \(R^2\) (the proportion of explained
variance) is reported:
## Fit the model to variable called fit
fit <- lm(hwy ~ cty, data = mpg)
## Use summary to print output
summary(fit)
##
## Call:
## lm(formula = hwy ~ cty, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.341 -1.279 0.021 1.034 4.046
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.892 0.469 1.9 0.058 .
## cty 1.337 0.027 49.6 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.75 on 232 degrees of freedom
## Multiple R-squared: 0.914, Adjusted R-squared: 0.913
## F-statistic: 2.46e+03 on 1 and 232 DF, p-value: <0.0000000000000002
If we look near the bottom, we see a value of \(R^2 = 0.914\), indicating that about 91% of the total variation in highway mpg can be explained using the city mpg.
Question 3: This question involves a penguins dataset, including measurements for 344 penguins including their species, size, and sex.
penguins <- read.csv("https://collinn.github.io/data/penguins.csv")
## Remove missing values
penguins <- filter(penguins, sex %in% c("male", "female"))
Part A Using the data above, create two different ggplots, both having body mass as the outcome variable, but with plot one using bill length as the predictor, the other using flipper length. For each plot, also include a visual representation of the linear model. Based on these, which predictor seems to have the least amount of unexplained variability? (Note: there are some missing values so you will get warnings – this is totally fine)
Part B: Create linear models based on each of the plots above. From the summaries, what is the \(R^2\) value for each model? Is this consistent with what you found in Part A? Explain
As we saw in class, we can also perform simple linear regression on a quantitative outcome using a categorical predictor: the result tells us the mean value of the quantitative variable within each of the categorical groups. Models with a categorical predictor will typically report a value for each of the categories except one: the variable missing in the output is called the reference variable and it is included in the intercept.
For example, consider the case using our college
dataset
in which we wish to estimate the mean enrollment for all public and
private schools:
## Using dplyr
college %>% group_by(Type) %>%
summarize(meanEnrollment = mean(Enrollment))
## # A tibble: 2 × 2
## Type meanEnrollment
## <chr> <dbl>
## 1 Private 2720.
## 2 Public 11325.
Here, we see the mean enrollment for private colleges is 2720
students, while the mean enrollment for public colleges is 11,325. Now
consider how this plays out when constructing our linear model with
Enrollment
as the outcome and Private
as our
variable.
lm(Enrollment ~ Type, college)
##
## Call:
## lm(formula = Enrollment ~ Type, data = college)
##
## Coefficients:
## (Intercept) TypePublic
## 2720 8605
Looking at the output, there are a few things we should notice:
\[ \begin{aligned} \widehat{\text{Enrollment}} &= \beta_0 + \beta_1 \times \mathbb{1}_{\text{Public}}\\ &= 2720 + 8605 \times \mathbb{1}_{\text{Public}} \end{aligned} \] where \(\mathbb{1}_{\text{Public}}\) is our indicator variable. Recall that if a school is a public school then \(\mathbb{1}_{\text{Public}} = 1\) and if a school is private, \(\mathbb{1}_{\text{Public}} = 0\)
\[ \mathbb{1}_{\text{Public}} = \begin{cases} 1 \quad \text{Type = Public} \\ 0 \quad \text{Type = Private} \end{cases} \]
\[ \begin{aligned} \widehat{\text{Enrollment}} &= 2720 + 8605 \times 0 \\ &= 2720 \end{aligned} \] This is precisely the mean value that we found above
\[ \begin{aligned} \widehat{\text{Enrollment}} &= 2720 + 8605 \times 1 \\ &= 11325 \end{aligned} \] Again, this is precisely what we found when we computed the mean value directly above.
In general, we should recall that whenever a categorical variable with \(p\) different categories is included in our model, our model output will give us coefficient estimates for \(p-1\) different categories, with the one omitted being included in the intercept as the reference variable.
Question 4: First, with the mpg
dataset, use the appropriate dplyr functions to find the average city
mpg (cty
) for each of the categories in drv
.
Then use the lm()
function to fit a linear model, again
with cty
as the outcome variable and with drv
as the predictor. What becomes the reference variable in this
model? Which drive-train appears to have the best city mpg? Which has
the worst?
Multiple linear regression (MLR) is a natural extension of the ideas
covered in simple linear regression, and this is represented in the
formula syntax in R. For example, to include an additional covariate,
all we need to do is use +
in our lm
function
## Let's treat cylinder as a categorical variable
mpg <- mutate(mpg, cyl = factor(cyl))
## Now regression with two categorical variables
fit <- lm(cty ~ drv + cyl, mpg)
coef(fit)
## (Intercept) drvf drvr cyl5 cyl6 cyl8
## 18.8758 2.9838 1.6505 -1.3596 -4.3683 -6.8422
Using these, we should be able to write out our regression formulas and use them to predict values for the response variable, given values of our categorical variables.
There is a way to predict values of a response variable, given the
predictors, that does not involve writing things out by hand. This is
the predict()
function. predict()
requires two
things: a fitted linear model object (returned by lm()
) and
a data.frame
that contains values of our predictors.
For example, suppose I am creating a model to predict vehicle miles
per gallon based on displacement and horsepower. First I would create a
model, giving me my lm
object; then I would create a
new data frame with my explanatory variables. Critically, the
names of my new data frame must match exactly those in my model.
Below, I show how to predict the miles per gallon of two different vehicles:
## First, I create my model and save the object
fit <- lm(mpg ~ disp + hp, mtcars)
## I create a data.frame with variable names disp and hp
df <- data.frame(disp = c(170, 300),
hp = c(125, 185))
## Use predict() to estimate the mpg of each
predict(fit, df)
## 1 2
## 22.472 17.037
This tells me that my first vehicle has a predicted mpg of 22.472 while the second is estimated at 17.037 mpg.
This next problem will ask you to use the mean()
function to find an average. One thing we should be aware of, however,
is that if there are any NA
values in our dataset, the
mean()
function will return NA
by default. We
can correct this by using an additional arugment,
na.rm = TRUE
x <- c(2, 4, NA, 6)
mean(x)
## [1] NA
mean(x, na.rm = TRUE)
## [1] 4
This will be a generally useful thing to know.
Question 5 For this problem, we will again by using
the penguin
dataset included above.
Part A Create a linear model using the bill length (mm) as our response variable and including both sex and flipper length (mm) as explanatory variables. Interpret the intercept and slope. Is the intercept meaningful?
Part B Recreate the model from Part A, this time also including species as an explanatory variable. Interpret the intercept in this model and explain why it has changed.
Part C Using your dplyr functions, find the average flipper length for each of the species in the dataset.
Part D Subset your penguin data to only include Chinstrap and Gentoo penguins. Using your subset data, recreate the same model from Part B. Why has the indicator variable for Gentoo changed so drastically?
Part E Using your linear model from Part D, find the predicted bill length (mm) of a male Gentoo penguin with a flipper length of 208mm