Lab 5 – Regression

library(ggplot2)
library(dplyr)

## Less uggo plots
theme_set(theme_bw())

## Copy this to turn off scientific notation (also uggo)
options(scipen = 9999)

## College data
college <- read.csv("https://collinn.github.io/data/college2019.csv")

Introduction

This lab is intended to be a first introduction to linear regression in R. In pursuit of this, we will begin with a brief introduction on the use of Latex in R Markdown, a convenient typesetting language that will help us express mathematical idea notationally. Following this, we will investigate the use of the correlation function, cor(), introduce how linear models are represented in ggplot, and finally conclude by showcasing a variety of linear modeling techniques, demonstrate how to extract and interpret useful information, presented along with a collection of helper functions to extend their utility.

Latex Expressions

One of the benefits of Markdown (and R Markdown in particular) is the ability to include alongside written text and code various Latex expressions. Latex, similar to markdown, is a typesetting language that can help us represent various mathematical symbols. They are not necessary, but I have included below a number that you may find useful in this course:

Symbol	Latex Expression	Output
Beta coefficient	$\beta$	$\beta$
“Hat” notation	$\hat{y}$	$\hat{y}$
“Wide Hat” notation	$\widehat{hello}$	$\widehat{hello}$
Overline	$\overline{x}$	$\overline{x}$
Subscript	$b_0$	$b_0$
Superscript	$X^2$	$X^2$
Mu, sigma, epsilon, rho	$\mu$, $\sigma$, $\epsilon$, $\rho$	$\mu$, $\sigma$, $\epsilon$, $\rho$
Indicator Variables	$\mathbb{1}_{\text{Public}}$	$\mathbb{1}_{\text{Public}}$
Linear Model	$ \hat{y} = \hat{\beta}_0 + x \hat{\beta}_1$	$\hat{y} = \hat{\beta}_0 + x \hat{\beta}_1$

You can write an equation in “equation” form by using a double dollar sign

\hat{y} = \hat{\beta}_0 + x \hat{\beta}_1

will render as

\[ \hat{y} = \hat{\beta}_0 + x \hat{\beta}_1 \] Question 0 Recreate the following expression in latex:

\[ \overline{\beta} = x^2 - \hat{9}_9 \]

Correlation

Correlation analysis in R is done using the cor() function, which takes two variables as its argument:

# cor(college$Adm_Rate, college$ACT_median) also works
with(college, cor(Adm_Rate, ACT_median))

## [1] -0.43393

By default, the method used is Pearson’s, but we can change this with the method argument:

## This is the default, notice it returns same value as above
with(college, cor(Adm_Rate, ACT_median, method = "pearson"))

## [1] -0.43393

## We can find spearman's just as easily
with(college, cor(Adm_Rate, ACT_median, method = "spearman"))

## [1] -0.21997

Question 1: This problem uses the mpg dataset with the variables displ and cty:

Consider the scatterplot shown above. Is it more appropriate to use Pearson’s or Spearman’s rank correlation? Explain
Find and report both Pearson’s and Spearman’s correlation.

Linear models in ggplot

Fortunate for us, ggplot includes a geom layer that is designed for the express purpose of visualizing mathematical relationships between two quantitative variables: geom_smooth(). By default, geom_smooth() fits what is called a “Loess” curve with standard error bars, though we will not be using this much in class:

## Using geom_smooth
ggplot(college, aes(ACT_median, Salary10yr_median)) + 
  geom_point() + 
  geom_smooth()

We can remove the standard error bars with the argument se = FALSE

ggplot(college, aes(ACT_median, Salary10yr_median)) + 
  geom_point() +
  geom_smooth(se = FALSE)

Critical for our mission today, however, is the fact that we can modify this smoother to add a simple linear regression line to our plot by passing in the argument method = lm

ggplot(college, aes(ACT_median, Salary10yr_median)) + 
  geom_point() +
  geom_smooth(method = lm, se = FALSE)

This will be a useful tool for visualization as we continue in the lab.

Simple linear regression

While correlation is a useful way of quantitatively describing the (linear) relationship between two variables, simple linear regression extends this utility by allowing us to make predictions according to the line

\[ \begin{aligned} y = \beta_0 + x \beta_1 + \epsilon \end{aligned} \] where $y$ is our outcome variable, $x$ is our predictor, and $\epsilon$ (epsilon) represents the error. We include this now for completeness, but we will not worry about it any further for now. Recall that

$\beta_0$ is the model’s intercept, or the expected value of $y$ when $x = 0$
$\beta_1$ is the slope, or the expected change in $y$ given a 1-unit change in $x$.

Note that this is a theoretical model; in practice, we need to use our data to estimate the values of $\beta_0$ and $\beta_1$. We do this by fitting a linear model. We denote our estimated values (also known as fitted values) using “hat” notation. For example, if $y$ represents our observed outcome, then $\hat{y}$ represents our predicted or fitted values. We thus represent the fitted regression line in terms of its estimates:

\[ \hat{y} = \hat{\beta_0} + x \hat{\beta_1} \] Note that this formula does not have the error term, $\epsilon$, as this equation represents the line itself rather than our theoretical model.

Let’s run through an example together to build a linear model using the mpg dataset with the variables hwy and cty which represent the miles per gallon when driving on the highway or in the city, respectively.

First, we propose our regression model. Let’s consider hwy our outcome and cty our predictor variable. Our regression model is then:

\[ \text{Highway mpg} = \beta_0 + \text{City mpg} \times \beta_1 \]

Next, we can fit this model using the function lm() in R:

fit <- lm(hwy ~ cty, data = mpg)

The first argument, hwy ~ cty, is called a formula in R, where our outcome is on the left and our predictor is on the right

We can extract the coefficients from our model using the coef() function which returns values for the intercept and slope:

coef(fit)

## (Intercept)         cty 
##     0.89204     1.33746

Note that this is a vector, so we can save it to a value and extract its elements if we wish

## Assign coefficients to a variable
model_coef <- coef(fit)

## Extract the intercept, the first element
model_coef[1]

## (Intercept) 
##     0.89204

## Extract the slope
model_coef[2]

##    cty 
## 1.3375

Using the model’s estimated coefficients, we can create our fitted regression line:

\[ \widehat{\text{Highway mpg}} = 0.89 + 1.33 \times \text{City mpg} \] Thus, we see that for every increase in city miles per gallon, we can expect to see a 1.33 increase in highway miles per gallon. Further, if I wanted to predict highway miles per gallon for a vehicle with 18 city miles per gallon, I could do so with

\[ \widehat{\text{Highway mpg}} = 0.89 + 1.33 \times 18 = 24.83 \]

Question 2: This question uses the college dataset from above

Fit a simple linear regression model that predicts FourYearComp_Males using the variable FourYearComp_Females.
Using the fitted coefficients, write out the fitted regression equation for your model (you do not need to use hat notation, you can just write text)
Next fit a linear model and provide a one-sentence interpretation of the slope coefficient in this model: how does the slope describe the relationship between these variables?
Is the intercept meaningful here? Explain.

Coefficient of Determination – $R^2$

In addition to extracting coefficients from our model, there is quite a bit more that we are able to do. The summary() function provides a lot of useful information that we will be using later in the course. For now, see if you can identify where $R^2$ (the proportion of explained variance) is reported:

## Fit the model to variable called fit
fit <- lm(hwy ~ cty, data = mpg)

## Use summary to print output
summary(fit)

## 
## Call:
## lm(formula = hwy ~ cty, data = mpg)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.341 -1.279  0.021  1.034  4.046 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)    0.892      0.469     1.9               0.058 .  
## cty            1.337      0.027    49.6 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.75 on 232 degrees of freedom
## Multiple R-squared:  0.914,  Adjusted R-squared:  0.913 
## F-statistic: 2.46e+03 on 1 and 232 DF,  p-value: <0.0000000000000002

If we look near the bottom, we see a value of $R^2 = 0.914$, indicating that about 91% of the total variation in highway mpg can be explained using the city mpg.

Question 3: This question involves a penguins dataset, including measurements for 344 penguins including their species, size, and sex.

penguins <- read.csv("https://collinn.github.io/data/penguins.csv")

## Remove missing values
penguins <- filter(penguins, sex %in% c("male", "female"))

Part A Using the data above, create two different ggplots, both having body mass as the outcome variable, but with plot one using bill length as the predictor, the other using flipper length. For each plot, also include a visual representation of the linear model. Based on these, which predictor seems to have the least amount of unexplained variability? (Note: there are some missing values so you will get warnings – this is totally fine)

Part B: Create linear models based on each of the plots above. From the summaries, what is the $R^2$ value for each model? Is this consistent with what you found in Part A? Explain

SLR with Categorical Variables

As we saw in class, we can also perform simple linear regression on a quantitative outcome using a categorical predictor: the result tells us the mean value of the quantitative variable within each of the categorical groups. Models with a categorical predictor will typically report a value for each of the categories except one: the variable missing in the output is called the reference variable and it is included in the intercept.

For example, consider the case using our college dataset in which we wish to estimate the mean enrollment for all public and private schools:

## Using dplyr
college %>% group_by(Type) %>% 
  summarize(meanEnrollment = mean(Enrollment))

## # A tibble: 2 × 2
##   Type    meanEnrollment
##   <chr>            <dbl>
## 1 Private          2720.
## 2 Public          11325.

Here, we see the mean enrollment for private colleges is 2720 students, while the mean enrollment for public colleges is 11,325. Now consider how this plays out when constructing our linear model with Enrollment as the outcome and Private as our variable.

lm(Enrollment ~ Type, college)

## 
## Call:
## lm(formula = Enrollment ~ Type, data = college)
## 
## Coefficients:
## (Intercept)   TypePublic  
##        2720         8605

Looking at the output, there are a few things we should notice:

As there is no indicator variable associated with “Type”, we can assume that it is the reference variable included in the intercept. This means that the formula for our model can be written

\[ \begin{aligned} \widehat{\text{Enrollment}} &= \beta_0 + \beta_1 \times \mathbb{1}_{\text{Public}}\\ &= 2720 + 8605 \times \mathbb{1}_{\text{Public}} \end{aligned} \] where $\mathbb{1}_{\text{Public}}$ is our indicator variable. Recall that if a school is a public school then $\mathbb{1}_{\text{Public}} = 1$ and if a school is private, $\mathbb{1}_{\text{Public}} = 0$

\[ \mathbb{1}_{\text{Public}} = \begin{cases} 1 \quad \text{Type = Public} \\ 0 \quad \text{Type = Private} \end{cases} \]

To find the mean value of enrollment for private schools, we can use the formula above, using a value of 0 for $\mathbb{1}_{\text{Public}}$. This gives:

\[ \begin{aligned} \widehat{\text{Enrollment}} &= 2720 + 8605 \times 0 \\ &= 2720 \end{aligned} \] This is precisely the mean value that we found above

To find the mean value of enrollment for public schools, we can use the same formula, this time setting $\mathbb{1}_{\text{Public}} = 1$

\[ \begin{aligned} \widehat{\text{Enrollment}} &= 2720 + 8605 \times 1 \\ &= 11325 \end{aligned} \] Again, this is precisely what we found when we computed the mean value directly above.

In general, we should recall that whenever a categorical variable with $p$ different categories is included in our model, our model output will give us coefficient estimates for $p-1$ different categories, with the one omitted being included in the intercept as the reference variable.

Question 4: First, with the mpg dataset, use the appropriate dplyr functions to find the average city mpg (cty) for each of the categories in drv. Then use the lm() function to fit a linear model, again with cty as the outcome variable and with drv as the predictor. What becomes the reference variable in this model? Which drive-train appears to have the best city mpg? Which has the worst?

Multiple Linear Regression

Multiple linear regression (MLR) is a natural extension of the ideas covered in simple linear regression, and this is represented in the formula syntax in R. For example, to include an additional covariate, all we need to do is use + in our lm function

## Let's treat cylinder as a categorical variable
mpg <- mutate(mpg, cyl = factor(cyl))

## Now regression with two categorical variables
fit <- lm(cty ~ drv + cyl, mpg)

coef(fit)

## (Intercept)        drvf        drvr        cyl5        cyl6        cyl8 
##     18.8758      2.9838      1.6505     -1.3596     -4.3683     -6.8422

Using these, we should be able to write out our regression formulas and use them to predict values for the response variable, given values of our categorical variables.

Predicting Regression Values in R

There is a way to predict values of a response variable, given the predictors, that does not involve writing things out by hand. This is the predict() function. predict() requires two things: a fitted linear model object (returned by lm()) and a data.frame that contains values of our predictors.

For example, suppose I am creating a model to predict vehicle miles per gallon based on displacement and horsepower. First I would create a model, giving me my lm object; then I would create a new data frame with my explanatory variables. Critically, the names of my new data frame must match exactly those in my model.

Below, I show how to predict the miles per gallon of two different vehicles:

A vehicle with a displacement of 170 and horsepower of 125
A vehicle with a displacement of 300 and horsepower of 185

## First, I create my model and save the object
fit <- lm(mpg ~ disp + hp, mtcars)

## I create a data.frame with variable names disp and hp
df <- data.frame(disp = c(170, 300), 
                 hp = c(125, 185))

## Use predict() to estimate the mpg of each
predict(fit, df)

##      1      2 
## 22.472 17.037

This tells me that my first vehicle has a predicted mpg of 22.472 while the second is estimated at 17.037 mpg.

Practice

This next problem will ask you to use the mean() function to find an average. One thing we should be aware of, however, is that if there are any NA values in our dataset, the mean() function will return NA by default. We can correct this by using an additional arugment, na.rm = TRUE

x <- c(2, 4, NA, 6)
mean(x)

## [1] NA

mean(x, na.rm = TRUE)

## [1] 4

This will be a generally useful thing to know.

Question 5 For this problem, we will again by using the penguin dataset included above.

Part A Create a linear model using the bill length (mm) as our response variable and including both sex and flipper length (mm) as explanatory variables. Interpret the intercept and slope. Is the intercept meaningful?
Part B Recreate the model from Part A, this time also including species as an explanatory variable. Interpret the intercept in this model and explain why it has changed.
Part C Using your dplyr functions, find the average flipper length for each of the species in the dataset.
Part D Subset your penguin data to only include Chinstrap and Gentoo penguins. Using your subset data, recreate the same model from Part B. Why has the indicator variable for Gentoo changed so drastically?
Part E Using your linear model from Part D, find the predicted bill length (mm) of a male Gentoo penguin with a flipper length of 208mm