Lab 4 – Regression

library(ggplot2)
library(dplyr)

## Less uggo plots
theme_set(theme_bw())

## Copy this to turn off scientific notation (also uggo)
options(scipen = 9999)

## Copy this to run and install openintro package
# This works by checking if installed and if not, it installs it for you
if (!require(openintro)) {
  install.packages("openintro")
}

## Modify the mpg dataset in ggplot2
mpg$cyl <- factor(mpg$cyl) # turn cyl into categorical
mpg <- as.data.frame(mpg) # turn mpg into regular data.frame

## College data
college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")

This lab is broken into sections with various questions embedded in each.

Linear models in ggplot

Both correlation and linear models are relatively straightforward operations in R, utilizing only the two functions cor() and lm() (for correlation and (l)inear (m)odel). Before moving onto these, it will be worthwhile to briefly the geom_smooth() layer in ggplot2 to help us visualize the relationships between variables.

Consider first this plot from the college dataset looking at the relationship between median ACT and median 10yr post-graduation salary:

ggplot(college, aes(ACT_median, Salary10yr_median)) + 
  geom_point()

We can get a sense of the general relationship of these two variables using a smoother which, by default creates a loess curve (a form of moving average that we need not worry about in detail). Adding this smoother will add the loess curve. Note that we add the argument se = FALSE so that the plot does not include estimates of standard error:

ggplot(college, aes(ACT_median, Salary10yr_median)) + 
  geom_point() +
  geom_smooth(se = FALSE)

Critical for our mission today, however, is the fact that we can modify this smoother to add a simple linear regression line to our plot by passing in the argument method = lm

ggplot(college, aes(ACT_median, Salary10yr_median)) + 
  geom_point() +
  geom_smooth(method = lm, se = FALSE)

This will be a useful tool for visualization as we continue in the lab.

Correlation

Correlation analysis in R is done using the cor() function, which takes two variables as its argument:

cor(college$Adm_Rate, college$ACT_median)

## [1] -0.43393

By default, the method used is Pearson’s, but we can change this with the method argument:

## This is the default
cor(college$Adm_Rate, college$ACT_median, method = "pearson")

## [1] -0.43393

## We can find spearman's just as easily
cor(college$Adm_Rate, college$ACT_median, method = "spearman")

## [1] -0.21997

Just as with the table() function, we can optionally use a cleaner syntax to avoid a bunch of typing:

# The general syntax is with(data, f(var1, var2))
with(college, cor(Adm_Rate, ACT_median))

## [1] -0.43393

Question 1: This problem uses the mpg dataset with the variables displ and cty:

Consider the scatterplot shown above. Is it more appropriate to use Pearson’s or Spearman’s rank correlation? Explain
Find and report both Pearson’s and Spearman’s correlation.

Simple linear regression

While correlation is a useful way of quantitatively describing the (linear) relationship between two variables, simple linear regression extends this utility by allowing us to make predictions according to the line

\[ \begin{aligned} y = \beta_0 + x \beta_1 + \epsilon \end{aligned} \] where \(y\) is our outcome variable, \(x\) is our predictor, and \(\epsilon\) (epsilon) represents the error. Recall that

\(\beta_0\) is the model’s intercept, or the expected value of \(y\) when \(x = 0\)
\(\beta_1\) is the slope, or the expected change in \(y\) given a 1-unit change in \(x\).

Note that this is a theoretical model; in practice, we need to use our data to estimate the values of \(\beta_0\) and \(\beta_1\). We do this by fitting a linear model. We denote our estimated values (also known as fitted values) using “hat” notation: \(\hat{}\). For example, if \(y\) represents our observed outcome, then \(\hat{y}\) represents our predicted or fitted values. We thus represent the fitted regression line in terms of its estimates:

\[ \hat{y} = \hat{\beta_0} + x \hat{\beta_1} \] Note that this formula does not have the error term, \(\epsilon\), as this equation represents the line itself rather than our theoretical model.

Let’s run through an example together to build a linear model using the mpg dataset with the variables hwy and cty which represent the miles per gallon when driving on the highway or in the city, respectively.

First, we propose our regression model. Let’s consider hwy our outcome and cty our predictor variable. Our regression model is then:

\[ \text{Highway mpg} = \beta_0 + \text{City mpg} \times \beta_1 + \epsilon \]

Next, we can fit this model using the function lm() in R:

fit <- lm(hwy ~ cty, data = mpg)

The first argument, hwy ~ cty, is called a formula in R, where our outcome is on the left and our predictor is on the right

We can extract the coefficients from our model using the coef() function which returns values for the intercept and slope:

coef(fit)

## (Intercept)         cty 
##     0.89204     1.33746

Note that this is a vector, so we can save it to a value and extract its elements if we wish

## Assign coefficients to a variable
model_coef <- coef(fit)

## Extract the intercept, the first element
model_coef[1]

## (Intercept) 
##     0.89204

## Extract the slope
model_coef[2]

##    cty 
## 1.3375

Using the model’s estimated coefficients, we can create our fitted regression line:

\[ \widehat{\text{Highway mpg}} = 0.89 + 1.33 \times \text{City mpg} \] Thus, we see that for every increase in city miles per gallon, we can expect to see a 1.33 increase in highway miles per gallon

Question 2: This question uses the college dataset from above

Fit a simple linear regression model that predicts FourYearComp_Males using the variable FourYearComp_Females
Using the fitted coefficients, write out the fitted regression equation for your model (you do not need to use hat notation, you can just write text)
Provide a one-sentence interpretation of the slope coefficient in this model: how does the slope describe the relationship between these variables?
Is the intercept meaningful here? Explain.

Coefficient of Determination – \(R^2\)

In addition to extracting coefficients from our model, there is quite a bit that we are able to do in R. The summary() function provides a lot of useful information that we will be using later in the course. For now, see if you can identify where \(R^2\) (the proportion of explained variance) is reported:

## Fit the model to variable called fit
fit <- lm(hwy ~ cty, data = mpg)

## Use summary to print output
summary(fit)

## 
## Call:
## lm(formula = hwy ~ cty, data = mpg)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.341 -1.279  0.021  1.034  4.046 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)    0.892      0.469     1.9               0.058 .  
## cty            1.337      0.027    49.6 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.75 on 232 degrees of freedom
## Multiple R-squared:  0.914,  Adjusted R-squared:  0.913 
## F-statistic: 2.46e+03 on 1 and 232 DF,  p-value: <0.0000000000000002

If we look near the bottom, we see a value of \(R^2 = 0.914\), indicating that about 91% of the total variation in highway mpg can be explained using the city mpg.

Question 3: This question will use the cherry dataset that comes with the openintro package which details the height, diameter in inches measured 54 inches above ground, and the total volume of wood measured in cubic feet:

library(openintro)
head(cherry)

## # A tibble: 6 × 3
##    diam height volume
##   <dbl>  <int>  <dbl>
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7

Below are two plots showing a fitted regression line in predicting total volume using either diameter or height as our predictor

Which of these models do you anticipate has a larger \(R^2\)? Explain your reasoning. Verify your answer by creating two simple linear regression models, each using volume as outcome, and report on the \(R^2\).

SLR with Categorical Variables

As we saw in class, we can also perform simple linear regression on a quantitative outcome using a categorical predictor: the result tells us the mean value of the quantitative variable within each of the categorical groups. Models with a categorical predictor will typically report a value for each of the categories except one: the variable missing in the output is called the reference variable and it is included in the intercept.

For example, consider the case using our college dataset in which we wish to estimate the mean enrollment for all public and private schools:

## Private school enrollment
c_priv <- subset(college, Private == "Private")
c_pub <- subset(college, Private == "Public")

mean(c_priv$Enrollment)

## [1] 2720.4

mean(c_pub$Enrollment)

## [1] 11325

Here, we see the mean enrollment for private colleges is 2720 students, while the mean enrollment for public colleges is 11,325. Now consider how this plays out when constructing our linear model with Enrollment as the outcome and Private as our variable.

lm(Enrollment ~ Private, college)

## 
## Call:
## lm(formula = Enrollment ~ Private, data = college)
## 
## Coefficients:
##   (Intercept)  PrivatePublic  
##          2720           8605

Looking at the output, there are a few things we should notice:

As there is no indicator variable associated with “Private”, we can assume that it is the reference variable included in the intercept. This means that the formula for our model can be written

\[ \begin{aligned} \widehat{\text{Enrollment}} &= \beta_0 + \beta_1 \times \mathbb{1}_{\text{Public}}\\ &= 2720 + 8605 \times \mathbb{1}_{\text{Public}} \end{aligned} \] where \(\mathbb{1}_{\text{Public}}\) is our indicator variable. Recall that if a school is a public school then \(\mathbb{1}_{\text{Public}} = 1\) and if a school is private, \(\mathbb{1}_{\text{Public}} = 0\)

To find the mean value of enrollment for private schools, we can use the formula above, using a value of 0 for \(\mathbb{1}_{\text{Public}}\). This gives:

\[ \begin{aligned} \widehat{\text{Enrollment}} &= 2720 + 8605 \times 0 \\ &= 2720 \end{aligned} \] This is precisely the mean value that we found above

To find the mean value of enrollment for public schools, we can use the same formula, this time setting \(\mathbb{1}_{\text{Public}} = 1\)

\[ \begin{aligned} \widehat{\text{Enrollment}} &= 2720 + 8605 \times 1 \\ &= 11325 \end{aligned} \] Again, this is precisely what we found when we computed the mean value directly above.

In general, we should recall that whenever a categorical variable with \(p\) different categories is included in our model, our model output will give us coefficient estimates for \(p-1\) different categories, with the one omitted being included in the intercept as the reference variable.

Question 4: Using the lm() function and the mpg dataset, determine the average city mpg (cty) for each of the categories in the drv variable (drv is drive-train, with values 4, f, and r for 4-wheel drive, front -wheel drive, and rear-wheel drive, respectively). What becomes the reference variable in this model? Which drive-train appears to have the best city mpg? Which has the worst?

TBD

Maybe

📉