library(ggplot2)
library(dplyr)
## Less uggo plots
theme_set(theme_bw())
## Copy this to turn off scientific notation (also uggo)
options(scipen = 9999)
## Copy this to run and install openintro package
# This works by checking if installed and if not, it installs it for you
if (!require(openintro)) {
install.packages("openintro")
}
## Modify the mpg dataset in ggplot2
mpg$cyl <- factor(mpg$cyl) # turn cyl into categorical
mpg <- as.data.frame(mpg) # turn mpg into regular data.frame
## College data
college <- read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
This lab is broken into sections with various questions embedded in each.
Both correlation and linear models are relatively straightforward
operations in R, utilizing only the two functions cor()
and
lm()
(for correlation and (l)inear (m)odel). Before moving
onto these, it will be worthwhile to briefly the
geom_smooth()
layer in ggplot2
to help us
visualize the relationships between variables.
Consider first this plot from the college
dataset
looking at the relationship between median ACT and median 10yr
post-graduation salary:
ggplot(college, aes(ACT_median, Salary10yr_median)) +
geom_point()
We can get a sense of the general relationship of these two variables
using a smoother which, by default creates a loess
curve (a form of moving average that we need not worry about in detail).
Adding this smoother will add the loess curve. Note that we add the
argument se = FALSE
so that the plot does not include
estimates of standard error:
ggplot(college, aes(ACT_median, Salary10yr_median)) +
geom_point() +
geom_smooth(se = FALSE)
Critical for our mission today, however, is the fact that we can
modify this smoother to add a simple linear regression line to our plot
by passing in the argument method = lm
ggplot(college, aes(ACT_median, Salary10yr_median)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)
This will be a useful tool for visualization as we continue in the lab.
Correlation analysis in R is done using the cor()
function, which takes two variables as its argument:
cor(college$Adm_Rate, college$ACT_median)
## [1] -0.43393
By default, the method used is Pearson’s, but we can change this with the method argument:
## This is the default
cor(college$Adm_Rate, college$ACT_median, method = "pearson")
## [1] -0.43393
## We can find spearman's just as easily
cor(college$Adm_Rate, college$ACT_median, method = "spearman")
## [1] -0.21997
Just as with the table()
function, we can optionally use
a cleaner syntax to avoid a bunch of typing:
# The general syntax is with(data, f(var1, var2))
with(college, cor(Adm_Rate, ACT_median))
## [1] -0.43393
Question 1: This problem uses the mpg
dataset with the variables displ
and cty
:
While correlation is a useful way of quantitatively describing the (linear) relationship between two variables, simple linear regression extends this utility by allowing us to make predictions according to the line
\[ \begin{aligned} y = \beta_0 + x \beta_1 + \epsilon \end{aligned} \] where \(y\) is our outcome variable, \(x\) is our predictor, and \(\epsilon\) (epsilon) represents the error. Recall that
Note that this is a theoretical model; in practice, we need to use our data to estimate the values of \(\beta_0\) and \(\beta_1\). We do this by fitting a linear model. We denote our estimated values (also known as fitted values) using “hat” notation: \(\hat{}\). For example, if \(y\) represents our observed outcome, then \(\hat{y}\) represents our predicted or fitted values. We thus represent the fitted regression line in terms of its estimates:
\[ \hat{y} = \hat{\beta_0} + x \hat{\beta_1} \] Note that this formula does not have the error term, \(\epsilon\), as this equation represents the line itself rather than our theoretical model.
Let’s run through an example together to build a linear model using
the mpg
dataset with the variables hwy
and
cty
which represent the miles per gallon when driving on
the highway or in the city, respectively.
hwy
our outcome and cty
our predictor
variable. Our regression model is then:\[ \text{Highway mpg} = \beta_0 + \text{City mpg} \times \beta_1 + \epsilon \]
lm()
in
R:fit <- lm(hwy ~ cty, data = mpg)
hwy ~ cty
, is called a
formula in R, where our outcome is on the left and our
predictor is on the rightcoef()
function which returns values for the intercept and
slope:coef(fit)
## (Intercept) cty
## 0.89204 1.33746
Note that this is a vector, so we can save it to a value and extract its elements if we wish
## Assign coefficients to a variable
model_coef <- coef(fit)
## Extract the intercept, the first element
model_coef[1]
## (Intercept)
## 0.89204
## Extract the slope
model_coef[2]
## cty
## 1.3375
\[ \widehat{\text{Highway mpg}} = 0.89 + 1.33 \times \text{City mpg} \] Thus, we see that for every increase in city miles per gallon, we can expect to see a 1.33 increase in highway miles per gallon
Question 2: This question uses the
college
dataset from above
FourYearComp_Males
using the variable
FourYearComp_Females
In addition to extracting coefficients from our model, there is quite
a bit that we are able to do in R. The summary()
function
provides a lot of useful information that we will be using later in the
course. For now, see if you can identify where \(R^2\) (the proportion of explained
variance) is reported:
## Fit the model to variable called fit
fit <- lm(hwy ~ cty, data = mpg)
## Use summary to print output
summary(fit)
##
## Call:
## lm(formula = hwy ~ cty, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.341 -1.279 0.021 1.034 4.046
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.892 0.469 1.9 0.058 .
## cty 1.337 0.027 49.6 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.75 on 232 degrees of freedom
## Multiple R-squared: 0.914, Adjusted R-squared: 0.913
## F-statistic: 2.46e+03 on 1 and 232 DF, p-value: <0.0000000000000002
If we look near the bottom, we see a value of \(R^2 = 0.914\), indicating that about 91% of the total variation in highway mpg can be explained using the city mpg.
Question 3: This question will use the
cherry
dataset that comes with the openintro
package which details the height, diameter in inches measured 54 inches
above ground, and the total volume of wood measured in cubic feet:
library(openintro)
head(cherry)
## # A tibble: 6 × 3
## diam height volume
## <dbl> <int> <dbl>
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
Below are two plots showing a fitted regression line in predicting total volume using either diameter or height as our predictor
Which of these models do you anticipate has a larger \(R^2\)? Explain your reasoning. Verify your answer by creating two simple linear regression models, each using volume as outcome, and report on the \(R^2\).
As we saw in class, we can also perform simple linear regression on a quantitative outcome using a categorical predictor: the result tells us the mean value of the quantitative variable within each of the categorical groups. Models with a categorical predictor will typically report a value for each of the categories except one: the variable missing in the output is called the reference variable and it is included in the intercept.
For example, consider the case using our college
dataset
in which we wish to estimate the mean enrollment for all public and
private schools:
## Private school enrollment
c_priv <- subset(college, Private == "Private")
c_pub <- subset(college, Private == "Public")
mean(c_priv$Enrollment)
## [1] 2720.4
mean(c_pub$Enrollment)
## [1] 11325
Here, we see the mean enrollment for private colleges is 2720
students, while the mean enrollment for public colleges is 11,325. Now
consider how this plays out when constructing our linear model with
Enrollment
as the outcome and Private
as our
variable.
lm(Enrollment ~ Private, college)
##
## Call:
## lm(formula = Enrollment ~ Private, data = college)
##
## Coefficients:
## (Intercept) PrivatePublic
## 2720 8605
Looking at the output, there are a few things we should notice:
\[ \begin{aligned} \widehat{\text{Enrollment}} &= \beta_0 + \beta_1 \times \mathbb{1}_{\text{Public}}\\ &= 2720 + 8605 \times \mathbb{1}_{\text{Public}} \end{aligned} \] where \(\mathbb{1}_{\text{Public}}\) is our indicator variable. Recall that if a school is a public school then \(\mathbb{1}_{\text{Public}} = 1\) and if a school is private, \(\mathbb{1}_{\text{Public}} = 0\)
\[ \begin{aligned} \widehat{\text{Enrollment}} &= 2720 + 8605 \times 0 \\ &= 2720 \end{aligned} \] This is precisely the mean value that we found above
\[ \begin{aligned} \widehat{\text{Enrollment}} &= 2720 + 8605 \times 1 \\ &= 11325 \end{aligned} \] Again, this is precisely what we found when we computed the mean value directly above.
In general, we should recall that whenever a categorical variable with \(p\) different categories is included in our model, our model output will give us coefficient estimates for \(p-1\) different categories, with the one omitted being included in the intercept as the reference variable.
Question 4: Using the lm()
function and
the mpg
dataset, determine the average city mpg
(cty
) for each of the categories in the drv
variable (drv
is drive-train, with values 4
,
f
, and r
for 4-wheel drive, front -wheel
drive, and rear-wheel drive, respectively). What becomes the
reference variable in this model? Which drive-train appears to
have the best city mpg? Which has the worst?
Maybe
📉