General Regression Notes

Linear Regression with Continuous Predictor

The first type of simple linear regression (SLR) is that with a single quantitative predictor. It always has the form

\[ Y = \beta_0 + \beta_1X \]

Where:

Example

Here we construct a model predicting penguin body mass from flipper length

penguin <- read.csv("https://collinn.github.io/data/penguins.csv")

lm(body_mass_g ~ bill_length_mm, penguin)
## 
## Call:
## lm(formula = body_mass_g ~ bill_length_mm, data = penguin)
## 
## Coefficients:
##    (Intercept)  bill_length_mm  
##          388.8            86.8

Question 0: Write the equation for this model based on the output above

Question 1: Based on this, does there appear to be a positive or linear relationship between bill length and body mass?

Question 2: If a penguin’s bill length increased by 2mm, what would be the projected increase in body mass?

Question 3: What is the predicted body mass of a penguin that has a bill length of 44mm?

Question 4: What does it mean if the \(R^2\) value for this model is \(R^2 = 0.347\)?

Linear regression with categorical predictor

This is just like SLR with a quantiative variable, except now our explanatory variable is categorical. It will always be of the form (e.g.,)

\[ Y = \beta_0 + \beta_1 \mathbb{1}_{B} + \beta_2 \mathbb{1}_{C} \]

Where:

Where

\[ \mathbb{1}_B = \cases{1 \quad \text{in category B} \\ 0 \quad \text{not in category B}} \] Remember: any value can serve as the reference variable

Example 1

First consider the values for the variables species and sex:

with(penguin, table(species, sex))
##            sex
## species     female male
##   Adelie        73   73
##   Chinstrap     34   34
##   Gentoo        58   61

Additionally, consider two separate linear models

Model 1:

lm(body_mass_g ~ species, penguin)
## 
## Call:
## lm(formula = body_mass_g ~ species, data = penguin)
## 
## Coefficients:
##      (Intercept)  speciesChinstrap     speciesGentoo  
##           3706.2              26.9            1386.3

Model 2:

lm(body_mass_g ~ species + sex, penguin)
## 
## Call:
## lm(formula = body_mass_g ~ species + sex, data = penguin)
## 
## Coefficients:
##      (Intercept)  speciesChinstrap     speciesGentoo           sexmale  
##           3372.4              26.9            1377.9             667.6

Question 1: What is the reference category in Model 1?

Question 2: From Model 1, which two species appear to be the closest in mass? How can you tell?

Question 3: Rewrite the equation for Model 1 so that Chinstrap penguins are the reference value

Question 4: What is the reference variable(s) in Model 2?

Question 5: Based on Model 2, what is the predicted difference in mass between a male chinstrap penguin and a female gentoo penguin?

Question 6: The \(R^2\) value for Model 1 is \(R^2 = 0.67\), while \(R^2 = 0.84\). What does this suggest about adding sex as a variable to Model 2?