Preamble

library(ggplot2)
library(dplyr)

This lab, and the next set of labs, are going to be focused on statistical modeling most generally and regression in particular.

Functional and Statistical Relationships

We say that an independent variable \(y\) and the dependent variable X have a functional relationship if we can define their relation in the form

\[ y = f(X) \] That is, if we know our function \(f\) and the data \(X\), we will know exactly the value of \(y\). There are many examples of functional relationships between variables, with a standard example being that of the relationship between temperature measured in Fahrenheit and Celsius

# Set seed to randomly sample values from new haven, ct temp data (nhtemp)
set.seed(123)
faren <- sample(as.numeric(nhtemp), size = 15)
cels <- (faren - 32) * (5/9)
df <- data.frame(Farenheit = faren, 
                 Celcius = cels)

ggplot(df, aes(Farenheit, Celcius)) + geom_point(size = 3) + geom_line()

We see that the entire relationship between these variables in contained within the line given by

\[ y = \frac{5}{9} (X - 32) \] In contrast, variables \(y\) and \(X\) are said to have a statistical relationship when they cannot be defined perfectly by a function. Consider, for example, the women dataset that we had seen previously detailing the relationship between height and weight:

ggplot(women, aes(height, weight)) + geom_point() + 
  geom_smooth(method = "lm", se = FALSE, color = 'black')

While height and weight clearly suggest some type of relationship, it is by no means a perfect one.

Linear Regression

Our goal in this lab will be to try and determine good linear approximations of a functional relationship when a statistical relationship exists. Specifically, we will be considering approximations that are simple lines with functions of the form

\[ y = \alpha + X \beta + \epsilon \] where:

\(\alpha\) represents the intercept
\(\beta\) represents the slope
\(\epsilon\) represents the error term

Typically in regression, we assume that the error term \(\epsilon\) is a random variable, normally distributed with a mean value of \(0\) and a variance of \(\sigma^2\)

\[ \epsilon \sim N(0, \sigma^2) \] This will be important to keep in mind as \(\sigma^2\) is an estimate of the variability that exists in our model.

This lab is focused on linear regression with a single variable and a numeric outcome. As such, it is a good candidate for illustrating at a high level much of the guiding theory for regression models. In the next lab, we will introduce regression for multivariate models where all of the information presented here will remain relevant. Finally, we will end our discussion of regression with an introduction of logistic regression, which extends our methods to categorical outcomes.

By the end of this lab, we should have a good idea of how to fit these regression lines to our data, how to interpret the results of our models, and how to verify our assumptions to be sure that a linear model was appropriate for the data collected.

Lab

The method of regression, or the fitting of a line, primarily serves two purposes:

Description or inference on the relationship between two variables
Prediction of the value of one variable based on the value of another

The end result of a regression analysis is the estimation of the covariates, \(\alpha\) and \(\beta\), included in the linear equation or linear model given above. Based on the value of these estimates, as well as the estimates of their errors, we can make informed decisions as to the probable relationship of our independent and dependent variables and construct ranges of reasonable predictions for one value based on another.

Fitting a line

It’s not enough to say that we wish to fit a line to our data; rather, we should seek to fit the best line. To do that, we will begin by introducing a little bit of notation.

We can begin by enumerating our \(n\) observations, labeled \(i = 1, \dots, n\), as pairs of data \((x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\). For each data point \((x_i, y_i)\), let \(\hat{y}_i\) denote the corresponding point of our (currently hypothetical) best line. This will be our fitted value. And finally, we have the error term associated with each observation:

\[ e_i = (\hat{y}_i - y_i) \] which represents the vertical distance between our observed point and the corresponding point on the line. This term is also known as the residual.

Without getting into the technical details, it is sufficient to say that the line we will consider to be the line of best fit will be that which minimizes the squared error of the residuals. Most commonly, we use the metric of mean squared error, or MSE, which is defined as

\[ \text{MSE} = \frac{1}{n-p} \sum_{i=1}^n (\hat{y}_i - y_i)^2 = \frac1n \sum_{i=1}^n e_i^2 \] where \(p\) is the number of covariates we are estimating (2 in the case of slope and intercept).

A similar metric is the residual standard error which is simply defined as \(\sqrt{\text{MSE}}\). It’s further worth noting that the MSE is an estimate of \(\sigma^2\) and RMSE is an estimate of \(\sigma\).

Computing a linear model in R is simple enough. We will provide a formula of the form y ~ x, as well as a dataset containing the variables. For example, the women dataset contains the variables height and weight; letting height be the independent variable, we use the function lm() (linear model) to construct estimates of our covariates.

fit <- lm(weight ~ height, data = women)

This, in effect, creates a linear model of the form

\[ weight = \alpha + height \times \beta. \] We can find estimated values for our coefficients by simply printing the fit object:

fit

## 
## Call:
## lm(formula = weight ~ height, data = women)
## 
## Coefficients:
## (Intercept)       height  
##      -87.52         3.45

We will discuss interpretation more in the next section, but note that what this is telling us is that our fitted line has an intercept of -87.52, and that for every increase in height in inches, the average weight increases by 3.45 pounds.

Using these coefficients to create a line with geom_abline(), we see that we get something identical to geom_smooth(method = "lm")

ggplot(women, aes(height, weight)) + geom_point() +
  geom_abline(intercept = -87.52, slope = 3.45, color = 'red', size = 2) + 
  geom_smooth(method = 'lm', color = 'blue', size = 0.5, se = FALSE)

We can see that the class of the fit object (class(fit)) is "lm", and we briefly introduce a number of helpful functions working with "lm" objects

Helpful `lm()` functions

Summary and coefficients

To extract only the coefficients from the fit object, we can use coef()

coef(fit)

## (Intercept)      height 
##     -87.517       3.450

which could alternatively be used for our ggplot:

# Not run
cc <- coef(fit)

ggplot(women, aes(height, weight)) + geom_point() +
  geom_abline(intercept = cc[1], slope = cc[2], color = 'red', size = 2) + 
  geom_smooth(method = 'lm', color = 'blue', size = 0.5, se = FALSE)

To get more detailed information, we can use the summary() function:

summary(fit)

## 
## Call:
## lm(formula = weight ~ height, data = women)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.733 -1.133 -0.383  0.742  3.117 
## 
## Coefficients:
##             Estimate Std. Error t value          Pr(>|t|)    
## (Intercept) -87.5167     5.9369   -14.7 0.000000001711082 ***
## height        3.4500     0.0911    37.9 0.000000000000011 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.53 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.99 
## F-statistic: 1.43e+03 on 1 and 13 DF,  p-value: 0.0000000000000109

We will investigate the output of summary more in the next section

Fitted values and residuals

We know that the \(y\) values associated with our data were weight from the dataset. We can get the fitted values \(\hat{y}_i\) with fitted() and the residual values with residuals():

(y <- women$weight)

##  [1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164

(yhat <- fitted(fit))

##      1      2      3      4      5      6      7      8      9     10     11 
## 112.58 116.03 119.48 122.93 126.38 129.83 133.28 136.73 140.18 143.63 147.08 
##     12     13     14     15 
## 150.53 153.98 157.43 160.88

(res <- residuals(fit))

##         1         2         3         4         5         6         7         8 
##  2.416667  0.966667  0.516667  0.066667 -0.383333 -0.833333 -1.283333 -1.733333 
##         9        10        11        12        13        14        15 
## -1.183333 -1.633333 -1.083333 -0.533333  0.016667  1.566667  3.116667

## Compare them side by side in data.frame
data.frame(diff = y - yhat, 
           res = res)

##         diff       res
## 1   2.416667  2.416667
## 2   0.966667  0.966667
## 3   0.516667  0.516667
## 4   0.066667  0.066667
## 5  -0.383333 -0.383333
## 6  -0.833333 -0.833333
## 7  -1.283333 -1.283333
## 8  -1.733333 -1.733333
## 9  -1.183333 -1.183333
## 10 -1.633333 -1.633333
## 11 -1.083333 -1.083333
## 12 -0.533333 -0.533333
## 13  0.016667  0.016667
## 14  1.566667  1.566667
## 15  3.116667  3.116667

We can use these values directly to estimate the mean squared error, as well as the root mean squared error:

# divide by n - p
(mse = (1 / (nrow(women) - 2) * sum(res^2)))

## [1] 2.3256

sqrt(mse) # rsme

## [1] 1.525

# sigma() also returns rmse
sigma(fit)

## [1] 1.525

Question 1: Using the faithful dataset built in to R data("faithful"), construct a linear model with eruptions as the dependent variable and waiting time as the independent variable. Create a scatter plot demonstrating the relationship of these two variables and then use the coefficients from your fitted model, along with geom_abline(), to construct a line of best fit.

Inference

Inference with linear models is closely associated with hypothesis testing. For our variables \(y\) and \(X\), and the equation

\[ y = \alpha + X \beta \] we are asking if there is any evidence of the data in support of a linear relationship. We can express this as a simple hypothesis test, where the null hypothesis, \(H_0\) is that there is no relationship between the variables. Another way to say this is that any change in \(X\) results in no change in \(y\) (because they are not related); this is equivalent to our slope being zero.

\[ H_0: \beta = 0 \qquad H_A: \beta \not= 0 \]

Before getting into the inference, let’s first consider the meaning of the model coefficients themselves.

Slope

Let’s begin by trying to build some intuition about the situation.

First, suppose that you were given only the data on womens’ weight from the women dataset and were asked to give your best estimate of the value of their weight. That is, if you given a random woman and were asked to guess her weight, what value would you give? Here, the average weight in the entire dataset seems like a reasonable estimate, as you have nothing else to go on. However, if you were then given the height of each woman in the dataset, and were then asked to estimate the weight of a particular woman of a given height, do you suspect that the average weight in the dataset would still be the best answer you could give? Hold this thought as we move onto the next example.

Now suppose you were given a collection of values \(y\) with no additional information or variables and were asked to give your best estimate of the value of \(y\). Again, the mean value, \(\overline{y}\) seems like a reasonable estimate. If you were then given an additional variable \(X\), which you knew beforehand has absolutely nothing to do with \(y\), should our estimate of \(y\) change?

set.seed(123)
x <- runif(200, -10, 10)
y <- rnorm(200)
df <- data.frame(x, y)

ggplot(df, aes(x,y)) + geom_point() + 
  geom_abline(aes(intercept = mean(y), slope = 0, color = "Mean")) +
  geom_smooth(method = "lm", se = FALSE, mapping = aes(color = "LM"))

As we know from algebra, the equation

\[ y = \alpha + X \beta \]

tells us that a unit change in \(X\) results in a \(\beta\) change in \(y\). When the variable \(X\) has no linear relationship with out outcome \(y\), we expect that \(\beta = 0\) and that \(\alpha = \overline{y}\).

Intercept

As we saw in the previous section, when we have \(\beta = 0\), the value of \(\alpha\) is equal to the mean of \(y\). When this is not the case, we have

\[ \hat{\alpha} = \overline{y} - \overline{X} \hat{\beta} \] In terms of inference, we find ourselves in one of two scenarios:

When the value \(X = 0\) is included in our data, the value of \(\hat{\alpha}\) gives us the mean value of \(y\) when \(X = 0\)
When the value \(X = 0\) is not included in our dataset, the intercept has no particular meaning on it’s own and exists otherwise to modify the intercept of the estimated line so that the slope is able to minimize the residual sum of squares.

A nice illustration of this comes from a reconsideration of the coefficients found in the women data. There, we had an estimated intercept of \(\hat{\alpha} = -87.5167\), indicating that a woman who was 0 inches tall would weigh about -87 pounds.

Significance and Model Fit

When discussing the significance of a covariate, we are specifically asking whether or not there is evidence of a linear relationship between the dependent and independent variable. This is done using a \(t\)-test, similar to what you may have seen in other statistics courses. For example, the estimate of the covariate \(\hat{\beta}\) follows a \(t\)-distribution, with

\[ \frac{(\hat{\beta} - \beta)}{s_{\hat{\beta}}} \sim t_{n-2} \] where \(\beta\) is taken to be the true value. As our null hypothesis is typically \(\beta = 0\), this reduces to

\[ \frac{\hat{\beta}}{s_{\hat{\beta}}} \sim t_{n-2} \] This gives us a useful way to think about our statistic: if variability associated with our estimate (\(s_{\hat{\beta}}\)) is large relative to \(\hat{\beta}\), then this value will be closer to 0, resulting in a large p-value. Conversely, if \(\hat{\beta}\) is large relative to its significance, we should expect the p-value to be much smaller.

We can get values for both the estimate of \(\beta\), it’s standard error, and it’s p-value with the summary() function:

fit <- lm(weight ~ height, women)
summary(fit)

## 
## Call:
## lm(formula = weight ~ height, data = women)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.733 -1.133 -0.383  0.742  3.117 
## 
## Coefficients:
##             Estimate Std. Error t value          Pr(>|t|)    
## (Intercept) -87.5167     5.9369   -14.7 0.000000001711082 ***
## height        3.4500     0.0911    37.9 0.000000000000011 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.53 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.99 
## F-statistic: 1.43e+03 on 1 and 13 DF,  p-value: 0.0000000000000109

From this, we see we have an estimate of \(\hat{\beta} = 3.45\), \(s_{\hat{\beta}} = 0.091\), and \(p = 0.000000000000011\). In this case, we would say that our estimate of \(\beta\) is statistically significant, with evidence to reject our null hypothesis that \(\beta = 0\).

Our inference is not limited to only the individual parameters, however; it is also important that we assess how well the model fits the data overall. A simple metric for doing this is with the \(R^2\) value, which we introduce now.

First, some terms. The total sum of squares (SST) is defined as the total amount of variation that exists in \(y\), relative to it’s mean value. It’s given as

\[ SST = \sum_{1}^{n} (y_i - \overline{y})^2 \] In effect, this is considering the case in which \(\beta = 0\) and \(\alpha = \overline{y}\) – if we had no values for \(X\) and were simply using the mean value, how much error would our model contain?

The second value, the residual sum of squares (SSR) is asking for the sum of squared errors, similar to the MSE but without dividing by \(n\):

\[ SSR = \sum_{1}^{n} (y_i - \hat{y}_i)^2 \] This is demonstrating how much error exists in our model after considering values for \(X\). A natural thing to consider, then, would be the ratio of these two values,

\[ \frac{SSR}{SST} \]

If the value of \(\overline{y}\) is nearly similar to the fitted values \(\hat{y}_i\) (meaning that adding the variable \(X\) did little to improve the fit), then this value will be nearly close to 1.

\(R^2\) is computed as:

\[ R^2 = 1 - \frac{SSR}{SST} \] meaning that when this ratio is close to 1, the value of \(R^2\) will be near zero: this indicates a poor fit. Conversely, values of \(R^2\) closer to 1 indicate a better fit. Below, we see examples of two models, one with an \(R^2\) closer to zero, the second closer to one.

It’s important to note here that assessing the quality of a model is distinct from assessing the linear relationship between our variables. When assessing the linear relationship, we are asking, “is there evidence that there is a linear relationship between the variable \(X\) and \(y\)”, whereas assessing model fit is asking, “does our covariate \(X\) adequately explain the variation that exists in \(y\).”

To illustrate this further, we construct two datasets, both of the form

\[ y = 1 + 3X + \epsilon \] where in the first model, the amount of variability is large, with

\[ \epsilon_1 \sim N(0, 5) \] whereas the second is much smaller,

\[ \epsilon_2 \sim N(0, 0.5) \]

yet in both cases, the relationship between \(X\) and \(y\) is the same.

N <- 400
set.seed(123)
x <- rnorm(N)
y <- 1 + 3*x + rnorm(N, sd = 5)

df1 <- data.frame(x, y)

x <- rnorm(N)
y <- 1 + 3*x + rnorm(N, sd = 0.5)

df2 <- data.frame(x, y)

## Create plots of each
p1 <- ggplot(df1, aes(x, y)) + geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  ggtitle("High variability") + ylim(-5, 5) + xlim(-1, 1)

p2 <- ggplot(df2, aes(x, y)) + geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  ggtitle("Low variability")+ ylim(-5, 5)+ xlim(-1, 1)

gridExtra::grid.arrange(p1, p2, nrow = 1)

Even looking at the summary output, we see that the estimates of \(\hat{\beta}\) are similar, but notice near the bottom the estimate of \(R^2\), the first with \(R^2 = 0.273\), the second at \(R^2 = 0.978\).

lm(y ~ x, df1) %>% summary()

## 
## Call:
## lm(formula = y ~ x, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.101  -3.185   0.173   3.344  13.422 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)    1.022      0.249    4.11             0.000048 ***
## x              3.141      0.257   12.22 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.98 on 398 degrees of freedom
## Multiple R-squared:  0.273,  Adjusted R-squared:  0.271 
## F-statistic:  149 on 1 and 398 DF,  p-value: <0.0000000000000002

lm(y ~ x, df2) %>% summary()

## 
## Call:
## lm(formula = y ~ x, data = df2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.417 -0.319  0.020  0.330  1.665 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   1.0052     0.0236    42.5 <0.0000000000000002 ***
## x             3.0681     0.0230   133.2 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.472 on 398 degrees of freedom
## Multiple R-squared:  0.978,  Adjusted R-squared:  0.978 
## F-statistic: 1.77e+04 on 1 and 398 DF,  p-value: <0.0000000000000002

Interestingly, this can occur in the other direction as well, where a model fit improves by including a covariate, even though the covariate doesn’t show (or in this case, doesn’t have) any linear relationship with the outcome.

set.seed(123)
x <- rnorm(100, sd = 2)
y <- rep(1, 100) 
lm(y ~ x) %>% summary()

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##                   Min                    1Q                Median 
## -0.000000000000018588  0.000000000000000095  0.000000000000000189 
##                    3Q                   Max 
##  0.000000000000000273  0.000000000000000546 
## 
## Coefficients:
##                          Estimate            Std. Error             t value
## (Intercept) 0.9999999999999978906 0.0000000000000001901 5259261438382945.00
## x           0.0000000000000000745 0.0000000000000001042                0.71
##                        Pr(>|t|)    
## (Intercept) <0.0000000000000002 ***
## x                          0.48    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.00000000000000189 on 98 degrees of freedom
## Multiple R-squared:  0.502,  Adjusted R-squared:  0.497 
## F-statistic: 98.6 on 1 and 98 DF,  p-value: <0.0000000000000002

In summary, the progression of this section matches neatly with the goals of inference in regression: we wish to construct models that explain the variability of our dependent variable in terms of the independent variables we have collected. By reviewing the value of our estimated \(\beta\), we can determine how likely it is that our variables have a linear relationship, and through examining our \(R^2\), we can determine how much of the variation in \(y\) is determined through it’s relationship with \(X\). We will investigate this more in the next lab where we introduce additional covariates into our model.

Question 2: Although this will be discussed in the next section, use only your intuition for this question. Suppose that you have fit a model with strong evidence for a linear relationship between \(X\) and \(y\), but with an \(R^2 < 0.2\). Would you feel comfortable using this model to predict values of \(y\) based on new observations in \(X\)? Why or why not?

Prediction

In addition to performing inference, regression models can also be used for prediction using the predict() function. This function takes the fitted dataset, a data.frame with the original covariate names, and a number of additional arguments. See ?predict.lm() for more.

The data.frame that we pass into predict() will include the independent variables values for which we wish to predict the dependent variables. You may remember using this in the plotly lab where we used grid() to create an array of points in two dimensional space. In this example, we use the seq() function to create a vector from -2 to 2 with 20 values in between

set.seed(123)
x <- rnorm(100)
y <- 2 - 3*x + rnorm(100)
df <- data.frame(x, y)

fit <- lm(y ~ x, df)

# However, we do need a data.frame for predict
newdf <- data.frame(x = seq(-2, 2, length.out = 20))

# This returns new values for each x
(yhat <- predict(fit, newdata = newdf))

##        1        2        3        4        5        6        7        8 
##  8.00214  7.35951  6.71689  6.07426  5.43164  4.78901  4.14639  3.50376 
##        9       10       11       12       13       14       15       16 
##  2.86114  2.21851  1.57588  0.93326  0.29063 -0.35199 -0.99462 -1.63724 
##       17       18       19       20 
## -2.27987 -2.92250 -3.56512 -4.20775

## Let's add the prediction points to new data frame
newdf$y <- yhat

## The original data in black, predicted data in red
ggplot(df, aes(x, y)) + geom_point() + 
  geom_point(data = newdf, mapping = aes(x, y), color = 'red', size = 2)

The predict() function also takes an argument for the construction of confidence intervals.

(yhat <- predict(fit, newdata = newdf, interval = "confidence"))

##         fit      lwr      upr
## 1   8.00214  7.51873  8.48555
## 2   7.35951  6.91670  7.80233
## 3   6.71689  6.31381  7.11997
## 4   6.07426  5.70978  6.43874
## 5   5.43164  5.10421  5.75906
## 6   4.78901  4.49651  5.08151
## 7   4.14639  3.88582  4.40695
## 8   3.50376  3.27091  3.73661
## 9   2.86114  2.65011  3.07216
## 10  2.21851  2.02145  2.41556
## 11  1.57588  1.38322  1.76855
## 12  0.93326  0.73478  1.13174
## 13  0.29063  0.07696  0.50431
## 14 -0.35199 -0.58844 -0.11555
## 15 -0.99462 -1.25947 -0.72977
## 16 -1.63724 -1.93452 -1.33997
## 17 -2.27987 -2.61241 -1.94733
## 18 -2.92250 -3.29234 -2.55265
## 19 -3.56512 -3.97374 -3.15650
## 20 -4.20775 -4.65624 -3.75925

## Let's turn this into a data.frame that we can plot
# First, let's add x to this new data.frame
yhat <- as.data.frame(yhat)
yhat$x <- newdf$x

# The pivot so that
library(tidyr)
yhat <- pivot_longer(yhat, cols = c("lwr", "upr"), 
             names_to = "se", values_to = "values")

p1 <- ggplot(df, aes(x, y)) + geom_point() +
  geom_point(data = yhat, aes(x, fit), color = 'red', size = 2) + 
  geom_line(data = yhat, aes(x, values, group = se), color = 'red', 
            linetype = "dashed")

# Compare this with geom_smooth
p2 <- ggplot(df, aes(x, y)) + geom_point() +
  geom_smooth(method = "lm")

gridExtra::grid.arrange(p1, p2, nrow = 1)

Question 3: Repeat the experiment we ran above, this time drawing 20 random variables instead of 100. Then, recreate the plot with the confidence intervals. What has changed? Why do you think this is?

Question 4: Modify the code from the previous question to draw 100 random variables, but now increase the standard deviation of the error term associated with y to be 5. How does this change the intervals? Thinking of your solution to this and the previous question, what appear to be the two main sources of variability in our model estimates?

Checking Assumptions

The validity of a linear model rests on the satisfying of assumptions that make up a model. These assumptions are:

Linearity of relationship between \(X\) and \(y\)
Normality of errors
Heteroscedasticity (the variance is the same across observations)
Presence of outliers

While there are a number of tests that can be used to verify these assumptions, the simplest way is to visually investigate the data itself. Here, we will briefly cover each.

Linearity

The question of linearity may seem obvious, but it is important to still investigate visually. An excellent way to explore this is with residual analysis, that is, creating plots and investigating the residuals. Our assumption is that the error term \(\epsilon\) is normally distributed with a mean value of zero. When a linear relationship exists, plotting the residuals against our \(X\) values should show the errors normally distributed around 0

set.seed(123)
x <- rnorm(100)
y <- -2 + 3*x + rnorm(100)

df <- data.frame(x, y)
fit <- lm(y ~ x, df)

# Get the residual values
df$res <- residuals(fit)
p1 <- ggplot(df, aes(x, y)) + geom_point() +
  geom_smooth(method = "lm", se = FALSE) + 
  ggtitle("Regression Line")

p2 <- ggplot(df, aes(x, res)) + geom_point() +
  geom_abline(slope = 0, intercept = 0, color = 'red', linetype = "dashed") + 
  ggtitle("Residual Plot")

gridExtra::grid.arrange(p1, p2, nrow = 1)

It’s possible for our model to still show a strong linear relationship when in fact the relationship isn’t linear at all. This drives primarily from the fact that the strength of the linear relationship lies in the correlation between the \(X\) and \(y\) values:

set.seed(123)
x <- runif(100, 0, 10)
y <- x^(4/2) + rnorm(100)
df <- data.frame(x, y)
fit <- lm(y ~ x, df)

# Strong evidence of linear relationship
summary(fit)

## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9.47  -6.46  -1.24   6.03  16.99 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  -17.307      1.481   -11.7 <0.0000000000000002 ***
## x             10.059      0.258    39.0 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.32 on 98 degrees of freedom
## Multiple R-squared:  0.939,  Adjusted R-squared:  0.939 
## F-statistic: 1.52e+03 on 1 and 98 DF,  p-value: <0.0000000000000002

df$res <- residuals(fit)

# This is not linear at all
p1 <- ggplot(df, aes(x, y)) + geom_point() + 
  geom_smooth(method = "lm",  se = FALSE) + 
  ggtitle("Regression Line")

p2 <- ggplot(df, aes(x, res)) + geom_point() +
  geom_abline(slope = 0, intercept = 0, color = 'red', linetype = "dashed") + 
  ggtitle("Residual Plot")

gridExtra::grid.arrange(p1, p2, nrow = 1)

Here, the values of the residuals clearly change with \(X\), suggesting that our assumptions of linearity have failed.

We could also investigate this using QQ plots, which test the values of the residuals against their hypothetical values if they were normally distributed. If they are, we should expect to have a straight line:

x <- rnorm(100)
y <- -2 + 3*x + rnorm(100)

df <- data.frame(x, y)
fit <- lm(y ~ x, df)

# This is mostly a line
qqnorm(residuals(fit))

# ----------------

x <- runif(100, 0, 10)
y <- x^(4/2) + rnorm(100)
df <- data.frame(x, y)
fit <- lm(y ~ x, df)

# This is mostly not a line
qqnorm(residuals(fit))

We should also keep in mind the converse of this statement: just because there is no evidence of a linear relationship does not at all mean that \(X\) and \(y\) are not related. Here, for example, there is very clearly a relationship between \(X\) and \(y\), it just happens to not be linear:

x <- runif(100, -2, 2)
y <- x^2 + rnorm(100, sd = 0.2)

# This is not linear at all
ggplot(df, aes(x, y)) + geom_point() + 
  geom_smooth(method = "lm",  se = FALSE)

As is typically the case, there is no one-size-fits-all solutions to these types of problems. Sometimes they indicate that variables are missing from our models, sometimes they suggest that transformations of our data may be appropriate

x <- runif(100, -2, 2)
y <- x^2 + rnorm(100, sd = 0.2)
df <- data.frame(x, y)

# Add an x^2 variable
df$x2 <- df$x^2

# This is much better on the x^2 scale
ggplot(df, aes(x2, y)) + geom_point() + 
  geom_smooth(method = "lm",  se = FALSE)

Heteroscedasticity

Heteroscedasticity is a fancy term for the phenomenon in which the amount of variability observed in \(y\) is dependent on the value of \(X\): this is in contrast to our assumption that the variability of \(y\) is the same for all observations. In this case, we do indeed have a linear relationship, but the variability increases along with \(X\)

x <- runif(100, 0, 20) 
y <- 1.5*x + rnorm(100, sd = 1.5*x) # sd is dependent on x
df <- data.frame(x, y)
fit <- lm(y ~ x, df)

df$res <- residuals(fit)

p1 <- ggplot(df, aes(x, y)) + geom_point() + 
  geom_smooth(method = "lm",  se = FALSE) + 
  ggtitle("Regression Line")

p2 <- ggplot(df, aes(x, res)) + geom_point() +
  geom_abline(slope = 0, intercept = 0, color = 'red', linetype = "dashed") + 
  ggtitle("Residual Plot")

gridExtra::grid.arrange(p1, p2, nrow = 1)

The main consequence that heteroscedasticity has on our model is an increase in the variance of parameter estimation, meaning we need even more evidence to properly refute the null hypothesis.

Outliers

Finally, we address the case of outliers. The primary issue with outliers is that they have the potentially to drastically shift our model estimates:

set.seed(14)
x <- runif(50, 0, 4)
y <- 2 - 3*x + rnorm(50, sd = 0.5)
#y[50] <- 5 * y[50]

y <- c(y, -40)
x <- c(x, 4)

df <- data.frame(x, y)
df2 <- df[1:50, ]


ggplot(df, aes(x, y)) + geom_point() +
  geom_smooth(data = df2, method = "lm", aes(color = "Without Outlier"), 
              se = FALSE) + 
  geom_smooth(data = df, method = "lm", aes(color = "With Outlier"), 
              se = FALSE)

As before, there is no general issue for addressing outliers. More often, we should investigate the presence of outliers to determine if there might be an error in our data or an observation that has been included that is not a part of our intended sample. At any rate, discarding the outlier without further investigation is often frowned upon.

Simple Linear Regression