library(ggplot2)
library(dplyr)
# install.packages("gridExtra")
library(gridExtra)

theme_set(theme_bw())

\(\chi^2\) Goodness of Fit

With a \(\chi^2\) goodness of fit test, we are comparing what we have observed with what we should expect to be true under the null hypothesis. This is helpful for checking if observed data follows a particular distribution. For example, suppose we have a coin and we wish to investigate whether or not it is fair, i.e., equal probability heads or tails.

To perform this experiement, we will flip it twice and record the number of heads. We will perform this experiment 100 times

set.seed(123)
coin <- rbinom(100, 2, p = 0.5)
table(coin)
## coin
##  0  1  2 
## 26 47 27

If the coin is fair, we should get 0 heads 1/4 of the time, 1 heads 1/2 of the time, and 2 heads 1/4 of the time.

## Probabilties under the null
p <- c(0.25, 0.5, 0.25)
obs <- c(26, 47, 27)

chisq.test(obs, p = p)
## 
##  Chi-squared test for given probabilities
## 
## data:  obs
## X-squared = 0.38, df = 2, p-value = 0.83

This gives us a \(\chi^2\) test statistic of \(\chi^2 = 0.38\) with \(df = 2\) and \(p\)-val \(= 0.83\)

Regression

There are primarily three issues with regression:

  1. Can I interpret the model and make predictions
  2. Can I relate prediction ability with tests for individual covariates
  3. Can I use residuals to find missing values

This will review (2) and (3)

Variance Explained

We will create two separate lines, both of the form

\[ y = 1 + 3X + \epsilon \] In the first case, our total variance will be \(\sigma = 5\), and in the second it will be \(\sigma = 15\)

## Create 100 observations
N <- 100
x <- rnorm(100, sd = 5) # values here do not matter

## First line (sd = 5)
y1 <- 1 + 3*x + rnorm(N, sd = 5)

## Second line (sd = 15)
y2 <- 1 + 3*x + rnorm(N, sd = 15)

## Put them all in a dataframe
df <- data.frame(x, y1, y2)

## Fit two separate models
fit1 <- lm(y1 ~ x, df)
fit2 <- lm(y2 ~ x, df)

First, let’s compare how they look

p1 <- ggplot(df, aes(x, y1)) + 
  geom_point(size = 2) + ylim(-40, 65) +
  geom_smooth(method = lm, se = FALSE) + 
  ggtitle("Fit 1")

p2 <- ggplot(df, aes(x, y2)) + 
  geom_point(size = 2) + ylim(-40, 65) +
  geom_smooth(method = lm, se = FALSE) +
  ggtitle("Fit 2")

## This let's me do a side-by-side
gridExtra::grid.arrange(p1, p2, nrow = 1)

We see that the slope for both is about 3. We can then examine the output from the linear models

summary(fit1)
## 
## Call:
## lm(formula = y1 ~ x, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.908 -3.630 -0.997  2.642 15.723 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   1.0531     0.4687    2.25               0.027 *  
## x             2.9158     0.0975   29.90 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.68 on 98 degrees of freedom
## Multiple R-squared:  0.901,  Adjusted R-squared:   0.9 
## F-statistic:  894 on 1 and 98 DF,  p-value: <0.0000000000000002
summary(fit2)
## 
## Call:
## lm(formula = y2 ~ x, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32.38  -9.28  -0.88  10.91  32.26 
## 
## Coefficients:
##             Estimate Std. Error t value           Pr(>|t|)    
## (Intercept)    2.873      1.497    1.92              0.058 .  
## x              2.951      0.311    9.47 0.0000000000000017 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.9 on 98 degrees of freedom
## Multiple R-squared:  0.478,  Adjusted R-squared:  0.473 
## F-statistic: 89.8 on 1 and 98 DF,  p-value: 0.00000000000000168

Notice that the estimates for X coefficient are both nearly 3 and both are statistically significant. However, compare the values for \(R^2\). Why are these different? Which plot above looks like the line explains most of the variability in y?

Residuals

A plot of the residuals can help us find missing variables. Consider here a regression line of the form

\[ y = -2X_1 + 3X_2 + \epsilon \]

We will start by creating a model that only includes \(X_1\)

## Create fit and get residuals
N <- 100
x1 <- rnorm(N, sd = 2)
x2 <- rnorm(N)

y <- -2*x1 + 3*x2 + rnorm(100)

df <- data.frame(y, x1, x2)
fit <- lm(y ~ x1, df)

Look at these plots side-by-side. They are the same plot, but one is colored according to the missing variable \(X_2\)

p1 <- ggplot(df, aes(x1, y)) + 
  geom_point(size = 2) + 
  geom_smooth(method = lm, se = FALSE)

p2 <- ggplot(df, aes(x1, y, color = x2)) + 
  geom_point(size = 2) + 
  geom_smooth(method = lm, se = FALSE) + 
  scale_color_continuous(type = "viridis", option = "H")

gridExtra::grid.arrange(p1, p2, nrow = 1)

From the color, I can see all of the dots above the regression line are orange/red, corresponding to larger values of \(X_2\). This means that if I then plot \(X_2\) against the residuals, I should see that larger \(X_2\) values are associated with larger (positive) residuals. Again, we do a side by side, the second including the same color as the first

p1 <- ggplot(df, aes(x2, fit$residuals)) + 
  geom_point()

p2 <- ggplot(df, aes(x2, fit$residuals, color = x2)) + 
  geom_point() + 
  scale_color_continuous(type = "viridis", option = "H")

gridExtra::grid.arrange(p1, p2, nrow = 1)

In fact, if we were to do a regression of \(X_2\) on the residuals, we would find that is has the same slope that it would have if we included it in the model against \(y\). In other words, we can see in the residual plot exactly how this missing predictor is “mopping up” the extra variability in \(y\)

## Regression against residuals
lm(fit$residuals ~ x2)
## 
## Call:
## lm(formula = fit$residuals ~ x2)
## 
## Coefficients:
## (Intercept)           x2  
##      -0.485        3.186
## Regression against y
lm(y ~ x2)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Coefficients:
## (Intercept)           x2  
##        0.11         3.22