library(ggplot2)
library(dplyr)
# install.packages("gridExtra")
library(gridExtra)
theme_set(theme_bw())
With a \(\chi^2\) goodness of fit test, we are comparing what we have observed with what we should expect to be true under the null hypothesis. This is helpful for checking if observed data follows a particular distribution. For example, suppose we have a coin and we wish to investigate whether or not it is fair, i.e., equal probability heads or tails.
To perform this experiement, we will flip it twice and record the number of heads. We will perform this experiment 100 times
set.seed(123)
coin <- rbinom(100, 2, p = 0.5)
table(coin)
## coin
## 0 1 2
## 26 47 27
If the coin is fair, we should get 0 heads 1/4 of the time, 1 heads 1/2 of the time, and 2 heads 1/4 of the time.
## Probabilties under the null
p <- c(0.25, 0.5, 0.25)
obs <- c(26, 47, 27)
chisq.test(obs, p = p)
##
## Chi-squared test for given probabilities
##
## data: obs
## X-squared = 0.38, df = 2, p-value = 0.83
This gives us a \(\chi^2\) test statistic of \(\chi^2 = 0.38\) with \(df = 2\) and \(p\)-val \(= 0.83\)
There are primarily three issues with regression:
This will review (2) and (3)
We will create two separate lines, both of the form
\[ y = 1 + 3X + \epsilon \] In the first case, our total variance will be \(\sigma = 5\), and in the second it will be \(\sigma = 15\)
## Create 100 observations
N <- 100
x <- rnorm(100, sd = 5) # values here do not matter
## First line (sd = 5)
y1 <- 1 + 3*x + rnorm(N, sd = 5)
## Second line (sd = 15)
y2 <- 1 + 3*x + rnorm(N, sd = 15)
## Put them all in a dataframe
df <- data.frame(x, y1, y2)
## Fit two separate models
fit1 <- lm(y1 ~ x, df)
fit2 <- lm(y2 ~ x, df)
First, let’s compare how they look
p1 <- ggplot(df, aes(x, y1)) +
geom_point(size = 2) + ylim(-40, 65) +
geom_smooth(method = lm, se = FALSE) +
ggtitle("Fit 1")
p2 <- ggplot(df, aes(x, y2)) +
geom_point(size = 2) + ylim(-40, 65) +
geom_smooth(method = lm, se = FALSE) +
ggtitle("Fit 2")
## This let's me do a side-by-side
gridExtra::grid.arrange(p1, p2, nrow = 1)
We see that the slope for both is about 3. We can then examine the output from the linear models
summary(fit1)
##
## Call:
## lm(formula = y1 ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.908 -3.630 -0.997 2.642 15.723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0531 0.4687 2.25 0.027 *
## x 2.9158 0.0975 29.90 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.68 on 98 degrees of freedom
## Multiple R-squared: 0.901, Adjusted R-squared: 0.9
## F-statistic: 894 on 1 and 98 DF, p-value: <0.0000000000000002
summary(fit2)
##
## Call:
## lm(formula = y2 ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.38 -9.28 -0.88 10.91 32.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.873 1.497 1.92 0.058 .
## x 2.951 0.311 9.47 0.0000000000000017 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.9 on 98 degrees of freedom
## Multiple R-squared: 0.478, Adjusted R-squared: 0.473
## F-statistic: 89.8 on 1 and 98 DF, p-value: 0.00000000000000168
Notice that the estimates for X coefficient are both nearly 3 and both are statistically significant. However, compare the values for \(R^2\). Why are these different? Which plot above looks like the line explains most of the variability in y?
A plot of the residuals can help us find missing variables. Consider here a regression line of the form
\[ y = -2X_1 + 3X_2 + \epsilon \]
We will start by creating a model that only includes \(X_1\)
## Create fit and get residuals
N <- 100
x1 <- rnorm(N, sd = 2)
x2 <- rnorm(N)
y <- -2*x1 + 3*x2 + rnorm(100)
df <- data.frame(y, x1, x2)
fit <- lm(y ~ x1, df)
Look at these plots side-by-side. They are the same plot, but one is colored according to the missing variable \(X_2\)
p1 <- ggplot(df, aes(x1, y)) +
geom_point(size = 2) +
geom_smooth(method = lm, se = FALSE)
p2 <- ggplot(df, aes(x1, y, color = x2)) +
geom_point(size = 2) +
geom_smooth(method = lm, se = FALSE) +
scale_color_continuous(type = "viridis", option = "H")
gridExtra::grid.arrange(p1, p2, nrow = 1)
From the color, I can see all of the dots above the regression line are orange/red, corresponding to larger values of \(X_2\). This means that if I then plot \(X_2\) against the residuals, I should see that larger \(X_2\) values are associated with larger (positive) residuals. Again, we do a side by side, the second including the same color as the first
p1 <- ggplot(df, aes(x2, fit$residuals)) +
geom_point()
p2 <- ggplot(df, aes(x2, fit$residuals, color = x2)) +
geom_point() +
scale_color_continuous(type = "viridis", option = "H")
gridExtra::grid.arrange(p1, p2, nrow = 1)
In fact, if we were to do a regression of \(X_2\) on the residuals, we would find that is has the same slope that it would have if we included it in the model against \(y\). In other words, we can see in the residual plot exactly how this missing predictor is “mopping up” the extra variability in \(y\)
## Regression against residuals
lm(fit$residuals ~ x2)
##
## Call:
## lm(formula = fit$residuals ~ x2)
##
## Coefficients:
## (Intercept) x2
## -0.485 3.186
## Regression against y
lm(y ~ x2)
##
## Call:
## lm(formula = y ~ x2)
##
## Coefficients:
## (Intercept) x2
## 0.11 3.22