library(ggplot2)
library(dplyr)
# Prettier graphs
theme_set(theme_bw())
For each of these scenarios, you should (1) state the null hypothesis and (2) identify the correct statistical test for this hypothesis
Model 1:
lm(formula = Net_Tuition ~ Enrollment, data = college)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14225.3137 272.8034 52.1 <0.0000000000000002 ***
Enrollment -0.0820 0.0265 -3.1 0.002 **
Residual standard error: 7180 on 1093 degrees of freedom
Multiple R-squared: 0.00869, Adjusted R-squared: 0.00779
F-statistic: 9.58 on 1 and 1093 DF, p-value: 0.00201
Model 2:
lm(formula = Net_Tuition ~ Enrollment + Type, data = college)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5746.1019 377.2481 15.2 <0.0000000000000002 ***
Enrollment 0.2533 0.0239 10.6 <0.0000000000000002 ***
TypePrivate 10808.5970 398.6370 27.1 <0.0000000000000002 ***
Residual standard error: 5550 on 1092 degrees of freedom
Multiple R-squared: 0.408, Adjusted R-squared: 0.406
F-statistic: 376 on 2 and 1092 DF, p-value: <0.0000000000000002
Part A: For this part, consider Model 1 from above. What is the null hypothesis in linear regression? Based on the summary output, how would you describe the relationship between enrollment and tuition?
Part B: Now consider Model 2, which includes an indicator for whether or not a college is private. How would you interpret the intercept in this model? Is this a meaningful value in this model?
Part C: Compare the coefficient for Enrollment between Model 1 and Model 2. What has changed? In other words, what impact has adding an indicator for Private had on this value, and why did it result in such a drastic change?
We are interested in predicting the wing length of chinstrap hawks using sex and culmen (upper beak) length. The following plots illustrate the relationship between them, with fitted regression lines from Model 1 and Model 3, respectively. Three models are presented beneath
Model 1 (Culmen only)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 90.06 22.63 3.98 0.00018 ***
Culmen 8.78 1.28 6.86 0.0000000032 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 25 on 64 degrees of freedom
Multiple R-squared: 0.424, Adjusted R-squared: 0.415
F-statistic: 47.1 on 1 and 64 DF, p-value: 0.00000000323
Model 2 (Sex only)
lm(Wing ~ Sex, data = hwk) %>% summary()
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 257.24 5.24 49.1 < 0.0000000000000002 ***
SexM -26.67 7.41 -3.6 0.00062 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 30.1 on 64 degrees of freedom
Multiple R-squared: 0.168, Adjusted R-squared: 0.155
F-statistic: 13 on 1 and 64 DF, p-value: 0.000624
Model 3 (Culmen and Sex)
lm(Wing ~ Culmen + Sex, data = hwk) %>% summary()
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 109.45 27.20 4.02 0.00016 ***
Culmen 7.93 1.44 5.50 0.00000073 ***
SexM -8.82 6.94 -1.27 0.20836
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 24.9 on 63 degrees of freedom
Multiple R-squared: 0.438, Adjusted R-squared: 0.421
F-statistic: 24.6 on 2 and 63 DF, p-value: 0.0000000128
Part A: Consider Models 1 and 2. How much total variance is explained by Culmen length or Sex in each of these models?
Part B: Based on Model 2, what conclusion could we come to regarding wing length and sex?
Part C: Consider the Model 1 Regression plot above that includes only Culmen. Does there appear to be any relationship between the culmen length and sex? Explain
Part D: Now consider Model 3. Does sex appear to be statistically significant in this model? Why do you think this is happening? Explain. (Hint: How is the interpretation of coefficient for sex in Model 3 different than in Model 2? Using the plot may help)
The following dataset includes observations investigating the relationship between ecological factors and sleep factors in mammals. The variables below include the average hours dreaming during sleep as well as an index from 1-5 indicating the degree to which the animal may be considered a predator (1 = least likely to be preyed upon, 5 = most likely to be prey).
Part A: In considering an ANOVA, state the null hypothesis relating hours dreaming to predator index
Part B: Consider the following ANOVA output. Based on this, what conclusion would you come to regarding the null hypothesis in Part A when testing at \(\alpha = 0.05\)?
> aov(Dreaming ~ Predation, sleep) %>% summary()
Df Sum Sq Mean Sq F value Pr(>F)
Predation 4 22.1 5.53 3.12 0.024 *
Residuals 45 79.8 1.77
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Part C: Below are the results of Tukey’s HSD test. As it relates to the statistical power and multiple testing, explain how it is possible that we are able to reject our hypothesis in Part B while failing to reject any pairwise differences between predators.
> aov(Dreaming ~ Predation, sleep) %>% TukeyHSD()
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Dreaming ~ Predation, data = sleep)
$Predation
diff lwr upr p adj
2-1 -0.027692 -1.6197 1.564321 1.00000
3-1 -0.575556 -2.3146 1.163486 0.87953
4-1 -1.420000 -3.2852 0.445219 0.21235
5-1 -1.538182 -3.1919 0.115560 0.07941
3-2 -0.547863 -2.1891 1.093379 0.87621
4-2 -1.392308 -3.1667 0.382081 0.18764
5-2 -1.510490 -3.0611 0.040083 0.05954
4-3 -0.844444 -2.7519 1.062966 0.71771
5-3 -0.962626 -2.6638 0.738560 0.50031
5-4 -0.118182 -1.9482 1.711794 0.99974
Part D: Below is the same data presented as a linear model. Why do we see some covariates marked as statistically significant here at \(\alpha = 0.05\) that were not identified in the Tukey test?
lm(Dreaming ~ Predation, data = sleep)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.6200 0.4212 6.22 0.00000015 ***
Predation2 -0.0277 0.5603 -0.05 0.961
Predation3 -0.5756 0.6120 -0.94 0.352
Predation4 -1.4200 0.6564 -2.16 0.036 *
Predation5 -1.5382 0.5820 -2.64 0.011 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.33 on 45 degrees of freedom
(12 observations deleted due to missingness)
Multiple R-squared: 0.217, Adjusted R-squared: 0.147
F-statistic: 3.12 on 4 and 45 DF, p-value: 0.0239
Consider the linear regression model below comparing the blood pressure a control and treatment group following a clinical intervention. A plot of the results (with mean indicated in orange) is also given below
Model 1:
lm(formula = BP ~ Group, data = df)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 130.69 2.59 50.46 <0.0000000000000002 ***
GroupTreatment -7.76 3.66 -2.12 0.037 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 18.3 on 98 degrees of freedom
Multiple R-squared: 0.0438, Adjusted R-squared: 0.034
F-statistic: 4.49 on 1 and 98 DF, p-value: 0.0367
Part A: Which group had the lower average blood pressure following the study?
Part B: Consider the model output. How is it possible that our coefficients are statistically significant, but we still end up with an \(R^2\) value of less than 5%?
Part C: Based on this, would using treatment group be a good candidate for predicting blood pressure? Why or why not? If not, what value is there in creating a model like the one above?