STA-290 HW 8

library(ggplot2)
library(dplyr)

# Prettier graphs
theme_set(theme_bw())

Question 1

For each of these scenarios, you should (1) state the null hypothesis and (2) identify the correct statistical test for this hypothesis

Part A: For this study, I am interested in determining if a student’s major (Humanities/STEM/Social Sciences) is associated with their final exam score in STA-209.
Part B: A two-day workshop for learning basic R has been created, where attendees are tested in their R skills both prior to the workshop and after the workshop has been completed. For each of these tests, a numeric score is given. We wish to determine whether or not the workshop has been effective in improving the R skill of the attendees.
Part C: Binge drinking is defined as a pattern of drinking that involves consuming 5 or more standard drinks within 2 hours. Respondents of a survey were asked for their sex and whether or not they have engaged in binge drinking more than twice in the previous week. We wish to determine whether or not there is a difference in binge drinking patterns between men and women.
Part D: I have a six-sided die and I am interested in determining if each side lands with equal probability
Part E: We have collected data from who has visited the Bear in the last week. We want to know if men and women attend with equal probability

Question 2

Model 1:

lm(formula = Net_Tuition ~ Enrollment, data = college)

Coefficients:
              Estimate Std. Error t value            Pr(>|t|)
(Intercept) 14225.3137   272.8034    52.1 <0.0000000000000002 ***
Enrollment     -0.0820     0.0265    -3.1               0.002 **

Residual standard error: 7180 on 1093 degrees of freedom
Multiple R-squared:  0.00869,   Adjusted R-squared:  0.00779
F-statistic: 9.58 on 1 and 1093 DF,  p-value: 0.00201

Model 2:

lm(formula = Net_Tuition ~ Enrollment + Type, data = college)

Coefficients:
              Estimate Std. Error t value            Pr(>|t|)
(Intercept)  5746.1019   377.2481    15.2 <0.0000000000000002 ***
Enrollment      0.2533     0.0239    10.6 <0.0000000000000002 ***
TypePrivate 10808.5970   398.6370    27.1 <0.0000000000000002 ***

Residual standard error: 5550 on 1092 degrees of freedom
Multiple R-squared:  0.408, Adjusted R-squared:  0.406
F-statistic:  376 on 2 and 1092 DF,  p-value: <0.0000000000000002

Part A: For this part, consider Model 1 from above. What is the null hypothesis in linear regression? Based on the summary output, how would you describe the relationship between enrollment and tuition?

Part B: Now consider Model 2, which includes an indicator for whether or not a college is private. How would you interpret the intercept in this model? Is this a meaningful value in this model?

Part C: Compare the coefficient for Enrollment between Model 1 and Model 2. What has changed? In other words, what impact has adding an indicator for Private had on this value, and why did it result in such a drastic change?

Question 3

We are interested in predicting the wing length of chinstrap hawks using sex and culmen (upper beak) length. The following plots illustrate the relationship between them, with fitted regression lines from Model 1 and Model 3, respectively. Three models are presented beneath

Model 1 (Culmen only)

Coefficients:
            Estimate Std. Error t value     Pr(>|t|)    
(Intercept)    90.06      22.63    3.98      0.00018 ***
Culmen          8.78       1.28    6.86 0.0000000032 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 25 on 64 degrees of freedom
Multiple R-squared:  0.424, Adjusted R-squared:  0.415 
F-statistic: 47.1 on 1 and 64 DF,  p-value: 0.00000000323

Model 2 (Sex only)

lm(Wing ~ Sex, data = hwk) %>% summary()

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   257.24       5.24    49.1 < 0.0000000000000002 ***
SexM          -26.67       7.41    -3.6              0.00062 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 30.1 on 64 degrees of freedom
Multiple R-squared:  0.168, Adjusted R-squared:  0.155 
F-statistic:   13 on 1 and 64 DF,  p-value: 0.000624

Model 3 (Culmen and Sex)

lm(Wing ~ Culmen + Sex, data = hwk) %>% summary()

Coefficients:
            Estimate Std. Error t value   Pr(>|t|)    
(Intercept)   109.45      27.20    4.02    0.00016 ***
Culmen          7.93       1.44    5.50 0.00000073 ***
SexM           -8.82       6.94   -1.27    0.20836    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 24.9 on 63 degrees of freedom
Multiple R-squared:  0.438, Adjusted R-squared:  0.421 
F-statistic: 24.6 on 2 and 63 DF,  p-value: 0.0000000128

Part A: Consider Models 1 and 2. How much total variance is explained by Culmen length or Sex in each of these models?

Part B: Based on Model 2, what conclusion could we come to regarding wing length and sex?

Part C: Consider the Model 1 Regression plot above that includes only Culmen. Does there appear to be any relationship between the culmen length and sex? Explain

Part D: Now consider Model 3. Does sex appear to be statistically significant in this model? Why do you think this is happening? Explain. (Hint: How is the interpretation of coefficient for sex in Model 3 different than in Model 2? Using the plot may help)

Question 4

The following dataset includes observations investigating the relationship between ecological factors and sleep factors in mammals. The variables below include the average hours dreaming during sleep as well as an index from 1-5 indicating the degree to which the animal may be considered a predator (1 = least likely to be preyed upon, 5 = most likely to be prey).

Part A: In considering an ANOVA, state the null hypothesis relating hours dreaming to predator index

Part B: Consider the following ANOVA output. Based on this, what conclusion would you come to regarding the null hypothesis in Part A when testing at \(\alpha = 0.05\)?

> aov(Dreaming ~ Predation, sleep) %>% summary()

            Df Sum Sq Mean Sq F value Pr(>F)  
Predation    4   22.1    5.53    3.12  0.024 *
Residuals   45   79.8    1.77                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Part C: Below are the results of Tukey’s HSD test. As it relates to the statistical power and multiple testing, explain how it is possible that we are able to reject our hypothesis in Part B while failing to reject any pairwise differences between predators.

> aov(Dreaming ~ Predation, sleep) %>% TukeyHSD()
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Dreaming ~ Predation, data = sleep)

$Predation
         diff     lwr      upr   p adj
2-1 -0.027692 -1.6197 1.564321 1.00000
3-1 -0.575556 -2.3146 1.163486 0.87953
4-1 -1.420000 -3.2852 0.445219 0.21235
5-1 -1.538182 -3.1919 0.115560 0.07941
3-2 -0.547863 -2.1891 1.093379 0.87621
4-2 -1.392308 -3.1667 0.382081 0.18764
5-2 -1.510490 -3.0611 0.040083 0.05954
4-3 -0.844444 -2.7519 1.062966 0.71771
5-3 -0.962626 -2.6638 0.738560 0.50031
5-4 -0.118182 -1.9482 1.711794 0.99974

Part D: Below is the same data presented as a linear model. Why do we see some covariates marked as statistically significant here at \(\alpha = 0.05\) that were not identified in the Tukey test?

lm(Dreaming ~ Predation, data = sleep)

Coefficients:
            Estimate Std. Error t value   Pr(>|t|)    
(Intercept)   2.6200     0.4212    6.22 0.00000015 ***
Predation2   -0.0277     0.5603   -0.05      0.961    
Predation3   -0.5756     0.6120   -0.94      0.352    
Predation4   -1.4200     0.6564   -2.16      0.036 *  
Predation5   -1.5382     0.5820   -2.64      0.011 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.33 on 45 degrees of freedom
  (12 observations deleted due to missingness)
Multiple R-squared:  0.217, Adjusted R-squared:  0.147 
F-statistic: 3.12 on 4 and 45 DF,  p-value: 0.0239

Question 5

Consider the linear regression model below comparing the blood pressure a control and treatment group following a clinical intervention. A plot of the results (with mean indicated in orange) is also given below

Model 1:

lm(formula = BP ~ Group, data = df)

Coefficients:
               Estimate Std. Error t value            Pr(>|t|)    
(Intercept)      130.69       2.59   50.46 <0.0000000000000002 ***
GroupTreatment    -7.76       3.66   -2.12               0.037 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 18.3 on 98 degrees of freedom
Multiple R-squared:  0.0438,    Adjusted R-squared:  0.034 
F-statistic: 4.49 on 1 and 98 DF,  p-value: 0.0367

Part A: Which group had the lower average blood pressure following the study?

Part B: Consider the model output. How is it possible that our coefficients are statistically significant, but we still end up with an \(R^2\) value of less than 5%?

Part C: Based on this, would using treatment group be a good candidate for predicting blood pressure? Why or why not? If not, what value is there in creating a model like the one above?