## Copy and paste the code chunk below into the top of your R Markdown document
library(ggplot2)
library(dplyr)

theme_set(theme_bw())

## College data
college <- read.csv("https://collinn.github.io/data/college2019.csv")

Introduction

This lab is intended to serve as an opportunity to practice some old skills and learn some new ones. In particular, our goals are to practice using:

  • qnorm() and qt()
  • Using dplyr to create variables for analysis
  • Using R as a calculator (order of operations)

Quantile functions

Briefly, recall that the qnorm() and qt() functions are used to find the critical values associated with particular quantiles of a distribution.

## Quantiles for 80% CI for a standard normal
qnorm(c(0.1, 0.9))
## [1] -1.2816  1.2816
## For t-distribution, need to include df
qt(c(0.1, 0.9), df = 15)
## [1] -1.3406  1.3406

If you understand that the critical values will always be negatives of each other, we can shorten this and simply ask for the positive (or negative) one

qnorm(0.95)
## [1] 1.6449

This tells me my critical values for a 90% confidence interval are 1.65 and -1.65

dplyr

The reference lab can be found here

CalculatoR

This may be obvious, but it is worthwhile understanding convenient ways to use R as a calculator. For example, we saw in the lecture slides (and homework) that for the John’s Hopkins preterm study had 39 total babies, of which 31 survived at 6 months. If I wanted a confidence interval for this, I may be tempted to type:

(31/39) + qt(c(0.05, 0.95), df = 28) * sqrt((31/39) * (1 - 31/39) / 39)
## [1] 0.68488 0.90487

However, this leaves tremendous room for error. More simply, we can take advantage of the fact that R allows us to assign variables

## Assign p and n
p <- 31/39
n <- 39

## Standard error formula
se <- sqrt( p * (1-p) / n )

## Confidence interval
p + qt(c(0.05, 0.95), df = 28) * se
## [1] 0.68488 0.90487

Note that the order of operations is preserved, according to PEMDAS

Problem 1

The following code takes a sample of size \(n = 20\) from the college dataset. The function set.seed() takes an integer value that, when run, ensures that any random operations we perform (such as sampling) will take the same value each time. Here, we set our seed to be 123. Copy all of the code below to put into your own document

## Controls the randomness in the sampling step
set.seed(123)

## Sample 20 row numbers
idx <- sample(1:nrow(college), size = 20)

## Only keep those rows that were sampled
c_samp <- college[idx, ]

Done this way, we can all be sure that we are working with an identical sample.

Part A What is the sampling distribution of \(\overline{X}\) for the ACT_median variable based on the values you found in your sample c_samp

Part B Construct a 90% confidence interval for the true average value of the median ACT

Part C Treating the full college dataset as our entire population, find the true mean value for the average median ACT. Use this to construct a \(t\)-statistic to see how far our sample mean is from the true mean value.

Problem 2

Here, we will again start by taking a random sample of our college dataset, this time of size 50

## Controls the randomness in the sampling step
set.seed(1)

## Sample 20 row numbers
idx <- sample(1:nrow(college), size = 50)

## Only keep those rows that were sampled
c_samp <- college[idx, ]

Use the sample to answer the following questions.

Part A First, use the appropriate dplyr function, along with ifelse() to construct a new variable called Size that takes the value “Small” for colleges with enrollment less than 2500 and “Large” for colleges with enrollment greater than or equal to 2500

Part B Create a table using the table() function to show the number of public and private schools that are considered large and small

Part C From your table, how many colleges in your sample are Private? How many are Public? What proportion of each is considered a large school?

Part D Create a confidence interval of a size of your choosing with the appropriate critical values that provides a range of reasonable values for the true difference in proportion between public and private schools. Based on this interval, does it seem like a true difference in proportions is equal to zero?

Part E Find the \(t\)-statistic associated with the observed difference in proportions.