## Copy and paste the code chunk below into the top of your R Markdown document
library(ggplot2)
library(dplyr)
theme_set(theme_bw())
## College data
college <- read.csv("https://collinn.github.io/data/college2019.csv")
This lab is intended to serve as an opportunity to practice some old skills and learn some new ones. In particular, our goals are to practice using:
qnorm()
and qt()
dplyr
to create variables for analysisR
as a calculator (order of operations)Briefly, recall that the qnorm()
and qt()
functions are used to find the critical values associated with
particular quantiles of a distribution.
## Quantiles for 80% CI for a standard normal
qnorm(c(0.1, 0.9))
## [1] -1.2816 1.2816
## For t-distribution, need to include df
qt(c(0.1, 0.9), df = 15)
## [1] -1.3406 1.3406
If you understand that the critical values will always be negatives of each other, we can shorten this and simply ask for the positive (or negative) one
qnorm(0.95)
## [1] 1.6449
This tells me my critical values for a 90% confidence interval are 1.65 and -1.65
dplyr
The reference lab can be found here
This may be obvious, but it is worthwhile understanding convenient ways to use R as a calculator. For example, we saw in the lecture slides (and homework) that for the John’s Hopkins preterm study had 39 total babies, of which 31 survived at 6 months. If I wanted a confidence interval for this, I may be tempted to type:
(31/39) + qt(c(0.05, 0.95), df = 28) * sqrt((31/39) * (1 - 31/39) / 39)
## [1] 0.68488 0.90487
However, this leaves tremendous room for error. More simply, we can take advantage of the fact that R allows us to assign variables
## Assign p and n
p <- 31/39
n <- 39
## Standard error formula
se <- sqrt( p * (1-p) / n )
## Confidence interval
p + qt(c(0.05, 0.95), df = 28) * se
## [1] 0.68488 0.90487
Note that the order of operations is preserved, according to PEMDAS
The following code takes a sample of size \(n = 20\) from the college dataset. The
function set.seed()
takes an integer value that, when run,
ensures that any random operations we perform (such as sampling) will
take the same value each time. Here, we set our seed to be 123. Copy all
of the code below to put into your own document
## Controls the randomness in the sampling step
set.seed(123)
## Sample 20 row numbers
idx <- sample(1:nrow(college), size = 20)
## Only keep those rows that were sampled
c_samp <- college[idx, ]
Done this way, we can all be sure that we are working with an identical sample.
Part A What is the sampling distribution of \(\overline{X}\) for the
ACT_median
variable based on the values you found in your
sample c_samp
Part B Construct a 90% confidence interval for the true average value of the median ACT
Part C Treating the full college
dataset as our entire population, find the true mean value for the
average median ACT. Use this to construct a \(t\)-statistic to see how far our sample
mean is from the true mean value.
Here, we will again start by taking a random sample of our college dataset, this time of size 50
## Controls the randomness in the sampling step
set.seed(1)
## Sample 20 row numbers
idx <- sample(1:nrow(college), size = 50)
## Only keep those rows that were sampled
c_samp <- college[idx, ]
Use the sample to answer the following questions.
Part A First, use the appropriate dplyr function,
along with ifelse()
to construct a new variable called
Size
that takes the value “Small” for colleges with
enrollment less than 2500 and “Large” for colleges with enrollment
greater than or equal to 2500
Part B Create a table using the table()
function to show the number of public and private schools that are
considered large and small
Part C From your table, how many colleges in your sample are Private? How many are Public? What proportion of each is considered a large school?
Part D Create a confidence interval of a size of your choosing with the appropriate critical values that provides a range of reasonable values for the true difference in proportion between public and private schools. Based on this interval, does it seem like a true difference in proportions is equal to zero?
Part E Find the \(t\)-statistic associated with the observed difference in proportions.