Introduction

Run this line. This will do three things:

  • Load necessary packages and datasets
  • Creates the gh() function, a very convenient shortcut for creating histograms
  • Source R code from a URL; this will contain the functions we will use. When you run this, you should see the functions now in your environment
library(ggplot2)
library(dplyr)

theme_set(theme_bw())

college <- read.csv("https://collinn.github.io/data/college2019.csv")

## Better histograms
gh <- function(bins = 8) {
  geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}

## Source functions
source("https://collinn.github.io/f24/labs/clt_lab_functions.R")

For an illustration of how gh() is used, see:

## Default way
p1 <- ggplot(college, aes(ACT_median)) + 
  geom_histogram(color = 'black', fill = 'gray80', bins = 15)

## Superior awesome gh() way
p2 <- ggplot(college, aes(ACT_median)) + gh(bins = 15)

## They are identical
gridExtra::grid.arrange(p1, p2, nrow = 1)

Sampling Distribution and CLT

sampleNormalData()

Here, we introduce the sampleNormalData() function. This function takes 3 arguments:

  • n The number of observations we want in our sample
  • mu The true population mean
  • sig The true population standard deviation

This function works by

  1. Drawing n observations from a normal distribution with mean mu and standard deviation sig
  2. Taking the mean value (\(\overline{x}\)) from this sample
  3. Repeating this process 1000 times

It will return a data.frame where each row contains the sample number and the mean value. It will be important to note: this dataset makes up a distribution of \(\overline{x}\) in the sense that this is what it would look like if we could create a sample repeatedly and find each value of \(\overline{x}\).

df <- sampleNormalData(n = 15, mu = 100, sig = 15, samples = 1000)

## Example of first 10 rows
head(df, 10)
##       xbar sample
## 1   99.952      1
## 2  104.601      2
## 3   99.632      3
## 4   98.150      4
## 5  102.565      5
## 6   97.524      6
## 7  101.808      7
## 8  104.354      8
## 9   96.213      9
## 10 100.061     10
## Plot distribution of xbar with gh()
ggplot(df, aes(xbar)) + gh(bins = 10)

getSampleMean()

Here, we introduce the getSampleMean() function. This function takes 3 arguments:

  • data The dataset we wish to sample
  • variable The name of the variable we want to sample
  • n The number of observations we want in each sample

This function work similar to sampleNormalData() in that n tells us the sample size. However, instead of sampling from a normal distribution, we provide two other arguments: data tells us which dataset we wish to use and variable tells us which variable we want to sample and create a sample mean for.

For example, the following code collections 1000 samples from the variable FourYearComp_Males, each with 20 observations. It then returns a data.frame where each row includes the sample number as well as the mean for that sample

## Sample from the variable FourYearComp_Males
df_males <- getSampleMean(college, FourYearComp_Males, n = 20)

## Example of first 10 rows
head(df_males, 10)
##       xbar sample
## 1  0.47904      1
## 2  0.48849      2
## 3  0.46714      3
## 4  0.45684      4
## 5  0.48785      5
## 6  0.48852      6
## 7  0.43937      7
## 8  0.49809      8
## 9  0.52052      9
## 10 0.50882     10
## Plot distribution of xbar with gh()
ggplot(df_males, aes(xbar)) + gh(bins = 10)

simulateConfInt()

Here, we introduce the simulateConfInt() function. This function takes 3 arguments:

  • n Number of observations in sample
  • m Multiplier for standard error (i.e., do we want our CI to be \(\pm\) 2 standard error)
  • sd The population standard deviation

The function works by simulating the construction of 25 confidence intervals. The value n will determine the number of observations in our sample. The value for m determines the range of our confidence interval, as in the equation

\[ \overline{x} \pm m \times s \] where \(s\) is the standard error (i.e., \(s = \sigma / \sqrt{n}\)). Finally, sd determines the standard deviation from the population.

The following example constructs 25 confidence intervals with \(n = 15\) observations in each sample. The population has a standard deviation of \(sd = 15\) and we are constructing our interval as being plus or minus 1 standard deviation1 (i.e., this is roughly a 68% confidence interval). For now, we won’t concern ourselves with the true percentage that our confidence interval should be; rather, we will focus on how changing the different variables either increases or decreases the number of times our confidence interval contains the mean.

simulateConfInt(n = 15, m = 0.5, sd = 50)

Although it will not be necessary, I will provide this illustration again here for reference. We can think of the \(m\) term as being \(m = 0, 1, 2, 3\) below (i.e., \(\mu \pm m \times \sigma\)) to give context for where these values should lie.