Introduction

Run the chunk below. This will do three things:

  • Load necessary packages and datasets
  • Creates the gh() function, a very convenient shortcut for creating histograms
  • Source R code from a URL; this will contain the functions we will use. When you run this, you should see the functions now in your environment

The rest of this lab will illustrate how we will be using these sourced functions (of which there are three) to complete our CLT Funsheet

library(ggplot2)
library(dplyr)

theme_set(theme_bw())

college <- read.csv("https://collinn.github.io/data/college2019.csv")

## Better histograms
gh <- function(bins = 8) {
  geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}

## Source functions
source("https://collinn.github.io/s25/209/labs/clt_lab_functions.R")

The gh() function allows us to create nice histograms with shading and outline without having to pass these in as arguments every time. Given the number of histograms we will be creating, this will hopefully prove very useful. For an illustration of how gh() is used, see:

## Godless and without any style
p0 <- ggplot(college, aes(ACT_median)) + 
  geom_histogram(bins = 15)

## Lame default way
p1 <- ggplot(college, aes(ACT_median)) + 
  geom_histogram(color = 'black', fill = 'gray80', bins = 15)

## Superior awesome gh() way
p2 <- ggplot(college, aes(ACT_median)) + gh(bins = 15)

## The last two are identical
gridExtra::grid.arrange(p0, p1, p2, nrow = 1)

Sampling Distribution and CLT

sampleNormalData()

Here, we introduce the sampleNormalData() function. This function takes 3 arguments:

  • n The number of observations we want in our sample
  • mu The true population mean
  • sig The true population standard deviation

Essentially, this function pretends that there is a population that is normally distributed with a mean value of \(\mu\) and a standard deviation of \(\sigma\) from which we can draw samples as often as we wish. This function works by:

  1. Drawing n observations from a normal distribution with mean mu and standard deviation sig
  2. Taking the mean value (\(\overline{x}\)) from this sample
  3. Repeating this process 1000 times

It will return a data.frame where each row contains the sample number and the mean value. It will be important to note: this dataset makes up a distribution of \(\overline{x}\) in the sense that this is what it would look like if we could create a sample repeatedly and find each value of \(\overline{x}\). In other words, this allows us to simulate a sampling distribution for the statistic \(\overline{x}\)

## Creates 1,000 samples of size n = 15
df <- sampleNormalData(n = 20, mu = 30, sig = 10)

## Example of first 10 rows
## Each row represents the sample mean from one sample
head(df, 10)
##      xbar sample
## 1  28.884      1
## 2  29.982      2
## 3  27.139      3
## 4  29.016      4
## 5  36.390      5
## 6  31.894      6
## 7  32.425      7
## 8  35.840      8
## 9  29.500      9
## 10 31.689     10
## Plot distribution of xbar with gh()
## This represents the sampling distribution
ggplot(df, aes(xbar)) + gh(bins = 10)

From this histogram, I can see that it is centered around 30 (the value of \(\mu\)) and most of the values appear to fall between 25 and 35.

getSampleMean()

Here, we introduce the getSampleMean() function. This function takes 3 arguments:

  • data The dataset we wish to sample
  • variable The name of the variable we want to sample
  • n The number of observations we want in each sample

This function work similar to sampleNormalData() in that n tells us the sample size. However, instead of sampling from a normal distribution, we provide two other arguments: data tells us which dataset we wish to use and variable tells us which variable we want to sample and create a sample mean for.

For example, the following code collections 1000 samples from the variable FourYearComp_Males, each with 20 observations. It then returns a data.frame where each row includes the sample number as well as the mean for that sample

## Sample from the variable FourYearComp_Males
df_males <- getSampleMean(college, FourYearComp_Males, n = 20)

## Example of first 10 rows
head(df_males, 10)
##       xbar sample
## 1  0.44626      1
## 2  0.45752      2
## 3  0.47678      3
## 4  0.42341      4
## 5  0.49308      5
## 6  0.54315      6
## 7  0.48093      7
## 8  0.48998      8
## 9  0.48464      9
## 10 0.46162     10
## Plot distribution of xbar with gh()
ggplot(df_males, aes(xbar)) + gh(bins = 10)

This gives us an estimate of the sampling distribution for \(\overline{x}\) from the four year completion rate of males when the sample size is \(n = 20\).

simulateConfInt()

Here, we introduce the simulateConfInt() function. This function takes 3 arguments:

  • n Number of observations in sample
  • C Multiplier for standard error, also called the critical value. This is the value that relates standard deviations to percentiles
  • sd The population standard deviation

The function works by simulating the construction of 25 confidence intervals. The value n will determine the number of observations in our sample. The value for C determines the range of our confidence interval, as in the equation

\[ \overline{x} \pm C \times \frac{\sigma}{\sqrt{n}} \] Finally, sd determines the standard deviation from the population, \(\sigma\).

The following example constructs 25 confidence intervals with \(n = 15\) observations in each sample. The population has a standard deviation of \(sd = 15\) and we are constructing our interval as being plus or minus 1 standard deviation1 (i.e., this is roughly a 68% confidence interval). For now, we won’t concern ourselves with the true percentage that our confidence interval should be; rather, we will focus on how changing the different variables either increases or decreases the number of times our confidence interval contains the mean.

simulateConfInt(n = 15, C = 0.5, sd = 50)

Although it will not be necessary, I will provide this illustration again here for reference. We can think of the \(C\) term as being \(C = 0, 1, 2, 3\) below (i.e., \(\mu \pm C \times \sigma\)) to give context for where these values should lie.

In a later lab, we will take the time to more directly associate percentiles with standard errors.