Run this line. This will do three things:
gh()
function, a very convenient shortcut
for creating histogramslibrary(ggplot2)
library(dplyr)
theme_set(theme_bw())
college <- read.csv("https://collinn.github.io/data/college2019.csv")
## Better histograms
gh <- function(bins = 8) {
geom_histogram(color = 'black', fill = 'gray80', bins = bins)
}
## Source functions
source("https://collinn.github.io/f24/labs/clt_lab_functions.R")
For an illustration of how gh()
is used, see:
## Default way
p1 <- ggplot(college, aes(ACT_median)) +
geom_histogram(color = 'black', fill = 'gray80', bins = 15)
## Superior awesome gh() way
p2 <- ggplot(college, aes(ACT_median)) + gh(bins = 15)
## They are identical
gridExtra::grid.arrange(p1, p2, nrow = 1)
sampleNormalData()
Here, we introduce the sampleNormalData()
function. This
function takes 3 arguments:
n
The number of observations we want in our samplemu
The true population meansig
The true population standard deviationThis function works by
n
observations from a normal distribution with
mean mu
and standard deviation sig
It will return a data.frame where each row contains the sample number and the mean value. It will be important to note: this dataset makes up a distribution of \(\overline{x}\) in the sense that this is what it would look like if we could create a sample repeatedly and find each value of \(\overline{x}\).
df <- sampleNormalData(n = 15, mu = 100, sig = 15, samples = 1000)
## Example of first 10 rows
head(df, 10)
## xbar sample
## 1 99.952 1
## 2 104.601 2
## 3 99.632 3
## 4 98.150 4
## 5 102.565 5
## 6 97.524 6
## 7 101.808 7
## 8 104.354 8
## 9 96.213 9
## 10 100.061 10
## Plot distribution of xbar with gh()
ggplot(df, aes(xbar)) + gh(bins = 10)
getSampleMean()
Here, we introduce the getSampleMean()
function. This
function takes 3 arguments:
data
The dataset we wish to samplevariable
The name of the variable we want to
samplen
The number of observations we want in each
sampleThis function work similar to sampleNormalData()
in that
n
tells us the sample size. However, instead of sampling
from a normal distribution, we provide two other arguments:
data
tells us which dataset we wish to use and
variable
tells us which variable we want to sample and
create a sample mean for.
For example, the following code collections 1000 samples from the
variable FourYearComp_Males
, each with 20 observations. It
then returns a data.frame where each row includes the sample number as
well as the mean for that sample
## Sample from the variable FourYearComp_Males
df_males <- getSampleMean(college, FourYearComp_Males, n = 20)
## Example of first 10 rows
head(df_males, 10)
## xbar sample
## 1 0.47904 1
## 2 0.48849 2
## 3 0.46714 3
## 4 0.45684 4
## 5 0.48785 5
## 6 0.48852 6
## 7 0.43937 7
## 8 0.49809 8
## 9 0.52052 9
## 10 0.50882 10
## Plot distribution of xbar with gh()
ggplot(df_males, aes(xbar)) + gh(bins = 10)
simulateConfInt()
Here, we introduce the simulateConfInt()
function. This
function takes 3 arguments:
n
Number of observations in samplem
Multiplier for standard error (i.e., do we want our
CI to be \(\pm\) 2 standard error)sd
The population standard deviationThe function works by simulating the construction of 25 confidence
intervals. The value n
will determine the number of
observations in our sample. The value for m
determines the
range of our confidence interval, as in the equation
\[
\overline{x} \pm m \times s
\] where \(s\) is the standard
error (i.e., \(s = \sigma /
\sqrt{n}\)). Finally, sd
determines the standard
deviation from the population.
The following example constructs 25 confidence intervals with \(n = 15\) observations in each sample. The population has a standard deviation of \(sd = 15\) and we are constructing our interval as being plus or minus 1 standard deviation1 (i.e., this is roughly a 68% confidence interval). For now, we won’t concern ourselves with the true percentage that our confidence interval should be; rather, we will focus on how changing the different variables either increases or decreases the number of times our confidence interval contains the mean.
simulateConfInt(n = 15, m = 0.5, sd = 50)
Although it will not be necessary, I will provide this illustration again here for reference. We can think of the \(m\) term as being \(m = 0, 1, 2, 3\) below (i.e., \(\mu \pm m \times \sigma\)) to give context for where these values should lie.