6 Sampling Distributions and the
Central Limit Theorem

“While nothing is more uncertain than a single life, nothing is more certain than the average duration of a thousand lives.” - Elizur Wright

“Everything that can be counted does not necessarily count; everything that counts cannot necessarily be counted.” - Albert Einstein

Learning objectives
  1. Learn to differentiate between statistics and parameters
  2. Understand the concept of sampling distributions
  3. Conceptual understanding of Central Limit Theorem and how it applies to sample means

6.1 Introduction to Sampling

Suppose policy makers and public health experts in Iowa are concerned with Iowans’ fast food intake. To understand this further, officials want to determine the average monthly spending on fast food for all Iowans. It would be very expensive, however, to have all 3.2 million Iowans track and report their total monthly fast food expenses. Whatsmore, many Iowans would not be willing to share personal financial details with public health researchers or government officials. These are but a few of the challenges making it practically impossible for officials to ever determine the true average spent by Iowans each month on fast food.

A true numeric quantity about the population, here assumed to be the mean, is known as a population parameter. While we may never know the exact value of a population parameter, we have tools at our disposal to estimate it. For example, officials can determine a representative sample of Iowans and have those consenting to share their information report their monthly fast food spending. The average spending of this group is the sample mean, which can then be used approximate the population mean. The sample mean is an example of a sample statistic, which is a numerical quantity about the sample. In other words, statistics are what investigators know, and parameters are what investigators want to know.

The statistical framework (pictured below) demonstrates the process of making inference about population parameters with the use of sample statistics, as was done in the example above. The numerical quantities of interest can be many different things – means, proportions, standard deviations, etc.,. While the methods of estimation are similar in each case, the present chapter will simplify our discussion with a focus specifically on estimations of the mean.

Statistical Framework

Figure 6.1: Statistical Framework

The statistical framework outlines the general process of making inference using sample statistics to estimate population parameters. In trying to make proper inference, we want to ensure our that our calculated statistic is a valid estimate of the population parameter of interest. To that end, there are two major statistical issues we concern ourselves with:

  • On average, does our estimate tend to be centered around the true answer, or is it biased?
  • How much variability is there likely to be in our sample?

The difference between our estimate and our parameter is called bias. Variance (or “noise”) is a description of the spread of our data. Sometimes we talk about spread in terms of precision, which is the reciprocal of variance. The more precision we have, the less variance and vice-versa. A good statistic will have little or no bias and would have minimal variability. We can visualize both bias and variability using a dart board analogy:

Bias and Variability Illustration

Figure 6.2: Bias and Variability Illustration

In terms of priority, having a low bias is generally more desirable than low variance, and this is represented in the top row above. Having low bias indicates that our sample statistic is correctly estimating the population parameter of interest, even if our statistic is considered “noisy”. Variance, illustrated in the columns, gives an idea of how “consistent” our estimate is. One way to think of it is to consider the throwing of a dart to the collection of a sample; comparing the top right and bottom left corners, we might say it is better throw generally close to the target than to consistently miss it entirely. Statisticians work hard to develop sampling and inferential procedures that allow us to estimate population parameters with low variability and little to no bias.


Definition 6.1

 

Population Parameter: The true numeric quantity describing a population (i.e., the population mean)

Sample Statistic: A numerical quantity about the sample, often used to estimate a population parameter

Inference: The process of making generalizations about the population based on sampling statistics

Bias: The systematic difference between our estimate and the true parameter

Variability: The spread or noise in the data

Precision: Inverse of variance, or how “narrow” the data is

6.2 Sampling Distributions

In order to make inference, we must be able to quantify our uncertainty about the population parameter based on our sample data. To illustrate this uncertainty, consider again our example above in estimating the average expenditures of Iowans on fast food, and suppose that researchers decide to sample 500 individuals from the population and calculate the sample mean, denoted \(\overline{x}\). Now suppose that those same researchers go out and collect a second sample of 500 individuals from the population and calculate the sample mean again, which me might call \(\overline{x}_2\). Should we expect \(\overline{x}\) and \(\overline{x}_2\) to be exactly the same? What if they did this again and calculated \(\overline{x}_3\)? Or \(\overline{x}_4\)?

As we might expect, it is very unlikely that any of these will be exactly equal to any of the others, though they might be close. But if none of these sample means are equal to each other, how accurate might we expect any of them to be with regards to estimating our population parameter? Herein lies the question of quantifying our uncertainty. Fortunately, statisticians have created powerful methods for being able to do so accurately, even when only a single sample statistic has been obtained.

We might first begin by recognizing that the process of sampling from a population is itself a random experiment, with the sample mean \(\overline{x}\) being a numeric quantity which assumes a value based on the outcome of that experiment. This should sound familiar – this is precisely the definition of a random variable! Just as with other random variables, sample statistics themselves also follow a probability distribution. When the random variable itself is a statistic, we call the probability distribution a sampling distribution. Sampling distributions reflect which values of the statistic are likely if we were to repeat the sampling process (that is, likely values for \(\overline{x}_2\), \(\overline{x}_3\), etc.,), and which values are improbable. We might then expect that the value of the true population parameter will fall somewhere within the range of likely values.

Exercise 6.1

 

The applet below is designed to help you get familiar with the concept of random sampling. The plot on the left side gives a population distribution consisting of 25 people, where each subject in the population is represented by one box. The x-axis represents the value of the outcome variable for each individual, with stacked boxes indicating that multiple people have the same value. For example, we have one subject in the population with a value of zero, three subjects with values of one, and so on.

The plot on the right displays values for the subjects collected in our random sample, with the same subjects being highlighted in green on the left. We can take samples from our population of various sizes by changing the value of the slider at the top and clicking on the “Sample” button.

  1. Set the sample size to 2 and click the “Sample” button five times. Fill out the following table with the sample means observed in your five samples.
    Sample Number Sample mean
    1
    2
    3
    4
    5
    1. Did you get the same sample mean for any of your five samples?
    2. Was the sample mean ever equal to the population mean in any of your five samples? If not, how close were your sample means to the true population mean?
  2. Now set the sample size to 15 and click the “Sample” button five times. Fill out the following table with the sample means observed in your five samples.
    Sample Number Sample mean
    1
    2
    3
    4
    5
    1. Did any of your five simulations give you a sample mean equal to the population mean?
    2. How close were your sample means to the true population mean?
  3. Compare your results from problems 1 and 2.
    1. In either case were you able to obtain a sample mean equal to the population mean?
    2. Which one had sample means with more variability? What does this indicate about the relationship between the sample size and the distribution of sample means in the context of repeated sampling?

Following the conventions used previously, we will use upper-case letters to denote random variables and lower-case letters to denote values that have been observed (that is, they are no longer random). Regarding the sampling distribution of the mean, \(\overline{X}\) represents the random variable which arises from the repeated random sampling from the population, while \(\overline{x}\) represents an observed sample mean from an already collected sample.

In describing the sampling distribution, we are often interested in the mean and the standard deviation which, together, describe the size of the interval of probable values for our population parameter. When describing statistics with a sampling distribution, we refer to the sample mean as the expected value of our statistic, while the standard deviation is referred to as the standard error. For the random variable describing the sample mean, these are denoted as \(E(\overline{X})\) and \(SE(\overline{X}) = \sqrt{Var(\overline{X})}\), respectively.

Let’s return to the fast food scenario to motivate our development, but for now let’s assume there are only five individuals in the entire population. The monthly spending for those five individuals (rounded to the nearest dollar) is as follows:

Person Monthly Spending
A 8
B 22
C 22
D 36
E 50

The population mean can be found by taking the average monthly spending of these five individuals, as they are the only people in this population. This means the population mean is 27.6. Now suppose we are taking samples of three individuals from this population. There are \(\binom{5}{3}\) = 10 ways we can sample three people from this population. Since this is a small population, we can enumerate all possible samples, the values we would obtain for the monthly spending in each sample, and then the sample mean monthly fast food spending for each sample:

Sample Values \(\overline{x}\)
(A, B, C) (8, 22, 22) 17.33
(A, B, D), (A, C, D) (8, 22, 36), (8, 22, 36) 22.00
(A, B, E), (A, C, E), (B, C, D) (8, 22, 50), (8, 22, 50), (22, 22, 36) 26.67
(A, D, E), (B, C, E) (8, 36, 50), (22, 22, 50) 31.33
(B, D, E), (C, D, E) (22, 36, 50), (22, 36, 50) 36.00

One thing to note is that none of the sample means are actually equal to the population mean. This is often the case! Based on this table, we can construct the sampling distribution of \(\overline{X}\) for a sample of size three.

Why is our sample mean often not equal to the population mean?

Probability
P(\(\overline{x}\) = 17.33) 0.1
P(\(\overline{x}\) = 22) 0.2
P(\(\overline{x}\) = 26.67) 0.3
P(\(\overline{x}\) = 31.33) 0.2
P(\(\overline{x}\) = 36) 0.2

With this probability distribution, we can get the expected value of the sampling distribution. Whenever we think about probability, we are thinking about long-run frequencies, so when we think about the expected value, we are thinking about the average sample mean if we were to take repeated samples of size three from this population. The probability distribution indicates that if we took samples of size three from this population over and over again (say 1,000 times), we would end up with 10% with a mean of 17.33, 20% with a mean of 22, 30% with a mean of 26.67; and so on. So the expected value we would observe if we continued collecting samples over and over would be:

(0.1 \(\times\) 17.33) + (0.2 \(\times\) 22) + (0.3 \(\times\) 26.67) + (0.2 \(\times\) 31.33) + (0.2 \(\times\) 36) = 27.6

But wait! That is the population mean. This illustrates an extremely powerful property of the sample mean – the expected value of the sample mean \(\overline{X}\) is equal to the population parameter \(\mu\)! Because of this, we say that the sample mean is an unbiased estimator of the population mean. While we illustrated the property with a small example, this holds regardless of the population or sample size.


Definition 6.2

 

Sampling Distribution: A probability distribution associated with a sample statistic

Expected Value: The mean value of a statistic in repeated sampling

Standard Error: The standard deviation of a statistic (including the mean) in repeated sampling

Unbiased Estimator: The statistic with an expected value equal to the true population parameter

6.3 Central Limit Theorem

In the last section, we used a toy example to conceptualize repeated sampling and the property of the sample mean being an unbiased estimator. Now we will formalize these properties. For any population distribution which can be described by a random variable \(X\) with mean \(\mu\) and variance \(\sigma^2\), the sample distribution of the mean based on sample of size \(n\), denoted \(\overline{X}_n\), has the following properties:

  1. The expected value of the sampling distribution is \(\mu\) \[E(\overline{X}_n) = \mu\]
  2. The variance of the sampling distribution is \[Var(\overline{X}_n) = \frac{\sigma^2}{n}\]

Together, these expressions are used to describe the typical value of the sample mean as well as the expected amount of variability in the sample mean under repeated sampling. From this, there are two further observations that are worth noting:

  1. The expected value of the sample mean is always equal to \(\mu\), regardless of the size of the sample: this property makes the sample mean an unbiased estimator of the population mean.

  2. Whereas the expected value of the sample mean does not depend on \(n\), the variance of the sampling distribution does. Indeed, this should match our intuition; as the size of our sample increases, the variability of \(\overline{X}_n\) decreases, giving us a more precise estimate of the population mean.

As we’ve seen already, it is more common to discuss the standard deviation of a statistic than the variance. In the context of sampling distributions, this quantity is called the standard error (to emphasize that we are discussing a sampling distribution) and is denoted

\[SE(\overline{X}_n) = \sqrt{Var(\overline{X}_n)} = \sigma / \sqrt{n}\]

Note that while all standard errors could be referred to as standard deviations, the converse does not hold – standard deviations not associated with a sampling distribution are never referred to as standard errors.

Until now, we have discussed the mean and standard error of a random statistic without having described the sampling distribution itself. Why might having the distribution of a statistic be useful? Recall from the previous chapter that we defined a distribution as a method of associating an event or outcome with its corresponding probability. As we shall see, the sample mean, along with its standard error, will help us in constructing a range of “likely” values that may contain the true parameter. With the use of a sampling distribution, we will now be able to assign a probability to the range of these likely values.

As it turns out, the process of randomly collecting a sample from a population and computing its sample means has a truly amazing property – the sampling distribution will always be normally distributed. This brings us to what is largely considered the “fundamental theorem of statistics”.

Central Limit Theorem

What the Central Limit Theorem (CLT) states is this: so long as the size of our sample is “large enough”, the sampling distribution of \(\overline{X}\) will be approximately normal, regardless of the characteristics of the underlying distribution. That is, we have

\[\overline{X}_n \sim N(\mu, \sigma^2/n).\]

While the term “large enough” will inevitably depend on the context of the problem at hand, a typical rule of thumb puts this value at \(n \geq 30\). Because the variance of \(\overline{X}\) is inversely related to the sample size, as the \(n\) increases, the variance decreases and the distribution becomes more concentrated around its mean value. This is one of the most important, remarkable, and powerful results in all of statistics. In the real world, we rarely know the underlying population distribution of our data, yet the CLT assures us that this is not an issue. In the next chapter, we will investigate how the consequences of the CLT create an immensely powerful tool in performing statistical inference.

Exercise 6.2

 

The applet below is designed to help your conceptual understanding of the CLT. You can change the underlying population distribution, the size of the samples taken (\(n\)), and the number of experiments performed. Each experiment consists of taking a sample of the specified size from the population and calculating the sample mean. The top left panel (blue) shows the underlying population distribution. The top right panel (red) shows the data from the most recent sample, with the sample mean from that specific sample indicated by the dashed black line. The bottom panel (gold) shows us the distribution of sample means from all the experiments, with the observed mean of the sample means indicated by the dashed black line and the true underlying population mean indicated by the solid blue line. Once you select the parameters to the values you want, click “Run Simulation” to tabulate the results.

  1. Set the population distribution to Normal, the sample size to 30 and perform 100 experiments to collect 100 sample means.
    1. Describe the histogram of the the data from the last experiment. Comment on the modality and skew.
    2. Describe the histogram of the distribution of sample means from all experiments.
    3. What is the range of values observed for the distribution of sample means? How does this compare the the range of values observed in the data from the last experiment?
  2. Now adjust the sample size to 100. Perform 100 experiments.
    1. What is the range of values observed for the distribution of sample means? How does this compare to the range observed when the sample size was 30?
    2. What about the range of values observed in the data from the last experiment – did this change much when the sample size was increased?
    3. Play around with various sample sizes. How does changing the sample size effect the spread of the distribution of sample means?
  3. Return to a sample size of 30, but change the population distribution to be right skewed. Perform 1,000 experiments.
    1. Describe the histogram of the the data from the last experiment. Comment on the modality and skew.
    2. Describe the histogram of the distribution of sample means from all experiments. Does it resemble the population distribution?
    3. Re-run the simulation using a sample size of 10. How does this effect the distribution of the sample means? How does this compare to what you observed when taking samples of size 10 with a normal population distribution?
  4. Change the population distribution to be left skewed, set the sample size to 80, and perform 1,000 experiments.
    1. Describe the histogram of the the data from the last experiment. Comment on the modality and skew.
    2. Describe the histogram of the distribution of sample means from all experiments. Does it resemble the population distribution?
    3. Run the simulation with these parameters a few times and pay attention to how the mean of the distribution of sample means compares to the population mean. Is the population mean similar or different from the mean of the sample means? What property does this illustrate?
  5. Change the parameters of the simulations as needed to answer the following true/false questions. Explain your answers.
    1. CLT tells us that the distribution of data from any experiment will be normally distributed.
    2. With a larger sample size, the data from the last experiment will resemble the population distribution.
    3. Performing 10 experiments is enough to see the effect of CLT.
    4. Regardless of the underlying population distribution and the sample size, the distribution of the sample means will be normally distributed.
    5. Increasing the sample size causes the data from a single experiment to look more and more normally distributed.

Definition 6.3

 

Central Limit Theorem: A mathematical theorem that provides us with the sampling distribution of the mean