Brett Melbourne 2024-09-17
Note: The .md version of this document is best for viewing on GitHub.
See the .Rmd version for the latex equation markup. The .Rmd version
does not display the plots in GitHub and is best for viewing within
RStudio.
The sampling distribution is a frequentist conceptualization of our big idea in data science of counting all the ways the data could have happened. It is the key to understanding frequentist inference. The sampling distribution is the distribution of a sample statistic from repeatedly sampling the population. There are two critical features of this frequentist conceptualization:
-
There is a real entity, which is called the population. Precisely because the population is reality, the characteristics of the population are not random variables. They are said to be “fixed”. An example of a non-random characteristic of a population might be its mean or its variance or, more completely, its distribution as represented by a histogram that includes every single individual. A summary characteristic of a population is called a population statistic, or parameter. Examples include the population mean or variance, the slope,
$\beta_1$ , of$y$ on$x$ , or the t-statistic, or F-statistic. Be careful here to realize we are talking about the true value of these statistics for the population not a sample. -
We imagine all the ways we could have sampled from the population, each time calculating a sample statistic (e.g. mean, variance, slope, t-statistic). The sample statistic is a random variable and it will be different from the population statistic (which is not random because there is only one true value). The idea is that if we were to repeat the sampling in all the possible ways that a sample could be drawn from the population, there would be a frequency distribution for the sample statistic, called the sampling distribution. This distribution is real in the sense that since the population is real there exists an exact sampling distribution that corresponds to the real population. However, we don’t usually do the sampling more than once, so this repeated sampling process is imaginary. In hand (i.e. our dataset), we have only the one sample and the one sample statistic.
Thus, much of frequentist inference is based on the following underlying algorithm of an imaginary sampling process:
for each possible combination of n sample units
sample n units from the population
calculate the sample statistic
plot sampling distribution (histogram) of the sample statistic
Due to the law of large numbers, which states that the average of a large number of trials will converge on the expected value, the following algorithm is equivalent:
repeat very many times
sample n units from the population
calculate the sample statistic
plot sampling distribution (histogram) of the sample statistic
You can alternatively think of this latter variant of the sampling algorithm as corresponding to the idea that the sampling distribution is the expected distribution under repeated random sampling.
In any scientific situation, we can never know reality exactly, we can
only infer it or hope to get closer. However, in a simulation, we can
know reality because we program it in, so simulation is very useful for
gaining better intuition about the concept of the sampling distribution.
So let’s explore the sampling distribution by simulation. We will
consider two cases, estimating the prevalence of a pathogen in a
population (below), and an hypothesis test for the slope in simple
linear regression (see the next installment: lm_ssq_inference.md).
Let’s say we have a population of 132 orange-spotted warblers (a fictional endangered species; 132 individuals are all that are left) and we are interested in the prevalence of a certain pathogen because this pathogen will put the population at a greater risk of extinction. In reality (the reality inside our simulation), the population is as follows (individuals infected with the pathogen are coded as 1, while pathogen free individuals are coded as 0) listed by individual i = 1 … 132:
pathogen <- c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,0,1,1,0,1,1,0,0,0,1,1,0,0,1,1,0,1,0,0,0,0,
1,1,1,0,1,1,0,1,1,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,1,1,
0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,1,0,1,1,0,1,1,0,0,0,0,0,0,0,0,1,0,1,1,1,0,
1,0,1,0,0,0,0,0,0,1,1,0,0,0,1,1,1,1,0,0,1)This population has several characteristics. One characteristic is the population size
n_pop <- length(pathogen)
n_pop## [1] 132
This is a population statistic. The prevalence of the pathogen is another characteristic of this population - it is also a population statistic. The pathogen prevalence of the population is
pathogen_p_pop <- sum(pathogen) / n_pop #equivalently mean(pathogen)
pathogen_p_pop## [1] 0.4242424
Let’s give this true population statistic the symbol
hist(pathogen, breaks=seq(-0.5,1.5,1), xlab="Pathogen status (not infected=0, infected=1)",
main=NA, col="#56B4E9", xaxt="none", freq=FALSE)
axis(1, at=0:1)Now, let’s assume we are in the typical scientific situation of incomplete knowledge. For example, we do not have data on the infection status of each bird and we can’t check the infection status of a bird unless we catch it and take a blood draw. As a precaution, we only want to handle a small number of birds to obtain blood in case handling them will put them at greater risk. We’ve decided that we can risk catching 10 birds to determine their infection status. We go ahead and randomly sample 10 birds from the population:
set.seed(470) #so this example is reproducible
data <- sample(pathogen, 10)
data## [1] 0 0 0 0 0 0 1 0 0 1
pathogen_p_sample <- mean(data)
pathogen_p_sample## [1] 0.2
We find we have 2 infected birds among the 10 birds we sampled, so the
pathogen prevalence of the sample is 0.2. This is our sample statistic.
We’ll take this sample statistic to be an estimate of the population
statistic and because it’s an estimate we’ll give it the same symbol as
the population but with a little hat over it,
Frequentists assess the reliability of this estimate for the pathogen prevalence by invoking the concept of the sampling distribution. What is the sampling distribution in this case? The sampling distribution is the distribution of the pathogen prevalence built from all the possible ways we could have drawn a sample of 10 birds from the population of 132 birds. Remember that we don’t actually take these samples in our scientific study, we imagine taking them. For example, if we number the individuals in the population from 1 … 132, in one imaginary sample we might have birds 12, 13, 15, 23, 40, 45, 60, 92, 94, 114, whereas in another imaginary sample we might have birds 2, 24, 41, 51, 59, 63, 93, 94, 111, 127. How many possible ways are there to take a sample of 10 birds from a population of 132 birds?
choose(132, 10)## [1] 3.120587e+14
That’s a very large number and it would take us a very long time to try all the samples. Using R on a typical laptop it takes about 0.01 milliseconds to try one sample, so it would take about 100 years. If we had access to a large supercomputer, say one capable of processing 10,000 samples simultaneously, we could do it in about 4 days. Instead, we will take a very large number of random samples, say 100,000, which will only take a second. The more random samples we take, the closer we will come to the distribution of the actual combinations. In practice, because of the law of large numbers, 100,000 random samples of 10 birds will be extremely close to the distribution for all the possible combinations of birds. Indeed, we don’t really need to consider all the possible sample combinations. If, in our study, we take a random sample from our population, then that random sampling procedure corresponds to the idea that the sampling distribution consists of the expected distribution under repeated random sampling. Here is the algorithm for the sampling process for our example of the bird population experiencing a pathogen:
reps <- 100000
pathogen_p_imagine_samples <- rep(NA, reps) #to hold prevalence from imagined samples
for ( i in 1:reps ) {
imagine_sample <- sample(pathogen, 10) #randomly sample 10 birds from true population
pathogen_p_imagine_samples[i] <- mean(imagine_sample) #keep prevalence from sample
}What does this sampling distribution for the pathogen prevalence look like?
hist(pathogen_p_imagine_samples, breaks=seq(-0.05, 1.05, 0.1), col="#56B4E9", freq=FALSE,
xlab="Pathogen prevalence", main=NA, ylim=c(0,2.7))
abline(v=pathogen_p_pop, col="#E69F00", lwd=2)
abline(v=pathogen_p_sample, lwd=2, lty=2)
text(pathogen_p_pop, 2.68, expression(italic(theta)), pos=4)
text(pathogen_p_sample, 2.68, expression(hat(italic(theta))), pos=2)Remember this histogram is the “true” sampling distribution for the
pathogen prevalence given a sample of 10 individuals from the
population. That’s because we sampled from the true population, which we
know exactly because the truth is coded into our simulation. Because we
are drawing samples of 10, the possible prevalences are from 0 to 1 in
steps of 0.1. The true prevalence,
mean(pathogen_p_imagine_samples)## [1] 0.424042
One thing that sometimes confuses people is that we have encountered two different distributions: the distribution of infected and uninfected individuals in the population and the sampling distribution of the prevalence. Notice that the sampling distribution (the second histogram) is completely different to the distribution of infected and uninfected individuals in the population (the first histogram). To obtain the sampling distribution for the prevalence, we have sampled from the distribution of infected and uninfected individuals in the population.
How can we use this concept of the sampling distribution to judge the
reliability of an estimate? Let’s consider our one sample of 10
individual birds that we actually took. The pathogen prevalence for this
sample happens to lie at 0.2 in this sampling distribution (the dashed
black line
A 95% confidence interval is defined as:
An interval calculated by some procedure that would contain (or cover) the true population value 95% of the time, if sampling and calculating an interval were repeated a very large number of times.
This is often abbreviated to something like “there is a 0.95 probability that the interval contains the true value”, which is true only if we keep in mind the precise definition of probability as an imagined repeated procedure. It is common to reword it subtly as “there is a 0.95 probability that the true value is in the interval”, which seems the same as the previous abbreviated definition to most people (including me) since if “the interval contains the true value” then “the true value is in the interval”. However, it can be argued that once an interval is calculated, since the true value is fixed, the true value is either in the interval or it is not, so the probabilities of being in the interval are respectively 1 when it is in the interval and 0 when it is outside the interval (Pratt 1961). A lot of arguments ensue around these semantics. Nevertheless, the important point is that a confidence interval is about the probability associated with sampling and calculating an interval (i.e. the procedure) and not directly about the probability for the value of the statistic or parameter (the pathogen prevalence in this case).
In our example for the pathogen prevalence, how could we construct an interval with these properties? One approach would recognize that the sampling distribution is well-approximated by a Normal distribution. This is because the pathogen prevalence is the result of summing up the number of infected individuals, so the sampling distribution of the prevalence fairly quickly approaches a Normal distribution as the sample size (10 in this case) increases due to the central limit theorem (see Big Ideas in Data Science). It’s not a perfect match (notably, the possible values of the pathogen prevalence are discrete) but it’s a very good approximation. Here they are side by side and overlaid:
par(mfrow=c(1,3))
hist(pathogen_p_imagine_samples, breaks=seq(-0.45, 1.35, 0.1), col="#56B4E9", freq=FALSE,
xlab="Pathogen prevalence", ylim=c(0,2.7), main="True sampling distribution")
sd_samp <- sd(pathogen_p_imagine_samples) #standard deviation of the true sampling distribution
pathogen_p_normal_samples <- rnorm(1e6, pathogen_p_pop, sd_samp)
hist(pathogen_p_normal_samples, breaks=seq(-0.45,1.35,0.05), col="#56B4E9", freq=FALSE,
xlab="Pathogen prevalence", ylim=c(0,2.7), main="Approximating Normal")
hist(pathogen_p_imagine_samples, breaks=seq(-0.45,1.35,0.1), col="#56B4E9", freq=FALSE,
xlab="Pathogen prevalence", ylim=c(0,2.7), main="Normal overlaid on true")
lines(seq(-0.45, 1.35, 0.01), dnorm(seq(-0.45, 1.35, 0.01), mean=pathogen_p_pop,
sd=sd_samp), col="#E69F00")Here, the parameters of the approximating Normal distribution are the mean and standard deviation of the true sampling distribution. One issue is apparent: the Normal distribution allows the pathogen prevalence to be outside the bounds of 0 and 1, which is not actually possible, but we’ll live with that.
To form a confidence interval we can ask “How far could the true
pathogen prevalence
par(mfrow=c(1,1))
hist(pathogen_p_normal_samples, breaks=seq(-0.45, 1.35, 0.05), col="#56B4E9", freq=FALSE,
xlab="Pathogen prevalence", ylim=c(0,2.9), main=NA)
abline(v=pathogen_p_pop, col="#E69F00", lwd=4)
text(pathogen_p_pop,2.9, "True", pos=4)
text(pathogen_p_pop,2.7, "prevalence", pos=4)
abline(v=pathogen_p_pop+2*sd_samp, col="#E69F00", lty=2)
abline(v=pathogen_p_pop-2*sd_samp, col="#E69F00", lty=2)
#upper unusual value
points(pathogen_p_pop+2*sd_samp, 1.5, col="#D55E00", pch=16)
arrows(pathogen_p_pop, 1.5, pathogen_p_pop+2*sd_samp-0.01, 1.5, code=1, angle=90,
length=0.12, lwd=2)
text(pathogen_p_pop+sd_samp, 1.5, expression(2*sigma), pos=3)
text(pathogen_p_pop+2*sd_samp, 1.57, "Unlikely high", pos=4)
text(pathogen_p_pop+2*sd_samp, 1.43, "sample prevalence", pos=4)
#lower unusual value
points(pathogen_p_pop-2*sd_samp, 1.9, col="#D55E00", pch=16)
arrows(pathogen_p_pop, 1.9, pathogen_p_pop-2*sd_samp+0.01, 1.9, code=1, angle=90,
length=0.12, lwd=2)
text(pathogen_p_pop-sd_samp, 1.9, expression(2*sigma), pos=3)
text(pathogen_p_pop-2*sd_samp-0.45, 1.97, "Unlikely low", pos=4)
text(pathogen_p_pop-2*sd_samp-0.45, 1.83, "sample prevalence", pos=4)Now, in our scientific scenario we don’t know the true pathogen
prevalence, so we don’t know where the prevalence of our one sample lies
in relation to the true prevalence. What if we happened to draw a very
unlikely sample with a high prevalence way out in the upper tail of the
sampling distribution? Anything beyond
This is all logical. But there is a problem. When we know the true
sampling distribution we know its standard deviation
- The sampling distribution may be far from Normal
- The estimated
$\sigma$ might be wrong
The sampling distribution is more likely to be Normal enough and
Coverage is an important frequentist concept. How often do confidence intervals calculated from a procedure cover/contain the true value? So, let’s measure the coverage for the confidence interval procedure we just described above for the pathogen prevalence. We’ll randomly sample and calculate an interval repeatedly. How often does the calculated interval cover the true value? Here is the general algorithm to examine coverage:
repeat very many times
sample n units from the population
calculate the sample statistic
calculate the interval for the sample statistic
calculate the frequency with which the true value is in the interval
You can see this algorithm is similar to the sampling distribution algorithm but now includes calculating an interval as well. Here is the coverage algorithm in R for our specific example of pathogen prevalence in our bird population:
reps <- 100000
interval <- matrix(NA, nrow=reps, ncol=2)
for ( i in 1:reps ) {
imagine_sample <- sample(pathogen, 10) #randomly sample 10 birds from true population
pathogen_p_sample <- mean(imagine_sample)
se_sample <- sd(imagine_sample) / sqrt(10)
interval[i,] <- c(pathogen_p_sample - 2*se_sample, pathogen_p_sample + 2*se_sample)
}
in_interval <- pathogen_p_pop > interval[,1] & pathogen_p_pop < interval[,2]
sum(in_interval) / reps## [1] 0.95616
which is pretty close to the desired coverage of 0.95 and is indeed a bit conservative since it is greater than 0.95 (i.e. our procedure produces a 95.6% confidence interval). Let’s see the first 100 of these intervals compared to the true pathogen prevalence (in blue):
nd <- 100 #number of intervals to draw
plot(NA, NA, ylim=c(-0.1,1.1), xlim=c(1,nd), type="n",
ylab="Pathogen prevalence", xlab="", axes=FALSE)
abline(h=pathogen_p_pop, col="#56B4E9")
intcol <- ifelse(in_interval, "gray", "#E69F00")
segments(1:nd, interval[1:nd,1], 1:nd, interval[1:nd,2], col=intcol)
axis(2)
box()Several things are apparent.
- The location of the intervals varies a lot. Confidence intervals are random variables.
- The intervals are of different widths.
- As expected, about 5 out of 100 (here 6) of the intervals (orange) don’t cover the true value.
- The true value is often not near the center of the interval.
- Many of the intervals include prevalences that are impossible, either negative, or greater than one. This is a consequence of substituting the Normal distribution for the true sampling distribution.
- We have seen how the core concept of frequentist inference is the sampling distribution of a statistic arising from an imagined repeated sampling from the population. The algorithm for this imagined sampling process is:
repeat very many times
sample n units from the population
calculate the sample statistic
plot sampling distribution (histogram) of the sample statistic
- A confidence interval is:
An interval calculated by some procedure that would contain (or cover) the true population value 95% of the time, if sampling and calculating an interval were repeated a very large number of times.
- Confidence intervals quantify the reliability of the procedure and can be constructed from the sampling distribution.
- Since we cannot know the true sampling distribution, we make an assumption for what the true sampling distribution is. Furthermore, we don’t know the values of the sampling distribution’s parameters, so we plug in estimates of those from the sample.
- The coverage of an interval procedure calibrates the degree of confidence in the procedure.
Pratt JW (1961) Review of “Testing Statistical Hypotheses. E. L. Lehmann. New York: John Wiley and Sons, Inc., 1959. Pp. xiii, 369. $11.00.” Journal of the American Statistical Association 56, 163-167.




