Suppose we draw independent observations from a population with mean μ and standard deviation σ – i.e. we take a sample from the population. Before we draw the sample, each observation is a random variable – i.e. a mapping from the probability space to a number on the real line representing a unit of measurement – since we aren’t sure what specific value it will take on. The sample is simply a collection of multiple random variables and is therefore a random variable itself. Note that while sample values are independent from one another, they are all drawn from a common distribution. Sometimes we know what that distribution is beforehand, and often times we don’t.

We use samples to compute **statistics**, which are functions of random variables that approximate parameters of the distribution in question. A statistic like the sample mean, which is an estimate of the true mean (or average) of the distribution, is computed based on a draw of just n observations – obviously a subset of the population’s distribution. Clearly, the sample mean is different from the mean; the sample mean is computed with n random observations, whereas the mean is computed with every observation in the distribution. Thus we call the mean a **parameter **which means it’s a characteristic of the *entire* population, and we call the sample mean a statistic, to indicate that its computation relies on a finite number of observations from the population.

A statistic, specifically the sample mean, is a function of random variables and therefore is a random variable itself, endowed with all of the requisite characteristics – measures of location like the mean and median, measures of dispersion like the variance, a probability distribution, and so forth. We call the probability distribution of a statistic the **sampling distribution** of that statistic. The probability distribution of the sample mean is therefore called the *sampling distribution of the sample mean*.

There are 50 students in a histology class in the first year of medical school. On exam one, the mean was a score of 72 and the standard deviation was 8.5 points. However, only the teacher knows this. Suppose you’re a student in the class. From your perspective, the true mean μ is an unknown parameter. However, you know the scores of three classmates because you shamelessly asked them what they scored, which gives you three observations (test scores) from the distribution of histology test scores. You have a random sample of size 3.

You can compute the average of this sample and get the sample mean, which estimates the true mean of all histology test scores in your class. Using R, I computed a sample mean of 67.48 from a random sample of observations consisting of 74.3, 56.6, and 71.6. In the bar plot below, each bar is a sample value (a random variable drawn from the distribution when I randomly sampled 3 values), the blue line is the mean of these observations (i.e. the sample mean), and the red dotted line is the true mean. Clearly, the means don’t match up. However, they’re pretty close.

If you took another sample, i.e. asked 3 different students how they scored, you wouldn’t get the same result. Sampling again from the same distribution, I get values of 89, 66, and 76, the mean of which is 76.8 – almost 10 points higher than my last sample:

You can see that we came up with totally different samples in the previous two examples even though we sampled the same way and from the same distribution. There are as many possible samples as there are ways to arrange subsets of 3 from a population of 50 – which is the binomial coefficient ’50 choose 3’, or 19,600. So, there are almost 20,000 ways to draw a sample of 3 from the distribution of histology test scores by asking 3 classmates their scores. Hence we shouldn’t expect to get the same sample often.

The sampling distribution of the sample mean is the distribution of the sample means of all possible samples of size 3 from the test score distribution – i.e. the 19,600 cases I just mentioned. We can think of each sample mean of size 3 we compute by asking classmates their scores as being a *random* *variable* from this distribution. While we don’t know what exactly we will compute before sampling occurs, we do know some statistical properties of the sampling distribution of the sample mean:

- The mean or expected value of the sampling distribution of the sample mean is μ, the true mean
- The standard deviation of the sampling distribution of the sample mean is σ/√n, where σ is the standard deviation of the distribution from which we’re taking a sample of size n
- If a random variable X is normal, then so is the sampling distribution of its sample mean

Below is a chart illustrating the points above. Given a distribution of some variable X, we first take a sample of size n from the distribution. With these n randomly drawn sample values, we compute the sample mean (S) as the arithmetic average of the sample. We call the probability distribution of all sample means calculated based on all possible samples of size n from the distribution of X the Sampling Distribution of S, which is on the far right. The sampling distribution is similar to the distribution of X in that they’re peaked around the same average value, but X represents the value of an observation, whereas S represents the mean of n observations.

*I know the stars are corny; the point is that only a few of the items in the population are selected for the sample. Perhaps we’re interested in the average mass of a star in the distribution, given their varying sizes. With sampling, we can account for the variation in size by sampling appropriately, which will give us the tools to make probabilistic statements about the average mass of a star even though we can’t see its whole distribution in practice.

To reiterate, we are no longer talking about the distribution of a random variable X, or, in our example, histology test scores. We are talking about the distribution of the sample means of random samples of size 3 from the distribution of histology test scores. On average, we expect the sample of size 3 to produce a sample mean equal to the population mean, with a typical deviation of σ/√3, i.e. the standard deviation of the actual test scores divided by the square root of 3. Clearly, the spread of the distribution of the sample means from samples of size 3 is smaller, as measured by the standard deviation, than the distribution of test scores themselves. Intuitively, this makes sense – one could draw an unlikely value from the distribution of test scores, but the effect of this value is muted when it’s averaged with two others. By the same reasoning, the standard deviation of the sampling distribution of the sample mean decreases as the sample size n increases.

To illustrate, below I simulated histology test scores based on the mean and standard deviation info:

The distribution is bell-shaped with a mean of 72. What about the distribution of the results you obtain when asking three classmates their scores and taking the average? What is the spread and central tendency? Below is the distribution of the mean score in samples of size 3 from the class, also called the sampling distribution of the sample mean:

They share the same expected value, but the sampling distribution is much more tightly centered on the mean. This is because there is less dispersion in the distribution of sample means than there is in the distribution of test scores themselves. We can corroborate this by computing the standard deviation of the sampling distribution, which is just over 4. Compare this to the standard deviation of test scores themselves, which was 7.5. For the sake of comparison, here’s the density of each distribution:

The blue line is the test score distribution. The red line is the distribution of the sample mean of all samples of size 3 taken from the test score distribution. The sampling distribution is far less variable, as evidenced by the relatively flatter tails. Values less than 60 or greater than 85 are virtually never observed in the sampling distribution – that is, it’s unlikely that you would ask three classmates their scores, compute the average, and wind up with a number below 60 or above 85. However, it’s not uncommon to observe values below 60 or above 85 in the distribution of test scores themselves. This is basically common sense – relatively extreme values don’t have as big of an impact when you’re averaging the observations, and when you’re randomly drawing more than one value it’s unlikely that they’re all extreme in the same direction. (Note – this is contingent upon truly random sampling!)

We have seen that the sampling distribution of the sample mean has the same expected value as the distribution from which samples are drawn – in our example, test scores and the average of three classmates’ test scores. Why is this? It’s because the sample mean is an unbiased estimator of μ. In other words, E(X) = μ. This is easy to prove, just calculate the expected value of the sample mean, subbing in the formula (1/n) Σ x for the sample mean.

However, this is not true for the standard deviation, which we already know because we saw that the sampling distribution is much narrower than the population of individual observations. It’s not much more complex than the derivation of the mean, but we end up with a standard deviation that is different from that of the original distribution. The standard deviation of the sampling distribution of the sample mean, also called the standard error, is the standard deviation of the underlying distribution divided by the square root of n, the sample size.

So, the sample mean is distributed normally with a mean of μ and a standard deviation of σ times the square root of the sample size, n. The sample size corresponds to the number of observations used to calculate each sample mean in the sampling distribution. A different number for n corresponds to a different sampling distribution of the sample mean, because it has a different variance. The standard deviation of the sampling distribution of the sample mean is usually called the **standard error**** **of the mean. Intuitively, it tells us by how much we deviate from the true mean, μ, on average. The previous information tells us that when we take a sample of size n from the population and compute the sample mean, we know two things: we expect, on average, to compute the true population mean, μ; and, the average amount by which we will deviate from the parameter μ is σ/Sqrt(n), the standard error of the mean. Clearly, the standard error of the mean decreases as n increases, which implies that larger samples lead to greater precision (i.e. less variability) in estimating μ. Below you can see that for larger sample sizes the distributions are more centered on the expected value and exhibit less spread:

The graph shows that while the distribution of test scores is relatively flat, the distribution of the sample mean in a sample of size n from the class gets more and more peaked as the sample size increases. Every distribution is expected to take on the true mean value, hence they all peak at about 72, but they differ in their spread around the mean. Basically the window of values that are typically taken on shrinks as the sample increases. You can see that a huge sample isn’t really necessary to improve accuracy considerably.

If we employ the randomly ask thee students their scores method, about two-thirds of the time our outcome will be between 67.1 and 76.9. (i.e. 2/3 of the time we will get a sample mean in this range). We can make this range more precise by sampling more people. Suppose you ask just two more people their test scores. The distribution of the sample mean has shrunk: the lower limit is now 68.2, and the upper limit 75.8. We still expect an average value of 72, but we can rule out some values by shortening the range. Below is the new interval, based on a sample of 5 classmates:

By now it’s clear that a sample of three won’t be too precise due to its relatively large standard deviation. This means we won’t be able to say with certainty that the true mean, i.e. the average of all 50 tests, is within a given range. Conversely, if we are to be almost certain, say 95% sure, that the mean falls within a range, that range will necessarily be fairly large. In fact, upon calculating the confidence interval for the sample mean test score given a sample of 3 students, I find that my results are basically meaningless: we can be 95% sure that the mean is between 51 and 96, which is such an imprecise estimate it doesn’t do me any good. That the mean is probably somewhere between 51 and 96 tells me literally nothing useful. Below, I simulate 50 confidence intervals plotted against the true mean of 72 and highlight the confidence intervals that omit the true mean. Since we used a 95% confidence level, only 5% of the 50 simulations should be highlighted:

Two intervals don’t include the true mean of 72, represented by the red dotted line. This is fine; we only expect our confidence intervals to include the true mean 95% of the time. The fatal flaw with this is the length of the confidence intervals – a sample size of 3 simply can’t provide enough precision to be useful. With a sample size of 15, we get useful ranges from which we can construct estimates of the mean:

Note the y-axis labels on this graph compared to the previous. A sample of 15 students gives much more useful insight, as the typical confidence interval has a much narrower range. The averages, represented by the circles on each bar, are also generally pretty close to the dotted line representing the true mean. In fact, if I average these averages, I get 71.86, which is very close to the true value of 72. Below is the distribution of the means of the confidence intervals with the average of 71.86 highlighted:

The sample of 15 students was pretty effective – sample means that deviate far from the true mean of 72 are rare, as shown above. This tells us that taking a sample of size 15 and computing the sample mean is a fairly accurate approximation of the true mean test score in the histology class.

We’ve seen 50 simulated confidence intervals in graph form and computed the mean of their means, which lined up well with the true mean. We can also calculate a finite value for the 95% confidence interval, which will give us a range within the true mean is expected to lie with 95% certainty. First, I’ll use simulation. I compute the sample mean from a randomly drawn sample of 15 observations from a distribution with mean 72 and sd 8.5. I repeat this 10,000 times and order the resulting sample means from smallest to largest and store them in a vector. From there I can very simply extract values corresponding to outcomes smaller or larger than x% of the population, which gives us an empirical quantile.

Suppose I want to know the 2.5% quantile; I simply grab the element in m that is lower than 97.5% of the others. Since they’re ordered and there are 10,000 total, I want the 250^{th} observation, since 250/10000=2.5%. The command m[250] gives me my answer, which is 67.7. So we say that 97.5% of the simulated values of the sample mean were greater than 67.7. Similarly, the same percentage were less than 76.4. Therefore, our 95% CI is 67.7 to 76.4. Based on sampling 15 classmates, we’re 95% certain that the true mean of the test scores is between 67.7 and 76.4. Here is the result of the simulation, graphically:

This is roughly the same result you would get by solving for the 95% CI using the laws of probability. First we consider the standard error in constructing a range around the mean, which is just the standard deviation of test scores divided by the square root of n. We also need to use values of the student’s t distribution that correspond to our stated probabilities; we want the t value that’s lower than 97.5% observations and the one that’s greater than 97.5% observations in a student’s t distribution with 14 df, which we can find with the qt() function in R. Here is the code to generate the confidence interval:

x15 <- rnorm(15,mean=72, sd=8.5)

n <- length(x15)

m <- mean(x15)

alpha <- .05

lower.ci <- m + sd(x15)/sqrt(15)*qt(alpha/2, df=n-1, lower.tail=T)

upper.ci <- m + sd(x15)/sqrt(15)*qt(alpha/2, df=n-1, lower.tail=F)

I think that’s it for the intro! Quick recap: the probability distribution of a statistic, which is just a function of sample values, is called the sampling distribution of that statistic. Statistics and sampling distributions help us infer about population parameters, which we don’t usually know in practice. The sampling distribution of the sample mean is the distribution of the means of all samples of size n. As n increases, the variability of the sampling distribution, the standard error, decreases.