The Sampling Distribution of The Sample Mean – An Introduction using R

Suppose we draw independent observations from a population with mean μ and standard deviation σ – i.e. we take a sample from the population.  Before we draw the sample, each observation is a random variable – i.e. a mapping from the probability space to a number on the real line representing a unit of measurement – since we aren’t sure what specific value it will take on. The sample is simply a collection of multiple random variables and is therefore a random variable itself.  Note that while sample values are independent from one another, they are all drawn from a common distribution.  Sometimes we know what that distribution is beforehand, and often times we don’t.

We use samples to compute statistics, which are functions of random variables that approximate parameters of the distribution in question.  A statistic like the sample mean, which is an estimate of the true mean (or average) of the distribution, is computed based on a draw of just n observations – obviously a subset of the population’s distribution.  Clearly, the sample mean is different from the mean; the sample mean is computed with n random observations, whereas the mean is computed with every observation in the distribution.  Thus we call the mean a parameter which means it’s a characteristic of the entire population, and we call the sample mean a statistic, to indicate that its computation relies on a finite number of observations from the population.

A statistic, specifically the sample mean, is a function of random variables and therefore is a random variable itself, endowed with all of the requisite characteristics – measures of location like the mean and median, measures of dispersion like the variance, a probability distribution, and so forth.  We call the probability distribution of a statistic the sampling distribution of that statistic.  The probability distribution of the sample mean is therefore called the sampling distribution of the sample mean.

There are 50 students in a histology class in the first year of medical school.  On exam one, the mean was a score of 72 and the standard deviation was 8.5 points.  However, only the teacher knows this.  Suppose you’re a student in the class.  From your perspective, the true mean μ is an unknown parameter.  However, you know the scores of three classmates because you shamelessly asked them what they scored, which gives you three observations (test scores) from the distribution of histology test scores.  You have a random sample of size 3.

You can compute the average of this sample and get the sample mean, which estimates the true mean of all histology test scores in your class.  Using R, I computed a sample mean of 67.48 from a random sample of observations consisting of 74.3, 56.6, and 71.6.  In the bar plot below, each bar is a sample value (a random variable drawn from the distribution when I randomly sampled 3 values), the blue line is the mean of these observations (i.e. the sample mean), and the red dotted line is the true mean.  Clearly, the means don’t match up.  However, they’re pretty close.


If you took another sample, i.e. asked 3 different students how they scored, you wouldn’t get the same result.  Sampling again from the same distribution, I get values of 89, 66, and 76, the mean of which is 76.8 – almost 10 points higher than my last sample:


You can see that we came up with totally different samples in the previous two examples even though we sampled the same way and from the same distribution.  There are as many possible samples as there are ways to arrange subsets of 3 from a population of 50 – which is the binomial coefficient ’50 choose 3’, or 19,600.  So, there are almost 20,000 ways to draw a sample of 3 from the distribution of histology test scores by asking 3 classmates their scores.  Hence we shouldn’t expect to get the same sample often.

The sampling distribution of the sample mean is the distribution of the sample means of all possible samples of size 3 from the test score distribution – i.e. the 19,600 cases I just mentioned.  We can think of each sample mean of size 3 we compute by asking classmates their scores as being a random variable from this distribution.  While we don’t know what exactly we will compute before sampling occurs, we do know some statistical properties of the sampling distribution of the sample mean:

  • The mean or expected value of the sampling distribution of the sample mean is μ, the true mean
  • The standard deviation of the sampling distribution of the sample mean is σ/√n, where σ is the standard deviation of the distribution from which we’re taking a sample of size n
  • If a random variable X is normal, then so is the sampling distribution of its sample mean

Below is a chart illustrating the points above.  Given a distribution of some variable X, we first take a sample of size n from the distribution.  With these n randomly drawn sample values, we compute the sample mean (S) as the arithmetic average of the sample.  We call the probability distribution of all sample means calculated based on all possible samples of size n from the distribution of X the Sampling Distribution of S, which is on the far right.  The sampling distribution is similar to the distribution of X in that they’re peaked around the same average value, but X represents the value of an observation, whereas S represents the mean of n observations.


*I know the stars are corny; the point is that only a few of the items in the population are selected for the sample.  Perhaps we’re interested in the average mass of a star in the distribution, given their varying sizes.  With sampling, we can account for the variation in size by sampling appropriately, which will give us the tools to make probabilistic statements about the average mass of a star even though we can’t see its whole distribution in practice.

To reiterate, we are no longer talking about the distribution of a random variable X, or, in our example, histology test scores.  We are talking about the distribution of the sample means of random samples of size 3 from the distribution of histology test scores.  On average, we expect the sample of size 3 to produce a sample mean equal to the population mean, with a typical deviation of σ/√3, i.e. the standard deviation of the actual test scores divided by the square root of 3.  Clearly, the spread of the distribution of the sample means from samples of size 3 is smaller, as measured by the standard deviation, than the distribution of test scores themselves.  Intuitively, this makes sense – one could draw an unlikely value from the distribution of test scores, but the effect of this value is muted when it’s averaged with two others.  By the same reasoning, the standard deviation of the sampling distribution of the sample mean decreases as the sample size n increases.

To illustrate, below I simulated histology test scores based on the mean and standard deviation info:


The distribution is bell-shaped with a mean of 72.  What about the distribution of the results you obtain when asking three classmates their scores and taking the average?  What is the spread and central tendency?  Below is the distribution of the mean score in samples of size 3 from the class, also called the sampling distribution of the sample mean:


They share the same expected value, but the sampling distribution is much more tightly centered on the mean.  This is because there is less dispersion in the distribution of sample means than there is in the distribution of test scores themselves.  We can corroborate this by computing the standard deviation of the sampling distribution, which is just over 4.  Compare this to the standard deviation of test scores themselves, which was 7.5.  For the sake of comparison, here’s the density of each distribution:


The blue line is the test score distribution.  The red line is the distribution of the sample mean of all samples of size 3 taken from the test score distribution.  The sampling distribution is far less variable, as evidenced by the relatively flatter tails.  Values less than 60 or greater than 85 are virtually never observed in the sampling distribution – that is, it’s unlikely that you would ask three classmates their scores, compute the average, and wind up with a number below 60 or above 85.  However, it’s not uncommon to observe values below 60 or above 85 in the distribution of test scores themselves.  This is basically common sense – relatively extreme values don’t have as big of an impact when you’re averaging the observations, and when you’re randomly drawing more than one value it’s unlikely that they’re all extreme in the same direction.  (Note – this is contingent upon truly random sampling!)

We have seen that the sampling distribution of the sample mean has the same expected value as the distribution from which samples are drawn – in our example, test scores and the average of three classmates’ test scores.  Why is this?  It’s because the sample mean is an unbiased estimator of μ. In other words, E(X) = μ. This is easy to prove, just calculate the expected value of the sample mean, subbing in the formula (1/n) Σ x for the sample mean.

However, this is not true for the standard deviation, which we already know because we saw that the sampling distribution is much narrower than the population of individual observations.  It’s not much more complex than the derivation of the mean, but we end up with a standard deviation that is different from that of the original distribution.  The standard deviation of the sampling distribution of the sample mean, also called the standard error, is the standard deviation of the underlying distribution divided by the square root of n, the sample size.

So, the sample mean is distributed normally with a mean of μ and a standard deviation of σ times the square root of the sample size, n. The sample size corresponds to the number of observations used to calculate each sample mean in the sampling distribution. A different number for n corresponds to a different sampling distribution of the sample mean, because it has a different variance. The standard deviation of the sampling distribution of the sample mean is usually called the standard error of the mean. Intuitively, it tells us by how much we deviate from the true mean, μ, on average. The previous information tells us that when we take a sample of size n from the population and compute the sample mean, we know two things: we expect, on average, to compute the true population mean, μ; and, the average amount by which we will deviate from the parameter μ is σ/Sqrt(n), the standard error of the mean. Clearly, the standard error of the mean decreases as n increases, which implies that larger samples lead to greater precision (i.e. less variability) in estimating μ.  Below you can see that for larger sample sizes the distributions are more centered on the expected value and exhibit less spread:


The graph shows that while the distribution of test scores is relatively flat, the distribution of the sample mean in a sample of size n from the class gets more and more peaked as the sample size increases.  Every distribution is expected to take on the true mean value, hence they all peak at about 72, but they differ in their spread around the mean.  Basically the window of values that are typically taken on shrinks as the sample increases.  You can see that a huge sample isn’t really necessary to improve accuracy considerably.

If we employ the randomly ask thee students their scores method, about two-thirds of the time our outcome will be between 67.1 and 76.9.  (i.e. 2/3 of the time we will get a sample mean in this range).  We can make this range more precise by sampling more people.  Suppose you ask just two more people their test scores.  The distribution of the sample mean has shrunk: the lower limit is now 68.2, and the upper limit 75.8.  We still expect an average value of 72, but we can rule out some values by shortening the range.  Below is the new interval, based on a sample of 5 classmates:


By now it’s clear that a sample of three won’t be too precise due to its relatively large standard deviation.  This means we won’t be able to say with certainty that the true mean, i.e. the average of all 50 tests, is within a given range.  Conversely, if we are to be almost certain, say 95% sure, that the mean falls within a range, that range will necessarily be fairly large.  In fact, upon calculating the confidence interval for the sample mean test score given a sample of 3 students, I find that my results are basically meaningless: we can be 95% sure that the mean is between 51 and 96, which is such an imprecise estimate it doesn’t do me any good.  That the mean is probably somewhere between 51 and 96 tells me literally nothing useful.  Below, I simulate 50 confidence intervals plotted against the true mean of 72 and highlight the confidence intervals that omit the true mean.  Since we used a 95% confidence level, only 5% of the 50 simulations should be highlighted:


Two intervals don’t include the true mean of 72, represented by the red dotted line.  This is fine; we only expect our confidence intervals to include the true mean 95% of the time.  The fatal flaw with this is the length of the confidence intervals – a sample size of 3 simply can’t provide enough precision to be useful.  With a sample size of 15, we get useful ranges from which we can construct estimates of the mean:


Note the y-axis labels on this graph compared to the previous.  A sample of 15 students gives much more useful insight, as the typical confidence interval has a much narrower range.  The averages, represented by the circles on each bar, are also generally pretty close to the dotted line representing the true mean. In fact, if I average these averages, I get 71.86, which is very close to the true value of 72.  Below is the distribution of the means of the confidence intervals with the average of 71.86 highlighted:


The sample of 15 students was pretty effective – sample means that deviate far from the true mean of 72 are rare, as shown above.  This tells us that taking a sample of size 15 and computing the sample mean is a fairly accurate approximation of the true mean test score in the histology class.

We’ve seen 50 simulated confidence intervals in graph form and computed the mean of their means, which lined up well with the true mean.  We can also calculate a finite value for the 95% confidence interval, which will give us a range within the true mean is expected to lie with 95% certainty.  First, I’ll use simulation.  I compute the sample mean from a randomly drawn sample of 15 observations from a distribution with mean 72 and sd 8.5.  I repeat this 10,000 times and order the resulting sample means from smallest to largest and store them in a vector.  From there I can very simply extract values corresponding to outcomes smaller or larger than x% of the population, which gives us an empirical quantile.

Suppose I want to know the 2.5% quantile; I simply grab the element in m that is lower than 97.5% of the others.  Since they’re ordered and there are 10,000 total, I want the 250th observation, since 250/10000=2.5%.  The command m[250] gives me my answer, which is 67.7.  So we say that 97.5% of the simulated values of the sample mean were greater than 67.7.  Similarly, the same percentage were less than 76.4.  Therefore, our 95% CI is 67.7 to 76.4.  Based on sampling 15 classmates, we’re 95% certain that the true mean of the test scores is between 67.7 and 76.4.  Here is the result of the simulation, graphically:


This is roughly the same result you would get by solving for the 95% CI using the laws of probability. First we consider the standard error in constructing a range around the mean, which is just the standard deviation of test scores divided by the square root of n.  We also need to use values of the student’s t distribution that correspond to our stated probabilities; we want the t value that’s lower than 97.5% observations and the one that’s greater than 97.5% observations in a student’s t distribution with 14 df, which we can find with the qt() function in R.  Here is the code to generate the confidence interval:
x15 <- rnorm(15,mean=72, sd=8.5)
n <- length(x15)
m <- mean(x15)

alpha <- .05 <- m + sd(x15)/sqrt(15)*qt(alpha/2, df=n-1, lower.tail=T) <- m + sd(x15)/sqrt(15)*qt(alpha/2, df=n-1, lower.tail=F)

I think that’s it for the intro!  Quick recap: the probability distribution of a statistic, which is just a function of sample values, is called the sampling distribution of that statistic.  Statistics and sampling distributions help us infer about population parameters, which we don’t usually know in practice.  The sampling distribution of the sample mean is the distribution of the means of all samples of size n.  As n increases, the variability of the sampling distribution, the standard error, decreases.

KDD Cup 2015: The story of how I built hundreds of predictive models….And got so close, yet so far away from 1st place!

Data Until I Die!

The challenge from the KDD Cup this year was to use their data relating to student enrollment in online MOOCs to predict who would drop out vs who would stay.

The short story is that using H2O and a lot of my free time, I trained several hundred GBM models looking for the final one which eventually got me an AUC score of 0.88127 on the KDD Cup leaderboard and at the time of this writing landed me in 120th place. My score is 2.6% away from 1st place, but there are 119 people above me!

Here are the main characters of this story:

MySQL Workbench

It started with my obsessive drive to find an analytics project to work on. I happened upon the KDD Cup 2015 competition and decided to give it a go. It had the characteristics of a project that I wanted to get…

View original post 1,843 more words

Forecasting Exchange Rates Using R Time Series

Benny Austin

Time Series is the historical representation of data points collected at periodic intervals of time. Statistical tools like R use forecasting models to analyse historical time series data to predict future values with reasonable accuracy. In this post I will be using R time series to forecast the exchange rate of Australian dollar using daily closing rate of the dollar collected over a period of two years.

The data for this post is sourced from AdventureWorksDW which is a SQL Server database that’s available for download from CodePlex. Within AdventureWorksDW database, the table FactCurrencyRate contains daily closing rate (EndOfDayRate) of different currencies (CurrencyKey) pegged against the US dollar.

Screen Capture 1 - FactCurrencyRate Dataset Screen Capture 1 – FactCurrencyRate Dataset

R Time Series

Data from SQL Server is first loaded into an R dataset using functions from RODBC package.

The above code above does these

  • Loads the RODBC package using library() function
  • Establishes connection to…

View original post 605 more words

Approaches to Extremes: Visualisation, Tests and The Local Outlier Factor

some real numbers

For most modelling exercises, extreme values pose a real problem. Whether these points are considered outliers* or influential points, it is good practice to identify, visualise and possibly exclude them for the sake of your model. Extreme values can really mess up your estimates and consequently your conclusions!

Of course in some fields, such as marketing or retail analytics, extreme values may represent a significant opportunity. Perhaps there are some customers that are simply off the charts in terms of purchase size and frequency. It would then be a good idea to identify them and do everything you can to retain them!

An extreme value may come to be via a mistake in data collection or it could have been legitimately generated from the statistical process the generated your data, however unlikely. The decision on whether a point is extreme is inherently subjective. Some people believe outliers should never be…

View original post 546 more words

Analyzing Medical Expenditures with R – Part 1

Using a sample from the Medical Expenditures Panel Survey (MEPS), I examined the distributions of inpatient and outpatient medical expenditures with R.

I used the truehist() function in the MASS library to create the histogram below, which depicts outpatient expenditures.  You’ll notice that the graph looks like shit and communicates virtually no useful information.  But here are some key points:

Distribution of outpatient medical expenditures in MEPS.
Distribution of outpatient medical expenditures in MEPS.
  • The Expected Value of outpatient medical expenditures is $1,253, but interpret this carefully because only 68% of those surveyed incurred any expenses at all
  • Intuitively, the Average Cost of Outpatient Treatment is more consistent with the conditional expected value of outpatient treatment given that you’re one of the roughly 2/3 who will need it – this number is $1,853
  • The data is heavily skewed to the right (i.e. positively skewed)
  • Positively skewed distributions are associated with the relatively frequent occurrence of extreme values in the positive direction
  • Data that are denominated in currency (Income, GDP, etc.) are often right-skewed, partly due to the exponential nature of currency appreciation (i.e. compound interest)

The last 3 bullet points are why my graph sucks so much (and, technically, they’re behind the issues that the first two bullets highlight).

The distribution of inpatient expenditures tells a similar but more extreme story – not surprising given they occur less frequently and at a higher cost than outpatient services.  The Expected Value is $1,008, but the average cost incurred by the 7.5% of people who actually utilized inpatient treatment was $4,628.

I’m not doomed to make unuseful graphs with this data, but in order to visually analyze any aspect of it I need to transform it somehow.  A log transform will dampen the effect of extreme values without fundamentally changing the data’s relationship with explanatory variables, which is primarily what I’m interested in.


Viola!  Way less fucked up.  Thanks, Euler.

A unique (and potentially confusing) quality of healthcare as a service is the variety involved.  The treatment a patient receives depends on, to name a few: his/her insurance coverage (or lack thereof), personal health characteristics, geography, performance of medical professionals, and post-treatment medical complications.  All of this variation will necessarily impact total expenditures and the required number of trips to medical centers.  With that in mind, I plot the frequency versus the cost of outpatient health services reported by MEPS survey respondents to try and get a somewhat complete view of healthcare use (even though I’m primarily interested in expenditures):

count_vs_log_expendituresAgain I’ve run into an issue with distributions that have long right tails.  The range of the frequency/count variable – the number of outpatient visits – ranges from 0 to 150, but there are only 8 observations over 100, meaning that less than 2% of my data take up one-third of my plot.  There are also so many data points in the bottom left that I basically end up with a high-density clusterfuck (HDC) rather than an elegant representation of relative densities within the data.

Using the quantile() function in R, I exclude the top 5% of values of the count variable and set my horizontal scale accordingly with xlim().  In general, omitting outliers can be problematic, but this is just informal visualization.  To avoid overplotting in areas with lots of observations, I shade the data according to its frequency with jitter in ggplot2:


Hmmm.  This isn’t “bad”, and it’s definitely “better”, but it’s more like a representation of those who used outpatient healthcare as opposed to outpatient healthcare usage because it’s hard to see that there’s a very high density at 0 in the bottom left corner.  It doesn’t really convey what proportion of people actually incurred healthcare expenditures, but rather how much they spent given that they did.

Again, my reasoning behind plotting expenditures and the number of visits is just to view the cost of healthcare in a broad sense – visits take time, and time is essentially money – not “explain” expenditures by using the number of outpatient visits as an explanatory variable, as doing so would probably lead to confounding.  Confounding is the bias that results when an extraneous variable is correlated with both the dependent variable and an independent variable – a likely scenario for healthcare expenditures and number of visits because the two will probably be driven by similar underlying factors.  Though it is worth noting that it looks like there’s a lot of variability in expenditures even within groups of observations where the number of outpatient visits is the same.

Using color is an interesting way to add a third variable to a scatterplot.  In R, all you have to do is set the color of the plot equal to a factor variable, which I’ve done below to indicate which of the survey respondents were insured (blue) and uninsured (red):


The problem with this plot (again) is that it doesn’t show the bulk of the data, which is in the bottom left corner, too well.  It does sort of look like red dots (i.e. uninsured patients) get less common as you move northeast (increase cost and number of visits), but it’s not the least bit conclusive.

Producing a boxplot of expenditures for each outcome of the binary variable – insured and uninsured – and arranging them side-by-side is an easier way to compare the data.  If the empirical distributions aren’t the same-ish, then we know there’s a difference between the typical outpatient expenditures of insured vs. uninsured people in the MEPS data set.  Also, boxplots are “robust” i.e. not sensitive to distributional assumptions, which basically means extreme values won’t affect their validity (this isn’t the case with the mean, for instance). The line in the middle of the plot is the median, which is a measure of central tendency that doesn’t care about the shape of the distribution.  Here’s the boxplot of outpatient expenditures by insurance status, which is interactive and linked to my plotly account (username:schapshow) so that anyone can access and manipulate my data:

Dist. of Outpatient Expenditures in MEPS

Note that there’s no central line – i.e. no median, or 50th percentile – on the boxplot for uninsured respondents.  This isn’t an error, the median is just 0.  That the median is 0 tells you that more than half of the 445 uninsured individuals spent $0 in outpatient expenditures.  Since outpatient expenditures are right-skewed, though, we should expect a mean that’s much higher than the median – in this case, it’s about $450.  Compare this to a median and mean of $330 and almost $1500, respectively, for insured people (on the boxplot, it’s 5.8; exp(5.8)=330, and ln(330)=5.8). We can make boxplots according to any factor variable, not just binary ones.  Here are boxplots of outpatient expenditures based on the educational attainment of the respondent, which is broken up into factors representing college graduates, high school graduates, and people who didn’t finish high school:

Dist. of Outpatient Expenditures by Education Level in MEPS

Interestingly, the medians aren’t that different – $322 for college grads, $217 for high school grads, and $110 for non high school grads.  The means are even closer $1419, $1222, and $1415.  The big difference is that the 25th percentile – which represents the outpatient expenditure which 25% of people spend less than and 75% of people spend more than – is positive for college graduates and 0 for the other groups.  So more than 75% of college graduates spend at least some money on outpatient healthcare, which is a larger proportion than the other two groups.  So college graduates are more likely to have positive outpatient expenditures, but not to spend considerably more.

In the following post, I’ll look at other variables’ impact on outpatient expenditures and perform a few different types of regressions.

Markov Chain Monte Carlo Without all the Bullshit

Math ∩ Programming

I have a little secret: I don’t like the terminology, notation, and style of writing in statistics. I find it unnecessarily complicated. This shows up when trying to read about Markov Chain Monte Carlo methods. Take, for example, the abstract to the Markov Chain Monte Carlo article in the Encyclopedia of Biostatistics.

Markov chain Monte Carlo (MCMC) is a technique for estimating by simulation the expectation of a statistic in a complex model. Successive random selections form a Markov chain, the stationary distribution of which is the target distribution. It is particularly useful for the evaluation of posterior distributions in complex Bayesian models. In the Metropolis–Hastings algorithm, items are selected from an arbitrary “proposal” distribution and are retained or not according to an acceptance rule. The Gibbs sampler is a special case in which the proposal distributions are conditional distributions of single components of a vector parameter. Various special cases…

View original post 2,908 more words

Macro Matters and Orthodoxy


Quantivity disliked undergrad macroecon, as it was largely a waste of time: fancy theory lacking compelling evidence, amplified by no consensus within the field à la saltwater versus freshwater.

Few folks could be blamed for such flippancy, as it was mostly harmless throughout the great moderation. In fact, traders took apparent pride in their ignorance of macro?except the global macro guys, obviously. Then, along came a credit crisis.

With that swan, Quantivity concluded it was high time to formulate a systematic macro perspective: a “top down” complement to calibrate “bottom up” quant models. Quantivity brought great humility to this effort, due to both intrinsic complexity and comparatively weaker background in macro.

This post kicks off a few thoughts derived from this effort; hopefully a welcome addition alongside micro analysis. Two caveats are worth noting. First, confirmation bias is particularly dangerous with macro, and thus emphasis is placed…

View original post 465 more words

Guest post: Open-Source Loan-Level Analysis of Fannie and Freddie


This is a guest post by Todd Schneider. You can read the full post with additional analysis on Todd’s personal site.

[M]ortgages were acknowledged to be the most mathematically complex securities in the marketplace. The complexity arose entirely out of the option the homeowner has to prepay his loan; it was poetic that the single financial complexity contributed to the marketplace by the common man was the Gordian knot giving the best brains on Wall Street a run for their money. Ranieri’s instincts that had led him to build an enormous research department had been right: Mortgages were about math.

The money was made, therefore, with ever more refined tools of analysis.

—Michael Lewis, Liar’s Poker (1989)

Fannie Mae and Freddie Mac began reporting loan-level credit performance data in 2013 at the direction of their regulator, the Federal Housing Finance Agency. The stated purpose of releasing the data was…

View original post 635 more words

Hypothesis Tests on Financial Asset Returns

In my previous post I computed some descriptive statistics on ETFs (i.e. estimates of parameter values calculated via a sample of 126 asset returns on SPY and VNQ from 2005-01-03 until 2015-06-01), including the sample mean (X bar), sample standard deviation (s), and sample covariance.  Though I didn’t go into why or how, I used the Maximum Likelihood Estimates, which is an estimation method that basically maximizes the probability of the sample.  I made a short slideshow about the process of maximum likelihood estimation on my Statistics Resources page.

Calculating a point estimate only gives us so much information, however.  We should also answer the question. “how certain can I be that my point estimate reflects the population parameter I’m trying to estimate?”  While the sample mean is an unbiased estimator of the population mean (µ), there’s still a possibility that you basically drew a really fucked up and unlikely sample from the population, and with that you calculated the sample mean.  And if that is in fact the case, your sample mean will not reflect the parameter you’re trying to estimate, µ.  That’s (mostly) the rationale behind Hypothesis Tests – we can, for example, compute the likelihood that the population mean µ is different from 0, or create a range of values that contains the true mean with a predetermined degree of certainty, rather than just accepting the sample mean as our “best guess”.

(For a detailed explanation of the significance of the test statistic, p-value, and confidence interval for a one sample T test, refer to my statistics resources page, where the first section starts with a discussion of this test.  If you don’t have a good idea of what the terms mean, or if you’d like to know the statistical reasoning behind comparing the test statistic to the student’s T distribution with n-1 degrees of freedom, I’d suggest reading it.)    

We have already determined that the sample mean of SPY in my dataset is 0.606%, which is a monthly return corresponding to an approximate annual return of 2.099%.  For VNQ, the monthly and annual sample mean are 0.615% and 2.132%, respectively.  (Just multiply the monthly return by the square root of 12 to get an approx. annual return).  As mentioned above, these are point estimates of the true values of µ; they reflect our best guess given the sample.  Now we want to conduct a hypothesis test.  Since I have stored the returns in a data frame called ‘returns.df’, I use $ and the column name to extract the appropriate data series to perform the test.  For SPY, the code t.test(returns.df$SPY) performs the one sample T test against the null hypothesis that µ, the population mean, is 0.  R gives me the following output:

  • t = 1.6086, the test statistic computed
  • df = 125, the degrees of freedom, i.e. the # of observations -1
  • p-value = 0.1102, the probability of realizing an observation as extreme as the sample, given the null hypothesis is true
  • Confidence Interval: (-0.1396%, 1.352%)
  • Mean of x = 0.60607, the sample mean is 0.60607%; i.e. the point estimate of μ

In the context of SPY, our sample of 126 monthly returns would be considered a bit unusual/unlikely if the true mean return on SPY were in fact 0.   Abandoning the null hypothesis and instead concluding that the mean return on SPY is nonzero, will result in error about 11% of the time.  We can be 95% certain that the true mean is somewhere between -0.1396% and 1.352%, which doesn’t totally negate the possibility that the mean return on SPY is 0 or even negative.

For VNQ, we get the following output:

  • t = 0.9186
  • df = 125
  • p-value = 0.3601
  • Confidence Interval: (-0.7104%, 1.9412%)
  • Mean of x = 0.6154

The lower test statistic and higher p-value as compared to SPY corresponds to less evidence suggesting the true mean monthly return on VNQ is not zero.  That is, assuming the mean return on VNQ is actually 0, the 126 months of data in our data set aren’t quite as unusual compared to SPY.  A mean return of 0 is well within the bounds of the 95% confidence interval.

At the 95% confidence level, we can’t reject the claim that the mean return is 0 for either financial asset.  This might seem odd, since the S&P500 (which SPY tracks) is known to offer returns in the long run and has a track record of doing so.  There are a few potential explanations.  First, the fact that SPY generates positive returns in the long run doesn’t mean that an investor can’t lose by buying at a time when SPY is relatively expensive and selling when it’s cheap.  An upward trend in the price of the index, which is well-documented, necessarily includes peaks and valleys.  Perhaps our data set corresponds to one such unlucky time frame (i.e. purchased at a peak and sold at a valley).  Second, this time frame, which I chose out of necessity because VNQ hasn’t been around for too long, includes a major financial crisis.  However, we can dismiss both of these explanations by examining the cumulative product of the simple returns on the SPY and VNQ, which both indicate that an investor who bought at the beginning of 2005 and sold last month would have made money.  The third explanation is a statistical one: the distribution of the test statistic, which we used as the basis for determining how likely our sample was for a hypothesized parameter value (µ), assumes that the underlying data is normal and iid.  If this is not the case, then the test statistic may not be distributed as the T distribution with n-1 degrees of freedom, and thus our conclusions might be invalid.

In general, statistical tests that assume the random variables drawn in a sample belong to a specific distribution are called parametric.  The one sample T test is a parametric test; it assumes that the sample data follows the Gaussian distribution.  Non-parametric tests don’t make these assumptions.  Assumption-free tests are termed “robust“.  The one sample T test is not robust.  If my data set consisting of 126 monthly returns on SPY and VNQ since 2005 isn’t Gaussian/Normal, it’s going to be more dispersed than the Normal distribution, and the test will yield a p-value that is too high and a confidence interval that is too wide.  (Technically, if data is less dispersed than the Gaussian/Normal distribution then the student’s T test will yield a p-value that is too small and a confidence interval that is too narrow.  One such example would be if the underlying data were actually uniformly distributed.  But this would be very unusual for financial data – the issue is generally that data is more widely dispersed than the normal distribution, with excess kurtosis and a higher incidence of outliers.  For example, the data may be distributed like the Cauchy Distribution, or the Student’s T Distribution).  

We can check if our data is normal using the Shapiro-Wilk Test of Normality, which is implemented with shapiro.test() in R.  The Shapiro-Wilk tests against the null hypothesis that the data come from a normal distribution.  As discussed in this stackoverflow thread, defining the null and alternative hypotheses this way necessarily “favors” the proposition that the data are normally distributed, since a p-value small enough to reject the null (usually <0.05) is rare.  So while we can firmly posit that data is not normal when we reject the null, it may often be the case that we fail to reject the null hypothesis despite the data displaying moderate non-normality because of the strict requirement the test places on rejecting the null.  That being said, the test produces incredibly small p-values for both SPY and VNQ (5.281e-05 and 4.688e-09, respectively), hence we can conclude that neither were sampled from a normal distribution.  

Since our data is not normally distributed, we should a robust test, like the Wilcoxon U test, instead of the one sample T test to analyze whether SPY and VNQ generally produce positive returns.  In R, I type the following to perform a Wilcoxon U test on SPY with the null hypothesis that µ = 0, specifying that I want R to compute an exact p-value and a 95% confidence interval:

wilcox.test(returns.df$SPY, mu=0, paired=FALSE,, data=returns.df) 

We get a p-value of 0.008139 and a 95% confidence interval of (0.2523%, 1.589%).  This result is starkly different from the one sample T test.  The lower end of the 95% confidence interval is positive and the p-value is small.  Now we repeat the test for VNQ and get a p-value of 0.02557 with a confidence interval of (0.137%, 2.1957%).  Clearly, our conclusions are much different when we choose a test that doesn’t make assumptions that our data set violates.


  • Student’s T: (-0.1396%, 1.352%)
  • Wilcoxon U: (0.2523%, 1.589%)


  • Student’s T: (-0.7104%, 1.9412%)
  • Wilcoxon U: (0.137%, 2.1957%)

Another nonparametric approach we could take to estimating the mean return µ is bootstrapping.  Note that not all bootstrapped estimates are robust – there are a bunch of different bootstrapping methods, some of which depend on distributional assumptions.  However, it’s possible to bootstrap an estimate without making these assumptions.  One such method involves repeatedly sampling, with replacement, from the sample data, computing a sample mean in each iteration, and averaging the values of the sample mean computed during each successive iteration.

First, we want to randomly draw a sample with the same number of observations as our data from our data, making sure we replace sampled values.  In R, we type sample(returns.df$SPY, n, replace=TRUE) to get 126 random values from the SPY monthly return data.  Taking the mean gives us as estimate of the mean based on one random sample.  We want to repeat this a bunch of times and average the result.  To do this, I used the following R code:

sample.mean.boot = rep(O,Z)
for (i in 1:Z) {
      data = sample(returns.df$SPY, n, replace=TRUE)
      sample.mean.boot[i] = mean(data)


Running this in R gives us sample.mean.boot, which is a vector of 1000 estimates of the sample mean obtained by bootstrapping.  We take the mean of sample.mean.boot to get our estimate, which turns out to be 0.5857%.  We can do more with this than output than a normal point estimate – we can select and count the number of outcomes in sample.mean.boot that are negative, for example:

negative <- sample.mean.boot<0
neg.return <- sample.mean.boot[negative]

I get 54, which means only 5.4% of the 1000 iterated values of the sample mean are negative.  We can also create a histogram of the vector sample.mean.boot to visualize the distribution of the simulated sample means.  In the graphic below, I plotted a vertical red line at the sample mean computed from the original data set over the histogram:

 hist(sample.mean.boot, col=”slateblue1″,main=”Bootstrapped SPY Returns”,xlab=”Sample Mean”)



This method of estimation suggests that a population mean µ equal to 0 is unlikely; it agrees more with our other nonparametric test, the wilcoxon U, than the parametric student’s T test.  Bootstrapping the mean of VNQ produces a result similar to the wilcoxon U – the mean is probably nonzero, but evidence isn’t quite as strong as it is for SPY.  Clearly, it’s important to think about the assumptions underlying various statistical tests before we draw conclusions from them.  When assumptions are in doubt, choose a nonparametric approach.