Simulating A Fund’s Value Over Time in R

Suppose you have $1 million  invested in a trust fund at 8% annually with a standard deviation of 30%.  Assume that inflation averages 3% over the term with a standard deviation of 9%.   You plan to withdraw $80,000 a year in monthly installments of $6,666.67.  What are the chances that the $1 million lasts you ten years?

This problem can be modeled using simulation.  The fund will lose and gain money according to a process that has both deterministic and random components.  The interest rate will fluctuate in a similar fashion and impact the fund’s bottom line.  The goal is to model how likely the fund is to last the desired 10 years.

I’ve seen problems like this approached by simulating random price paths in excel, but they almost always look like shit because each solitary trial involves periodic iterations, quickly turning the spreadsheet into a clusterfuck. Excel also makes it incredibly hard to make a simple histogram or Kernel Density Estimate, which I personally think are the most effective approaches to model the fund’s survival.  Instead of Excel, I’ll use R to quickly visualize the behavior of the fund and compute descriptive and inferential statistics using the same (admittedly questionable) Brownian Motion approach.  First set the parameters:

#Fund balance and rates = 1000000
annual.mean.return = 8 / 100 = 30 / 100
annual.inflation = 3 / 100 = 9 / 100


monthly.withdrawals = 80000/12
n.obs = 10*12

Number of simulations

n.sim = 100

It’s easiest to treat the each iteration of the simulation as a sequence of 120 months and convert our annual return and inflation measures into monthly ones:

monthly Investment and Inflation assumptions

monthly.mean.return = annual.mean.return / 12 = / sqrt(12)
monthly.inflation = annual.inflation / 12 = / sqrt(12)

With just one line of code, R can perform a wide range of simulations.  Some are parametric, meaning their validity depends on an assumption(s) that must be made about the shape of the underlying distribution.  Other techniques, like bootstrapping and subsampling, are called robust, meaning distributional assumptions aren’t necessary for sound results.

Brownian motion is a process characterized by periodic returns that are normally distributed and independent of each other.  To simulate it, we must first generate normally distributed and independent returns and inflation rates for each of the 120 periods over which we will model the fund’s value.  This completes one simulated ten year period.  Since we can’t learn shit from just one simulated outcome, we iterate this process to a desired frequency, which is positively related to the model’s accuracy.  First we simulate the required number of returns and inflation rates to simulate 100 outcomes:
monthly.invest.returns = matrix(0, n.obs, n.sim)
monthly.inflation.returns = matrix(0, n.obs, n.sim)

monthly.invest.returns[] = rnorm(n.obs * n.sim, mean = monthly.mean.return, sd =
monthly.inflation.returns[] = rnorm(n.obs * n.sim, mean = monthly.inflation, sd =

So now we’ve generated a matrix monthly.invest.returns with 120 rows, corresponding to months, and 100 columns, corresponding to iterations of the random entire 120-month long random process.  The matrix monthly.inflation.returns is the same thing but for inflation.

The fund starts out with $1 million, and each period three adjustments are made to the net asset value (NAV): it is multiplied by a factor of 1 plus the random return, it’s discounted by the random inflation rate, and a withdrawal is made.  So, to get to NAV at time x+1, the previous NAV is multiplied by:

(1+monthly.invest.returns(x, ) – monthly.inflation.returns(x ,))

Notice I left the column number blank when I called the matrices because I want to calculate values for month x (row x) over all columns.  Using a simple code I can loop through all 100 columns efficiently for each row.  Putting everything together, I have:

nav = matrix(1000000, n.obs + 1, n.sim)
for (j in 1:n.obs) {
      nav[j + 1, ] = nav[j, ] * (1 + monthly.invest.returns[j, ] –            monthly.inflation.returns[j, ]) – monthly.withdrawals

Since negative fund values don’t make any sense, I tell R to replace negative fund values with NA.  Characterizing all such funds this way also gives me an easy way to count how many cases there are:

nav[ nav < 0 ] = NA

p.alive = 1 – rowSums( / ncol(nav)

The vector p.alive is the percentage of funds that haven’t yet run out of money.  Plotting the probability of fund survival by period gives us an idea of the time at which funds start to experience solvency troubles:


About halfway through the 10 years the percentage of living funds turns downward and continues on a linear ish trend until the end of the term, though the proportion of living funds remains larger than the proportion of insolvent funds throughout.

I can get a snapshot of how the strong the 100 funds finished by printing out the 120th row of the matrix:


If I wanna know how big a typical surviving fund’s financial cushion is, I’m going to want to sort the final NAVs into groups. Aesthetically, the NA values everywhere just kind of fuck me up.  Use the ! operator to group the final NAV values that aren’t equal to NA (i.e. all the surviving funds):


So of the 100 funds 56 survived the entire 120 months.  Formally they represent the conditional distribution of the funds given they survived to month 120.  This might be of interest in a more detailed analysis, but for now I’ll just point out one thing – the fact that only about half of the funds survived but more than a few of the ones that did survive have balances well above 1 million suggests to me that the issue stems from too much risk (variability) not insufficient return.

Finally, R makes it very easily to plot the density of the surviving funds, which is hard and annoying to do in Excel:

  if(length(final.nav) == 0) return()
plot(density(final.nav, from=0, to=max(final.nav)), las = 1, xlab = ‘Final Capital’,
main = paste(‘Distribution of Final Balance,’, 100 * p.alive[120], ‘% still solvent’))


Ta-da.  Personally, I think a density estimate of the final fund value for surviving funds, appropriately labeled with the proportion of the total 100 funds that belong to this group, gives a clearer, more precise, and more complete summary of the simulation results than could be obtained by just simulating and plotting price paths in Excel.

The Raw and Central Moments of a Probability Distribution

Moments are expectations of quantities involving a random variable X that reveal characteristics of X’s distribution, most of which deal with its shape.  In the real world, we often can’t justify making assumptions related to an unknown distribution’s shape, which means we can’t enjoy the simplicity and predictability of assuming a bell-shaped curve.  As such, questions about the distributions’s symmetry and tail mass are critical.  We calculate them using the skewness, a function of the third raw moment, and the kurtosis, a function of the fourth raw moment, respectively.  The word document below mathematically derives general formulas for all of a distribution’s moments, provides a mathematical relationship between raw and central moments, and gives formulas to calculate the skewness and kurtosis.

The Standard Error of The Mean

The standard deviation of the sample means is known as the standard error of the mean (SE).  When we use sampling to estimate the true mean of the population, we necessarily incur some error.  We should expect this – sampling depends on the presence of randomness, after all.  A natural question to ask is how well the sample mean approximates the true mean.  We can’t answer this question exactly, but we can measure how different the sample mean could potentially be from the true mean with the standard error of the mean.  What the standard deviation is to a population, the standard error is to a sampling distribution.  They’re literally the exact same concept.

So, the SE is our measure of spread for sampling distributions.  What it represents is the same principle as the SD, but it will always be smaller than the SD.  The SE also decreases as the sample size increases, but not linearly – reducing variability by increasing the sample size quickly becomes expensive due to diminishing returns.  Intuitively, the SE is smaller than the SD because each sample mean dampens the effect of outliers, which tends to flatten the tails of the distribution and create a large centralized peak at the mean.

Suppose I open a bakery and sell 24 muffins per day on average, with a SD of 4.  On any given day, the probability that I sell more than 28 muffins is easy to compute: take Z = (28-24)/4 = 2, compute the cumulative density of the standard normal for Z=2, and subtract this value from 1.  In excel, the formula is 1-NORMDIST(28,24,4,1) = 15.866%.

Now suppose I want to look at the average muffins sold per day during a 4 day period, because I don’t want a fluke to fuck up my conclusions.  The probability that the average over the 4 day period exceeds 28 is distributed totally differently than the probability that more than 28 muffins are sold on a random single day.  It follows the sampling distribution of the sample mean, and thus its spread is defined by the standard error SE = SD/sqrt(4) instead of the SD.  This is much less variable than the single day case.  The probability that the average muffin sales over 4 days are greater than 28 is obtained from the normal dist with Z = (28-24)/(4/sqrt(4)) = 2, which turns out to be just over 2%.

We used the distribution of muffin sales to find out that the prob of selling more than 28 muffins in a day is 15.9%, and we used the sampling dist of the sample mean to compute the prob that average muffin sales over a 4 day period are greater than 28, which is only 2%.  Here are the graphs, which shows that the average over 4 days is much less variable, as we should expect:


To compute the second probability, we used the fact that a sample of n from a population distributed normally with mean µ and standard deviation σ, the Z score is distributed normally.

  • Z = (X – µ)/(σ/sqrt(n)) is distributed normally

But if we don’t know the standard deviation, which is a more likely scenario, this not necessarily the case.  If we don’t know the population SD we approximate it using the sample standard deviation, S, and the resulting Z score is distributed as a student’s T distribution with n-1 degrees of freedom.

  • Z = (X – µ)/(S/sqrt(n)) is distributed as a student T with n-1 degrees of freedom

Realistically, you’ll never know the population SD.  If you did, you’d almost definitely know the mean as well, in which case there’s no point in using sampling distributions and constructing confidence intervals to estimate its value.  If the SD is already known there’s also probably not much interesting new information to be learned from the data.

Let’s suppose we don’t know for sure that my muffin sales have a SD of 4, and instead assume that 4 is a sample estimate (S = 4).  Our Z score doesn’t change, but to compute the probabilities of outcomes we now use the student’s T dist with 3 degrees of freedom, not the normal dist.  In R, we compute the probability that the 4 day average sales is greater than 28 by finding the cumulative density with the command pt(2, df=3, lower.tail=TRUE) = 93% and then subtracting it from 1.   So the chance that I’ll sell more than 28 over the 4 days on average is about 7%, which is more than the probability implied by the normal distribution.

Conveniently, the T distribution tends towards the normal distribution as the degrees of freedom, which are determined by the sample size, get larger.  So, even though Z is only normally distributed when the population SD is known, we can actually use the normal distribution if it’s unknown so long as the sample size, n, is large enough.  Therefore, if we’re sampling about 30 or more observations from a distribution, we can say its Z score is normally distributed even if the SD is estimated with the sample standard deviation S.  The plot below shows that when the df is 1, the student’s T dist. is considerably different from the Gaussian distribution; however, as df increases the T distribution approximates the Gaussian very well:


We should also want to know how well S approximates the population SD.  This is a trickier question, and the result isn’t very neat.  Basically, the sample standard deviation is a biased estimator so we have to make some adjustments to fit its distribution to a suitable probability distribution.  So, to gauge how well S approximates the true SD, we use the fact that for a sample of size n, the sampling distribution of [S2(n-1) /σ2]  is distributed chi-squared with n-1 df.    This distribution doesn’t resemble the normal or student’s T:


Conclusion: sampling distributions are super important in a lot of applied statistics.  It’s not really the type of thing you want to memorize formulas about, but here are, in my opinion, the big points with respect to sampling distributions:

  • The mean of the sample means will be the mean of the population
  • The variance of the sample means will be the variance of the population divided by the sample size, n
  • The standard deviation of the sample means is called the standard error of the mean and it tells us how precise our estimate of the population mean is
  • The standard error of the mean will be smaller than the standard deviation of the population
  • The standard error of the mean decreases as the sample size increases, but to a diminishing extent as the sample size increases
  • Given a population has a normal distribution, the sample means will also have a normal distribution
  • If a population is not normally distributed, but the sample size is sufficiently large (>30), then the sample means will have an approx. normal distribution
  • The sample standard deviation (S) is distributed as a chi square random variable with n-1 degrees of freedom

The Sampling Distribution of The Sample Mean – An Introduction using R

Suppose we draw independent observations from a population with mean μ and standard deviation σ – i.e. we take a sample from the population.  Before we draw the sample, each observation is a random variable – i.e. a mapping from the probability space to a number on the real line representing a unit of measurement – since we aren’t sure what specific value it will take on. The sample is simply a collection of multiple random variables and is therefore a random variable itself.  Note that while sample values are independent from one another, they are all drawn from a common distribution.  Sometimes we know what that distribution is beforehand, and often times we don’t.

We use samples to compute statistics, which are functions of random variables that approximate parameters of the distribution in question.  A statistic like the sample mean, which is an estimate of the true mean (or average) of the distribution, is computed based on a draw of just n observations – obviously a subset of the population’s distribution.  Clearly, the sample mean is different from the mean; the sample mean is computed with n random observations, whereas the mean is computed with every observation in the distribution.  Thus we call the mean a parameter which means it’s a characteristic of the entire population, and we call the sample mean a statistic, to indicate that its computation relies on a finite number of observations from the population.

A statistic, specifically the sample mean, is a function of random variables and therefore is a random variable itself, endowed with all of the requisite characteristics – measures of location like the mean and median, measures of dispersion like the variance, a probability distribution, and so forth.  We call the probability distribution of a statistic the sampling distribution of that statistic.  The probability distribution of the sample mean is therefore called the sampling distribution of the sample mean.

There are 50 students in a histology class in the first year of medical school.  On exam one, the mean was a score of 72 and the standard deviation was 8.5 points.  However, only the teacher knows this.  Suppose you’re a student in the class.  From your perspective, the true mean μ is an unknown parameter.  However, you know the scores of three classmates because you shamelessly asked them what they scored, which gives you three observations (test scores) from the distribution of histology test scores.  You have a random sample of size 3.

You can compute the average of this sample and get the sample mean, which estimates the true mean of all histology test scores in your class.  Using R, I computed a sample mean of 67.48 from a random sample of observations consisting of 74.3, 56.6, and 71.6.  In the bar plot below, each bar is a sample value (a random variable drawn from the distribution when I randomly sampled 3 values), the blue line is the mean of these observations (i.e. the sample mean), and the red dotted line is the true mean.  Clearly, the means don’t match up.  However, they’re pretty close.


If you took another sample, i.e. asked 3 different students how they scored, you wouldn’t get the same result.  Sampling again from the same distribution, I get values of 89, 66, and 76, the mean of which is 76.8 – almost 10 points higher than my last sample:


You can see that we came up with totally different samples in the previous two examples even though we sampled the same way and from the same distribution.  There are as many possible samples as there are ways to arrange subsets of 3 from a population of 50 – which is the binomial coefficient ’50 choose 3’, or 19,600.  So, there are almost 20,000 ways to draw a sample of 3 from the distribution of histology test scores by asking 3 classmates their scores.  Hence we shouldn’t expect to get the same sample often.

The sampling distribution of the sample mean is the distribution of the sample means of all possible samples of size 3 from the test score distribution – i.e. the 19,600 cases I just mentioned.  We can think of each sample mean of size 3 we compute by asking classmates their scores as being a random variable from this distribution.  While we don’t know what exactly we will compute before sampling occurs, we do know some statistical properties of the sampling distribution of the sample mean:

  • The mean or expected value of the sampling distribution of the sample mean is μ, the true mean
  • The standard deviation of the sampling distribution of the sample mean is σ/√n, where σ is the standard deviation of the distribution from which we’re taking a sample of size n
  • If a random variable X is normal, then so is the sampling distribution of its sample mean

Below is a chart illustrating the points above.  Given a distribution of some variable X, we first take a sample of size n from the distribution.  With these n randomly drawn sample values, we compute the sample mean (S) as the arithmetic average of the sample.  We call the probability distribution of all sample means calculated based on all possible samples of size n from the distribution of X the Sampling Distribution of S, which is on the far right.  The sampling distribution is similar to the distribution of X in that they’re peaked around the same average value, but X represents the value of an observation, whereas S represents the mean of n observations.


*I know the stars are corny; the point is that only a few of the items in the population are selected for the sample.  Perhaps we’re interested in the average mass of a star in the distribution, given their varying sizes.  With sampling, we can account for the variation in size by sampling appropriately, which will give us the tools to make probabilistic statements about the average mass of a star even though we can’t see its whole distribution in practice.

To reiterate, we are no longer talking about the distribution of a random variable X, or, in our example, histology test scores.  We are talking about the distribution of the sample means of random samples of size 3 from the distribution of histology test scores.  On average, we expect the sample of size 3 to produce a sample mean equal to the population mean, with a typical deviation of σ/√3, i.e. the standard deviation of the actual test scores divided by the square root of 3.  Clearly, the spread of the distribution of the sample means from samples of size 3 is smaller, as measured by the standard deviation, than the distribution of test scores themselves.  Intuitively, this makes sense – one could draw an unlikely value from the distribution of test scores, but the effect of this value is muted when it’s averaged with two others.  By the same reasoning, the standard deviation of the sampling distribution of the sample mean decreases as the sample size n increases.

To illustrate, below I simulated histology test scores based on the mean and standard deviation info:


The distribution is bell-shaped with a mean of 72.  What about the distribution of the results you obtain when asking three classmates their scores and taking the average?  What is the spread and central tendency?  Below is the distribution of the mean score in samples of size 3 from the class, also called the sampling distribution of the sample mean:


They share the same expected value, but the sampling distribution is much more tightly centered on the mean.  This is because there is less dispersion in the distribution of sample means than there is in the distribution of test scores themselves.  We can corroborate this by computing the standard deviation of the sampling distribution, which is just over 4.  Compare this to the standard deviation of test scores themselves, which was 7.5.  For the sake of comparison, here’s the density of each distribution:


The blue line is the test score distribution.  The red line is the distribution of the sample mean of all samples of size 3 taken from the test score distribution.  The sampling distribution is far less variable, as evidenced by the relatively flatter tails.  Values less than 60 or greater than 85 are virtually never observed in the sampling distribution – that is, it’s unlikely that you would ask three classmates their scores, compute the average, and wind up with a number below 60 or above 85.  However, it’s not uncommon to observe values below 60 or above 85 in the distribution of test scores themselves.  This is basically common sense – relatively extreme values don’t have as big of an impact when you’re averaging the observations, and when you’re randomly drawing more than one value it’s unlikely that they’re all extreme in the same direction.  (Note – this is contingent upon truly random sampling!)

We have seen that the sampling distribution of the sample mean has the same expected value as the distribution from which samples are drawn – in our example, test scores and the average of three classmates’ test scores.  Why is this?  It’s because the sample mean is an unbiased estimator of μ. In other words, E(X) = μ. This is easy to prove, just calculate the expected value of the sample mean, subbing in the formula (1/n) Σ x for the sample mean.

However, this is not true for the standard deviation, which we already know because we saw that the sampling distribution is much narrower than the population of individual observations.  It’s not much more complex than the derivation of the mean, but we end up with a standard deviation that is different from that of the original distribution.  The standard deviation of the sampling distribution of the sample mean, also called the standard error, is the standard deviation of the underlying distribution divided by the square root of n, the sample size.

So, the sample mean is distributed normally with a mean of μ and a standard deviation of σ times the square root of the sample size, n. The sample size corresponds to the number of observations used to calculate each sample mean in the sampling distribution. A different number for n corresponds to a different sampling distribution of the sample mean, because it has a different variance. The standard deviation of the sampling distribution of the sample mean is usually called the standard error of the mean. Intuitively, it tells us by how much we deviate from the true mean, μ, on average. The previous information tells us that when we take a sample of size n from the population and compute the sample mean, we know two things: we expect, on average, to compute the true population mean, μ; and, the average amount by which we will deviate from the parameter μ is σ/Sqrt(n), the standard error of the mean. Clearly, the standard error of the mean decreases as n increases, which implies that larger samples lead to greater precision (i.e. less variability) in estimating μ.  Below you can see that for larger sample sizes the distributions are more centered on the expected value and exhibit less spread:


The graph shows that while the distribution of test scores is relatively flat, the distribution of the sample mean in a sample of size n from the class gets more and more peaked as the sample size increases.  Every distribution is expected to take on the true mean value, hence they all peak at about 72, but they differ in their spread around the mean.  Basically the window of values that are typically taken on shrinks as the sample increases.  You can see that a huge sample isn’t really necessary to improve accuracy considerably.

If we employ the randomly ask thee students their scores method, about two-thirds of the time our outcome will be between 67.1 and 76.9.  (i.e. 2/3 of the time we will get a sample mean in this range).  We can make this range more precise by sampling more people.  Suppose you ask just two more people their test scores.  The distribution of the sample mean has shrunk: the lower limit is now 68.2, and the upper limit 75.8.  We still expect an average value of 72, but we can rule out some values by shortening the range.  Below is the new interval, based on a sample of 5 classmates:


By now it’s clear that a sample of three won’t be too precise due to its relatively large standard deviation.  This means we won’t be able to say with certainty that the true mean, i.e. the average of all 50 tests, is within a given range.  Conversely, if we are to be almost certain, say 95% sure, that the mean falls within a range, that range will necessarily be fairly large.  In fact, upon calculating the confidence interval for the sample mean test score given a sample of 3 students, I find that my results are basically meaningless: we can be 95% sure that the mean is between 51 and 96, which is such an imprecise estimate it doesn’t do me any good.  That the mean is probably somewhere between 51 and 96 tells me literally nothing useful.  Below, I simulate 50 confidence intervals plotted against the true mean of 72 and highlight the confidence intervals that omit the true mean.  Since we used a 95% confidence level, only 5% of the 50 simulations should be highlighted:


Two intervals don’t include the true mean of 72, represented by the red dotted line.  This is fine; we only expect our confidence intervals to include the true mean 95% of the time.  The fatal flaw with this is the length of the confidence intervals – a sample size of 3 simply can’t provide enough precision to be useful.  With a sample size of 15, we get useful ranges from which we can construct estimates of the mean:


Note the y-axis labels on this graph compared to the previous.  A sample of 15 students gives much more useful insight, as the typical confidence interval has a much narrower range.  The averages, represented by the circles on each bar, are also generally pretty close to the dotted line representing the true mean. In fact, if I average these averages, I get 71.86, which is very close to the true value of 72.  Below is the distribution of the means of the confidence intervals with the average of 71.86 highlighted:


The sample of 15 students was pretty effective – sample means that deviate far from the true mean of 72 are rare, as shown above.  This tells us that taking a sample of size 15 and computing the sample mean is a fairly accurate approximation of the true mean test score in the histology class.

We’ve seen 50 simulated confidence intervals in graph form and computed the mean of their means, which lined up well with the true mean.  We can also calculate a finite value for the 95% confidence interval, which will give us a range within the true mean is expected to lie with 95% certainty.  First, I’ll use simulation.  I compute the sample mean from a randomly drawn sample of 15 observations from a distribution with mean 72 and sd 8.5.  I repeat this 10,000 times and order the resulting sample means from smallest to largest and store them in a vector.  From there I can very simply extract values corresponding to outcomes smaller or larger than x% of the population, which gives us an empirical quantile.

Suppose I want to know the 2.5% quantile; I simply grab the element in m that is lower than 97.5% of the others.  Since they’re ordered and there are 10,000 total, I want the 250th observation, since 250/10000=2.5%.  The command m[250] gives me my answer, which is 67.7.  So we say that 97.5% of the simulated values of the sample mean were greater than 67.7.  Similarly, the same percentage were less than 76.4.  Therefore, our 95% CI is 67.7 to 76.4.  Based on sampling 15 classmates, we’re 95% certain that the true mean of the test scores is between 67.7 and 76.4.  Here is the result of the simulation, graphically:


This is roughly the same result you would get by solving for the 95% CI using the laws of probability. First we consider the standard error in constructing a range around the mean, which is just the standard deviation of test scores divided by the square root of n.  We also need to use values of the student’s t distribution that correspond to our stated probabilities; we want the t value that’s lower than 97.5% observations and the one that’s greater than 97.5% observations in a student’s t distribution with 14 df, which we can find with the qt() function in R.  Here is the code to generate the confidence interval:
x15 <- rnorm(15,mean=72, sd=8.5)
n <- length(x15)
m <- mean(x15)

alpha <- .05 <- m + sd(x15)/sqrt(15)*qt(alpha/2, df=n-1, lower.tail=T) <- m + sd(x15)/sqrt(15)*qt(alpha/2, df=n-1, lower.tail=F)

I think that’s it for the intro!  Quick recap: the probability distribution of a statistic, which is just a function of sample values, is called the sampling distribution of that statistic.  Statistics and sampling distributions help us infer about population parameters, which we don’t usually know in practice.  The sampling distribution of the sample mean is the distribution of the means of all samples of size n.  As n increases, the variability of the sampling distribution, the standard error, decreases.

Analyzing Medical Expenditures with R – Part 1

Using a sample from the Medical Expenditures Panel Survey (MEPS), I examined the distributions of inpatient and outpatient medical expenditures with R.

I used the truehist() function in the MASS library to create the histogram below, which depicts outpatient expenditures.  You’ll notice that the graph looks like shit and communicates virtually no useful information.  But here are some key points:

Distribution of outpatient medical expenditures in MEPS.
Distribution of outpatient medical expenditures in MEPS.
  • The Expected Value of outpatient medical expenditures is $1,253, but interpret this carefully because only 68% of those surveyed incurred any expenses at all
  • Intuitively, the Average Cost of Outpatient Treatment is more consistent with the conditional expected value of outpatient treatment given that you’re one of the roughly 2/3 who will need it – this number is $1,853
  • The data is heavily skewed to the right (i.e. positively skewed)
  • Positively skewed distributions are associated with the relatively frequent occurrence of extreme values in the positive direction
  • Data that are denominated in currency (Income, GDP, etc.) are often right-skewed, partly due to the exponential nature of currency appreciation (i.e. compound interest)

The last 3 bullet points are why my graph sucks so much (and, technically, they’re behind the issues that the first two bullets highlight).

The distribution of inpatient expenditures tells a similar but more extreme story – not surprising given they occur less frequently and at a higher cost than outpatient services.  The Expected Value is $1,008, but the average cost incurred by the 7.5% of people who actually utilized inpatient treatment was $4,628.

I’m not doomed to make unuseful graphs with this data, but in order to visually analyze any aspect of it I need to transform it somehow.  A log transform will dampen the effect of extreme values without fundamentally changing the data’s relationship with explanatory variables, which is primarily what I’m interested in.


Viola!  Way less fucked up.  Thanks, Euler.

A unique (and potentially confusing) quality of healthcare as a service is the variety involved.  The treatment a patient receives depends on, to name a few: his/her insurance coverage (or lack thereof), personal health characteristics, geography, performance of medical professionals, and post-treatment medical complications.  All of this variation will necessarily impact total expenditures and the required number of trips to medical centers.  With that in mind, I plot the frequency versus the cost of outpatient health services reported by MEPS survey respondents to try and get a somewhat complete view of healthcare use (even though I’m primarily interested in expenditures):

count_vs_log_expendituresAgain I’ve run into an issue with distributions that have long right tails.  The range of the frequency/count variable – the number of outpatient visits – ranges from 0 to 150, but there are only 8 observations over 100, meaning that less than 2% of my data take up one-third of my plot.  There are also so many data points in the bottom left that I basically end up with a high-density clusterfuck (HDC) rather than an elegant representation of relative densities within the data.

Using the quantile() function in R, I exclude the top 5% of values of the count variable and set my horizontal scale accordingly with xlim().  In general, omitting outliers can be problematic, but this is just informal visualization.  To avoid overplotting in areas with lots of observations, I shade the data according to its frequency with jitter in ggplot2:


Hmmm.  This isn’t “bad”, and it’s definitely “better”, but it’s more like a representation of those who used outpatient healthcare as opposed to outpatient healthcare usage because it’s hard to see that there’s a very high density at 0 in the bottom left corner.  It doesn’t really convey what proportion of people actually incurred healthcare expenditures, but rather how much they spent given that they did.

Again, my reasoning behind plotting expenditures and the number of visits is just to view the cost of healthcare in a broad sense – visits take time, and time is essentially money – not “explain” expenditures by using the number of outpatient visits as an explanatory variable, as doing so would probably lead to confounding.  Confounding is the bias that results when an extraneous variable is correlated with both the dependent variable and an independent variable – a likely scenario for healthcare expenditures and number of visits because the two will probably be driven by similar underlying factors.  Though it is worth noting that it looks like there’s a lot of variability in expenditures even within groups of observations where the number of outpatient visits is the same.

Using color is an interesting way to add a third variable to a scatterplot.  In R, all you have to do is set the color of the plot equal to a factor variable, which I’ve done below to indicate which of the survey respondents were insured (blue) and uninsured (red):


The problem with this plot (again) is that it doesn’t show the bulk of the data, which is in the bottom left corner, too well.  It does sort of look like red dots (i.e. uninsured patients) get less common as you move northeast (increase cost and number of visits), but it’s not the least bit conclusive.

Producing a boxplot of expenditures for each outcome of the binary variable – insured and uninsured – and arranging them side-by-side is an easier way to compare the data.  If the empirical distributions aren’t the same-ish, then we know there’s a difference between the typical outpatient expenditures of insured vs. uninsured people in the MEPS data set.  Also, boxplots are “robust” i.e. not sensitive to distributional assumptions, which basically means extreme values won’t affect their validity (this isn’t the case with the mean, for instance). The line in the middle of the plot is the median, which is a measure of central tendency that doesn’t care about the shape of the distribution.  Here’s the boxplot of outpatient expenditures by insurance status, which is interactive and linked to my plotly account (username:schapshow) so that anyone can access and manipulate my data:

Dist. of Outpatient Expenditures in MEPS

Note that there’s no central line – i.e. no median, or 50th percentile – on the boxplot for uninsured respondents.  This isn’t an error, the median is just 0.  That the median is 0 tells you that more than half of the 445 uninsured individuals spent $0 in outpatient expenditures.  Since outpatient expenditures are right-skewed, though, we should expect a mean that’s much higher than the median – in this case, it’s about $450.  Compare this to a median and mean of $330 and almost $1500, respectively, for insured people (on the boxplot, it’s 5.8; exp(5.8)=330, and ln(330)=5.8). We can make boxplots according to any factor variable, not just binary ones.  Here are boxplots of outpatient expenditures based on the educational attainment of the respondent, which is broken up into factors representing college graduates, high school graduates, and people who didn’t finish high school:

Dist. of Outpatient Expenditures by Education Level in MEPS

Interestingly, the medians aren’t that different – $322 for college grads, $217 for high school grads, and $110 for non high school grads.  The means are even closer $1419, $1222, and $1415.  The big difference is that the 25th percentile – which represents the outpatient expenditure which 25% of people spend less than and 75% of people spend more than – is positive for college graduates and 0 for the other groups.  So more than 75% of college graduates spend at least some money on outpatient healthcare, which is a larger proportion than the other two groups.  So college graduates are more likely to have positive outpatient expenditures, but not to spend considerably more.

In the following post, I’ll look at other variables’ impact on outpatient expenditures and perform a few different types of regressions.