The Poisson Model in Practice – Car Insurance Claims

Even in an introductory stats class, you’ve probably at least heard of the Poisson Model – a mathematical framework of the relationship between Poisson events and the elapsed time between their occurrences (usually called the “waiting time”).  Under this model, the Poisson Distribution gives the probability that a specified number of random events occur will occur in a given time period, based on one parameter (lambda) representing the average rate at which events occur per unit time.  The Exponential Distribution defines the waiting time until a Poisson event occurs.  More importantly, it can be generalized to applied to the time between successive events, not just the occurrence of the first.  Its single parameter is the reciprocal of the Poisson parameter.

The topic of this post is just to show that many real-world occurrences of random events do in fact follow a Poisson-ish model.  Textbooks tend to use outdated or uninteresting examples that students can’t easily adapt to current data, making them feel irrelevant – for example, the number of phone calls received at a call center in an x-minute interval or the number of typos on a page.  Sorry but I don’t give a shit about call centers and spell-checking software is so good now that if the frequency of your spelling errors concerns you, your issue probably isn’t statistical in nature.  One classic example I actually find pretty interesting, though, is the application of the Poisson distribution to model the number of flying-bomb hits in 576 equally-sized, small plots of land in South London during WWII, which first appeared in a textbook by William Feller in 1957 (more here).

More relevant is the fact that the number of car accidents at a given location tend to follow a Poisson distribution.  There are assumptions of the Poisson model that are unlikely to be totally satisfied in empirical data; for instance, the Poisson events are assumed to be independent, which may not be true for car accidents due to road conditions and other factors.  However, using data collected by the General Insurance Agency of Singapore (website) I’ll show that car insurance claims for an independent group of insured drivers tend to follow a distribution resembling the Poisson in practice.

The dim() command in R tells me that the data set is comprised of 7,483 observations of 15 variables reflecting characteristics of the insured drivers (sex, age, driving history, etc.) and their vehicles (type, age, etc.).  Here are the first 8 rows of the data frame:

Singaporedata

Detailed variable descriptions are available here (page 21), but for now I’m only going to consider the variable “Clm_Count” which is the number of claims the insured driver made during the year.  Type attach(Singapore) so that you can work with the variables in the data frame without using the super fucking annoying and tedious dataset$variable format.  Also, remember that R is case sensitive and both C’s are capitalized in Clm_Count.

By typing summary(Clm_Count) we see that the minimum number of claims made is 0, the maximum is 3, the median is 0, and the average is 0.06989.  Clearly, the number of claims made by insured drivers are right-skewed and occur at an average rate of 0.06989 claims per insured.  If we assume car accidents occur independently over time at a constant rate, we should be able to model insurance claims as Poisson events – that is, the probability of k insurance claims being filed by an insured driver would be Poi(k, lambda=0.06989).  Mathematically, this is equal to: [(e^-0.06989)*(-0.06989^k)]/k!, k=0, 1, 2, 3, … .  The mean and variance are both equal to lambda, the average rate of 0.06989 claims.

So if the insurance claims in the Singapore data set are Poisson events, we’d expect the following percentage breakdown in number of claims:

  • 0 claims – 93.25%
  • 1 claim – 6.52%
  • 2 claims – 0.23%
  • 3 claims – 0.01%
  • 4 claims – 0%

Which I obtained via the following code:

x=c(0:4)

poisson.model <- dpois(x, 0.06989)*100 

poisson.model <- round(poisson.model,2)

poisson.model <- past(poisson.model, “%”,sep=” “) 

poisson.model 

This corresponds to the following expected claim frequency in a sample of 7,483 insured drivers in Singapore:

  • 0 claims – 6,977.9
  • 1 claim – 487.7
  • 2 claims – 17
  • 3 claims – 0.4
  • 4 claims – 0

So if the insurance claims are Poisson events, that’s what we’d expect.  Here is the actual claim breakdown in the data set, or the observed frequency:

Rplot01

  • 6,996 insureds made 0 claims
  • 455 insureds made 1 claim
  • 28 insureds made 2 claims
  • 4 insureds made 3 claims

barplot(table(Singapore$Clm_Count),col=c(“darkblue”,”red”,”green”,”slateblue”),xlab=”Number of Claims Filed by Policyholder”,
main=”Empirical Claim Frequency (n=7483)”,legend=observed$Freq)

So, the observed frequency doesn’t differ all that much from the expected frequency suggested by the Poisson model.  That is, car insurance claims made by a pool of insureds (specifically, just over 7,000 demographically diverse drivers in Singapore) could possibly be viewed as Poisson events.  But to see where the model fails us, here’s a plot of the excess claims observed in the empirical data (i.e. observed number of claims less the expected number of claims under the Poisson model):

<insert plot>

  • The number of insureds who made zero claims was 18.1 more than predicted by the Poisson model
  • The number of insureds who made one claim was 32.7 less than predicted by the Poisson model
  • The number of insureds who made two claims was 11 more than predicted by the Poisson model
  • The number of insureds who made three claims was 3.6 more than predicted by the Poisson model

 

Basic Exploratory Data Analysis of Medical Expenditures in R

The data I’m working with is just one small component of the Medical Expenditure Panel Survey (MEPS).  In its entirety, MEPS offers probably the most complete look at the current state of healthcare spending in the U.S. from a range of relevant perspectives including individuals, insurers, and the government.  About 2000 MEPS survey respondents with diabetes were sampled for a public health study to create the data set I’ll be using.  It’s much more manageable in terms of size, plus diabetes is fascinating to me for personal reasons – a statistically unusual number of my close friends are diabetic, and my favorite commercial of all time is the one where the old southern guy pronounces diabetes as “diabeetus”.

First I want to get a clear picture of my sample demographic; knowing who comprises my sample, in terms of gender, race, age, etc., is necessary if I am to isolate and uncover relationships between variables within the data set.  Using subset I can split the data by gender and illustrate the proportions with a pie chart (the details of which I’ll omit cause it’s really simple):

gender.meps

 

Overall slightly more women were surveyed in MEPS – 57% of total respondents.  Overall, 65% of respondents are white, 25% are black, and 6% are Asian.  A very small number are Hawaiian.

Men are more likely to be white on average; 70% of male MEPS respondents are white, 19% black, and 7% Asian.  Correspondingly, only 61% of females are white, while 30% are black and 5% are Asian.

While none of the demographic facts I just listed seem capable of causing issues, it’s probably a good idea to keep note of the main findings: MEPS contains more women than men; men are disproportionately white, women are disproportionately black.

There are three numeric variables of interest in the data set – expenditures, income, and BMI.  Below I make a chart demonstating just how easy it is to compute descriptive statistics like the mean, median, and standard deviation, for multiple variables and report the results in R.  It takes the same amount of effort (probably less) as using an Excel spreadsheet, and I can even export my computations directly to an excel file if I want:

attach(diabetes)
options(digits=5)
Xymat = data.frame(cbind(EXPEND, INCOME, BMI))
mean.stat = sapply(Xymat, mean, na.rm=TRUE)
sd.stat = sapply(Xymat, sd, na.rm=TRUE)
med.stat = sapply(Xymat, median, na.rm=TRUE)
summ.stat = cbind(mean.stat, med.stat, sd.stat)

mean median      sd
EXPEND 11161.743 4803.5 18331.25
INCOME 47816.2 34518 43510.9
BMI 31.765467 30.7 7.259151

It’s important to note that the low medians relative to means of expenditures and income indicate that they’re both right-skewed or positively skewed so that you don’t waste time trying to create standard visualizations that will simply never pan out.  This shouldn’t come as a surprise, though, as almost all data dealing with money tends to be right-skewed, with income being a particularly common and well-known example.  For exploratory analysis and visualization it makes more sense for me to work with the logarithm of expenditures and income, so I’ll do that going forward.  The sample stats computed on BMI don;t suggest glaring distributional issues, so let’s start with some basics:

I. One-dimensional Density Plots

While not exactly thrilling or stimulating, histograms are an effective tool to represent the distribution of sample data.  Karl Pearson introduced them almost a century ago as a means to estimate the probability distribution of a population from which a sample was drawn.  Thanks to Karl, we know that by splitting sample observations into bins along the x axis and vertically plotting their frequency within each bin, we get a relative frequency distribution that should be similar in shape to the theoretical probability distribution.  But histograms aren’t a silver bullet, mainly because their shape (and thus your estimation of the distribution) depends on you selecting an appropriate bin width, a task that’s surprisingly easy to fuck up sometimes.  There also isn’t a whole lot you can do with super skewed data with histograms – your best bet is a variable transformation usually.  The MASS package in R makes really good histograms with the truehist() command:

bmi

The details are technical, but broadly speaking we can overcome the bin width issue posed by histograms by estimating the shape of the population directly from our sampled values – a technique called Kernel Density Estimation (KDE).  Rather than arranging data in bins, a Kernel Density Estimate involves smoothing the discrete sample observations to obtain a continuous curve that optimally fits the data and satisfies the mathematical conditions of a probability density function.  The most intuitive rationale of KDE is that your data is noisy – i.e. subject to random error and whatnot – and if you could smooth away this noise you’d be left with the probability density function from which the sample was drawn.  There’s no question KDE is more precise than a histogram, it just wasn’t computationally possible at the time histograms were developed.  Below, I plotted both distribution estimates – a histogram and density estimate – on the same plot.  Forgive the terrible formatting, I’m just trying to illustrate that a KDE is basically smooth histogram:

bbbbb

To compute a KDE in R, any of the following code will work.  Number two requires the lattice package and three requires ggplot2 but is undoubtedly the best way to to make aesthetically pleasing R graphs.  Rather than embed 4 plots of the same statistical concept, I posted one generic, barely formatted density estimate of BMI in MEPS  below, followed by codes that will produce very similar results but will look a little different formatting-wise:

qpltdensty

  1. Directly transform a vector of values into a density using density() and plot the result using one of R’s standard base plotting capabilities such as plot() or lines()

density.bmi <- density(diabetes$BMI)

plot(density.bmi, type=”l”,lwd=3, main=”BMI in MEPS”)

       2. Use the lattice package for a slightly more visually pleasing plot that includes a ‘rug’, i.e. a display of actual data points as they occur in the sample plotted on the x axis below the density curve:

densityplot(~diabetes$BMI,main=”BMI in MEPS”)

3. Use one of two ggplot2 methods – the simpler qplot or more involved ggplot().

z <- ggplot(diabetes,aes(x=BMI))

z + geom_density()

or

qplot(BMI, data=diabetes, geom=”density”,
main=”Distribution of BMI in MEPS”, xlab=”BMI”,
ylab=”Density”)

II. Exploratory Data Analysis with Density Plots

A fundamental principle of exploratory data analysis is to show relevant and conceptually sound comparisons between variables(s) or data in general.  Just plotting data without much forethought very rarely emphasizes the key relationships and characteristics, and actually quite often misleads viewers and invites them to make arbitrary and meaningless comparisons.  To demonstrate, I’ll be analyzing BMI with respect to factor variables and binary variables.

BMI vs Gender

To illustrate the difference in the BMIs of Male vs. Female diabetics, I need more than sample statistics from each cohort; a point estimate is not a useful quantity to visualize (as ‘point’ alludes to).  Breaking up a density plot of BMI into groups by gender could be a good approach, but only if they’re scaled appropriately! For instance, the histogram below is NOT an effective visualization tool because the y axis is ‘counts’ but our data set isn’t split equally between men and women – thus the female density plot dwarfs the male one.  I also can’t really tell if the female histogram legitimately has more spread or if it’s just bigger because it has more observations:

 cont

If we instead plot the relative densities of the variables, we can compare them one-to-one, which is what we want.  Also, I think since I’m overlapping the densities on the same plot I should use a smooth line density estimate instead of an opaque histogram, just to be able to see and highlight any differences:

ggplotdensity

Now it’s clear that the distribution of BMIs for diabetic men is narrower than the distribution of diabetic women’s BMIs based on the right tails of the density estimates.  The vertical distance from the density line to the axis can be interpreted as the likelihood for observations to occur at that specific x position.  For every BMI value greater than their intersection at approx. 35, the positions of the density lines tell us that observations are rarer for men.  We should expect that the BMI of men has a lower mean and a lower standard deviation based on this.

To annotate a plot with arbitrary text at any (x,y) coordinate I simply use the text() command as I did below to report the mean and standard deviation of men vs women.  I used the base plotting system instead of ggplot2 just to show some variety – I think it looks way worse, personally.  First I subset the dataset by gender, and to get them both on the same chart area I use plot() and add an additional density with lines():

  female.data <- with(diabetes, subset(BMI, FEMALE==1))

female.density <- density(female.data)

male.data <- with(diabetes, subset(BMI, FEMALE==0))

male.density <- density(male.data)

 

plot(female.density, type=”l”,lwd=3,col=”red”,ylim=c(0,0.081),
main=”BMI in MEPS”)
lines(male.density,col=’blue’,lwd=3)
legend(“topright”, inset=.05,
c(“Female”,”Male”), fill=c(“red”,”blue”), horiz=TRUE)
text(60, .05, c(“Mean=32.5″),col=”red”)
text(75,.05,c(“Mean=30.8″),col=”blue”)
text(60, .04, c(“SD=7.9″),col=”red”)
text(75,.04,c(“SD=6.1″),col=”blue”)

 

bmidend

 

A boxplot is another visual representation of a distribution, but its shape is dictated by the empirical quantiles observed in the sample.  Personally, I hate boxplots because I think they look childish.  However, they admittedly offer a pretty effective visual summary of the distribution of data.  More importantly, boxplots are based on metrics that aren’t sensitive to the data’s distribution, so they’re not influenced by extreme values/anomalies.  On the other hand, even very infrequent extreme values have a strong impact on statistics like the sample mean, and thus you generally can’t be certain that it reflects the central tendency of a distribution and not some anomaly in the sample.  So i think they look like shit but since they remind us to consider prudent, nonparametric analysis, and since I used distribution-sensitive metrics in my density pots, here are boxplots broken up by gender:

bxx

The median for females is a little bit higher it looks like.  The occurrence of outliers make the graphics kinda shitty – the box is almost too small but i cant make my y range more narrow without ommitting observations.  I could try imposing a jitter, which will fill in some raw data and hopefully show me the outlier situation more clearly

jit

Next post I’ll also consider ways to represent the statistical relationship between independent variables.  I’ll also create factor variables to look for patterns in a continuous dependent variable and extend density plots to more than two variables – a preview of which, using the logged expenditures and mental health score of MEPS respondents, is posted below.

ment

logexp

Some Friendly Advice for Washed-up Celebs Staging a Comeback (Part 1)

This is probably the only post I’ll ever write about celebrities (of the Hollywood variety, I guess I should say) since I’m mostly oblivious to what they do and happy to be so.  Don’t get me wrong, I’ll tune in for a public mental breakdown and fan the flames when it happens, but I don’t really care what Britney Spears and Amanda Bynes are up to if they aren’t wreaking havoc.

The other day my sister excitedly told me that Aaron Carter followed her on twitter.  Yes the Aaron Carter of “Aaron’s party” fame who hasn’t done shit (that I’m aware of) in a decade.  His fame was buoyed by his brother’s high profile and super abusive relationship with Paris Hilton, another celebrity we might be tempted to classify as washed up.  Anyway, I genuinely assumed that Aaron Carter was, I don’t fucking know, an accountant or something now – I mean for how long can you chase the fame you really only ever caught a glimpse of for a few months in the early 2000s?  And how does he earn a living now if he’s irrelevant to most people?  As a crude ‘litmus test’ of his fame as of late, I analyzed the daily number of searches he generates on Wikipedia over time, starting with 2011:

ws_AaronCarter

He’s had quite a few short-lived spikes in popularity, evidenced by the solitary data points above the clustered trend.  So if these are publicity stunts orchestrated by his agent, he needs to reconsider his approach as they’re not helping him long term.  His max took place in Feb 2012, and since 2013 there’s been a fairly pronounced downtrend.

I’ll get back to measuring Aaron’s relevancy, but first consider the careers of two much more talented singers – Adele and Lady Gaga.  They both fell off the map (my map anyway) for what feels like a few years at least only to make high profile returns to the limelight this year.  Adele did so with her first single off of her new album, and Lady Gaga kills and eats people on American Horror Story now.  Adele’s video for “Hello” smashed Vevo’s view record (sorry T Swift), setting the stage for what’ll probably be another successful album.  While Lady Gaga is known for her music mostly, landing a permanent role on a high-profile TV show is an impressive and lucrative career move.  “Hello” is a great song and video, and AHS is great if you’re into super fucked up terrifying shit, but I wonder if the comebacks of Adele and Lady Gaga have been bolstered by their absences in recent years.  They say absence makes the heart grow fonder, and maybe Aaron Carter would be better off in terms of long-term relevancy and popularity by adopting a similar approach.

From the plot below, you can see that Lady Gaga’s slow descent into irrelevancy started in 2012 – probably because the Fame Monster’s smash success was hard to top, not to mention her follow-up, Artpop, is garbage in its own right so it never stood a chance.  There’s definitely a pronounced downtrend, but she’s actually stayed quite relevant during the past five years in terms of absolute number of searches.  She must have had a lot of success outside of the U.S. because I haven’t heard shit from her or about her.  And while she isn’t near her 2012 peak, she’s no Aaron Carter:

ladygaga

Adele, who historically generates just over 5000 searches a day, set a personal record on October 23 immediately following the release of “Hello” with 127,000, over 25 times her average.  You can see how unusual the hits recorded in the past few weeks are given her five year average, but, more importantly, you can see a sequence of unusually high values emerging on the far right side rather than just one blip.  It’s too early to say so decisively, but this might suggest a pronounced uptrend in average searches for Adele – i.e. a true ‘comeback’ as opposed to a fluke.

adele

That the popularity of Adele and Lady Gaga follow fairly pronounced trends isn’t surprising – their jobs involve periods of time spent working on music/TV somewhat privately followed by periods during which they promote their work with press tours and shit.  So the fundamental change in activities and behavior that take place produce an obvious trend, not random fluctuations around a mean value.  For that, we’d need to analyze a celebrity who doesn’t have a job or any talent to cultivate – for instance, Kylie Jenner:

ws_KylieJenner

Just for the sake of comparison, here are all the celebrities plotted on the same axis.  When making comparisons, keep in mind WIkipedia probably isn’t the preferred search engine for diehard Kylie Jenner fans – if you’re not careful you might run into something academic on there.

group_graph

Simulating A Fund’s Value Over Time in R

Suppose you have $1 million  invested in a trust fund at 8% annually with a standard deviation of 30%.  Assume that inflation averages 3% over the term with a standard deviation of 9%.   You plan to withdraw $80,000 a year in monthly installments of $6,666.67.  What are the chances that the $1 million lasts you ten years?

This problem can be modeled using simulation.  The fund will lose and gain money according to a process that has both deterministic and random components.  The interest rate will fluctuate in a similar fashion and impact the fund’s bottom line.  The goal is to model how likely the fund is to last the desired 10 years.

I’ve seen problems like this approached by simulating random price paths in excel, but they almost always look like shit because each solitary trial involves periodic iterations, quickly turning the spreadsheet into a clusterfuck. Excel also makes it incredibly hard to make a simple histogram or Kernel Density Estimate, which I personally think are the most effective approaches to model the fund’s survival.  Instead of Excel, I’ll use R to quickly visualize the behavior of the fund and compute descriptive and inferential statistics using the same (admittedly questionable) Brownian Motion approach.  First set the parameters:

#Fund balance and rates
start.capital = 1000000
annual.mean.return = 8 / 100
annual.ret.std.dev = 30 / 100
annual.inflation = 3 / 100
annual.inf.std.dev = 9 / 100

Withdrawals

monthly.withdrawals = 80000/12
n.obs = 10*12

Number of simulations

n.sim = 100

It’s easiest to treat the each iteration of the simulation as a sequence of 120 months and convert our annual return and inflation measures into monthly ones:

monthly Investment and Inflation assumptions

monthly.mean.return = annual.mean.return / 12
monthly.ret.std.dev = annual.ret.std.dev / sqrt(12)
monthly.inflation = annual.inflation / 12
monthly.inf.std.dev = annual.inf.std.dev / sqrt(12)

With just one line of code, R can perform a wide range of simulations.  Some are parametric, meaning their validity depends on an assumption(s) that must be made about the shape of the underlying distribution.  Other techniques, like bootstrapping and subsampling, are called robust, meaning distributional assumptions aren’t necessary for sound results.

Brownian motion is a process characterized by periodic returns that are normally distributed and independent of each other.  To simulate it, we must first generate normally distributed and independent returns and inflation rates for each of the 120 periods over which we will model the fund’s value.  This completes one simulated ten year period.  Since we can’t learn shit from just one simulated outcome, we iterate this process to a desired frequency, which is positively related to the model’s accuracy.  First we simulate the required number of returns and inflation rates to simulate 100 outcomes:
monthly.invest.returns = matrix(0, n.obs, n.sim)
monthly.inflation.returns = matrix(0, n.obs, n.sim)

monthly.invest.returns[] = rnorm(n.obs * n.sim, mean = monthly.mean.return, sd = monthly.ret.std.dev)
monthly.inflation.returns[] = rnorm(n.obs * n.sim, mean = monthly.inflation, sd = monthly.inf.std.dev)

So now we’ve generated a matrix monthly.invest.returns with 120 rows, corresponding to months, and 100 columns, corresponding to iterations of the random entire 120-month long random process.  The matrix monthly.inflation.returns is the same thing but for inflation.

The fund starts out with $1 million, and each period three adjustments are made to the net asset value (NAV): it is multiplied by a factor of 1 plus the random return, it’s discounted by the random inflation rate, and a withdrawal is made.  So, to get to NAV at time x+1, the previous NAV is multiplied by:

(1+monthly.invest.returns(x, ) – monthly.inflation.returns(x ,))

Notice I left the column number blank when I called the matrices because I want to calculate values for month x (row x) over all columns.  Using a simple code I can loop through all 100 columns efficiently for each row.  Putting everything together, I have:

nav = matrix(1000000, n.obs + 1, n.sim)
for (j in 1:n.obs) {
      nav[j + 1, ] = nav[j, ] * (1 + monthly.invest.returns[j, ] –            monthly.inflation.returns[j, ]) – monthly.withdrawals
}

Since negative fund values don’t make any sense, I tell R to replace negative fund values with NA.  Characterizing all such funds this way also gives me an easy way to count how many cases there are:

nav[ nav < 0 ] = NA

p.alive = 1 – rowSums(is.na(nav)) / ncol(nav)

The vector p.alive is the percentage of funds that haven’t yet run out of money.  Plotting the probability of fund survival by period gives us an idea of the time at which funds start to experience solvency troubles:

paying.funds

About halfway through the 10 years the percentage of living funds turns downward and continues on a linear ish trend until the end of the term, though the proportion of living funds remains larger than the proportion of insolvent funds throughout.

I can get a snapshot of how the strong the 100 funds finished by printing out the 120th row of the matrix:

end

If I wanna know how big a typical surviving fund’s financial cushion is, I’m going to want to sort the final NAVs into groups. Aesthetically, the NA values everywhere just kind of fuck me up.  Use the ! operator to group the final NAV values that aren’t equal to NA (i.e. all the surviving funds):

surv

So of the 100 funds 56 survived the entire 120 months.  Formally they represent the conditional distribution of the funds given they survived to month 120.  This might be of interest in a more detailed analysis, but for now I’ll just point out one thing – the fact that only about half of the funds survived but more than a few of the ones that did survive have balances well above 1 million suggests to me that the issue stems from too much risk (variability) not insufficient return.

Finally, R makes it very easily to plot the density of the surviving funds, which is hard and annoying to do in Excel:

  if(length(final.nav) == 0) return()
plot(density(final.nav, from=0, to=max(final.nav)), las = 1, xlab = ‘Final Capital’,
main = paste(‘Distribution of Final Balance,’, 100 * p.alive[120], ‘% still solvent’))

hist

Ta-da.  Personally, I think a density estimate of the final fund value for surviving funds, appropriately labeled with the proportion of the total 100 funds that belong to this group, gives a clearer, more precise, and more complete summary of the simulation results than could be obtained by just simulating and plotting price paths in Excel.

The Raw and Central Moments of a Probability Distribution

Moments are expectations of quantities involving a random variable X that reveal characteristics of X’s distribution, most of which deal with its shape.  In the real world, we often can’t justify making assumptions related to an unknown distribution’s shape, which means we can’t enjoy the simplicity and predictability of assuming a bell-shaped curve.  As such, questions about the distributions’s symmetry and tail mass are critical.  We calculate them using the skewness, a function of the third raw moment, and the kurtosis, a function of the fourth raw moment, respectively.  The word document below mathematically derives general formulas for all of a distribution’s moments, provides a mathematical relationship between raw and central moments, and gives formulas to calculate the skewness and kurtosis.

Limitations of the Sample Average as a Measure of Central Tendency

On his blog Understanding Uncertainty, David talks about the limitations of the sample mean as a descriptive statistic for data related to health and income.  Of course the same could be said about countless other data sources, but health and income (which I’m using broadly to include shit like the income of a country, i.e. GDP, as well as personal income and other data series related to money) are especially “high profile” you could say.   Hardly a day goes by where a journalist or politician doesn’t make a statistical claim related to one or both.

Income tends to be very positively skewed, making the median a better statistic than the mean to estimate the central tendency of the distribution.  Intuitively, while the sample mean serves as a good estimate of average income, the sample median more accurately reflects the income of an average person.  The latter is probably more consistent with what most people think of when they hear “average income” – the income of the average person in the economy.  This number doesn’t care about extremes at either end of the distribution, whereas the mean is highly susceptible to outliers.  The fact that income is positively skewed means higher incomes skew the average making it look as though the typical person earns more than he/she actually does.  In the U.S., for example, the bottom quintile  of workers earn about 4% of the total wealth, compared to the top quintile’s approx. 38% share (source) which tells us that extreme values, in terms of distance from the mean, tend to be larger than the mean (and potentially much, much larger).  Basically, relatively low incomes are commonplace but super, crazy high incomes (like Beyonce or Bill Gates), which are outliers in terms of the typical deviation, aren’t unheard of.

Life expectancy is a similarly misleading statistic – it is calculated as the average number of future years an individual is expected to live at his or her current age.  The probability distribution of survival, however, is left-skewed, so it probably isn’t a great idea to use the sample mean to measure central tendency.

To show that survival probability is skewed, David used mortality rates to plot the number of deaths expected each year in a cohort of 100,000 people.  Below are mortality rates, which vary over time and by gender:

mortality.v.gender

The initial sharp decrease in mortality exhibited in the graph (at age=0) reflects the reality that complications leading to death are fairly common in the first few years of life, and especially for newborns not year one year old.  The rest of the graph is a roughly linear trend, which suggests that, unsurprisingly, the probability of survival decreases with age.

To see that the age-mortality relationship suggests a left-skewed distribution of life expectancy, David plots the expected number of deaths each year of a cohort of 100,000 individuals (separated by gender) implied by current mortality rates:

density-female

The mean of the distribution is 83, but the median is 86 – so a person lives to be 83 on average, but the person at the center of the survival distribution lives to be 86.  The mode, or most common value, is 90.

It’s worth noting that over the past 40 years, mortality has improved by about 1.5% a year for women and 2% a year for men, so our assumption that mortality won’t improve is probably wrong.  However, attempts to account for future expected improvements in mortality are sometimes met with criticism as well, because experts disagree on the degree to which mortality will continue to improve in the future.  Women still live longer than men on average, but the gap has narrowed.

Another way to show the relationship between mortality and age is by using mortality rates to plot the number of people still surviving in a cohort of 100,000 at each age:

mod

Personally, visualizing the demise of the cohort in this way makes me feel weird – it’s just kind of fucked up to watch the cohort of 100,000 people get picked off one by one (even though they’re hypothetical people).  Also, relatively small differences in mortality rates lead to significant differences in the size of the cohorts for each gender.  At their most divergent point (corresponding to an age on the x axis), the female cohort has over 15,000 more surviving members than the male cohort.  This gap narrows as both cohorts get super old, though.

LOST CAUSES IN STATISTICS I: Finite Additivity

Well said!

Normal Deviate

LOST CAUSES IN STATISTICS I: Finite Additivity

I decided that I’ll write an occasional post about lost causes in statistics. (The title is motivated by Streater (2007).) Today’s post is about finitely additive probability (FAP).

Recall how we usually define probability. We start with a sample space $latex {S}&fg=000000$ and a $latex {sigma}&fg=000000$-algebra of events $latex {{cal A}}&fg=000000$. A real-valued function $latex {P}&fg=000000$ on $latex {{cal A}}&fg=000000$ is a probability measure if it satisfies three axioms:

(A1) $latex {P(A) geq 0}&fg=000000$ for each $latex {Ain {cal A}}&fg=000000$.

(A2) $latex {P(S)=1}&fg=000000$.

(A3) If $latex {A_1,A_2,ldots}&fg=000000$ is a sequence of disjoint events then

$latex displaystyle PBigl(A_1 bigcup A_2 bigcup cdots Bigr) = sum_{i=1}^infty P(A_i). &fg=000000$

The third axiom, countable additivity, is rejected by some extremists. In particular, Bruno de Finetti was a vocal opponent of (A3). He insisted that probability should only be required to satisfy the additivity rule for finite unions…

View original post 533 more words

The Standard Error of The Mean

The standard deviation of the sample means is known as the standard error of the mean (SE).  When we use sampling to estimate the true mean of the population, we necessarily incur some error.  We should expect this – sampling depends on the presence of randomness, after all.  A natural question to ask is how well the sample mean approximates the true mean.  We can’t answer this question exactly, but we can measure how different the sample mean could potentially be from the true mean with the standard error of the mean.  What the standard deviation is to a population, the standard error is to a sampling distribution.  They’re literally the exact same concept.

So, the SE is our measure of spread for sampling distributions.  What it represents is the same principle as the SD, but it will always be smaller than the SD.  The SE also decreases as the sample size increases, but not linearly – reducing variability by increasing the sample size quickly becomes expensive due to diminishing returns.  Intuitively, the SE is smaller than the SD because each sample mean dampens the effect of outliers, which tends to flatten the tails of the distribution and create a large centralized peak at the mean.

Suppose I open a bakery and sell 24 muffins per day on average, with a SD of 4.  On any given day, the probability that I sell more than 28 muffins is easy to compute: take Z = (28-24)/4 = 2, compute the cumulative density of the standard normal for Z=2, and subtract this value from 1.  In excel, the formula is 1-NORMDIST(28,24,4,1) = 15.866%.

Now suppose I want to look at the average muffins sold per day during a 4 day period, because I don’t want a fluke to fuck up my conclusions.  The probability that the average over the 4 day period exceeds 28 is distributed totally differently than the probability that more than 28 muffins are sold on a random single day.  It follows the sampling distribution of the sample mean, and thus its spread is defined by the standard error SE = SD/sqrt(4) instead of the SD.  This is much less variable than the single day case.  The probability that the average muffin sales over 4 days are greater than 28 is obtained from the normal dist with Z = (28-24)/(4/sqrt(4)) = 2, which turns out to be just over 2%.

We used the distribution of muffin sales to find out that the prob of selling more than 28 muffins in a day is 15.9%, and we used the sampling dist of the sample mean to compute the prob that average muffin sales over a 4 day period are greater than 28, which is only 2%.  Here are the graphs, which shows that the average over 4 days is much less variable, as we should expect:

muffin

To compute the second probability, we used the fact that a sample of n from a population distributed normally with mean µ and standard deviation σ, the Z score is distributed normally.

  • Z = (X – µ)/(σ/sqrt(n)) is distributed normally

But if we don’t know the standard deviation, which is a more likely scenario, this not necessarily the case.  If we don’t know the population SD we approximate it using the sample standard deviation, S, and the resulting Z score is distributed as a student’s T distribution with n-1 degrees of freedom.

  • Z = (X – µ)/(S/sqrt(n)) is distributed as a student T with n-1 degrees of freedom

Realistically, you’ll never know the population SD.  If you did, you’d almost definitely know the mean as well, in which case there’s no point in using sampling distributions and constructing confidence intervals to estimate its value.  If the SD is already known there’s also probably not much interesting new information to be learned from the data.

Let’s suppose we don’t know for sure that my muffin sales have a SD of 4, and instead assume that 4 is a sample estimate (S = 4).  Our Z score doesn’t change, but to compute the probabilities of outcomes we now use the student’s T dist with 3 degrees of freedom, not the normal dist.  In R, we compute the probability that the 4 day average sales is greater than 28 by finding the cumulative density with the command pt(2, df=3, lower.tail=TRUE) = 93% and then subtracting it from 1.   So the chance that I’ll sell more than 28 over the 4 days on average is about 7%, which is more than the probability implied by the normal distribution.

Conveniently, the T distribution tends towards the normal distribution as the degrees of freedom, which are determined by the sample size, get larger.  So, even though Z is only normally distributed when the population SD is known, we can actually use the normal distribution if it’s unknown so long as the sample size, n, is large enough.  Therefore, if we’re sampling about 30 or more observations from a distribution, we can say its Z score is normally distributed even if the SD is estimated with the sample standard deviation S.  The plot below shows that when the df is 1, the student’s T dist. is considerably different from the Gaussian distribution; however, as df increases the T distribution approximates the Gaussian very well:

t

We should also want to know how well S approximates the population SD.  This is a trickier question, and the result isn’t very neat.  Basically, the sample standard deviation is a biased estimator so we have to make some adjustments to fit its distribution to a suitable probability distribution.  So, to gauge how well S approximates the true SD, we use the fact that for a sample of size n, the sampling distribution of [S2(n-1) /σ2]  is distributed chi-squared with n-1 df.    This distribution doesn’t resemble the normal or student’s T:

chi

Conclusion: sampling distributions are super important in a lot of applied statistics.  It’s not really the type of thing you want to memorize formulas about, but here are, in my opinion, the big points with respect to sampling distributions:

  • The mean of the sample means will be the mean of the population
  • The variance of the sample means will be the variance of the population divided by the sample size, n
  • The standard deviation of the sample means is called the standard error of the mean and it tells us how precise our estimate of the population mean is
  • The standard error of the mean will be smaller than the standard deviation of the population
  • The standard error of the mean decreases as the sample size increases, but to a diminishing extent as the sample size increases
  • Given a population has a normal distribution, the sample means will also have a normal distribution
  • If a population is not normally distributed, but the sample size is sufficiently large (>30), then the sample means will have an approx. normal distribution
  • The sample standard deviation (S) is distributed as a chi square random variable with n-1 degrees of freedom