Even in an introductory stats class, you’ve probably at least heard of the Poisson Model – a mathematical framework of the relationship between Poisson events and the elapsed time between their occurrences (usually called the “waiting time”). Under this model, the Poisson Distribution gives the probability that a specified number of random events occur will occur in a given time period, based on one parameter (lambda) representing the average rate at which events occur per unit time. The Exponential Distribution defines the waiting time until a Poisson event occurs. More importantly, it can be generalized to applied to the time between successive events, not just the occurrence of the first. Its single parameter is the reciprocal of the Poisson parameter.
The topic of this post is just to show that many real-world occurrences of random events do in fact follow a Poisson-ish model. Textbooks tend to use outdated or uninteresting examples that students can’t easily adapt to current data, making them feel irrelevant – for example, the number of phone calls received at a call center in an x-minute interval or the number of typos on a page. Sorry but I don’t give a shit about call centers and spell-checking software is so good now that if the frequency of your spelling errors concerns you, your issue probably isn’t statistical in nature. One classic example I actually find pretty interesting, though, is the application of the Poisson distribution to model the number of flying-bomb hits in 576 equally-sized, small plots of land in South London during WWII, which first appeared in a textbook by William Feller in 1957 (more here).
More relevant is the fact that the number of car accidents at a given location tend to follow a Poisson distribution. There are assumptions of the Poisson model that are unlikely to be totally satisfied in empirical data; for instance, the Poisson events are assumed to be independent, which may not be true for car accidents due to road conditions and other factors. However, using data collected by the General Insurance Agency of Singapore (website) I’ll show that car insurance claims for an independent group of insured drivers tend to follow a distribution resembling the Poisson in practice.
The dim() command in R tells me that the data set is comprised of 7,483 observations of 15 variables reflecting characteristics of the insured drivers (sex, age, driving history, etc.) and their vehicles (type, age, etc.). Here are the first 8 rows of the data frame:
Detailed variable descriptions are available here (page 21), but for now I’m only going to consider the variable “Clm_Count” which is the number of claims the insured driver made during the year. Type attach(Singapore) so that you can work with the variables in the data frame without using the super fucking annoying and tedious dataset$variable format. Also, remember that R is case sensitive and both C’s are capitalized in Clm_Count.
By typing summary(Clm_Count) we see that the minimum number of claims made is 0, the maximum is 3, the median is 0, and the average is 0.06989. Clearly, the number of claims made by insured drivers are right-skewed and occur at an average rate of 0.06989 claims per insured. If we assume car accidents occur independently over time at a constant rate, we should be able to model insurance claims as Poisson events – that is, the probability of k insurance claims being filed by an insured driver would be Poi(k, lambda=0.06989). Mathematically, this is equal to: [(e^-0.06989)*(-0.06989^k)]/k!, k=0, 1, 2, 3, … . The mean and variance are both equal to lambda, the average rate of 0.06989 claims.
So if the insurance claims in the Singapore data set are Poisson events, we’d expect the following percentage breakdown in number of claims:
- 0 claims – 93.25%
- 1 claim – 6.52%
- 2 claims – 0.23%
- 3 claims – 0.01%
- 4 claims – 0%
Which I obtained via the following code:
poisson.model <- dpois(x, 0.06989)*100
poisson.model <- round(poisson.model,2)
poisson.model <- past(poisson.model, “%”,sep=” “)
This corresponds to the following expected claim frequency in a sample of 7,483 insured drivers in Singapore:
- 0 claims – 6,977.9
- 1 claim – 487.7
- 2 claims – 17
- 3 claims – 0.4
- 4 claims – 0
So if the insurance claims are Poisson events, that’s what we’d expect. Here is the actual claim breakdown in the data set, or the observed frequency:
- 6,996 insureds made 0 claims
- 455 insureds made 1 claim
- 28 insureds made 2 claims
- 4 insureds made 3 claims
barplot(table(Singapore$Clm_Count),col=c(“darkblue”,”red”,”green”,”slateblue”),xlab=”Number of Claims Filed by Policyholder”,
main=”Empirical Claim Frequency (n=7483)”,legend=observed$Freq)
So, the observed frequency doesn’t differ all that much from the expected frequency suggested by the Poisson model. That is, car insurance claims made by a pool of insureds (specifically, just over 7,000 demographically diverse drivers in Singapore) could possibly be viewed as Poisson events. But to see where the model fails us, here’s a plot of the excess claims observed in the empirical data (i.e. observed number of claims less the expected number of claims under the Poisson model):
- The number of insureds who made zero claims was 18.1 more than predicted by the Poisson model
- The number of insureds who made one claim was 32.7 less than predicted by the Poisson model
- The number of insureds who made two claims was 11 more than predicted by the Poisson model
- The number of insureds who made three claims was 3.6 more than predicted by the Poisson model