Tag Archives: statistics

Simulating Waiting Times with R

In my last post, I used Mathematica to show the connection between events that follow a Poisson process (i.e. occur independently at a constant rate per unit time/space) and the waiting time between successive events, which is exponentially distributed.  By the memoryless property, the waiting time between successive events also happens to be the time until the first event occurs – making the exponential distribution a reasonable model of the time until failure of batteries, electronic components, machines, and so forth.  Using the exponential model, I showed that the probability that a component lasts longer than average (i.e. exceeds its expected lifetime of 15 months in the example) is only 36.79%, which tells us the distribution is skewed (i.e. the average and the median aren’t equal).  Below is the equation to compute this probability and below that are plots of the theoretical probability density function and cumulative density functions.


Rplot_pdf Rplot_cdf

In this post I’m going to simulate the previous result using R.  I provided the R code to generate the results but I’m a pretty bad programmer so I’d encourage you to look elsewhere for programming methodology etc. but nonetheless my code does produce the intended result.  Before starting I loaded the MASS library because it’s a quick way to make plots look less shitty if you’re not an R pro.

First, I simulated 1000 random draws from an exponential distribution with parameter 1/15 and assign the result to a variable called sample:

n <- 1000

sample <- rexp(n, rate=1/15)

Then I made a histogram of ‘sample’ using the truehist command (which you’ll need the MASS library to use:


Now I want to somehow count how many of the random waiting times are greater than the average, 15.  The graph gives us an idea – it doesn’t look more than a third – but it’s not precise.  To know for sure I create a list of only the values that exceeded 15 in the sample called upper_sample with the help of a logical vector that returns true if the corresponding value in ‘sample’ is greater than 15 and false otherwise:

above_avg <- sample > 15

upper_sample <- sample[above_avg]

truehist(upper_sample, main=”Sample Waiting Times > 15″)


The number of elements in the upper_sample vector divided by the number of trials gives me the probability that a component lasts more than 15 hours, which I named psuccess:

psuccess <- length(upper_sample)/n

There were 377 observations in my sample, so the simulated probability that a component lasts more than 15 hours is 37.7%, about 1% off from the true value.

The Maximum Likelihood Estimate of an unknown parameter


The Maximum Likelihood Estimate is one estimate of a parameter’s value.  It basically answers the question: “Given my data, what’s the most likely value of an unknown parameter”.  To use it, we need to compute the likelihood function by taking a sample of n observations from the same distribution, evaluating each of the n observations, (x1, …, xn), at the pdf pX(x), and taking the product of the result for all n.

Likelihood Function

Note: The expression pX(x; θ) is just the distribution of the population from which we’re sampling – usually the parameter is added with a semicolon to emphasize the fact that the distribution is characterized by the unknown parameter

The likelihood function is just the joint density of the random sample.  Since samples are independent and identically distributed (iid), the joint pdf of all n sample observations is the product of the individual densities.  This is the same principle that lets us multiply the P(A), P(B), and P(C) together to find P(A intersect B intersect C) when events A, B, and C are independent.  Suppose we take a sample of size n = 3 from the distribution, and the resulting values are x1, x2, and x3.  What’s the probability associated with the three sample values? That is, what’s the joint density of the three sample values, pX(x1, x2, x3)?


Generalizing this case for all n clearly gives us the result that the joint density of n randomly drawn sample values is the product of individual densities and the likelihood function is nothing more than the joint pdf of the sample – a multivariate probability density function of the values taken on my the n random variables in the sample.

The likelihood function is a function of the unknown parameter.  The Maximum Likelihood Estimate for the unknown parameter is the parameter value that maximizes the likelihood function:

We use calculus to find this value, by first taking the derivative of the likelihood function with respect to the unknown parameter, then setting it equal to 0 and solving for the parameter.  Don’t forget to verify conditions so as to make sure you are indeed finding a maximum.


This will usually involve complicated, messy math.  To mitigate this, we sometimes work with the logarithm of the likelihood function and use properties of logs to simplify computations.  This won’t chance our answer – taking the logarithm of some function won’t change the point at which the maximum value is achieved.


The value of the parameter that you end up with maximizes the probability of your sample values x1, …,xn.  You could say it’s the value “most consistent” with the observed sample – the Maximum Likelihood Estimate.

The Central Limit Theorem

Suppose we’re sampling from a population with mean μ and variance σ2. Formally, the independent random variables {X1, X2, …, Xn} comprise our sample of size n from the distribution of interest. The random variables observed in the sample will likely all be different, but they all come from the same distribution. The stipulation that sample observations are independent is important.

Though it’s often stated in other terms, the Central Limit Theorem is a statement about the sum of the independent random variables observed in the sample. What can we say about the quantity below?

Screen shot 2014-04-21 at 5.38.37 PM

We could start by standardizing the quantity S; this is done in the same manner that one would standardize a test statistic while testing a hypothesis. To standardize the sum, we subtract the mean and divide by the standard deviation. The standardized quantity can be interpreted as the number of directed standard deviations said quantity is from its mean. Some people call this a Z score. We will make use of the fact that the sample random variables are independent in deriving the mean and variance of Sn, which is why the independence assumption is so important. So, our standardized sum of random variables in the sample is going to be of the form:

Screen shot 2014-04-21 at 5.39.11 PM

If we substitute in the values for the expected value and standard deviation of Sn that we derived above, we have the expression:

Screen shot 2014-04-21 at 5.39.50 PM

The CLT is a statement about this standardized sum of observations from a population. Intuitively, the theorem states that as the sample size (n) grows arbitrarily large, the distribution of the sum of the sample values (the Xi’s) tends towards the normal distribution. Mathematically, this means:

Screen shot 2014-04-21 at 5.40.23 PM

In the expression above, the sample values are {X, X2, …, Xn}, the expected value of their sum is nμ and the standard deviation of their sum is (√n)σ. In words, the expected value of the sum of all n sample observations is n times the population mean μ and the standard deviation of the sum of all n sample observations is the square root of n times the population standard deviation σ. I know I’m beating a dead horse, but it’s important that the CLT is a statement about the sum of observations in a sample. Therefore, the quantity between numbers a and b in the inequality is the standardized sum of sample observations, or a z score if you want to call it that. On the right side is the probability density function (pdf) of the normal distribution. Integrating any continuous pdf over a region (a, b) gives the probability of attaining a value greater than a and less than b, so the right hand side of the equality above can be interpreted as the probability that a standard normal random variable takes on a value between the numbers a and b.

Putting everything together, the theorem states that if we take the sum of n observations from any distribution and standardize it, the probability that this quantity lies in a given range approaches can be approximated with the normal distribution as n increases. That is, the distribution of the standardized sum of sample values approaches the normal distribution as the size of the sample increases.

The best way to illustrate this is probably with the uniform distribution. Consider rolling a die – it’s a trivial example, but it’s a familiar concept.  The outcome of rolling the die is uniformly distributed over (1,6), because the probability that you roll a 1, 2, 3, 4, 5, or 6 is 1/6 or approx. .167.  It’s easy to show that the random variable X, where X corresponds to the dots showing after rolling one die, has a mean of E(X) = 3.5 and a Standard Deviation of approximately 1.4434.


Now let the random variable X be the sum of the dots showing when you roll two dice.  X can take on a minimum of 2 (if you ‘snake eyes’) and a maximum of 12 (if you roll two sixes).  X has a mean of of  7 and a standard deviation of 2.0412, and is distributed like the graph below:


To demonstrate the CLT, we want to increase the number of dice we roll when we compute the sum.  Below is the pdf for n = 3, that is, rolling three dice and computing their sum.


And as we let n get larger and larger, the histogram looks a lot like the normal curve:



So when we consider the sum of more rolls, the curve looks normal.  This is the CLT working!  In another post, I’ll talk about ways that we can test the normality of a given distribution and apply the theorem to the sample mean.

The Sampling Distribution of the Sample Mean

Before talking about the Central Limit Theorem (CLT), it’s necessary to distinguish between the probability distribution of the population of interest and the sampling distribution of the sample mean. Suppose we draw n independent observations from a population with mean μ and standard deviation σ. Each observation is a random variable, so the sample that we draw from the population is a collection of n random variables, all of which have the same probability distribution. We can compute the sample mean, X, from the observed sample values. We use the sample values drawn from the population of interest to compute the sample mean as a way of approximating the true mean μ, which we don’t know. The true mean μ is a parameter, because it is a characteristic of the entire population. It is not feasible (or even possible in most cases) to determine the parameter in question, so we develop mathematical functions with the purpose of approximating the parameter. The inputs to such a function are the sample values – the n observations described previously. So the sample mean is a function of the random variables we observe in a given sample. Any function of a random variable is a random variable itself, so the sample mean is a random variable, complete with all the properties that random variables are endowed with. The sample mean, which we just said is a function of random variables, or, using proper terminology, a statistic, gives us a point estimate of the true population mean μ. It is an unbiased estimator of the population mean, so we know that the expected value of the sample mean is equal to the true population mean. All this tells us is that our guesses will be “centered around” μ; the sample mean fails to give us any indication of dispersion about the population parameter. To address this concern, we have to appeal to the idea that the sample mean is a random variable, and is thus governed by a probability distribution that can tell us about its dispersion. We call the probability distribution of the sample mean the sampling distribution of the sample mean – but it’s really nothing more than the probability distribution of X. Like I mentioned earlier, the sample mean is an unbiased estimator of μ. In other words, E(X) = μ. This is easy to prove: Screen shot 2014-04-19 at 7.17.59 PM The expected value of each of the observations we draw from the population {X1,X2, …, Xn} is the unknown population parameter μ. Notice we’re NOT talking about the sample mean right now – the last expression is simply the sum of n independent observations from the population with mean μ and variance σ. Therefore: Screen shot 2014-04-19 at 7.19.15 PM We’ve shown that the mean of the sampling distribution of the sample mean is the parameter μ. However, this is not true for the standard deviation, as demonstrated below: Screen shot 2014-04-19 at 7.21.06 PM So, the sample mean is distributed normally with a mean of μ and a standard deviation of σ times the square root of the sample size, n. The sample size is not how many sample values we take from the sampling distribution of the sample mean. Instead, it corresponds to the number of observations used to calculate each sample mean in the sampling distribution. A different number for n corresponds to a different sampling distribution of the sample mean, because it has a different variance. The standard deviation of the sampling distribution of the sample mean is usually called the standard error of the mean. Intuitively, it tells us by how much we deviate from the true mean, μ, on average. The previous information tells us that when we take a sample of size n from the population and compute the sample mean, we know two things: we expect, on average, to compute the true population mean, μ; and, the average amount by which we will deviate from the parameter μ is σ/Sqrt(n), the standard error of the mean. Clearly, the standard error of the mean decreases as n increases, which implies that larger samples lead to greater precision (i.e. less variability) in estimating μ. Suppose repair costs for a TV are normally distributed with a population mean of $100 and a standard deviation of $20. To simplify notation, we can write: Repair Cost~N(100, 20).  The sampling distribution of the sample mean is the probability distribution of all sample means computed from a fixed sample of size n. In this case, Cost~N(100, 20/Sqrt(n)). If you want to visualize the distribution of repair costs for the TV in your bedroom, you’d plot a normal curve with a mean of 100 and a standard deviation of 20. If, however, you’d like the visualize average repair cost of the four TVs in your house, the correct curve would still be normal, but it would have a mean of 100 and a standard deviation (in this case, more correctly called a standard error) of 20/Sqrt(4). Now if you include your neighbor’s TVs as well, your sample size will be eight. The probability density function describing the average cost of the eight TV repairs is ~N(100, 20/Sqrt(8)). The three curves are plotted below. Screen shot 2014-04-19 at 1.10.14 PM To demonstrate the idea that the variance and the sample size are inversely related, consider the probability that a TV will require more than $110 worth of repairs. The probability that one TV requires repairs in excess of $110 (i.e. sample size = 1) is equivalent to P(Z > ((110 – 100)/20) = P(Z > 0.5) = 30.85%. The probability that the average repair cost of a sample of four TVs exceeds $110 is P(Z > ((110 – 100)/(20/Sqrt(4)) = P(Z > 1) = 15.87%. For the sample of size n = 8, we get 7.86%. What’s happening is that the value $110 is becoming more ‘extreme’ based on the distribution parameters as we increase the sample size. Since we’ve decreased the standard deviation when we take a sample of size 8 instead of 4, for example, the value $110 is less likely to be observed. Screen shot 2014-04-19 at 1.35.17 PM We can use the sample size n in a more useful way if we consider its impact on the precision of the sample mean. We know a larger n corresponds to a tighter distribution of the sample mean around the true mean.   Therefore, we can achieve a desired level of precision with our point estimator (the sample mean) by manipulating n. Since increasing n generally costs money in practice (think about surveying more respondents, etc.), we want to find the minimum value of n that will give us the desired condition in most cases. Let’s say I want the sample mean to be within 10 units of the true parameter with probability 0.97. To do this, I’ll have to decrease the variability by increasing n. I’d also like to find the minimum value that achieves this result, since I want to spend as little money as possible. In other words, I want to be 97% confident that the value I compute for the sample mean is within 10 units of the true mean, and I want the smallest possible n to do so. Here is the mathematica input: Screen shot 2014-04-19 at 1.57.55 PM Below, sample sizes are plotted against the probability of being within 10 units if the parameter μ: Screen shot 2014-04-19 at 1.58.01 PM Solving for n gives 18.84, so n = 19 is the smallest sample size that gives us the desired precision. Below, the sampling distribution of the sample mean for n = 19 is plotted against the normal curve with mean $100 and standard deviation $20: Screen shot 2014-04-19 at 2.01.20 PM In the next post, I’ll talk about how the properties of the sampling distribution of the sample mean relate to one of the most important theorems in all of mathematics, the Central Limit Theorem.

Binomial Distribution Basics

The Binomial Distribution is a discrete probability distribution of the number of successes in n independent trials with a binary outcome and probability of success p.  There are three important conditions that must be met for the binomial distribution to be used correctly:

  • Trials are independent
  • The probability of success p is constant for all trials
  • Each trial has only two possible outcomes (binary)

One way to describe a Binomial random variable is in terms of Bernoulli Trials.  A Bernoulli trial is a random experiment in which there are only two possible outcomes – success and failure.  If we let success = 1 and failure = 0, then a Bernoulli Random Variable looks like this:

P(X = 1) = p

P(X = 0) = 1 – p

So really a Binomial experiment is just a sequence of n repeated Bernoulli trials.  Formally, a random variable X that counts the number of successes, k, in n trials, has a Binomial Distribution with parameters n and p, written X~Bin(n, p).  Let p = Probability that success occurs at any given trial.  Then the probability of k successes is:

Screen shot 2013-11-14 at 2.10.11 AM

  • The first part of the equation is “n choose k” or a combination of k objects from a population of n objects.  This expression is the number of ways we can arrange k successes in n trials, where the order of the k successes is unimportant.  This is called a Binomial Coefficient.
  • The second part of the equation is the expression p raised to the power k, where p is the probability of success in any given trial.  This expression takes advantage of the independence assumption – independent probabilities are multiplicative, so the probability of p happening k times is p*p*p*…*p (k times) = p raised to the k power.
  • The third part of the equation is the probability of realizing n-k failures where trials are independent and the probability of failure is (1-p).  Intuitively, if there are n trials and k of those are successes, the remaining trials, of which there are n-k, must result in failure.

It might help to think backwards and ignore the binomial coefficient for a moment.  If k independent events occur with probability p, and n – k independent events occur with probability (1 – p), then the probability that all mentioned events occur is obviously the product (pk)*(1-p)n-k.  While what we’ve just stated is accurate for one specific sequence of k successes and n – k failures, we must consider that there are a number of combinations of outcomes that yield k successes and n – k failures.  If we flip a coin three times and define success as landing on heads, there is more than one outcome for k = 2.  We could have HHT, HTH, or THH.   The exact number of outcomes that yield k successes and n – k failures is the binomial coefficient “n choose k”.  That is why it is multiplied by probability of success and failure.

Ex) Three coins are tossed, each with probability p that the coin lands on heads (it may not be a fair coin, so p may not be 0.5).  Since coin tosses are independent, we can write:

Screen shot 2013-11-14 at 2.11.36 AM

The table below shows the probabilities associated with each sequence:

Screen shot 2013-11-14 at 2.12.20 AM

The chart above gives the probability for each outcome – that is a sequence of heads and tails.  It would be incorrect to look at row 6 and then assume that the probability of obtaining 2 heads is the probability in column 4.  The probability there is the probability of obtaining exactly the sequence TTH, not the more general case of obtaining two heads.

But if our main interest is the number of heads that occur, the sequence HHT is the same as HTH.  Note that there are three outcomes with exactly two heads, for example, each with probability p2(1 – p).  This is where the “n choose k” term becomes necessary.  Exactly k successes may be achieved in a number of different ways.  In the context of this question, the binomial probability is:

P(k heads) = (# ways to arrange k heads and n-k tails)*(prob. of any specific sequence w/ k heads and n-k tails)

P(k heads) = (# ways to arrange k heads and n-k tails)*pk (1-p)n-k

Implicit in the operations just performed is that a random variable was used to map outcomes in a sample space to the real numbers.  Basically, we used a random variable to assign a real number (0, 1, 2, or 3 in this case) to the outcomes (sequences of H and T).  A random variable is not random nor is it a variable – it is a function that maps sample outcomes to the real numbers.  This is a very simplified definition that will not suffice in a mathematical statistics course most of the time, but that’s basically what’s going on here.  The outcomes were grouped in such a way to facilitate problem solving – in this case, it was used to give us the number of heads rather than a specific sequence.

Just as the probability function assigns probabilities to sample outcomes in the sample space, a probability density function assigns probabilities to values that a random variable takes on.  Let X be the number of heads that occur in a sequence of 3 coin tosses.  Then the probability density function is:

Screen shot 2013-11-14 at 2.13.51 AM

For example:

Screen shot 2013-11-14 at 2.14.30 AM

And here is a chart of the actual probabilities:

Screen shot 2013-11-14 at 2.15.01 AM

It should be clear now why independence of trials is such an important criterion in using the Binomial distribution.  If trials are not independent, i.e. the outcome of one trial affects the probability of subsequent outcomes, then the binomial model is not applicable.  In this situation, the Hypergeometric Distribution should be used instead.

Paradoxical expected values

We tend to think of the expected value of a random variable as a standard, practical, and simple measure of the variable’s central tendency.  However, there are situations in which the expected value is misleading or even nonexistent.  One famous case, commonly called the “St. Petersburg Paradox”, illustrates perfectly the limitations of the expected value as a measure of central tendency.

Suppose a fair coin is flipped until the first tail appears.  You win $2 if a tail appears on the first toss, $4 if a tail appears on the second toss, $8 if a tail appears on the third toss, etc.  The sample space for this experiment is infinite: S = {T, HT, HHT, HHHT, …}.  A game is called fair if the ante, or amount you must pay to play the game, is exactly equal to the expected value E(x) of the game.  Casino games are obviously never fair, because casinos stand to earn a profit (on average) from the games they offer – otherwise, they would not have a viable business model and they couldn’t afford to give you free drinks and erect fancy statues (though, while we’re on the subject, craps is the “fairest” as far as casino games go in the sense that the odds are skewed least heavily in the dealer’s favor).

Let the random variable W denote the winnings from playing in the game described above.   We want to categorize W such that we can mathematically evaluate its expected value.  Once we find this expected value, we know how much we’d have to ante for this game to be considered fair.  We know how much we’ll win on the first, second, and third toss, but let’s generalize for k tosses:

  • T on 1st toss -> $2
  • T on 2nd toss -> $4
  • T on 3rd toss -> $8
  • T on kth toss -> $2^k

So we win $2^k if the first tail appears on the kth toss.  But what is the probability of each of those outcomes?  For toss one, there is a ½ chance of getting a tail.  Since trials are independent, the probability of the first tail appearing on the second toss is (1/2)*(1/2) = 1/4.  For the third toss, the probability is 1/8.  So the probability of winning $2^k is the probability of getting the first tail on the kth toss which is:

Screen shot 2013-11-01 at 7.01.14 PM

To find the expected value of the random variable W, we need to find the sum over all values that k (remember that k = the number of tosses before the first tail) of k times the probability of k.  It’s evident from the sample space that k can take on an infinite number of values.  The expected value is:

Screen shot 2013-11-01 at 7.01.21 PM

So the expected value of the random variable W (winnings) is the sum over all k of the payoff (2^k dollars) times the probability of realizing that payoff (p(2^k)).  So the first term is the actual payoff, the second is the probability of that payoff.  Plugging in what we know about the probability of k and the infinite nature of the sample space:

Screen shot 2013-11-01 at 7.01.30 PM

The sum diverges!  There is no finite expected value.  That is, the expected value of the game is infinite, which means we’d have to pay an infinite amount of money for the game to be fair.

The point is not that you should wager an infinite amount of money in order to play the aforementioned game.  The point is that the expected value is often an inappropriate measure of central tendency that leads to an inaccurate characterization of a distribution.

Criminology and Combinatorial Probability

I just read a really interesting application of combinatorics to criminology in my stats book, An introduction to Mathematical Statistics and Its Applications, by Larsen and Marx.  Alphonse Bertillon was a French criminologist who developed a system to identify criminals.  He chose 11 anatomical characteristics that supposedly remain somewhat unchanged throughout a person’s adult life – such as ear length – and divided each characteristic into three classifications: small, medium, and large.  So each of the 11 anatomical variables are categorized according to their respective size, and the ordered sequence of 11 letters (s for ‘small’, m for ‘medium’, and l for ‘large’) comprised a “Bertillon Configuration”.

This seems rudimentary and imprecise, perhaps, but keep in mind this was before the discovery of fingerprinting and so forth.  One glaring issue with this system is the obvious subjectivity involved with assigning a classification to a given anatomical variable.  Also, I’d imagine it’s kind of hard to get all of this information about a person, especially if they’re suspected of having committed a crime but haven’t ever been in custody.

Issues aside, the Bertillon configuration is an interesting idea I think.  But the obvious question to ask is how likely is it that two people have the exact same Bertillon configuration?  The implications are two people who, from a legal classification standpoint, are exactly the same (and you can use your imagination from there).  So, how many people must live in a municipality before two of them necessarily share a Bertillon configuration?

The question is actually very simple to answer via the multiplication rule:

If Operation A can be performed in m different ways, and Operation B can be performed in n different ways, the sequence {Operation A, Operation B} can be performed in (m)(n) different ways.

Corollary: If Operation Ai, i = 1, 2, …, k can be performed in ni ways, i = 1, 2, …, k, respectively, then the ordered sequence, {Operation A1, Operation A2, …, Operation Ak) can be performed in (n1)(n2)***(nk) ways.

Think of Operation A as the anatomical variable and Operation B as assigning it a value (s,m,l).

Screen shot 2013-10-17 at 10.08.05 AM

Each Operation can be performed in three different ways, namely s, m, and l.  Since there are 11 anatomical variables, by the multiplication rule there are (3)*(3)*(3)*(3)*(3)*(3)*(3)*(3) *(3)*(3)*(3) = 311 ways to assign arrange the Bertillon configuration.  Therefore, if 311 = 177,147 people live in a town, we’d expect two of them to have the same Bertillon configuration.

Birthday Statistics

Today is my roommate’s birthday and three of my friends’ birthdays.  There are 365 days in a year, so what are the chances that so many birthdays are on the same day?  More importantly, what are the chances that I have to spend so much time and money on gifts?

The probability is higher than you’d expect – so high, in fact, it has a special name, the birthday paradox.  Informally, the birthday paradox is the surprising probabilistic phenomenon that for a relatively small group of people, the probability that at least two people share the same birthday is surprisingly high.  For example, if you have 23 people in your stats class, there’s roughly a 50% chance that at least two people have the same birthday.

One way to think about this result is that, at first, it seems like there are only 22 comparisons to be made – essentially 22 ‘chances’ that one person shares his or her birthday with someone else in the room.  This is true for the first person, who we’ll call Bob, but what happens when Bob is done comparing his birthday to everyone else’s?  Well, then the second person, who we’ll call Mary, compares her birthday to everyone in the room except for Bob, who she has already compared birthdays with.  So Mary has 21 people to compare with.  The third person has 20 potential matches, the fourth has 19, and so forth.  So what you really have is 22+21+…+1 = 253 comparisons, not 22.

Each individual comparison has only a 1/365 chance of being a match – equivalently, each trial has a 99.726% chance of not being a match.  This is probably why we’re so surprised by this paradox – we’re conditioned to think it’s rare to have the same birthday as someone, because when you ask any given person when their birthday is there is only a .274% chance you’ll have the same birthday as him or her.  But each time we compare, there is one less possibility – i.e. there is one more birth date that isn’t available, so to speak.  Below is the probability that no two people in the room share a birthday.  Notice that the first term is 365/365 because the first person has 365 birthdays to choose from without ‘taking’ someone else’s.  The second person, however, has one less choice, because the first person’s birthday has already been taken.  The third person has one less choice than the second person.  This continues until the last person, person number 23, has 22 fewer choices (365-22=343) of birthdays if he is to avoid having the same birthday as another person in the room.

Screen shot 2013-09-17 at 8.21.26 AM

The probability that two people in the room do share a birthday is obviously the complement of this, so it’s equal to 1 – 0.4927 = 50.73%.  You can write the expression above with factorials etc., but I just wanted to sort of explain why (though not actually justify in a rigorously mathematical sense) the birthday paradox is a thing.  For a more hardcore explanation, you can visit Wolfram Mathworld here.

Basic Intro to Poisson Random Variables

One distribution that Basic Intro Statistics/Business Statistics classes often miss is the Poisson distribution.  This is unfortunate, because it comes up quite frequently in a number of applications.  It looks a little bit intimidating (probably because of the lambda and the factorial) but it’s really not very complex.

First of all, Poisson random variables are discrete, so the Poisson distribution is discrete, i.e. a Poisson RV can only take on integer values.

Let’s assume that some independent event occurs Lambda (I’m going to denote lambda as L instead of the appropriate symbol because it’s easier and I don’t know if the symbol will translate to WordPress) times over a time interval.

  • Independence just means that one occurrence of said event doesn’t affect other occurrences
    • If A and B are independent events, then the probability of A given B has already occurred, Pr(A|B), is equal to the probability that A occurs:
      • Pr(A|B) = Pr(A); the occurrence of event B doesn’t affect event A
      •  Similarly, Pr(A and B) = Pr(A)Pr(B)

Formally, a random variable X is a Poisson random variable (with the stipulation that L > 0) if its probability mass function (pmf) has the form:

Screen shot 2013-08-25 at 8.09.05 AM

L indicates the number of successes per time interval.  Successes are just specified outcomes that may or may not be considered successful in everyday life – for example, we might measure instances of cancer, in which case we could call getting cancer a success (for study purposes) though getting cancer is, objectively, not a success.  Basically what we’re going to be able to model with this random variable is the number of occurrences of some event or phenomenon in a predetermined time interval.  Sometimes stats books will call the time interval a “unit of space” or something fancy – don’t get confused.  Examples of Poisson random variables:

  • The number of patrons who enter a store every 5 minutes
    • The number of patrons who enter a store every ___ minutes/hours/days
    • The number of phone calls received by a call center every 10 minutes
      • The number of phone calls received by a call center every ____ minutes/hours/days

Hopefully I’ve demonstrated that the time interval or unit of space is generic and can be adapted to suit the needs of the researcher.

Since the distribution is discrete, we’re going to use the pdf to find the probability of exactly x occurrences of an independent event that occurs, on average, L times over some interval:

Screen shot 2013-08-25 at 8.11.30 AM

We couldn’t find the probability of exactly x occurrences if the random variable were continuous – the probability that a continuous random variable takes on any specific value is 0.  For more information, see The Continuum Paradox.  Essentially, if a variable is continuous we could measure it in smaller and smaller units until there is a difference in two approximately equal observations.  For example, suppose two observations in a dataset are both 68 inches tall.  Height is a continuous variable – if we wanted to measure down to the centimeter, we might see that the two observations aren’t exactly equivalent.  If not, we can measure down to the millimeter.  We can continue, down to fractions of nanometers, and the two observations will surely differ in height to some degree.  Therefore, we can only find the probability of the continuous random variable taking on a range of values (via integration).  With a discrete random variable, however, this is not the case – the discrete pdf below clearly illustrates this:


As you can see, each discrete value on the x-axis is associated with a probability on the y-axis, and values in between the integers on the x-axis are not associated with a probability at all, as represented by the lack of pdf in those areas.  (for example, there’s no pdf in a straight line above x=10.5, because 10.5 is not a value that the random variable can take on).  If this were a continuous pdf, the dots would be connected by a smooth (continuous) line.  That’s why we can integrate a continuous pdf from 1 to 5 to find the probability of the random variable taking on a value between 1 and 5, but we have to sum the individual probabilities associated with x=1, 2, 3, 4, 5 to find the probability of a discrete random variable taking on a value between 1 and 5 (inclusive). (And thanks to JohnDCook for the picture).

Suppose a store gets, on average, 2 customer complaints every 3 minutes.  This store clearly sucks and it’s probably unrealistic, but whatever.  How could we find that probability that 4 or fewer customers complain during a 9-minute interval?

  • Given our data (2 customers per 3 minutes) we’d expect 2×3 = 6 customers per 9 minutes, on average
  • The probability that 4 or fewer customers complain is:
    • Pr(0 customers) + Pr(1 customer) + Pr(2 customers) + Pr(3 customers) + Pr(4 customers)
    • Pr(0;6) + Pr(1;6) + Pr(2;6) + Pr(3;6) + Pr(4;6)
    • Using the formula for f(x;L) with x = 0, 1, 2, 3, and 4 and L = 6:

Screen shot 2013-08-25 at 8.12.45 AM

  • Recall that 0! = 1 by definition, so you’re not dividing by 0 in the 1st term

So, the chance that 4 or fewer customers complain during a 9-minute interval, given an average of 2 customer complaints every 3 minutes, is 28.5%.  Intuitively, this is reasonable – given the information the result should be relatively unlikely but not extremely rare, so 28.5% fits that description.

Intro to Value at Risk

Value at risk (VaR) is a way to statistically measure the worst expected loss on an investment given certain parameters.  Specifically, the level of confidence (called the alpha level, α) and time horizon must be specified to compute it.  We must also assume that returns on the asset in question are approximately normally distributed, because the value at risk model calculates the worst expected loss over a given horizon at a given level of confidence under normal market conditions1.  Although it’s a well-established fact that, in general, stock prices are log normally distributed while stock returns are normally distributed.  It may be a good idea to verify this first, using a goodness of fit test.  The Anderson-Darling, Shapiro-Wilk, and  Kolmogorov-Smirnov tests are all good examples.  When we say normal, we mean that the data is normally distributed; we are making a judgment regarding the distribution of returns, not the likelihood of realizing a return, as the dictionary definition of normal would suggest.  As we’ll see, the VaR turns out to be a relatively unusual result in most cases, and its rarity is inversely related to the level of confidence we choose.  It follows that a formal definition of the value at risk is a measure of the worst expected loss over a given time frame and level of confidence under normal market conditions.

The actual VaR number should be the worst expected loss on an asset in terms of currency (not probability).  If we let f represent the probability density function of the return2 of the asset and c the confidence interval, we have the following formula:

Screen shot 2013-08-13 at 4.16.32 PM

In words, this formula equates the confidence level, or degree of certainty with which we can draw a conclusion, with the integral from -∞ to the negative of our VaR value (some real number measured as a return in dollars) of the probability density function (pdf).  The Fundamental Theorem of Calculus tells us that integrating over a range gives us the area between the x-axis and the function.  In the context of a probability density function, whose y-axis is probability, this says that by integrating from A to B, we’re finding the probability that the random variable takes on a value between A and B.  In the context of VaR, when we integrate from negative infinity to –VaR, we’re finding the probability of realizing a value equal between -∞ and –VaR, or, more simply, the probability of realizing a value less than –VaR.  So for a given confidence level c, we’re looking for the value –VaR such that the integral from -∞ to –VaR is equal to (1 – c).

It might be confusing that the negative of the VaR is used in this computation.  This arises because of the symmetry property of the normal distribution.  Each value, however extreme (extreme meaning far from the mean in terms of standard deviations) has a complementary value in the opposite direction.  If we’re talking about the standard normal distribution, we’re just saying that the distribution is normally distributed with a mean of 0 and a standard deviation of 1.  So a relatively extreme value, say 3, has a complementary value, -3, that is just as extreme in the opposite direction.  3 is a value 3 standard deviations greater than the mean, and -3 is 3 standard deviations below the mean.  But we’re only interested in a potential loss in VaR models, as risk is by definition a quantitative measure of realizing a loss, we’re only interested in the –VaR.  The corresponding positive VaR will be a value just as extreme but positive – in this case, a good thing since it means higher returns.

Every potential portfolio value corresponds to a z score, or a standardized measure of how many standard units (standard deviations) a value lies from the mean of the distribution, as was mentioned above.  Ordinarily, we transform a random variable X into a standardized Z score with the formula Z = (x – μ)/σ, where, in this case, x = VaR and Z = Z value corresponding to the confidence level.  However, since we’re only interested in the negative values, we substitute x = -VaR and Z = -Z.  We can rewrite the equation more simply now: VaR = Zσ – μ.  This equation will standardize our VaR value.

So if you’d like to be 95% certain you won’t lose more than the VaR, you set the left hand side of the equation equal to (1 – 95%) = (1 – .95) = .05 and solve for the number VaR, which can then be interpreted as the 95% value at risk.  What you’ve really found here is the portfolio value that represents the spot on the x axis where the total area above the axis and below the pdf on the left hand side all the way to x = negative infinity is equal to .05.

If we use the 99% confidence level, we have c = (1 – .99) = .01.  We have to find a z score that divides the normal distribution into a section with 1% of the area to the left and 99% of the area to the right.  The picture below has 1% of the area in each tail of the distribution, and thus 98% shaded in the middle.

Screen shot 2013-08-13 at 1.48.48 PM

We’re interested in the z score that divides the lower white region from the rest of the graph, because that will correspond to the value that 99% of the x values are greater than.  But since the normal curve is symmetrical, that value is just going to be the negative of z value, which represents the x value with 99% of observations below it.

Screen shot 2013-08-13 at 2.24.26 PM

The pictures reveal that the z value in question is z = -2.33.  To reiterate, standardizeing the value of any random variable x (i.e. assigning it a z value), uses the relationship Z = (x – mean)/(StanDev).  This Z value is the VaR in standardized form.  Therefore, we can substitute –VaR for Z in computations because what we’re really finding when we find the VaR is the left tail (Z score) of the distribution.

If the VaR is 200 at the 99% confidence level, you could say that largest possible loss on your asset at the 99% confidence level is 200, given normal market conditions.  Alternatively, you could be 99% certain your losses won’t exceed 200.

Since the cumulative density function (F) is the indefinite integral of the probability density function (f), we can use it to simplify this expression (sorry for the bullshit above the equation – I had to screenshot it because wordpress hates equations):

Screen shot 2013-08-13 at 4.20.37 PM

An example should make this clearer.  Suppose the daily return on a portfolio follows a normal distribution with a mean of $1000 and a standard deviation of $500.  At the 99% confidence level, we have alpha = .01 which corresponds to a z-score of 2.33, i.e. the CDF of the normal distribution is equal to .99 when the z value is 2.33; F(2.33) = 0.99.  But we want the lower end of the distribution, -2.33.  Therefore:

VaR = ασ – μ = 2.33(500) – 1000 = 165

So with 99% confidence we can assume that our worst possible expected loss in one day is $165.  The value of -$165, i.e. a $165 loss, is 2.33 standard deviations (z scores) below our mean value of $1000.  So although we still expect (on average) to earn $1000 with a standard deviation of $500, we will only lose more than $165 in a single day less than 1% of the time.  The PDF and CDF are below.



So the value -165 is exactly equivalent to the standardized z score of -2.33 as they both represent the value that 1% of observations fall below and 99% exceed.  The only difference is that with the number -2.33 we’re talking about a standardized random variable with a mean of 0 and a standard deviation of 1, whereas the VaR number of -165 represents a portfolio return with a mean of $1000 and a standard deviation of $500.  Alternatively, -$165 is the value 2.33 standard deviations to the left of the $1000 mean portfolio return.


Open source projects for neuroscience!

Systematic Investor

Systematic Investor Blog

Introduction to Data Science, Columbia University

Blog to document and reflect on Columbia Data Science Class

Heuristic Andrew

Good-enough solutions for an imperfect world


"History doesn't repeat itself but it does rhyme"

My Blog

take a minute, have a seat, look around

Data Until I Die!

Data for Life :)

R Statistics and Programming

Resources and Information About R Statistics and Programming

Models are illuminating and wrong

A data scientist discussing his journey in the analytics profession

Xi'an's Og

an attempt at bloggin, nothing more...

Practical Vision Science

Vision science, open science and data analysis

Big Data Econometrics

Small posts about Big Data.

Simon Ouderkirk

Remote Work, Small Data, Digital Hospitality. Work from home, see the world.


Quantitative research, trading strategy ideas, and backtesting for the FX and equity markets


I can't get no

The Optimal Casserole

No Line Is Ever Pointless

SOA Exam P / CAS Exam 1

Preparing for Exam P / Exam 1 thru Problem Solving


Mathematical statistics for the layman.