# Tag Archives: probability

## Simulating Waiting Times with R

In my last post, I used Mathematica to show the connection between events that follow a Poisson process (i.e. occur independently at a constant rate per unit time/space) and the waiting time between successive events, which is exponentially distributed.  By the memoryless property, the waiting time between successive events also happens to be the time until the first event occurs – making the exponential distribution a reasonable model of the time until failure of batteries, electronic components, machines, and so forth.  Using the exponential model, I showed that the probability that a component lasts longer than average (i.e. exceeds its expected lifetime of 15 months in the example) is only 36.79%, which tells us the distribution is skewed (i.e. the average and the median aren’t equal).  Below is the equation to compute this probability and below that are plots of the theoretical probability density function and cumulative density functions.

In this post I’m going to simulate the previous result using R.  I provided the R code to generate the results but I’m a pretty bad programmer so I’d encourage you to look elsewhere for programming methodology etc. but nonetheless my code does produce the intended result.  Before starting I loaded the MASS library because it’s a quick way to make plots look less shitty if you’re not an R pro.

First, I simulated 1000 random draws from an exponential distribution with parameter 1/15 and assign the result to a variable called sample:

n <- 1000

sample <- rexp(n, rate=1/15)

Then I made a histogram of ‘sample’ using the truehist command (which you’ll need the MASS library to use:

Now I want to somehow count how many of the random waiting times are greater than the average, 15.  The graph gives us an idea – it doesn’t look more than a third – but it’s not precise.  To know for sure I create a list of only the values that exceeded 15 in the sample called upper_sample with the help of a logical vector that returns true if the corresponding value in ‘sample’ is greater than 15 and false otherwise:

above_avg <- sample > 15

upper_sample <- sample[above_avg]

truehist(upper_sample, main=”Sample Waiting Times > 15″)

The number of elements in the upper_sample vector divided by the number of trials gives me the probability that a component lasts more than 15 hours, which I named psuccess:

psuccess <- length(upper_sample)/n
psuccess

There were 377 observations in my sample, so the simulated probability that a component lasts more than 15 hours is 37.7%, about 1% off from the true value.

## Birthday Statistics

Today is my roommate’s birthday and three of my friends’ birthdays.  There are 365 days in a year, so what are the chances that so many birthdays are on the same day?  More importantly, what are the chances that I have to spend so much time and money on gifts?

The probability is higher than you’d expect – so high, in fact, it has a special name, the birthday paradox.  Informally, the birthday paradox is the surprising probabilistic phenomenon that for a relatively small group of people, the probability that at least two people share the same birthday is surprisingly high.  For example, if you have 23 people in your stats class, there’s roughly a 50% chance that at least two people have the same birthday.

One way to think about this result is that, at first, it seems like there are only 22 comparisons to be made – essentially 22 ‘chances’ that one person shares his or her birthday with someone else in the room.  This is true for the first person, who we’ll call Bob, but what happens when Bob is done comparing his birthday to everyone else’s?  Well, then the second person, who we’ll call Mary, compares her birthday to everyone in the room except for Bob, who she has already compared birthdays with.  So Mary has 21 people to compare with.  The third person has 20 potential matches, the fourth has 19, and so forth.  So what you really have is 22+21+…+1 = 253 comparisons, not 22.

Each individual comparison has only a 1/365 chance of being a match – equivalently, each trial has a 99.726% chance of not being a match.  This is probably why we’re so surprised by this paradox – we’re conditioned to think it’s rare to have the same birthday as someone, because when you ask any given person when their birthday is there is only a .274% chance you’ll have the same birthday as him or her.  But each time we compare, there is one less possibility – i.e. there is one more birth date that isn’t available, so to speak.  Below is the probability that no two people in the room share a birthday.  Notice that the first term is 365/365 because the first person has 365 birthdays to choose from without ‘taking’ someone else’s.  The second person, however, has one less choice, because the first person’s birthday has already been taken.  The third person has one less choice than the second person.  This continues until the last person, person number 23, has 22 fewer choices (365-22=343) of birthdays if he is to avoid having the same birthday as another person in the room.

The probability that two people in the room do share a birthday is obviously the complement of this, so it’s equal to 1 – 0.4927 = 50.73%.  You can write the expression above with factorials etc., but I just wanted to sort of explain why (though not actually justify in a rigorously mathematical sense) the birthday paradox is a thing.  For a more hardcore explanation, you can visit Wolfram Mathworld here.

## Basic Intro to Poisson Random Variables

One distribution that Basic Intro Statistics/Business Statistics classes often miss is the Poisson distribution.  This is unfortunate, because it comes up quite frequently in a number of applications.  It looks a little bit intimidating (probably because of the lambda and the factorial) but it’s really not very complex.

First of all, Poisson random variables are discrete, so the Poisson distribution is discrete, i.e. a Poisson RV can only take on integer values.

Let’s assume that some independent event occurs Lambda (I’m going to denote lambda as L instead of the appropriate symbol because it’s easier and I don’t know if the symbol will translate to WordPress) times over a time interval.

• Independence just means that one occurrence of said event doesn’t affect other occurrences
• If A and B are independent events, then the probability of A given B has already occurred, Pr(A|B), is equal to the probability that A occurs:
• Pr(A|B) = Pr(A); the occurrence of event B doesn’t affect event A
•  Similarly, Pr(A and B) = Pr(A)Pr(B)

Formally, a random variable X is a Poisson random variable (with the stipulation that L > 0) if its probability mass function (pmf) has the form:

L indicates the number of successes per time interval.  Successes are just specified outcomes that may or may not be considered successful in everyday life – for example, we might measure instances of cancer, in which case we could call getting cancer a success (for study purposes) though getting cancer is, objectively, not a success.  Basically what we’re going to be able to model with this random variable is the number of occurrences of some event or phenomenon in a predetermined time interval.  Sometimes stats books will call the time interval a “unit of space” or something fancy – don’t get confused.  Examples of Poisson random variables:

• The number of patrons who enter a store every 5 minutes
• The number of patrons who enter a store every ___ minutes/hours/days
• The number of phone calls received by a call center every 10 minutes
• The number of phone calls received by a call center every ____ minutes/hours/days

Hopefully I’ve demonstrated that the time interval or unit of space is generic and can be adapted to suit the needs of the researcher.

Since the distribution is discrete, we’re going to use the pdf to find the probability of exactly x occurrences of an independent event that occurs, on average, L times over some interval:

We couldn’t find the probability of exactly x occurrences if the random variable were continuous – the probability that a continuous random variable takes on any specific value is 0.  For more information, see The Continuum Paradox.  Essentially, if a variable is continuous we could measure it in smaller and smaller units until there is a difference in two approximately equal observations.  For example, suppose two observations in a dataset are both 68 inches tall.  Height is a continuous variable – if we wanted to measure down to the centimeter, we might see that the two observations aren’t exactly equivalent.  If not, we can measure down to the millimeter.  We can continue, down to fractions of nanometers, and the two observations will surely differ in height to some degree.  Therefore, we can only find the probability of the continuous random variable taking on a range of values (via integration).  With a discrete random variable, however, this is not the case – the discrete pdf below clearly illustrates this:

As you can see, each discrete value on the x-axis is associated with a probability on the y-axis, and values in between the integers on the x-axis are not associated with a probability at all, as represented by the lack of pdf in those areas.  (for example, there’s no pdf in a straight line above x=10.5, because 10.5 is not a value that the random variable can take on).  If this were a continuous pdf, the dots would be connected by a smooth (continuous) line.  That’s why we can integrate a continuous pdf from 1 to 5 to find the probability of the random variable taking on a value between 1 and 5, but we have to sum the individual probabilities associated with x=1, 2, 3, 4, 5 to find the probability of a discrete random variable taking on a value between 1 and 5 (inclusive). (And thanks to JohnDCook for the picture).

Suppose a store gets, on average, 2 customer complaints every 3 minutes.  This store clearly sucks and it’s probably unrealistic, but whatever.  How could we find that probability that 4 or fewer customers complain during a 9-minute interval?

• Given our data (2 customers per 3 minutes) we’d expect 2×3 = 6 customers per 9 minutes, on average
• The probability that 4 or fewer customers complain is:
• Pr(0 customers) + Pr(1 customer) + Pr(2 customers) + Pr(3 customers) + Pr(4 customers)
• Pr(0;6) + Pr(1;6) + Pr(2;6) + Pr(3;6) + Pr(4;6)
• Using the formula for f(x;L) with x = 0, 1, 2, 3, and 4 and L = 6:

• Recall that 0! = 1 by definition, so you’re not dividing by 0 in the 1st term

So, the chance that 4 or fewer customers complain during a 9-minute interval, given an average of 2 customer complaints every 3 minutes, is 28.5%.  Intuitively, this is reasonable – given the information the result should be relatively unlikely but not extremely rare, so 28.5% fits that description.

www.openeuroscience.com/

Open source projects for neuroscience!

Systematic Investor

Systematic Investor Blog

Heuristic Andrew

Good-enough solutions for an imperfect world

r4stats.com

"History doesn't repeat itself but it does rhyme"

My Blog

take a minute, have a seat, look around

Data Until I Die!

Data for Life :)

R Statistics and Programming

Resources and Information About R Statistics and Programming

Models are illuminating and wrong

Data & Machine Learning & Product

Xi'an's Og

an attempt at bloggin, nothing more...

Practical Vision Science

Vision science, open science and data analysis

Big Data Econometrics

Simon Ouderkirk

Remote Work, Small Data, Digital Hospitality. Work from home, see the world.

rbresearch

Quantitative research, trading strategy ideas, and backtesting for the FX and equity markets

Statisfaction

I can't get no

The Optimal Casserole

No Line Is Ever Pointless

SOA Exam P / CAS Exam 1

practice makes perfect

schapshow

Mathematical statistics for the layman.

Learning R

Finding my way around R