The Central Limit Theorem

Suppose we’re sampling from a population with mean μ and variance σ2. Formally, the independent random variables {X1, X2, …, Xn} comprise our sample of size n from the distribution of interest. The random variables observed in the sample will likely all be different, but they all come from the same distribution. The stipulation that sample observations are independent is important.

Though it’s often stated in other terms, the Central Limit Theorem is a statement about the sum of the independent random variables observed in the sample. What can we say about the quantity below?

Screen shot 2014-04-21 at 5.38.37 PM

We could start by standardizing the quantity S; this is done in the same manner that one would standardize a test statistic while testing a hypothesis. To standardize the sum, we subtract the mean and divide by the standard deviation. The standardized quantity can be interpreted as the number of directed standard deviations said quantity is from its mean. Some people call this a Z score. We will make use of the fact that the sample random variables are independent in deriving the mean and variance of Sn, which is why the independence assumption is so important. So, our standardized sum of random variables in the sample is going to be of the form:

Screen shot 2014-04-21 at 5.39.11 PM

If we substitute in the values for the expected value and standard deviation of Sn that we derived above, we have the expression:

Screen shot 2014-04-21 at 5.39.50 PM

The CLT is a statement about this standardized sum of observations from a population. Intuitively, the theorem states that as the sample size (n) grows arbitrarily large, the distribution of the sum of the sample values (the Xi’s) tends towards the normal distribution. Mathematically, this means:

Screen shot 2014-04-21 at 5.40.23 PM

In the expression above, the sample values are {X, X2, …, Xn}, the expected value of their sum is nμ and the standard deviation of their sum is (√n)σ. In words, the expected value of the sum of all n sample observations is n times the population mean μ and the standard deviation of the sum of all n sample observations is the square root of n times the population standard deviation σ. I know I’m beating a dead horse, but it’s important that the CLT is a statement about the sum of observations in a sample. Therefore, the quantity between numbers a and b in the inequality is the standardized sum of sample observations, or a z score if you want to call it that. On the right side is the probability density function (pdf) of the normal distribution. Integrating any continuous pdf over a region (a, b) gives the probability of attaining a value greater than a and less than b, so the right hand side of the equality above can be interpreted as the probability that a standard normal random variable takes on a value between the numbers a and b.

Putting everything together, the theorem states that if we take the sum of n observations from any distribution and standardize it, the probability that this quantity lies in a given range approaches can be approximated with the normal distribution as n increases. That is, the distribution of the standardized sum of sample values approaches the normal distribution as the size of the sample increases.

The best way to illustrate this is probably with the uniform distribution. Consider rolling a die – it’s a trivial example, but it’s a familiar concept.  The outcome of rolling the die is uniformly distributed over (1,6), because the probability that you roll a 1, 2, 3, 4, 5, or 6 is 1/6 or approx. .167.  It’s easy to show that the random variable X, where X corresponds to the dots showing after rolling one die, has a mean of E(X) = 3.5 and a Standard Deviation of approximately 1.4434.

Uniform1

Now let the random variable X be the sum of the dots showing when you roll two dice.  X can take on a minimum of 2 (if you ‘snake eyes’) and a maximum of 12 (if you roll two sixes).  X has a mean of of  7 and a standard deviation of 2.0412, and is distributed like the graph below:

Uniform2

To demonstrate the CLT, we want to increase the number of dice we roll when we compute the sum.  Below is the pdf for n = 3, that is, rolling three dice and computing their sum.

Uniform3

And as we let n get larger and larger, the histogram looks a lot like the normal curve:

Uniform20

Uniform30

So when we consider the sum of more rolls, the curve looks normal.  This is the CLT working!  In another post, I’ll talk about ways that we can test the normality of a given distribution and apply the theorem to the sample mean.

About schapshow

Math & Statistics graduate who likes gymnastics, 90s alternative music, and statistical modeling. View all posts by schapshow

4 responses to “The Central Limit Theorem

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

www.openeuroscience.com/

Open source projects for neuroscience!

Systematic Investor

Systematic Investor Blog

Introduction to Data Science, Columbia University

Blog to document and reflect on Columbia Data Science Class

Heuristic Andrew

Good-enough solutions for an imperfect world

r4stats.com

"History doesn't repeat itself but it does rhyme"

My Blog

take a minute, have a seat, look around

Data Until I Die!

Data for Life :)

R Statistics and Programming

Resources and Information About R Statistics and Programming

Models are illuminating and wrong

A data scientist discussing his journey in the analytics profession

Xi'an's Og

an attempt at bloggin, nothing more...

Practical Vision Science

Vision science, open science and data analysis

Big Data Econometrics

Small posts about Big Data.

Simon Ouderkirk

Remote Work, Small Data, Digital Hospitality. Work from home, see the world.

rbresearch

Quantitative research, trading strategy ideas, and backtesting for the FX and equity markets

Statisfaction

I can't get no

The Optimal Casserole

No Line Is Ever Pointless

SOA Exam P / CAS Exam 1

Preparing for Exam P / Exam 1 thru Problem Solving

schapshow

Mathematical statistics for the layman.

%d bloggers like this: