Category Archives: Math

The Maximum Likelihood Estimate of an unknown parameter


The Maximum Likelihood Estimate is one estimate of a parameter’s value.  It basically answers the question: “Given my data, what’s the most likely value of an unknown parameter”.  To use it, we need to compute the likelihood function by taking a sample of n observations from the same distribution, evaluating each of the n observations, (x1, …, xn), at the pdf pX(x), and taking the product of the result for all n.

Likelihood Function

Note: The expression pX(x; θ) is just the distribution of the population from which we’re sampling – usually the parameter is added with a semicolon to emphasize the fact that the distribution is characterized by the unknown parameter

The likelihood function is just the joint density of the random sample.  Since samples are independent and identically distributed (iid), the joint pdf of all n sample observations is the product of the individual densities.  This is the same principle that lets us multiply the P(A), P(B), and P(C) together to find P(A intersect B intersect C) when events A, B, and C are independent.  Suppose we take a sample of size n = 3 from the distribution, and the resulting values are x1, x2, and x3.  What’s the probability associated with the three sample values? That is, what’s the joint density of the three sample values, pX(x1, x2, x3)?


Generalizing this case for all n clearly gives us the result that the joint density of n randomly drawn sample values is the product of individual densities and the likelihood function is nothing more than the joint pdf of the sample – a multivariate probability density function of the values taken on my the n random variables in the sample.

The likelihood function is a function of the unknown parameter.  The Maximum Likelihood Estimate for the unknown parameter is the parameter value that maximizes the likelihood function:

We use calculus to find this value, by first taking the derivative of the likelihood function with respect to the unknown parameter, then setting it equal to 0 and solving for the parameter.  Don’t forget to verify conditions so as to make sure you are indeed finding a maximum.


This will usually involve complicated, messy math.  To mitigate this, we sometimes work with the logarithm of the likelihood function and use properties of logs to simplify computations.  This won’t chance our answer – taking the logarithm of some function won’t change the point at which the maximum value is achieved.


The value of the parameter that you end up with maximizes the probability of your sample values x1, …,xn.  You could say it’s the value “most consistent” with the observed sample – the Maximum Likelihood Estimate.

Level Payment vs. Sinking Fund Loans

Below is a document explaining how to derive formulas for the most basic level payment and sinking fund loans. This is a simple introduction, as I’m currently working on a more detailed analysis of the benefits/drawbacks to various types of loans (including installment, variable rate, etc.) using empirical data and considering various scenarios, like the option to refinance and varying interest rates.  I used the results from my post on annuity formulas to simplify the derivation, so if you’re confused about how I got from one step to the next, check there!

Level Payment and Sinking Fund Loans

The Central Limit Theorem

Suppose we’re sampling from a population with mean μ and variance σ2. Formally, the independent random variables {X1, X2, …, Xn} comprise our sample of size n from the distribution of interest. The random variables observed in the sample will likely all be different, but they all come from the same distribution. The stipulation that sample observations are independent is important.

Though it’s often stated in other terms, the Central Limit Theorem is a statement about the sum of the independent random variables observed in the sample. What can we say about the quantity below?

Screen shot 2014-04-21 at 5.38.37 PM

We could start by standardizing the quantity S; this is done in the same manner that one would standardize a test statistic while testing a hypothesis. To standardize the sum, we subtract the mean and divide by the standard deviation. The standardized quantity can be interpreted as the number of directed standard deviations said quantity is from its mean. Some people call this a Z score. We will make use of the fact that the sample random variables are independent in deriving the mean and variance of Sn, which is why the independence assumption is so important. So, our standardized sum of random variables in the sample is going to be of the form:

Screen shot 2014-04-21 at 5.39.11 PM

If we substitute in the values for the expected value and standard deviation of Sn that we derived above, we have the expression:

Screen shot 2014-04-21 at 5.39.50 PM

The CLT is a statement about this standardized sum of observations from a population. Intuitively, the theorem states that as the sample size (n) grows arbitrarily large, the distribution of the sum of the sample values (the Xi’s) tends towards the normal distribution. Mathematically, this means:

Screen shot 2014-04-21 at 5.40.23 PM

In the expression above, the sample values are {X, X2, …, Xn}, the expected value of their sum is nμ and the standard deviation of their sum is (√n)σ. In words, the expected value of the sum of all n sample observations is n times the population mean μ and the standard deviation of the sum of all n sample observations is the square root of n times the population standard deviation σ. I know I’m beating a dead horse, but it’s important that the CLT is a statement about the sum of observations in a sample. Therefore, the quantity between numbers a and b in the inequality is the standardized sum of sample observations, or a z score if you want to call it that. On the right side is the probability density function (pdf) of the normal distribution. Integrating any continuous pdf over a region (a, b) gives the probability of attaining a value greater than a and less than b, so the right hand side of the equality above can be interpreted as the probability that a standard normal random variable takes on a value between the numbers a and b.

Putting everything together, the theorem states that if we take the sum of n observations from any distribution and standardize it, the probability that this quantity lies in a given range approaches can be approximated with the normal distribution as n increases. That is, the distribution of the standardized sum of sample values approaches the normal distribution as the size of the sample increases.

The best way to illustrate this is probably with the uniform distribution. Consider rolling a die – it’s a trivial example, but it’s a familiar concept.  The outcome of rolling the die is uniformly distributed over (1,6), because the probability that you roll a 1, 2, 3, 4, 5, or 6 is 1/6 or approx. .167.  It’s easy to show that the random variable X, where X corresponds to the dots showing after rolling one die, has a mean of E(X) = 3.5 and a Standard Deviation of approximately 1.4434.


Now let the random variable X be the sum of the dots showing when you roll two dice.  X can take on a minimum of 2 (if you ‘snake eyes’) and a maximum of 12 (if you roll two sixes).  X has a mean of of  7 and a standard deviation of 2.0412, and is distributed like the graph below:


To demonstrate the CLT, we want to increase the number of dice we roll when we compute the sum.  Below is the pdf for n = 3, that is, rolling three dice and computing their sum.


And as we let n get larger and larger, the histogram looks a lot like the normal curve:



So when we consider the sum of more rolls, the curve looks normal.  This is the CLT working!  In another post, I’ll talk about ways that we can test the normality of a given distribution and apply the theorem to the sample mean.

The Sampling Distribution of the Sample Mean

Before talking about the Central Limit Theorem (CLT), it’s necessary to distinguish between the probability distribution of the population of interest and the sampling distribution of the sample mean. Suppose we draw n independent observations from a population with mean μ and standard deviation σ. Each observation is a random variable, so the sample that we draw from the population is a collection of n random variables, all of which have the same probability distribution. We can compute the sample mean, X, from the observed sample values. We use the sample values drawn from the population of interest to compute the sample mean as a way of approximating the true mean μ, which we don’t know. The true mean μ is a parameter, because it is a characteristic of the entire population. It is not feasible (or even possible in most cases) to determine the parameter in question, so we develop mathematical functions with the purpose of approximating the parameter. The inputs to such a function are the sample values – the n observations described previously. So the sample mean is a function of the random variables we observe in a given sample. Any function of a random variable is a random variable itself, so the sample mean is a random variable, complete with all the properties that random variables are endowed with. The sample mean, which we just said is a function of random variables, or, using proper terminology, a statistic, gives us a point estimate of the true population mean μ. It is an unbiased estimator of the population mean, so we know that the expected value of the sample mean is equal to the true population mean. All this tells us is that our guesses will be “centered around” μ; the sample mean fails to give us any indication of dispersion about the population parameter. To address this concern, we have to appeal to the idea that the sample mean is a random variable, and is thus governed by a probability distribution that can tell us about its dispersion. We call the probability distribution of the sample mean the sampling distribution of the sample mean – but it’s really nothing more than the probability distribution of X. Like I mentioned earlier, the sample mean is an unbiased estimator of μ. In other words, E(X) = μ. This is easy to prove: Screen shot 2014-04-19 at 7.17.59 PM The expected value of each of the observations we draw from the population {X1,X2, …, Xn} is the unknown population parameter μ. Notice we’re NOT talking about the sample mean right now – the last expression is simply the sum of n independent observations from the population with mean μ and variance σ. Therefore: Screen shot 2014-04-19 at 7.19.15 PM We’ve shown that the mean of the sampling distribution of the sample mean is the parameter μ. However, this is not true for the standard deviation, as demonstrated below: Screen shot 2014-04-19 at 7.21.06 PM So, the sample mean is distributed normally with a mean of μ and a standard deviation of σ times the square root of the sample size, n. The sample size is not how many sample values we take from the sampling distribution of the sample mean. Instead, it corresponds to the number of observations used to calculate each sample mean in the sampling distribution. A different number for n corresponds to a different sampling distribution of the sample mean, because it has a different variance. The standard deviation of the sampling distribution of the sample mean is usually called the standard error of the mean. Intuitively, it tells us by how much we deviate from the true mean, μ, on average. The previous information tells us that when we take a sample of size n from the population and compute the sample mean, we know two things: we expect, on average, to compute the true population mean, μ; and, the average amount by which we will deviate from the parameter μ is σ/Sqrt(n), the standard error of the mean. Clearly, the standard error of the mean decreases as n increases, which implies that larger samples lead to greater precision (i.e. less variability) in estimating μ. Suppose repair costs for a TV are normally distributed with a population mean of $100 and a standard deviation of $20. To simplify notation, we can write: Repair Cost~N(100, 20).  The sampling distribution of the sample mean is the probability distribution of all sample means computed from a fixed sample of size n. In this case, Cost~N(100, 20/Sqrt(n)). If you want to visualize the distribution of repair costs for the TV in your bedroom, you’d plot a normal curve with a mean of 100 and a standard deviation of 20. If, however, you’d like the visualize average repair cost of the four TVs in your house, the correct curve would still be normal, but it would have a mean of 100 and a standard deviation (in this case, more correctly called a standard error) of 20/Sqrt(4). Now if you include your neighbor’s TVs as well, your sample size will be eight. The probability density function describing the average cost of the eight TV repairs is ~N(100, 20/Sqrt(8)). The three curves are plotted below. Screen shot 2014-04-19 at 1.10.14 PM To demonstrate the idea that the variance and the sample size are inversely related, consider the probability that a TV will require more than $110 worth of repairs. The probability that one TV requires repairs in excess of $110 (i.e. sample size = 1) is equivalent to P(Z > ((110 – 100)/20) = P(Z > 0.5) = 30.85%. The probability that the average repair cost of a sample of four TVs exceeds $110 is P(Z > ((110 – 100)/(20/Sqrt(4)) = P(Z > 1) = 15.87%. For the sample of size n = 8, we get 7.86%. What’s happening is that the value $110 is becoming more ‘extreme’ based on the distribution parameters as we increase the sample size. Since we’ve decreased the standard deviation when we take a sample of size 8 instead of 4, for example, the value $110 is less likely to be observed. Screen shot 2014-04-19 at 1.35.17 PM We can use the sample size n in a more useful way if we consider its impact on the precision of the sample mean. We know a larger n corresponds to a tighter distribution of the sample mean around the true mean.   Therefore, we can achieve a desired level of precision with our point estimator (the sample mean) by manipulating n. Since increasing n generally costs money in practice (think about surveying more respondents, etc.), we want to find the minimum value of n that will give us the desired condition in most cases. Let’s say I want the sample mean to be within 10 units of the true parameter with probability 0.97. To do this, I’ll have to decrease the variability by increasing n. I’d also like to find the minimum value that achieves this result, since I want to spend as little money as possible. In other words, I want to be 97% confident that the value I compute for the sample mean is within 10 units of the true mean, and I want the smallest possible n to do so. Here is the mathematica input: Screen shot 2014-04-19 at 1.57.55 PM Below, sample sizes are plotted against the probability of being within 10 units if the parameter μ: Screen shot 2014-04-19 at 1.58.01 PM Solving for n gives 18.84, so n = 19 is the smallest sample size that gives us the desired precision. Below, the sampling distribution of the sample mean for n = 19 is plotted against the normal curve with mean $100 and standard deviation $20: Screen shot 2014-04-19 at 2.01.20 PM In the next post, I’ll talk about how the properties of the sampling distribution of the sample mean relate to one of the most important theorems in all of mathematics, the Central Limit Theorem.

Modeling Stock Market Behavior

In the finance world, there’s some debate about whether or not the daily closing prices for various stock market indices convey  useful information.  Some financiers subscribe to the belief that the daily close price reflects market trends and impacts the probability of realizing a good return.  Others disagree, claiming that the day-to-day movements in the stock market are completely random and convey no useful information.  If this is true, then the closing price changes in the stock market should mirror a geometric random variable.  In this post I’ll explain why the geometric model would imply that stock market fluctuations are random and then test the validity of the model empirically.

Suppose the outcome of some even is binary, and success occurs with probability p.   Obviously failure must occur with probability 1 – p.  A geometric random variable models the number of trials that take place before the first success.  It takes on the value k when the first success occurs on the kth trial.  Trials are assumed to be independent, so we can write the probability density function of the random variable X as follows:

Screen shot 2014-02-07 at 9.09.52 AM

We used the independence assumption to rewrite the event “k-1 failures and a success on trial k” as the product of two distinct groups of events, namely k -1 failures and then 1 success.  Now we use the fact that success occurs with probability p (and the independence assumption, again) to write the following:

Screen shot 2014-02-07 at 9.10.00 AM

To model the behavior of the stock market as a geometric random variable, assume that on day 1 the market has fallen from the previous day.  We’ll call this fall in the closing price a “failure” that occurs with probability 1 – p.  Let the geometric random variable X represent the number of subsequent failures that occur until the stock market rises (“success”).  For example, if on the second day the stock market rises, the random variable X takes on the value 1, because there was only one decline (failure) until the rise (success) that occurred on the second day.  Similarly, if the market declines on days 2, 3, and 4 and rises on day 5, then it has declined on four occasions before rising on the fifth day and thus the random variable X takes on the value 4.  Keep in mind that it is stipulated in the formulation of the random variable that the market declined on day one, and therefore a fall on days 2, 3, and 4 is a sequence of four failures, not three.

To determine whether a geometric model fits the daily behavior of the stock market, we have to estimate the parameter p.  In our model, we are addressing the question of whether stock market price fluctuations are geometric.  Geometric random variables can take on infinitely many values of p (so long as p is between 0 and 1), so our model doesn’t address the probability with which the stock market rises and falls; the geometric model addresses the behavior of the random variable for a given p.  The value p takes on may be of interest in formulating other questions, but here its job is to create a realistic geometric model that we can compare to empirical stock market data.  If the stock market data fits the geometric model, the implication is that stock markets tend to rise and fall randomly with a constant probability of success.  This suggests that daily stock market quotes are meaningless in that today’s price does not reflect historical prices.  One could say that if this model fits stock markets don’t “remember” yesterday, but that sounds a lot like something called the memoryless property, an important characteristic of the exponential distribution, so we should be careful to not confuse the two.

Once we get some empirical data, we’re going to estimate the probability of success p.  So let’s solve for the general case and then compute an actual value with data afterwards.  There is no one way to estimate the value of a parameter, but one good way to do so is to use the maximum likelihood estimator of the parameter.  The idea is simple, but sometimes computationally difficult.  To estimate the value of p with the maximum likelihood estimator, we find the value of p for which the observed sample is mostly likely to have occurred.  We are basically maximizing the “likelihood” that the sample data comes from a distribution with parameter p.   To do this, we take the likelihood function, which is the product of the probability density function of the underlying distribution evaluated at each sample value:

Screen shot 2014-02-07 at 9.11.29 AM

For our model, we just need to substitute in the pdf of a geometric random variable for the generic pdf above and replace theta with p, the probability of success:

Screen shot 2014-02-07 at 9.11.35 AM

To find the maximum likelihood estimate for p, we maximize the likelihood function with respect to p.  That is, we take its partial derivative with respect to p and set it equal to 0.  However, it’s computationally simpler to work with the natural logarithm of the likelihood function.  This won’t affect the value of p that maximizes L(p), since the natural logarithm of L(p) is a positive, increasing function of L(p).  Sometimes you’ll hear of “Log-likelihood functions”, and this is precisely what they are  – just the log of a likelihood function that facilitates easier calculus.

Screen shot 2014-02-07 at 9.12.56 AM

Taking the derivative of this function is a lot easier than the likelihood function we had before:

Screen shot 2014-02-07 at 9.13.05 AM

So our maximum likelihood estimate of p (the probability of success) is one divided by the sample average, or, equivalently, n divided by the sum of all the k values in our sample.  This gives us the value of p that is most consistent with the n observations k1, …, k.  Below is a table of k values derived from closing data for the Dow Jones over the course of the year 2006-2007.

Recall that the random variable X takes on the value K when K – 1 failures occur (market price decreases) before a success (price increase) on trial k.  For example, X takes on the value k = 1 72 times in our dataset, which means that on 72 occasions over the course of the year there was only one failure before the first success; that is, the market declined on day 1 (by definition) and subsequently rose on day 2.  Similarly, there were 35 occasions where two failures were realized before a success, because the random variable X took on the value k = 2 on 35 occasions.

K Observed Freq.












We now have the necessary data to compute p.  We have 128 observations (values of k), so n = 128.  There are two ways we can compute p.  First, we could take the sample mean of the dataset how we normally would for a discrete random variable and then utilize formula 1 above:

Screen shot 2014-02-07 at 9.14.43 AM

The second formula obviously yields the same result, as you directly compute 128/221 instead of first computing its reciprocal.  So we now have a maximum likelihood estimate for the parameter p.  We can use this to model the stock price movement as a geometric random variable.  First let’s make the assumption that the stock market can in fact be modeled this way.  Given our value of p, what would we expect for the values of k?  that is, what proportion or with what frequency do we expect X to take on the values k = 1, 2, … ? First we’ll compute this, and then compare to the empirical data.

Screen shot 2014-02-07 at 9.15.27 AM

The probability that X takes on the value one is equal to the probability of success, which is to be expected since X = 1 corresponds to the situation in which success is realized on the day immediately following the initial failure.

Screen shot 2014-02-07 at 9.16.03 AM

And the rest are computed the same way.  Now since we have 128 observations, we can multiply each expected percentage by the number of observations to come up with an expected frequency.  Then, we can compare these to the observed frequencies and judge how well the model fits.

K N Expected % Expected Frequency

















5 128



6 128



Now that we know what we should expect if the geometric model is a valid representation of the stock market, let’s compare the expected frequencies to the observed frequencies:

Expected Frequency Observed Frequency













The geometric model appears to be a very good fit, which suggests that daily fluctuations in stock market prices are random.  Furthermore, stock indices don’t ‘remember’ yesterday – the probability of the market rising or falling is constant, and whether it actually rises or falls on a given day is subject to random chance.

Immunization: A buzzword-free introduction

In a previous post, I talked about a few popular measures of interest rate risk: Macaulay Duration, Modified Duration, and Convexity.  However, I didn’t mention the practical implementation of these metrics or their relationship with the concept of Immunization.  Broadly, the task of managing the interest rate risk associated with a given portfolio of financial assets comes down to minimizing the impact of a specific case of rate fluctuations: the decrease in asset value that results from an increase in interest rates.  It’s hard to come up with good definition of immunization, and rather than copy and paste cookie cutter bullshit I’ll just say that a portfolio is “immunized” when its value is guarded from said interest rate fluctuations.  It is not difficult to mathematically derive the conditions that are necessary for this to be the case, and once they’ve been derived they can be expressed in terms of the familiar interest risk metrics Macaulay Duration, Modified Duration, and Convexity.

First we need to generalize the concept of the Duration of a single asset to the Duration of a portfolio.  In the simplest case, we have two assets, A and B.  The change in the portfolio value when interest rates change is just the sum of the changes in value to assets A and B individually, assuming that the change in the interest rate is the same for assets A and B.  If you want to be fancy, this uniform interest rate fluctuation across all assets can be called a parallel shift in the yield curve.  Below, an expression is derived for the change in portfolio value:

Screen shot 2013-12-19 at 3.37.40 PM


This expression doesn’t say a whole lot about the underlying process.  Since we assumed that the assets are affected by the same change in interest rates, we could factor that term (delta*i) out of the above expression.  We could also manipulate the expression so as to express the change in price as the multiplicative product of P and some other term containing the relevant duration and price metrics for each asset.  It takes some algebraic simplification and clever factoring, but you can pretty easily show that the total change in portfolio value is simply the weighted average of the changes in asset A and B, which implies that the Modified Duration of the portfolio is the weighted average of the Modified Durations of the individual assets in the portfolio, where an asset’s weight is its proportion of total portfolio value.  The formula the duration of a portfolio consisting of m investments (X1 ,…, Xm ) is below.

Screen shot 2013-12-19 at 3.38.26 PM


It’s important to remember that we made a simplifying assumption in deriving this formula, namely that interest rate fluctuations are the same for each individual asset that comprises the portfolio.  In other words, when interest rates change (in our simplified model) it’s due to a parallel shift in the yield curve.  This won’t usually be the case, but the above model is still a useful approximation.

I’m convinced that a concise definition of Immunization doesn’t really exist, but I’ll explain the steps involved in “immunizing” against interest rate fluctuations in mathematical terms and then try to express the result in terms of duration and convexity to bridge the gap between the math and portfolio management.  Suppose at any given time you have some assets and some liabilities.  The liabilities are to be paid out at future dates, but you’ve estimated the present value of said payments given an assumption about the current interest rate.  Ideally you’d like to match those liabilities with assets in order to cover them; for example, you would like to match a liability of $P at time t by purchasing an asset today for some price that yields exactly $P at time t, a situation sometimes referred to as an exact match.  But that definition is useless because the idea that a portfolio of any realistic size could be matched exactly given the infinite number of possible combinations of assets that exist in our society is fucking retarded.  Maybe it’s “possible”, but I’m using the word lightly and what I actually mean is a firm could theoretically (maybe) hire someone, make his job title “exact matcher” and it would take him an incredible amount of time to (maybe) eventually find an answer.  So exact matching is conceptually only mildly retarded, but, in practice it is incredibly and wildly retarded.

So let’s stop thinking in finance buzzwords for a minute and just think about the conditions necessary for you to not get fucked by an increase in interest rates.  Actually, first, let’s figure out if and why you’d get fucked if interest rates change.  You have some liabilities with a present value that you’ve estimated, and we’ll assume that you’ve picked assets so as to match them at the current time.  (Note – this is not “exact matching” because I’m talking about the present).  It’s not unreasonable to say that at the present time assets are equal to liabilities.  Suppose interest rates increase; the present value of your assets will decrease.  This sucks, but won’t the present value of your liabilities decrease too?  Yes.  So what are we worried about, then?  Immunization deals with guarding against an interest rate change that disproportionately affects the present values of assets and liabilities; specifically, the case in which an interest rate variation results in the PV of liabilities exceeding the PV of assets.

Naturally, our first condition to immunize a portfolio is that for small (this is important, but we’ll come back to it later) change in interest rates, which we’ll denote as a change from i0 to some nearby i, causes the PV of assets to exceed the PV of liabilities:

Screen shot 2013-12-19 at 3.39.48 PM


Inequalities suck so let’s define a new function h(i) as the difference between the PV of assets and the PV of liabilities.  So the statement above is equivalent to saying h(i) is positive, and the condition mentioned earlier, that the present value of assets should equal that of liabilities at time 0, is equivalent to saying h(i0) = 0, or, more generally, h(i) = 0 at i0.  So we have the following:

Screen shot 2013-12-19 at 3.40.28 PM


What do we know about h(i)?  The first condition is pretty obvious and uninteresting; all that is said in requiring that h(i) = 0 is that PV assets = PV liabilities initially, which, has already been stated more than enough.  The next condition is more interesting; regardless of the direction, any small change in the independent variable must result in a positive increase to the function h.  This is obviously a local minimum, which, by calculus is a stationary point with a positive second derivative.  The result is summarized below in terms of the function h and also in terms of the actual assets and liabilities.

Screen shot 2013-12-19 at 3.41.34 PM


If we wanted to, we could write the preceding expressions in terms of duration and convexity, since they are derived from the first and second derivative respectively.  Deriving these expressions is computational (involves a lot of substituting and rearranging and dealing with negatives but nothing actually hard, just annoying) so I will leave that part out.  If you don’t believe me you can practice your computational high school algebra skills and try it on your own.

Screen shot 2013-12-19 at 3.42.31 PM


I think there are finance-y terms for each of the three conditions above, but I can’t remember and a quick Internet search didn’t yield any helpful results.  I don’t think buzzwords matter anyway, but if I had to explain them in words, I’d say something like the following (you can put a buzzword-y spin on it):

1)    The present values of Assets and Liabilities are equal

2)    The Durations of Assets and Liabilities are equal

3)    The Convexity of Assets is greater than the Convexity of Liabilities

4)    You’re Immunized!


Why the formulas for Duration and Convexity make perfect sense.

*Disclaimer:  The title is somewhat misleading.  While the mathematical derivation of said formulae is straightforward, the fact that Modified Duration is abbreviated by DM instead of MD makes no fucking sense.

The Macaulay Duration is just the weighted average time at which an investment pays, where the weights are based on the terms in the present value of the sum of cash flows that determine the price of the asset. (Note: v = 1/(1+i) in the equations below for simplicity)


So the Macaulay duration basically considers how large the present value of a given payment is in comparison to the sum of the present values of all payments, which is the price. This weights (i.e. assigns a proportion to) each payment time, so the result is the weighted average time at which the investment pays.

The Modified Duration considers the change in the price of an asset that results from a change in interest rates, which is, mathematically, the derivative of price with respect to interest. The value of an asset is a function of (among other things) interest rates, so we can write P(i) meaning the price of an asset at interest rate i to emphasize that interest is the independent variable and the asset price is the dependent variable. P’(i) is then the rate of change of the price of the asset with respect to interest rates; in other words, the change in value that results from a change in interest rates. You can think of this as “sensitivity” to interest rates. The function P’(i) will always be negative, because the price of an investment is a decreasing function of interest rates. (If you need to convince yourself of this, consider what happens to the value of a series with (1+i) in the denominator as i increases, and vice versa). Whoever came up with the definition of Modified Duration (DM) chose to multiply the expression by -1 so as to “normalize”, if you will, the effect of the negative derivative. Since we know it will be negative by the laws of finance, including the sign is useless and only begs the person using it to make a computational error. Formally, modified duration is the rate of change of the price of an asset/investment with respect to interest rates, expresses as a percentage of the asset’s price.


Now let’s consider how the total change of a function, f(x), that results from a change in the independent variable x can be approximated from a purely mathematical standpoint. The change in the independent variable is not of any interest because it can be freely manipulated and we immediately know what its value is. The dependent variable, however, is more interesting. We don’t automatically know what happens to f(x) when we change x by some small amount dx (d = delta in this case; I couldn’t find it in MS word). The values of f at f(x) and f(x + dx) could be exactly the same or wildly different. The Taylor Series is a way to approximate the value of f(x + dx).


The second expression is the first expression rewritten so as to represent the total change in f instead of the value of f(x + dx). It is derived by noting that the total change of f is equal to f(x + dx) – f(x) and making the appropriate substitution. It’s not necessary to calculate derivatives up to large values of n in most cases, and in many cases a function will only have derivatives of up to a few orders (the exponential function and trig functions are obvious counterexamples). So a lot of times this expression is simplified into taylor polynomials of a given degree, where the degree is the largest derivative considered.


This method of approximation can (and has) been applied to the value of financial assets. The definitions of Macaulay Duration, Modified Duration, and Convexity are direct results of the Taylor Series. The expression below is a first degree approximation of the change in price P(i) that results from a change in interest rates equal to (delta)i, shown below.


How is this related to the modified duration DM? We can rewrite the expression by multiplying it by P(i) / P(i) since it’s equal to 1:


So the first degree taylor series approximation of the change in value of the asset is the negative modified duration multiplied by the price before interest rates changed multiplied by the change in interest rates. We can do the same thing with the second degree taylor polynomial to get an expression in terms of modified duration and convexity:


This is not a coincidence! This is why these formulas make sense. While the mathematical validity of these formulas is important, it does not tell us why they effectively measure interest rate risk explicitly. To show this, first note the following relationship between the duration and modified duration:


We’ve already derived an expression for the 1st degree approximation of the change in asset value; rewrite this expression both in terms of Duration and Modified Duration:


The clear implication of the expression above is that longer duration investments undergo greater price changes than shorter duration investments when interest rates change, which suggests assets with longer durations are more volatile. This shouldn’t be surprising – think about what happens to a bond holder when interest rates change. His coupon payment is fixed (generally speaking) but his cash flows are discounted by a higher number – so the ratio in each term, payments to discount factor, decreases. Since the discount factor increases exponentially with time, payments further in the future are disproportionally affected by the change in rates. Therefore, bond holders with longer terms will be worse off than short term bondholders.

Paradoxical expected values

We tend to think of the expected value of a random variable as a standard, practical, and simple measure of the variable’s central tendency.  However, there are situations in which the expected value is misleading or even nonexistent.  One famous case, commonly called the “St. Petersburg Paradox”, illustrates perfectly the limitations of the expected value as a measure of central tendency.

Suppose a fair coin is flipped until the first tail appears.  You win $2 if a tail appears on the first toss, $4 if a tail appears on the second toss, $8 if a tail appears on the third toss, etc.  The sample space for this experiment is infinite: S = {T, HT, HHT, HHHT, …}.  A game is called fair if the ante, or amount you must pay to play the game, is exactly equal to the expected value E(x) of the game.  Casino games are obviously never fair, because casinos stand to earn a profit (on average) from the games they offer – otherwise, they would not have a viable business model and they couldn’t afford to give you free drinks and erect fancy statues (though, while we’re on the subject, craps is the “fairest” as far as casino games go in the sense that the odds are skewed least heavily in the dealer’s favor).

Let the random variable W denote the winnings from playing in the game described above.   We want to categorize W such that we can mathematically evaluate its expected value.  Once we find this expected value, we know how much we’d have to ante for this game to be considered fair.  We know how much we’ll win on the first, second, and third toss, but let’s generalize for k tosses:

  • T on 1st toss -> $2
  • T on 2nd toss -> $4
  • T on 3rd toss -> $8
  • T on kth toss -> $2^k

So we win $2^k if the first tail appears on the kth toss.  But what is the probability of each of those outcomes?  For toss one, there is a ½ chance of getting a tail.  Since trials are independent, the probability of the first tail appearing on the second toss is (1/2)*(1/2) = 1/4.  For the third toss, the probability is 1/8.  So the probability of winning $2^k is the probability of getting the first tail on the kth toss which is:

Screen shot 2013-11-01 at 7.01.14 PM

To find the expected value of the random variable W, we need to find the sum over all values that k (remember that k = the number of tosses before the first tail) of k times the probability of k.  It’s evident from the sample space that k can take on an infinite number of values.  The expected value is:

Screen shot 2013-11-01 at 7.01.21 PM

So the expected value of the random variable W (winnings) is the sum over all k of the payoff (2^k dollars) times the probability of realizing that payoff (p(2^k)).  So the first term is the actual payoff, the second is the probability of that payoff.  Plugging in what we know about the probability of k and the infinite nature of the sample space:

Screen shot 2013-11-01 at 7.01.30 PM

The sum diverges!  There is no finite expected value.  That is, the expected value of the game is infinite, which means we’d have to pay an infinite amount of money for the game to be fair.

The point is not that you should wager an infinite amount of money in order to play the aforementioned game.  The point is that the expected value is often an inappropriate measure of central tendency that leads to an inaccurate characterization of a distribution.

Criminology and Combinatorial Probability

I just read a really interesting application of combinatorics to criminology in my stats book, An introduction to Mathematical Statistics and Its Applications, by Larsen and Marx.  Alphonse Bertillon was a French criminologist who developed a system to identify criminals.  He chose 11 anatomical characteristics that supposedly remain somewhat unchanged throughout a person’s adult life – such as ear length – and divided each characteristic into three classifications: small, medium, and large.  So each of the 11 anatomical variables are categorized according to their respective size, and the ordered sequence of 11 letters (s for ‘small’, m for ‘medium’, and l for ‘large’) comprised a “Bertillon Configuration”.

This seems rudimentary and imprecise, perhaps, but keep in mind this was before the discovery of fingerprinting and so forth.  One glaring issue with this system is the obvious subjectivity involved with assigning a classification to a given anatomical variable.  Also, I’d imagine it’s kind of hard to get all of this information about a person, especially if they’re suspected of having committed a crime but haven’t ever been in custody.

Issues aside, the Bertillon configuration is an interesting idea I think.  But the obvious question to ask is how likely is it that two people have the exact same Bertillon configuration?  The implications are two people who, from a legal classification standpoint, are exactly the same (and you can use your imagination from there).  So, how many people must live in a municipality before two of them necessarily share a Bertillon configuration?

The question is actually very simple to answer via the multiplication rule:

If Operation A can be performed in m different ways, and Operation B can be performed in n different ways, the sequence {Operation A, Operation B} can be performed in (m)(n) different ways.

Corollary: If Operation Ai, i = 1, 2, …, k can be performed in ni ways, i = 1, 2, …, k, respectively, then the ordered sequence, {Operation A1, Operation A2, …, Operation Ak) can be performed in (n1)(n2)***(nk) ways.

Think of Operation A as the anatomical variable and Operation B as assigning it a value (s,m,l).

Screen shot 2013-10-17 at 10.08.05 AM

Each Operation can be performed in three different ways, namely s, m, and l.  Since there are 11 anatomical variables, by the multiplication rule there are (3)*(3)*(3)*(3)*(3)*(3)*(3)*(3) *(3)*(3)*(3) = 311 ways to assign arrange the Bertillon configuration.  Therefore, if 311 = 177,147 people live in a town, we’d expect two of them to have the same Bertillon configuration.

Continuously Compounded Returns

I’m in a financial “math” class that I really shouldn’t have taken (it’s in the business school) and our lecture notes the other day included some incredibly basic properties about interest calculations.  Anyway, I told one of my friends that it was super fucking easy to derive the formula for continuous interest by taking the limit as the number of compounding periods approaches infinity.  Upon further review, I stand by the fact that it is conceptually easy to do this, but there are some computational issues that one could run into – for example, I had to use a relatively simple substitution in order to avoid using L’Hopital’s rule or something else messy like the definition of a difference quotient, etc.  But, regardless, below is the derivation, starting with the definition of the effective annual rate for an interest rate, r, compounded n periods per year for t years.  Suppose there are n compounding periods per year and r is the interest rate.

Screen shot 2013-09-17 at 2.46.18 AM

How do we show that the effective annual rate under continuously compounded interest (i.e. effective interest with arbitrarily large n) is er – 1?  We need to show the equation below, which equates the limit of the EAR as the compounding periods approach infinity to the formula we claimed represents the continuously compounded rate (Exp(rt)-1):

Screen shot 2013-09-17 at 2.45.36 AM

If I were being tested on this in a math class for whatever reason, I might not show the continuously compounded interest rate this way.  This is more of an intuitive justification than a real proof.  To see a real proof, you can check out notes from Wharton here – it’s not really any “harder” per se, but some of the steps might be less obvious.  For example, the proof involves taking the natural logarithm of both sides of the equation (as shown below) which serves a purpose, though it may not seem like it at first.

Screen shot 2013-09-17 at 2.59.49 AM

This was done in order to take advantage of a useful property of logarithms: the log of a product is the sum of the logs.  Using this, the right hand side is split up into a sum and the first term no longer depends on the limit because it doesn’t have an n in it.  You’re left with a sort of messy equation that involves logs on both sides (it’s implicitly defined), so to rectify that you just need to remember that e and the natural log are inverses, so e raised to the power ln is 1 and the ln of e is 1.

Note: just so you don’t sound like an asshole, e is Euler’s number, and it is pronounced like “oiler”.

Another note:  There are tons of definitions of e, but the one I think of first when I think of e is the one below:

Screen shot 2013-09-17 at 3.06.44 AM

In words, e is the positive real number such that the integral from 1 to e of the function 1/t is equal to 1.  Equivalently, the area of the integral of 1/t from 1 to e is 1.  ln(x) crosses the x-axis when x=1 (since ln(1)=0), so the area in between the function ln(x) and the x-axis from its x-intercept to x = e is 1.  Obviously, ln(x) takes on the value 1 when x=e since ln(e)=1.  I’m not good at making fancy graphs online, but below is a heinous picture of the point I’m trying to convey.  Sorry Euler 😦

Screen shot 2013-09-17 at 3.16.06 AM

Open source projects for neuroscience!

Systematic Investor

Systematic Investor Blog

Introduction to Data Science, Columbia University

Blog to document and reflect on Columbia Data Science Class

Heuristic Andrew

Good-enough solutions for an imperfect world

"History doesn't repeat itself but it does rhyme"

My Blog

take a minute, have a seat, look around

Data Until I Die!

Data for Life :)

R Statistics and Programming

Resources and Information About R Statistics and Programming

Models are illuminating and wrong

A data scientist discussing his journey in the analytics profession

Xi'an's Og

an attempt at bloggin, nothing more...

Practical Vision Science

Vision science, open science and data analysis

Big Data Econometrics

Small posts about Big Data.

Simon Ouderkirk

Remote Work, Small Data, Digital Hospitality. Work from home, see the world.


Quantitative research, trading strategy ideas, and backtesting for the FX and equity markets


I can't get no

The Optimal Casserole

No Line Is Ever Pointless

SOA Exam P / CAS Exam 1

Preparing for Exam P / Exam 1 thru Problem Solving


Mathematical statistics for the layman.