Lecture 41-42 – Maximum Likelihood

What is maximum likelihood?

Method of estimating parameters by finding those values that maximize the ‘likelihood’ of generating the data.

Why the ‘likelihood’? Why not the actual probability of getting the observed data?

Think of a normal distribution! What is the probability of getting a given value for a normal distribution? Zero!

Then what are the values of the normal curve? The Likelihood!

1-dimentional data – How to find parameter estimates for normal distribution

What if one value? What values of mu and sigma will create distributions that maximize the likelihood of getting that x value? Mu will equal x and sigma will approach zero, that will create an infinitely tall normal distribution that is centered on x. Such a distribution maximizes the likelihood of generating and sample with value x.

What if more than one value? What is probability of getting two heads on two flips of coin? 0.5 x 0.5 = 0.25, because each coin flip is independent. So we want to find the values of mu and sigma that maximize the likelihood of getting ALL the observed values. Multiply the likelihood of all the values together (under the assumption that values are independent).

Equation (Probability density function; PDF) of Normal distribution -

Equation represents the likelihood of getting a given x value for any value of mu and sigma.

The likelihood of getting a set (of size n) of x-values is:

So we want to find the values of mu and sigma that maximize this value

Before we do, we usually modify the likelihood equations to find the solutions. We usually don’t find the maximum likelihood, instead we usually find the minimum of the negative log likelihood. We work with the log likelihood because the log converts the product into a summation, which is much easier to take the derivative of. Taking the log does change the shape of the curve (flattens it out), but doesn’t change the values of mu and sigma that maximize the likelihood. We take the negative of the equation, which doesn’t affect the shaped of the curve at all, just makes a mirror image of it, because computers traditionally were much better at finding minimums than maximums.

Prepping equation for analysis

  1. Take log
  2. Simplify
  3. Take negative

Log rules:

ln(y)+ln(x) = ln(y*x)

ln(y)-ln(x) = ln(y/x)

ln(xy) = y*ln(x)

How do we find the value of this mu that maximizes this equation? Think back to calculus? Take partial derivative with respect to mu, set the equation equal to zero and solve for mu!

The derivative with respect to mu of any term that doesn’t contain mu is zero. Therefore:

We can’t simplify any more, so we set the right hand side of the equation equal to zero and solve for mu.

So, the estimate of mu that maximizes the likelihood of generating any given data set, is the sum of all x-values divided by the sample size. What is that?

Regression data

Slightly different as now we are trying to find the coefficient estimates for a line that maximize the probability of generating the data around that line (which we assume is normally distributed).

So, in our likelihood equation, mu is now yHat, xi is actually yi and sigma is our residual standard error.

Prepping equation for analysis

  1. Take log
  2. Simplify
  3. Take negative

Analysis is the same as above, but we don’t take the derivative (Don’t have to).

Our simplified negative log likelihood equation is:

Just as before when we took the derivative, if want to minimize this equation with respect to , the first parts of the equation are just constants that don’t have any real effect. Thus if you want to find the value of that minimizes the negative log likelihood, all you have to do is minimize:

That is, you minimize the sum of squared error! Formulas have been calculated to do this for simple linear regression, or for more complex models, the computer does it iteratively.

For Poisson regression, logistic regression, and other GLMs, traditionally you had to do all the stuff I just showed you. Not any more – those common GLMs are now part of many statistics packages. Still some distributions you have to do by hand. Let’s explore exactly how you would do that in R.

Example, Poisson regression.

PDF is:

Where x is the observed ‘count’ and lambda is the mean count

So the likelihood for a Poisson regression is:

Don’t forget to take the negative!

This function is to be minimized in R. Check out Code!