Statistics 550 Notes 11

Reading: Section 2.2.

Take-home midterm: I will e-mail it to you by Saturday, October 14th. It will be due Wednesday, October 25th by 5 p.m.

I. Maximum Likelihood

The method of maximum likelihood is an approach for estimating parameters in “parametric” model, i.e., a model in which the family of possible distributions

has a parameter space that is a subset of for some finite d.

Motivating Example: A box of Dunkin Donuts munchkins contains 12 munchkins. Each munchkin is either glazed or not glazed. Let denote the number of glazed donuts in the box. To gain some information on , you are allowed to select five of the munchkins from the box randomly without replacement and view them. Let X denote the number of glazed munchkins in the sample. Suppose X=3 of the munchkins in the sample are glazed. How should we estimate ?

Probability model: Imagine that the munchkins are numbered 1-12. A sample of five donuts thus consists of five distinct numbers. All samples are equally likely. The distribution of X is hypergeometric:

The following table shows the probability distribution for X given for each possible value of .

X=Number of glazed munchkins in the sample
0 / 1 / 2 / 3 / 4 / 5
Number / 0 / 1 / 0 / 0 / 0 / 0 / 0
of glazed / 1 / .5833 / .4167 / 0 / 0 / 0 / 0
munch-kins / 2 / .3182 / .5303 / .1515 / 0 / 0 / 0
originally / 3 / .1591 / .4773 / .3182 / .0454 / 0 / 0
in / 4 / .0707 / .3535 / .4243 / .1414 / .0101 / 0
Box / 5 / .0265 / .2210 / .4419 / .2652 / .0442 / .0012
() / 6 / .0076 / .1136 / .3788 / .3788 / .1136 / .0076
7 / .0012 / .0442 / .2652 / .4419 / .2210 / .0265
8 / 0 / .0101 / .1414 / .4243 / .3535 / .0707
9 / 0 / 0 / .0454 / .3182 / .4773 / .1591
10 / 0 / 0 / 0 / .1515 / .5303 / .3182
11 / 0 / 0 / 0 / 0 / .4167 / .5833
12 / 0 / 0 / 0 / 0 / 0 / 1

Once we obtain the sample X=3, what should we estimate to be?

It’s not clear how to apply the method of moments. We have but solving gives , which is not in the parameter space.

Maximum likelihood approach: We know that it is impossible that =0, 1, 2, 11 or 12. The set of possible values for once we observe X=3 are

=3, 4, 5, 6, 7, 8, 9, 10. Although both =3 and =7 are possible, the occurrence of X=3 would be more “likely” if =7 [] than if =3 []. Among =3, 4, 5, 6, 7, 8, 9, 10, the that makes the observed data X=3 most “likely” is =7.

General definitions for maximum likelihood estimator

The likelihood function is defined by.

The likelihood function is just the joint probability mass or probability density of the data, except that we treat it as a function of the parameter . Thus, . The likelihood function is not a probability mass function or a probability density function: in general, it is not true that integrates to 1 with respect to . In the motivating example, for , .

The maximum likelihood estimator (the MLE), denoted by , is the value of that maximizes the likelihood:

. For the motivating example, =7.

Intuitively, the MLE is a reasonable choice for an estimator. The MLE is the parameter point for which the observed sample is most likely.

Equivalently, the log likelihood function is

Example 2: Poisson distribution. Suppose are iid Poisson().

To maximize the log likelihood, we set the first derivative of the log likelihood equal to zero,

is the unique solution to this equation. To confirm that in fact maximizes , we can use the second derivative test,

as long as so that in fact maximizes . When , it can be seen by inspection that 0 maximizes .

Example 3: Suppose are iid Uniform(].

Thus, .

Recall that the method of moments estimator is . In notes 4, we showed that dominates for the squared error loss function (although is dominated by ).

Key valuable features of maximum likelihood estimators:

1. The MLE is consistent.

2. The MLE is asymptotically normal: converges in distribution to a standard normal distribution for a one-dimensional parameter.

3. The MLE is asymptotically optimal: roughly, this means that among all well-behaved estimators, the MLE has the smallest variance for large samples.

Motivation for maximum likelihood as a minimum contrast estimate:

The Kullback-Leibler distance (information divergence) between two density functions and for a random variable that have the same support is

Note that by Jensen’s inequality

where the inequality is strict if since –log is a strictly convex function. Also note that . Thus, the Kullback-Leibler distance between and a fixed is minimized at .

Suppose the family of models has the same support for each and that is identifiable. Consider the function . The discrepancy for this function is

By the results of the above paragraph, so that is a valid contrast function. The minimum contrast estimator associated with the contrast function is

Thus, the maximum likelihood estimator is a minimum contrast estimator for a contrast that is based on the Kullback-Leibler distance.

Consistency of maximum likelihood estimates:

A basic desirable property of estimators is that they are consistent, i.e., converge to the true parameter when there is a “large” amount of data. The maximum likelihood estimator is generally, although not always consistent. We prove a special case of consistency here.

Theorem: Consider the model are iid with pmf or pdf

Suppose (a) the parameter space is finite; (b)is identifiable and (c) the have common support for all . Then the maximum likelihood estimator is consistent as .

Proof: Let denote the true parameter. First, we show that

(1.1)

The inequality is equivalent to

By the law of large numbers, the left side tends in probability toward

Since –log is strictly convex, Jensen’s inequality shows that

and (1.1) follows.

For a finite parameter space, is consistent if and only if .

Denote the points other than in the finite parameter space by . Let be the event that for n observations, . The event for n observations is contained in the event . By (1.1), as for . Consequently,

as and since for n observations is contained in the event ,

as .

For infinite parameter spaces, the maximum likelihood can be shown to be consistent under conditions (b)-(c) of the theorem plus the following two assumptions: (1) The parameter space contains an open set of which the true parameter is an interior point (i.e., true parameter is not on boundary of parameter space); (2) is differentiable in .

The consistency theorem assumes that the parameter space does not depend on the sample size. Maximum likelihood can be inconsistent when the number of parameters increases with the sample size, e.g.,

independent normals with mean and variance . MLE of is inconsistent.