Statistics 550 Notes 15

Reading: Section 2.4, 3.1-3.2

I.Numerical Methods for Finding MLEs

Review from last class:

The bisection method is a method for finding the root of a one-dimensional function that is continuous on , for which f is monotone increasing or decreasing. It can be used when the likelihood equation is (or can be reduced to) a one-parameter equation.

The bisection method works by repeatedly dividing an interval in half and then selecting the subinterval in which the root exists.

Coordinate Ascent Method

The coordinate ascent method is an approach to finding the maximum likelihood estimate in a multidimensional family. The coordinate ascent method works by using the bisection method iteratively. Suppose we have a k-dimensional parameter . The coordinate ascent method is:

Choose an initial estimate

0. Set

1. Maximize over using the bisection method by solving (assuming the log likelihood is differentiable). Reset to the that maximizes .

2. Maximize over using the bisection method. Reset to the that maximizes .

....

K. Maximize over using the bisection method. Reset to the that maximizes .

K+1. Stop if the distance between and is less than some tolerance . Otherwise return to step 0.

The coordinate ascent method converges to the maximum likelihood estimate when the log likelihood function is strictly concave on the parameter space. See Figure 2.4.1 in Bickel and Doksum.

Example: Beta Distribution

for and

This is a two parameter full exponential family and hence the log likelihood is strictly concave.

We found the method of moments estimates in Homework 5.

For iid Beta(),

For

= (0.3184108, 0.3875947, 0.7411803, 0.4044642, 0.7240628, 0.7247060, 0.1091041, 0.1388588, 0.7347975 0.5138287, 0.2683177, 0.4685777, 0.1746448, 0.2779592, 0.2876237, 0.5833377, 0.5847999, 0.2530112, 0.5018544 0.5295680)

R code for finding the MLE:

# Code for beta distribution MLE

# xvec stores the data

# rhatcurr, shatcurr store current estimates of r and s

# Generate data from Beta(r=2,s=3) distribution)

xvec=rbeta(20,2,3);

# Set low and high starting values for the bisection searches

rhatlow=.001;

rhathigh=20;

shatlow=.001;

shathigh=20;

# Use method of moments for starting values

rhatcurr=mean(xvec)*(mean(xvec)-mean(xvec^2))/(mean(xvec^2)-mean(xvec)^2);

shatcurr=((1-mean(xvec))*(mean(xvec)-mean(xvec^2)))/(mean(xvec^2)-mean(xvec)^2);

rhatiters=rhatcurr;

shatiters=shatcurr;

derivrfunc=function(r,s,xvec){

n=length(xvec);

sum(log(xvec))-n*digamma(r)+n*digamma(r+s);

}

derivsfunc=function(s,r,xvec){

n=length(xvec);

sum(log(1-xvec))-n*digamma(s)+n*digamma(r+s);

}

dist=1;

toler=.0001;

while(dist>toler){

rhatnew=uniroot(derivrfunc,c(rhatlow,rhathigh),s=shatcurr,xvec=xvec)$root;

shatnew=uniroot(derivsfunc,c(shatlow,shathigh),r=rhatnew,xvec=xvec)$root;

dist=sqrt((rhatnew-rhatcurr)^2+(shatnew-shatcurr)^2);

rhatcurr=rhatnew;

shatcurr=shatnew;

rhatiters=c(rhatiters,rhatcurr);

shatiters=c(shatiters,shatcurr);

}

rhatmle=rhatcurr;

shatmle=shatcurr;

Newton’s Method

Newton’s method is a numerical method for approximating solutions to equations. The method produces a sequence of values that, under ideal conditions, converges to the MLE .

To motivate the method, we expand the derivative of the log likelihood around :

Solving for gives

This suggests the following iterative scheme:

Newton’s method can be extended to more than one dimension (usually called Newton-Raphson)

where denotes the gradient vector of the likelihood and denote the Hessian.

Comments on methods for finding the MLE:

1. The bisection method is guaranteed to converge if there is a unique root in the interval being searched over but is slower than Newton’s method.

2. Newton’s method:

A.The method does not work if .

B.The method does not always converge.

See attached pages from Numerical Recipes in C book.

3. For the coordinate ascent method and Newton’s method, a good choice of starting values is often the method of moments estimator.

4. When there are multiple roots to the likelihood equation, the solution found by the bisection method, the coordinate ascent method and Newton’s method depends on the starting value. These algorithms might converge to a local maximum (or a saddlepoint) rather than a global maximum.

5. The EM (Expectation/Maximization) algorithm (Section 2.4.4) is another approach to finding the MLE that is particularly suitable when part of the data is missing.

Example of nonconcave likelihood:

MLE for Cauchy Distribution

Cauchy model:

Suppose are iid Cauchy() and we observe .

Log likelihood is not concave and has two local maxima between 0 and 10. There is also a local minimum.

The likelihood equation is

The local maximum (i.e., the solution to the likelihood equation) that the bisection method finds depends on the interval searched over.

R program to use bisection method

derivloglikfunc=function(theta,x1,x2,x3){

dloglikx1=2*(x1-theta)/(1+(x1-theta)^2);

dloglikx2=2*(x2-theta)/(1+(x2-theta)^2);

dloglikx3=2*(x3-theta)/(1+(x3-theta)^2);

dloglikx1+dloglikx2+dloglikx3;

}

When the starting points for the bisection method are , the bisection method finds the MLE:

uniroot(derivloglikfunc,interval=c(0,5),x1=0,x2=1,x3=10);

$root

[1] 0.6092127

When the starting points for the bisection method are , the bisection method finds a local maximum but not the MLE: uniroot(derivloglikfunc,interval=c(0,10),x1=0,x2=1,x3=10);

$root

[1] 9.775498

II. Bayes Procedures (Chapter 3.2)

In Chapter 3, we return to the theme of Section 1.3 which is how to appraise and select among point estimators and decision procedures. We now discuss how the Bayes criteria for choosing decision procedures can be implemented.

Review of the Bayes criteria:

Suppose a person’s prior distribution about is and the probability distribution for the data is that has probability density function (or probability mass function) .

This can be viewed as a model in which is a random variable and the joint pdf of is .

The Bayes risk of a decision procedure for a prior distribution , denoted by, is the expected value of the loss function over the joint distribution of , which is the expected value of the risk function over the prior distribution of :

For a person with prior distribution, the decision procedure which minimizes minimizes the expected loss and is the best procedure from this person’s point of view. The decision procedure which minimizes the Bayes risk for a prior is called the Bayes rule for the prior .

A nonsubjective interpretation of the Bayes criteria: The Bayes approach leads us to compare procedures on the basis of

if is discrete with frequency function or

if is continuous with density .

Such comparisons make sense even if we do not interpret as a prior density or frequency, but only as a weight function that reflects the importance we place on doing well at the different possible values of .

Computing Bayes estimators: In Chapter 1.3, we found the Bayes decision procedure by computing the Bayes risk for each decision procedure. This is usually an impossible task. We now provide a constructive method for finding the Bayes decision procedure

Recall from Chapter 1.2 that the posterior distribution is the subjective probability distribution for the parameter after seeing the data x:

The posterior risk of an action a is the expected loss from taking action a under the posterior distribution .

The following proposition shows that a decision procedure which chooses the action that minimizes the posterior risk for each sample x is a Bayes decision procedure.

Proposition 3.2.1: Suppose there exists a function such that

(1.1)

where denotes the action space, then is a Bayes rule.

Proof: For any decision rule , we have

(1.2)

By (1.1),

Therefore,

and the result follows from (1.2).

Consider the oil-drilling example (Example 1.3.5) again.

Example 2: We are trying to decide whether to drill a location for oil. There are two possible states of nature,

location contains oil and location doesn’t contain oil. We are considering three actions, =drill for oil, =sell the location or =sell partial rights to the location.

The following loss function is decided on

(Drill)
/ (Sell)
/ (Partial rights)

(Oil) / 0 / 10 / 5
(No oil) / 12 / 1 / 6

An experiment is conducted to obtain information about resulting in the random variable X with possible values 0,1 and frequency function given by the following table:

Rock formation
X
0 / 1
(Oil) / 0.3 / 0.7
(No oil) / 0.6 / 0.4

Application to Example 1.3.5:

Consider the prior . Suppose we observe . Then the posterior distribution of is

which equals .

Thus, the posterior risks of the actions are

Therefore, has the smallest posterior risk and if is the Bayes rule,

Similarly,

and we conclude that