Statistics 512 Notes 25: Decision Theory

Decision Theoretic Approach to Statistics: Views statistics as a mathematical theory for making decisions in the face of uncertainty.

The Decision Theory Paradigm:

The decision maker chooses an action from a set of all possible actions based on the observation of a random variable, or data, , which has a probability distribution that depends on a parameter called the state of nature. The set of all possible value of is denoted by . The decision is made by a statistical decision function, , which maps the sample space (the set of all possible data values) onto the action space . Denoting the data by , the action is random and is given as .

By taking the action , the decision maker incurs a loss , which depends on both and . The comparison of different decision functions is based on the risk function, which is the expected loss,

.

Here, the expectation is taken with respect to the probability distribution of , which depends on . Note that the risk function depends on the true state of nature ,, and on the decision function ,. Decision theory is concerned with methods of determining “good” decision functions, that is, decision functions that have small risk.

Examples:

(1) Sampling inspection: A manufacturer produces a lot consisting of items, of which are sampled randomly and determined to be either defective or nondefective. Let denote the proportion of the items that are defective. Let =0 or 1 according to whether the ith item is nondefective or defective and let denote the sample. Suppose that the lot is sold for a price, $M$, with a guarantee that if the proportion of defective items exceeds , the manufacturer will pay a penalty, $P, to the buyer. For any lot, the manufacturer has two possible actions: either sell the lot or junk it at a cost, $C$. The action space is therefore

={sell,junk}

The data are and the state of nature is . The loss function depends on the action and state of nature as shown in the following table

State of Nature / Sell / Junk
/ -$M / $C
/ $P / $C

Here a profit is expressed as a negative loss. Note that the decision rule depends on , which is random; the risk function is the expected loss, and the expectation is computed with the respect to the probability distribution of . This distribution depends on .

(2) Classification: On the basis of several physiological measurements, , a decision must be made concerning whether a patient has suffered a myocardial infarction (MI) and should be admitted to intensive care. Here {admit, do not admit} and {MI, no MI}. The probability distribution of depends on , perhaps in a complicated way. Some elements of the loss function may be difficult to specify; the economic costs of admission can be calculated, but the costs of not admitting when in fact the patient has suffered a myocardial infarction is more subjective. To make this problem more realistic, the action space could be expanded to include actions such as “send home” and “hospitalize for further observation.”

(3) Estimation: Suppose that we want to estimate some function on the basis of a sample , where the distribution of the depends on . Here is an estimator of . The quadratic loss function

is often used. The risk function is then

which is the familiar mean squared error. Note that, here again, the expectation is taken with respect to the distribution of , which depends on .

Bayes Rules and Minimax Rules

Decision theory is concerned with choosing a “good” decision function, that is, one that has a small risk

We have to face the difficulty that depends on , which is not known. For example, there might be two decision rules, and , and two values of , and , such that

but

Thus is better if the state of nature is , but is better if the state of nature is . The two most widely used methods for confronting this difficulty are to use either a minimax rule or a Bayes rule.

The minimax method proceeds as follows. First, for a given decision function , consider the worst that the risk could be:

.

Then choose a decision function, , that minimizes this maximum risk:

Such a decision rule is called a minimax rule.

The weakness of the minimax rule is intuitively apparent. It is a very conservative procedure that places all its emphasis on guarding against the worst possible case. In fact, this worst case might not be very likely to occur. To make this idea more precise, we can assign a probability distribution to the state of nature ; this distribution is called the prior distribution of and denoted by . Given such a prior distribution, we can calculate the Bayes risk of a decision function :

where the expectation is taken with respect to the distribution . The Bayes risk is the average of the risk function with respect to the prior distribution of . A function that minimizes the Bayes risk is called a Bayes rule.

Example: As part of the foundation of a building, a steel section is to be driven down to a firm stratum below ground. The engineer has two choices (actions):

There are two possible states of nature:

If the 40-ft section is incorrectly chosen, an additional length of steel must be spliced on at a cost of $400. If the 50-ft section is incorrectly chosen, 10 ft of steel must be scrapped at a cost of $100. The loss function is therefore represented in the following table:

/ 0 / $400
/ $100 / 0

A depth sounding is taken by means of a sonic test. Suppose that the measured depth, , has three possible values, , and that the probability distribution of depends on as follows:

/ .6 / .1
/ .3 / .2
/ .1 / .7

We will consider the following four decision rules:

We will first find the minimax rule. To do so, we need to compute the risk of each of the decision function in the case where and in the case where . To do such computations for , each risk function is computed as

We thus have

Similarly, in the case where , we have

To find the minimax rule, we note that the maximum values of the risk of are 400, 40, 120 and 100 respectively. Thus, is the minimax rule.

We now consider computation of a Bayes rule. Suppose that on the basis of previous experience and from large-scale maps, we take as the prior distribution and . Using this prior distribution and the risk functions computed above, we find for each decision function its Bayes risk

Thus, we have

Comparing these numbers, we see that is the Bayes rule (among these four rules). Note that this Bayes rule is less conservative than the minimax rule in that it chooses action (40-ft length) based on observation (45-ft sounding). That is because the prior distribution for this Bayes rule puts more weight on . If the prior distribution were changed sufficiently, the Bayes rule would change.

Posterior Analysis

We now develop a method for finding the Bayes rule. The Bayes risk for a prior distribution is the expected loss of a decision rule when is generated from the following probability model:

First, the state of nature is generated according to the prior distribution

Then, the data is generated according to the distribution , which we will denote by

Under this probability model (call it the Bayes model), the marginal distribution of is (for the continuous case)

Applying Bayes rule, the conditional distribution of given is

The conditional distribution is called the of . The words prior and posterior derive from the facts that is specified before (prior to) observing and is calculated after (posterior to) observing . We will discuss later more about the interpretation of prior and posterior distributions.

Suppose that we have observed . We define the posterior risk of an action as the expected loss, where the expectation is taken with respect to the posterior distribution of . For continuous random variables, we have

Theorem: Suppose there is a function that minimizes the posterior risk. Then is a Bayes rule.

Proof: We will this for the continuous case. The discrete case is proved analogously. The Bayes risk of a decision function is

(We have used the relations )

Now the inner integral is the posterior risk and since is nonnegative, can be minimized by choosing .

The practical importance of this theorem is that it allows us to use just the observed data, , rather than considering all possible values of to find the action for the Bayes rule given the data , . In summary the algorithm for finding is as follows:

Step 1: Calculate the posterior distribution .

Step 2: For each action , calculate the posterior risk, which is

The action that minimizes the posterior risk is the Bayes rule action

Example: Consider again the engineering example. Suppose that we observe . In the notation of that example, the prior distribution is ,. We first calculate the posterior distribution:

Hence,

We next calculate the posterior risk (PR) for and :

and

Comparing the two, we see that has the smaller posterior risk and is thus the Bayes rule.