Chapter 1 Basic Concepts

Thomas Bayes (1702-1761): two articles from his pen published posthumously in 1764 by his friend Richard Price.

Laplace (1774): stated the theorem on inverse probability in general form.

Jeffreys (1939): rediscovered Laplace’s work.

Example 1:

: the lifetime of batteries

Assume . Then,

To obtain the information about the values of and , two methods are available:

(a) Sampling theory (frequentist):

and are the hypothetical true values. We can use

l point estimation: finding some statistics and to estimate and , for example,

, .

l interval estimation: finding an interval estimate and for and , for example, the interval estimate for ,

(b) Bayesian approach:

Introduce a prior density for and . Then, after some manipulations, the posterior density (conditional density given y) can be obtained. Based on the posterior density, inferences about and can be obtained.

Example 2:

: the number of wins for some gambler in 10 bets, where p is the probability of winning.

Then,

(a) Sampling theory (frequentist):

To estimate the parameter p, we can employ the maximum likelihood principle. That is, we try to find the estimate to maximize the likelihood function

For example, as ,

Thus, . It is a sensible estimate. Since we can win all the time, the sensible estimate of the probability of winning should be 1. On the other hand, as ,

Thus, . Since we lost all the time, the sensible estimate of the probability of winning should be 0. In general, as ,

maximize the likelihood function.

(b) Bayesian approach::

prior density for p, i.e., prior beliefs in terms of probabilities of various possible values of p being true.

Let

Thus, if we know the gambler is a professional gambler, then we can use

the following beta density function,

to describe the winning probability p of the gambler.

The plot of the density function is

Since a professional gambler is likely to win, higher probability is

assigned to the large value of p.

If we know the gambler is a gambler with bad luck, then we can use the following beta density function,

to describe the winning probability p of the gambler. The plot of the density function is

Since a gambler with bad luck is likely to lose, higher probability is

assigned to the small value of p.

If we feel the winning probability is more likely to be around 0.5, then we can use the following beta density function,

to describe the winning probability p of the gambler. The plot of the density function is

If we don’t have any information about the gambler, then we can use the following beta density function,

to describe the winning probability p of the gambler. The plot of the density function is

Thus, the posterior density of p given x is

In fact,

Then, we can use some statistic based on the posterior density, for

example, the posterior mean

As ,

is different from the maximum likelihood estimate .

Note:

Properties of Bayesian Analysis:

1. Precise assumption will lead to consequent inference.

2. Bayesian analysis automatically makes use of all the information from

the data.

3. The inferences unacceptable must come from inappropriate assumption and not from inadequacies of the inferential system.

4. Awkward problems encountered in sampling theory do not arise.

5. Bayesian inference provides a satisfactory way of explicitly introducing and keeping track of assumptions about prior knowledge or ignorance.

1.1 Introduction

Goal: statistical decision theory is concerned with the making of decisions in the presence of statistical knowledge which sheds lights on some of the uncertainties involved in the decision problem.

3 Types of Information:

1. Sample information: the information from observations.

2.Decision information: the information about the possible consequences of the decisions, for example, the loss due to a wrong decision.

3. Prior information: the information about the parameter.

1.2 Basic Elements

parameter.

parameter space consisti1ng of all possible values of

decisions or actions (or some statistic used to estimate ).

the set of all possible actions.

: loss function

the loss when the parameter value is and the action is taken.

are independent

observations from a common distribution

sample space (all the possible values of X, usually will be a subset of ).

where is the cumulative distribution of X.

Example 2 (continue):

Let

Then,

Let

Also, let

Then,

Example 3:

Let and . Then,

Example 4:

sell the stock.

keep the stock.

stock price will go down.

stock price will go up.

Let

The above loss function can be summarized by

/ -500 / 300
/ 1000 / -300

Note that there is no sample information from an associated statistical experiment in this example. We call such a problem no-data problem.

1.3 Expected Loss, Decision Rules, and Risk

Motivation:

In the previous section, we introduced the loss of making a decision (taking an action). In this section, we consider the “expected” loss of making a decision. Two types of expected loss are considered:

l Bayesian expected loss

l Frequentist risk

(a) Bayesian Expected Loss:

Definition:

The Bayesian expected loss of an action a is

where and are the prior density and cumulative distribution of , respectively.

Example 4 (continue):

Let

Then,

and

(b) Frequentist Risk:

Definition:

A (nonrandomized) decision rule is a function from into . If is observed, then is the action that will be taken. Two decision rules, and , are considered equivalent if

Definition:

The risk function of a decision rule is defined by

Definition:

with strict inequality for some , then the decision rule is R-better than the decision rule . A decision rule is admissible if there exists no R-better decision rule. On the other hand, a decision rule is inadmissible if there does exist an R-better decision rule.

Note:

A rule is R-equivalent to if

Example 4 (continue):

and

Therefore, both and are admissible.

Example 5:

Let

Note that and .

Then,

and

Definition:

The Bayes risk of a decision rule with respect to a prior distribution on is defined as

Example 5 (continue):

Let

Then,

and

1.4 Decision Principles

The principles used to select a sensible decision are:

(a) Conditional Bayes Decision Principle

(b) Frequentist Decision Principle.

(a) Conditional Bayes Decision Principle:

Choose an action minimizing . Such a will be called a Bayes action and will be denoted .

Example 4 (continue):

Let . Thus,

Therefore,

(b) Frequentist Decision Principle:

3 most important frequentist decision principle:

l Bayes risk principle

l Minimax principle

l Invariance principle

(1) Bayes Risk Principle:

Let D be the class of the decision rules. Then, for , a decision rule is preferred to a rule based on Bayes risk principle if

A decision rule minimizing among all decision rules in class D is called a Bayes rule and will be denoted as . The quantity

is called Bayes risk for .

Example 5 (continue):

Let . Then,

and

Note that is a function of c. attains its minimum as ,

().

Thus,

is the Bayes estimator.

In addition,

Example 4 (continue):

Let . Then,

and

Thus, is the Bayes estimator.

Note:

In a no-data problem, the Bayes risk (frequentist principle) is equivalent to the Bayes expected loss (conditional Bayes principle). Further, the Bayes risk principle will give the same answer as the conditional Bayes decision principle.

Definition:

Let have the probability distribution function (or probability density function)

with prior density and prior cumulative distribution and , respectively. Then, the marginal density or distribution of is

The posterior density or distribution of given is

The posterior expectation of given is

Very Important Result:

Let have the probability distribution function (or probability density function)

with prior density and prior cumulative distribution and , respectively. Suppose the following two assumptions hold:

(a) There exists an estimator with finite Bayes risk.

(b) For almost all , there exists a value minimizing

Then,

(a)

then

and, more generally, if

then

(b)

then is the median of the posterior density or distribution of given . Further, if

then is the percentile of the posterior density or distribution of given .

(c)

then, is the midpoint of the interval of length which maximizing

[Outline of proof]

(a)

Thus

(b)

Without loss of generality, assume m is the median of . We want to prove

for .

Since

then

[Intuition of the above proof:]

We want to find a point c such that achieves its minimum. As ,

As .

Therefore, As , achieves its minimum.

(2) Minimax Principle:

A decision rule is preferred to a rule based on the minimax principle if

A decision minimizing among all decision rules in class D is called a minimax decision rule, i.e.,

Example 5 (continue):

and

Thus,

Therefore, is the minimax decision rule.

Example 4 (continue):

. Then,

and

Thus, .

(3) Invariance Principle:

If two problems have identical formal structures (i.e., the same sample space, parameter space, density, and loss function), the same decision rule should be obtained based on the invariance principle.

Example 6:

X: the decay time of a certain atomic particle (in seconds).

Let X be exponentially distributed with mean ,

Suppose we want to estimate the mean . Thus, a sensible loss function is

Suppose

Y: the decay time of a certain atomic particle (in minutes).

Then,

Thus,

where .

Let

: the decision rule used to estimate

and

: the decision rule used to estimate .

Based on the invariance principle,

The above augments holds for any transformation of the form , based on the invariance principle. Then,

Thus, is the decision rule based on the invariance principle.

1.5 Foundations

There are several fundamental principles discussed in this section. They are :

(a) Misuse of Classical Inference Procedure

(b) Frequentist Perspective

(d) Likelihood Principle

(e) Choosing Decision Principle

(a) Misuse of Classical Inference Procedure:

Example 7:

Let

In classical inference problem,

the rejection rule is

Assume the true mean and

Suppose , then

and we reject .

Intuitively, seems to strongly indicate should be true. However, for a large sample size, even as is very close to 0, the classical inference method still indicates the rejection of . The above result seems to contradict the intuition.

Note: it might be more sensible to test, for example,

Example 8:

Let

In classical inference problem,

the rejection rule is

If then and we do not reject . However, as ,

we then reject .

(b) Frequentist Perspective:

Example 9:

Let

In classical inference problem,

the rejection rule is

as By employing the above rejection rule, about 5% of all rejection of the null hypothesis will actually be in error as is true. However, suppose the parameter values and occur equally often in repetitive use of the test. Thus, the chance of being true is 0.5. Therefore, correctly speaking, 5% error rate is only correct for 50% repetitive uses. That is, one can not make useful statement about the actual error rate incurred in repetitive use without knowing for all .

Example 10:

Frequentist viewpoints:

are independent with identical distribution,

Then,

can be used to estimate .

In addition,

Thus, a frequentist claims 75% confidence procedure.

Conditional viewpoints:

Given , is 100% certain to estimate correctly, i.e., .

Given , is 50% certain to estimate correctly, i.e., .

Example 11:

X
1 / 2 / 3
/ 0.005 / 0.005 / 0.99
/ 0.0051 / 0.9849 / 0.01
/ 1.02 / 196.98 / 0.01

: some index (today) indicating the stock (tomorrow) will not go up or go down

: some index (today) indicating the stock (tomorrow) will go up

: some index (today) indicating the stock (tomorrow) will go down

: the stock (tomorrow) will go up

Frequentist viewpoints:

To test

by the most powerful test with , we reject as since

Thus, as , we reject and conclude the stock will go up. This conclusion might not be very convincing since the index does not indicate the rise of the stock.

Conditional viewpoints:

As ,

Thus, and are very close to each other. Therefore, based on conditional viewpoints, about 50% chance, the stock will go up tomorrow.

Example 12:

Suppose there are two laboratories, one in Kaohsiung and the other in Taichung. Then, we flip a coin to decide the laboratory we will perform an experiment at:

Head: Kaohsiung ; Tail: Taichung

Assume the coin comes up tail. Then, the laboratory in Taichung should be used.

Question: should we need to perform another experiment in Kaohsiung in order to develop report?

Frequentist viewpoints: we have to call for averaging over all possible data including data obtained in Kaohsiung.