Journeys of an Accidental Statistician

A partially anecdotal account of “A Unified Approach to the Classical Statistical Analysis of Small Signals,” GJF and Robert D. Cousins, Phys. Rev. D57, 3873 (1998) and further unpublished work.

A Simple Problem

Suppose you are searching for a rare process and have a well-known expected background of 3 events, and you observe 0 events. What 90% confidence limit can you set on the unknown rate for this rare process?

A classical (or frequentist) statistician makes a statement about the probability of data given theory. That is, given a hypothesis for the value of an unknown true value µ, he or she will give you the probability of obtaining a set of data x, P(x | µ).

A classical confidence interval (Jerzy Neyman, 1937) is a statement of the form: The unknown true value of µ lies in the region [µ1,µ2]. If this statement is made at the 90% confidence level, then it will be true 90% of the time, and false 10% of the time.

Poisson statistics P(x = 0 | µ = 2.3) = 0.1. Therefore, in the “standard” classical approach, µ < 2.3 at 90% C.L. Since µ = s + b, and b = 3.0, s < -0.7 at 90% C.L.

Thus, we are led to a statement that we know is a priori false.

Bayesian Statistics

A Bayesian takes the opposite position from a classical statistician. He or she calculates the probability of theory given data. That is, given a set of data x, he or she will calculate the probability that the unknown true value is µ, P(µ | x).

This appears attractive because it is what you really want to know. However, it comes at a price:

P(x | µ) and P(µ | x) arerelated by Bayes Theorem, which in set theory is the statement that an element is in both A and B is

P(A | B) P(B) = P(B | A) P(A)

which for probabilities becomes

P(µ | x) = P(x | µ) P(µ)/P(x).

P(x) is just a normalization term, but Bayes Theorem transforms P(µ), the prior distribution of “degree of belief” in µ,to P(µ | x),the posterior distribution.

A “credible interval” or “Bayesian confidence interval” is formed by .

An Example

Suppose you have a large number of marbles, which are either white or black, and you wish information on the fraction that are white, µ. You draw a single marble, and it is white. What can you say at 90% confidence?

Classical: µ ≥ 0.1

Prior

Bayesian:flatµ ≥ 0.316

1/µµ ≥ 0.1

1/(1-µ)unnormalizable

µµ ≥ 0.464

(1-µ)µ ≥ 0.196

Notice that most of the Bayesian priors do not cover, i.e., they are not true statements the stated fraction of the time (90% in this case). There is no requirement that credible intervals cover. However, Bob Cousins warns [Am. J. Phys. 63, 398 (1995)],

“…if a Bayesian method is known to yield intervals with frequentist coverage appreciably less than the stated C.L. for some possible value of the unknown parameters, then it seems to have no chance of gaining consensus acceptance in particle physics.”

The Role ofBayesian Statistics

Prosper [Phys. Rev. D 37, 1153 (1988)] argues for a 1/µ prior based on a scaling argument. I found it unsatisfactory for two reasons:

(1) It fails for x = 0. (Unnormalizable)

(2) In general, it undercovers.

I decided to continue my search for a solution to the simple problem in classical statistics.

To quote Bob’s prose from our paper:

“In our view, the attempt to find a non-informative prior within Bayesian inference is misguided. The real power of Bayesian inference lies in its ability to incorporate ‘informative’ prior information, not ‘ignorance.’”

Prosper wrote that he was using a Bayesian approach because

“…we are merely acknowledging the fact that a coherent solution to the small-signal problem is more easily achieved within a Bayesian framework than one which uses the methods of ‘classical’ statistics.”

By the end of this talk, I hope to convince you that this is no longer true.

Construction of Confidence Intervals

Neyman’s prescription: Before doing an experiment, for each possible value of theory parameters determine a region of data that occurs C.L. of the time, say 90%. After doing the experiment, find all of values of the theory parameters for which your data is in their 90% region. This is the confidence interval.

Notice that there is complete freedom of choice of which 90% to choose. This will be the key to our solution.

Examples of Poisson Confidence Belts

90% C.L. upper limits for Poisson µ with background = 3

90% C.L. central limits for Poisson µ with background = 3

The Solution

For both the upper limit and central limit, x = 0 excludes the whole plane. But consider the problem from the point of view of the data. If one measures no events, then clearly the most likely value of µ is zero. Why should one rule out the most likely scenario?

Therefore, we propose a new ordering principle based on the ratio of a given µ to the most likely µ, :

Example for µ = 0.5 and b = 3:

x / P(x|µ) / / P(x|) / R / rank / U.L. / C.L.
0 / 0.030 / 0.0 / 0.050 / 0.607 / 6 •
1 / 0.106 / 0.0 / 0.149 / 0.708 / 5 • / • / •
2 / 0.185 / 0.0 / 0.224 / 0.826 / 3 • / • / •
3 / 0.216 / 0.0 / 0.224 / 0.963 / 2 • / • / •
4 / 0.189 / 1.0 / 0.195 / 0.966 / 1 • / • / •
5 / 0.132 / 2.0 / 0.175 / 0.753 / 4 • / • / •
6 / 0.077 / 3.0 / 0.161 / 0.480 / 7 • / • / •
7 / 0.039 / 4.0 / 0.149 / 0.259 / • / •
8 / 0.017 / 5.0 / 0.140 / 0.121 / •

The Poisson Limits

90% C.L.unifiedlimits for Poisson µ with background = 3

Excerpt from Table IV of the paper (90% C.L.):

x/b / 0.0 / 1.0 / 2.0 / 3.0
0 / 0.00- 2.44 / 0.00- 1.61 / 0.00- 1.26 / 0.00- 1.08
1 / 0.11- 4.36 / 0.00- 3.36 / 0.00- 2.53 / 0.00- 1.88
2 / 0.53- 5.91 / 0.00- 4.91 / 0.00- 3.91 / 0.00- 3.04
3 / 1.10- 7.42 / 0.10- 6.42 / 0.00- 5.42 / 0.00- 4.42
4 / 1.47- 8.60 / 0.74- 7.60 / 0.00- 6.60 / 0.00- 5.60
5 / 1.84- 9.99 / 1.25- 8.99 / 0.43- 7.99 / 0.00- 6.99
6 / 2.21-11.47 / 1.61-10.47 / 1.08- 9.47 / 0.15- 8.47
7 / 3.56-12.53 / 2.56-11.53 / 1.59-10.53 / 0.89- 9.53
8 / 3.96-13.99 / 2.96-12.99 / 2.14-11.99 / 1.51-10.99
9 / 4.36-15.30 / 3.36-14.30 / 2.53-13.30 / 1.88-12.30

Examples of Gaussian Confidence Belts

90% C.L. upper limits for Gaussian µ ≥ 0 vs. x (total – background) in 

90 % C.L. central limits for Gaussian µ ≥ 0

Flip-flopping

How might a typical physicist use these plots?

(1) “If the result x < 3, I will quote an upper limit.”

(2) “If the result x > 3, I will quote central confidence interval.”

(3) “If the result x < 0, I will pretend I measured zero.”

This results in the following:

In the range 1.36 ≤ µ ≤ 4.28, there is only 85% coverage!

Due to flip-flopping (deciding whether to use an upper limit or a central confidence region based on the data) these are not valid confidence intervals.

Unified Solution to the Gaussian Case

Notes:

(1) This approaches the central limits for x >1.

(2) The upper limit for x = 0 is 1.64, the two-sided rather than the one-sided limit.

(3) From the defining 1937 paper of Neyman, this is the only valid confidence belt, since there are 4 requirements for a valid belt:

(a) It must cover.

(b) For every x, there must be at least one µ.

(d) Every limit must include its end points.

Including Errors on the Background

Our paper assumed perfect knowledge of the backgrounds. However, often backgrounds are estimated from an ancillary experiment (sidebands, target-out run, etc.) How does the error on the ancillary experiment affect the limits?

A Bayesian can solve this problem trivially by integrating over the background distribution. However, this is not permitted for frequentist. He or she must provide coverage for all possible values of the unknown true background rate.

[The unknown true background rate is technically known as a “nuisance parameter,” because one must provide coverage for it, but you really do not care what its value is.]

The exact efficient solution to this problem is computationally taxing, and has never been published. This is a second reason why physicists have been driven to Bayesian statistics.

However, there is a simple solution to this problem within the unified approach, and Bob and I have agreed that we will publish it this summer.

Trick for Including Errors on Backgrounds

If one provides coverage for the “worst case” of the nuisance parameter, then perhaps one will have provided coverage for all possible values of the nuisance parameter.

Let µ = unknown true value of the signal parameter

 = unknown true value of the background parameter

x = measurement of µ + 

b = measurement of  in the ancillary experiment

The rank R becomes

where

maximize the denominator (as usual), and

maximizes the numerator.

Since R is only a function of µ and the data, one proceeds to construct the confidence belt as before.

Examples and Test of the Trick

Let x be a Poisson measurement of µ +  and

b be a Poisson measurement of r in an ancillary experiment (i.e., r = signal region/control region),

r / n, rb / 0, 3 / 3, 3 / 6, 3 / 9, 3
0.0 / 0.00- 1.08 / 0.00- 4.42 / 0.15- 8.47 / 1.88-12.30
0.5 / 0.00- 1.11 / 0.00- 4.42 / 0.00- 8.47 / 1.75-12.30
1.0 / 0.00- 1.49 / 0.00- 4.73 / 0.00- 8.70 / 1.32-12.55
3.0 / 0.00- 1.57 / 0.00- 4.85 / 0.00- 9.36 / 0.00-13.03

A test of coverage for r = 1.0, 0 ≤ µ ≤ 20, 0 ≤  ≤ 15 at 10,000 randomly picked values of µ and :

Median coverage: / 90.25%
Fraction that undercovered: / 4.10%
Median coverage for those that undercovered: / 89.96%
Worst undercoverage: / 89.46%

Comment from Harvard statisticians: “We have never seen a statistical approximation work this well.”

Back to Neutrinos

As I had hoped, once we had the solution to the simple problem, the solution to the neutrino problem was obvious, unique, and natural. (My postdoc, Fred Weber, was doing almost the correct thing instinctively.)

Probability of oscillation: P = sin2(2) sin2(1.27 ∆m2L/E)

Since , the ordering principle becomes a =

[(data | point in plane) - (data | best fit point in plane)].

The prescription: To calculate the acceptance region for each point in the plane, do a Monte Carlo simulation and determine what critical value of will contain 90% of the data.

Combining Experiments: The Higgs Mass

An example that may be of interestto LEP experimenters is how to use these techniques to combine different experimental measurements of the Higgs Mass.

Let µ be the unknown true value of the Higgs mass, which is presumably a known function of the counting rate, i, whereiis an index denoting the experiment.

In each experiment i, let i be the unknown true rate of background. Let bi be a measure of the background in an ancillary experiment, say the sidebands.

Finally, let ni be the measured rate of i + i.

The ordering principle becomes:

where

maximize the denominator and

maximize the numerator.

The Higgs Mass Example (continued)

The maximization of the numerator can be done analytically; in general, the maximization of the denominator requires a simple numerical solution.

For each value of µ, the critical value ofRc(µ)

is determined. If the explicit sum is computationally impractical, a Monte Carlo calculation can be substituted.

If the measured values of ni and biare Ni and Bi, then the 90% confidence interval for µ is given by

The Higgs Mass Example (continued)

Using the following information from CERN-EP/98-046,“Lower bound for the Standard Model Higgs boson mass from combining the results of the four LEP experiments”

Experiment / Expected Background / Seen
Aleph / 0.84 / 0
Delphi / 4.24 / 2
Opal / 4.07 / 2
Total / 9.15 / 4

(Not enough information given to include L3.)

and the figure giving expected signal vs. Higgs mass,

assuming backgrounds are perfectly known, I obtained

mH < 75 GeV at 95% C.L.,

but a “sensitivity” to the Higgs Mass of only 71 GeV.

(5% probability of observing 4 events when you expect 9.15.)

(CERN-EP/98-047 numbers: 77.5 GeV and 75.5 GeV)

Brand X Comparisons: Raster Scan

We built a toy model to compare this technique with three other classical techniques that have, or could have, been used in previous experiments.

Raster scan: For each ∆m2, scan along sin2(2) to find the minimum (including the unphysical region), and take the 90% confidence interval to have ≤ + 2.71 (the two-sided limit).

This technique (and the other two to be discussed) will have all the problems of the possibility of excluding all physically allowed values, which we originally set out to solve. However, it does cover exactly, but it is not powerful (i.e., it does not do an efficient job of rejecting false hypotheses).

Brand X Comparisons: Flip-flop Raster Scan

Flip-flop raster scan: Same as raster scan, but use a one-sided limit (≤ + 1.64), unless the signal is > 3.

This has all the features of the raster scan, but it significantly undercovers (85%) in a substantial region:

Brand X Comparisons: Global Scan

Global Scan: Find the global minimum and take ( ≤ + 4.61, the two-sided 90% C.L. for with 2 degrees of freedom).

This technique is powerful, but has significant regions of both undercoverage (76%) and overcoverage (94%).

Possible reasons:

(1)Sinusoidal nature of the oscillation function.

(2)One-dimensional regions.

Brand X Comparisons: Scorecard

Technique / Always gives useful information / Gives proper coverage / Is powerful
Raster scan / X / √ / X
Flip-flop R.S. / X / X / X
Global scan / X / X / √
This technique / √ / √ / √

Sensitivity

The main objection to this work has been that an experiment that observes fewer events than the expected background may report a lower upper limit than a (better designed ?) experiment that has no background.

To address this problem and to provide additional information for the reader’s assessment of the significance of the results, we suggest that experiments that have fewer counts than expected background also report their sensitivity, which we define as the average upper limit that would be obtained by an ensemble of experiments with the expected background and no true signal.

For example, in the initial NOMAD publication of 1995 data, we reported an upper limit on high-mass oscillations of sin2(2) < and a sensitivity of . A few days ago, at the Neutrino ’98 conference, we reported preliminary results from a partial analysis of 1995-97 data yielding an upper limit of and a sensitivity of .

Visit to Harvard Statistics Department

Towards the end of this work, I decided to try it out on some professional statisticians whom I know at Harvard.

They told me thatthis was the standard method of constructing a confidence interval!

I asked them if they could point to a single reference of anyone using this method before, and they could not.

They explained that in statistical theory there is a one-to-one correspondence between a hypothesis test and a confidence interval. (The confidence interval is a hypothesis test for each value in the interval.) The Neyman-Pearson Theorem states that the likelihood ratio gives the most powerful hypothesis test. Therefore, it must be the standard method of constructing a confidence interval.

I decided to start reading about hypothesis testing…

Kendall and Stuart (1961)

At the start of chapter 24 of Kendall and Stuart’sThe Advanced Theory of Statistics (chapter 23 of Stuart and Ord), I found 1 1/4 cryptic pages that contain all of the information, published and unpublished, in this lecture.

(See text)

1998 Particle Data Group Review

The 1998 PDG “Review of Particle Physics” [European Physics JournalC3 (1998) 1]has supported this approach strongly:

“The Bayesian methodology, while well adapted to decision-making situations, is not in general appropriate for the objective presentation of experimental data.”

“In order to correct this problem [undercoverage due to flip-flopping]…it is necessary to adopt a unifiedprocedure…. The appropriate unified procedure and ordering principle are given in Ref. 11 [Feldman-Cousins].”

“If it is desired to report an upper limit in an objective way such that it has a well-defined statistical meaning in terms of limiting frequency, then report the Frequentist confidence bound(s) as given by the unified Feldman-Cousins approach.”

“…we suggest that… a measure of sensitivity should be reported whenever expected background is larger or comparable to the number of observed counts. The best such measure we know of is that proposed and tabulated in Ref. 11….

Conclusion

I have yet to find a problem in the construction of classical confidence intervals and regions that is not solvable by the ordering principle suggested here.

For example, we have been using it with success in the NOMAD experiment to combine the results for different  decay modes, including errors on the background determination.

Post script for LEP audiences:

The LEP Higgs paper (CERN-EP/98-046) states:

“There is no unique, exact, method to deal with this problem.”

• The method presented here is exact by construction.

•I believe that this method, due to its simplicity and reliance on classical methods, the classical construction of the confidence interval and the use of the likelihood ratio, deserves to be called unique, if any method does.

Gary Feldman1Journeys