36

Bayesian versus Orthodox statistics: Which side are you on?

RUNNING HEAD: Bayesian versus orthodox statistics

Zoltan Dienes

School of Psychology

University of Sussex

Brighton BN1 9QH UK

Abstract

Some common situations are presented where Bayesian and orthodox approaches to statistics come to different conclusions; you can see where your intuitions initially lie. The approaches are placed in the context of different notions of rationality and I accuse myself and others as having been irrational in the way we have been using statistics, i.e. as orthodox statistics. One notion of rationality is having sufficient justification for one’s beliefs. Assuming one can assign numerical continuous degrees of justification to beliefs, some simple minimal desiderata lead to the “likelihood principle” of inference. Hypothesis testing violates the likelihood principle, indicating that some of the deepest held intuitions we train ourselves to have as orthodox users of statistics are irrational on a key intuitive notion of rationality. I consider practical considerations so people can make a start at being Bayesian, if they so wish: If we want to, we really can change!

Keywords: Statistical inference, Bayes, significance testing, evidence
Bayesian versus Orthodox statistics: Which side are you on?

Introduction

Psychology and other disciplines have benefited enormously from having a rigorous procedure for extracting inferences from data. The question this article raises is whether we could not be doing it better. Two main approaches are contrasted, orthodox statistics versus the Bayesian approach. Around the 1940s the heated debate between the two camps was momentarily won in terms of what users of statistics did: Users followed the approach systematised by Jerzy Neyman and Egon Pearson (at least this approach defined norms; in practice researchers followed the somewhat different advice of Ronald Fisher; see e.g. Gigerenzer, 2004). But it wasn’t that the intellectual debate was decisively won. It was more a practical matter: A matter of which approach had the mathematics been most well worked out for detailed application at the time; and which approach was conceptually easier for the researcher to apply. But now the practical problems have been largely solved; there is little to stop researchers being Bayesian in almost all circumstances. Thus the intellectual debate can be opened up again, and indeed it has (e.g. Hoijtink, Klugkist, & Boelen, 2008; Howard, Maxwell, & Fleming, 2000; Rouder et al , 2007; Taper & Lele, 2004). It is time for researchers to consider foundational issues in inference. And it is time to consider whether the fact it takes less thought to calculate p values is really an advantage, or whether it has led us astray in interpreting data (e.g. Harlow, Mulaik, & Steiger, 1997; Meehl, 1967; Royall, 1997; Ziliak & McCloskey, 2008), despite the benefits it has also provided. Indeed, I argue we would be most rational, under one intuitively compelling notion of rationality, to be Bayesians. To see how which side your intuitions fall, at least initially, we next consider some common situations where the approaches come to different conclusions.

Bayesian or Orthodox: Where do your intuitions fall?

Consider the following scenarios and see what your intuitions tell you. You might reject all the answers or feel attracted to more than one. Real research questions do not have pat answers. See if, nonetheless, you have clear preferences for one or a couple of answers over another. Almost all answers are consistent either with some statistical approach or with what a large section of researchers do in practice, so do not worry about picking out the one ‘right’ answer (though, given certain assumptions, I will argue that there is one right answer!).

1) You have run the 20 subjects you planned and obtain a p value of .08. Despite predicting a difference, you know this won’t be convincing to any editor and run 20 more subjects. SPSS now gives a p of .01. Would you:

a) Submit the study with all 40 participants and report an overall p of .01?

b) Regard the study as non-significant at the 5% level and stop pursuing the effect in question, as each individual 20-subject study had a p of .08?

c) Use a method of evaluating evidence that is not sensitive to your intentions concerning when you planned to stop collecting subjects, and base conclusions on all the data?

2) After collecting data in a three-way design you find an unexpected partial two-way interaction, specifically you obtain a two-way interaction (p = .03) for just the males and not the females. After talking to some colleagues and reading the literature you realise there is a neat way of accounting for these results: certain theories can be used to predict the interaction for the males but they say nothing about females. Would you:

a) Write up the introduction based on the theories leading to a planned contrast for the males, which is then significant?

b) Treat the partial two-way as non-significant, as the three-way interaction was not significant, and the partial interaction won’t survive any corrections for post hoc testing?

c) Determine how strong the evidence of the partial two-way interaction is for the theory you put together to explain it, with no regard to whether you happen to think of the theory before seeing the data or afterwards, as all sorts of arbitrary factors could influence when you thought of a theory?

3) You explore five possible ways of inducing subliminal perception as measured with priming. Each method interferes with vision in a different way. The test for each method has a power of 80% for a 5% significance level to detect the size of priming produced by conscious perception. Of these methods, the results for four are non-significant and one, the Continuous Flash Suppression, is significant, p = .03, with a priming effect numerically similar in size to that found with conscious perception. Would you:

a) Report the test as p=.03 and conclude there is subliminal perception for this method?

b) Note that when a Bonferoni-corrected significance value of .05/5 is used, all tests are non-significant, and conclude subliminal perception does not exist by any of these methods?

c) Regard the strength of evidence provided by these data for subliminal perception produced by Continuous Flash Suppression to be the same regardless of whether or not four other rather different methods were tested?

4) A theory predicts a difference in reaction time between two conditions. A previous study finds a significant difference between the conditions of 20 seconds, with a Cohen’s dz of 0.5. You wish to replicate in your lab. In order to obtain a conventional power of 80% you run 35 subjects and find a t of 1.0 and a p of .32. Would you

a) conclude that under the conditions of your replication experiment there is no effect?

b) conclude that null results are never informative and withhold judgment about whether there is an effect or not?

c) realise that while 20 seconds is a likely value given the theory being tested, the difference could in fact be 15 seconds either side of this value and still be consistent with the theory. You treat the evidence as inconclusive; e.g. your certainty in the theory might go down modestly from being about 65% to a bit more than 50%, and so you decide to run more subjects until the evidence more strongly supports the null over the theory or the theory over the null?

5) You look up the evidence for a new expensive weight loss pill. Use of the pill resulted in significant weight loss after 3 months daily ingestion with a before-after Cohen’s dz of 1.0 with n=10 subjects giving a p of .01. In addition, you accept that there are no adverse side effects. Would you:

a) Reject the null hypothesis of no change and buy a 3 month’s supply?

b) Decide 10 subjects does not provide enough evidence to base a decision on when it comes to taking a drug, withhold judgment for the time being, and help sponsor a further study?

c) Decide that in a 3-month period you would like to loose between 10-15kg. In fact, despite the high standardised effect size, the raw mean weight loss in the study was 2kg. The evidence that the pill uses a mechanism producing 0-10 kg loss (which you are not interested in) rather than 10-15kg (which you are) is overwhelming. You have sufficient data to decide not to buy the pill?

We will discuss answers to this quiz below. But first we need to establish what the rational basis for orthodox and Bayesian statistics consists in and why they can produce different answers to the above questions.

Rationality

What is it to be rational? One answer is it is having sufficient justification for one’s beliefs; another is that it is a matter of having subjected one’s beliefs to critical scrutiny. Popper and others inspired by him took the second option under the name of critical rationalism (e.g. Popper, 1963, Miller, 1994). On this view, there is never a sufficient justification for a given belief because knowledge has no absolute foundation. Propositions can be provisionally accepted as having survived criticism, given other propositions those people in the debate are conventionally and provisionally willing to accept. All we can do is set up (provisional) conventions for accepting or rejecting propositions. An intuition behind this approach is that irrational beliefs are just those not subjected to sufficient criticism.

Critical rationalism bears some striking similarities to the orthodox approach to statistical inference, the Neyman Pearson approach (an approach almost universally used by users of statistics, but few would know it by that name, or any other). On this view, statistical inference cannot tell you how confident to be in different hypotheses; it only gives conventions for behavioural acceptance or rejection of different hypotheses, which, given a relevant statistical model (which can itself be subjected to testing), results in pre-set long term error rates being controlled. One cannot say how justified a particular decision is or how probable a hypothesis is; one cannot give a number to how much data supports a given hypothesis (i.e. how justified the hypothesis is, or how much its justification has changed); one can only say that the decision was made by a decision procedure that in the long run controls error probabilities (as objective probabilities in the sense of long run frequencies) (see Dienes, 1008, chapter 3, for a conceptual introduction to this approach; also Oakes, 1986; and Royall, 1997, for why p values do not provide such degrees of support). Note probability is a long run relative frequency so it does not apply to the truth of hypotheses, nor even to particular experiments. It is the long run relative frequency of errors for a given decision procedure. It can be obtained from the tail area of test statistics (e.g. tail area of t-distributions), adjusted for factors that affect long run error rates, like how many other tests are being conducted. These error rates apply to decision procedures not to individual experiments. An individual experiment is a one-off event, so it does not determine a unique long-run set of events; but a decision procedure can in principle be considered to apply over a long run indefinite number of events (i.e. experiments).

Now consider the other approach to rationality, that it is a matter of having sufficient justification for one’s beliefs. If we want to assign numerical degrees of justification (i.e. of belief ) to propositions, what are the rules for logical and consistent reasoning? Cox (1946; see Sivia & Skilling, 2006) took two minimal desiderata, namely that

1. If we specify degree of belief in P we have implicitly specified degree of belief in not-P

2. If we specify degree of belief in P and also specify degree of belief in (Q given P) then we have implicitly specified degree of belief in (P&Q)

Cox did not assume in advance what form this specification was nor what the relationships were; just that the relationships existed. Using deductive logic Cox showed that degrees of belief must follow the axioms of probability if we wish to accept the above minimal constraints. Thus, if we want to determine by how much we should revise continuous degrees of belief, we need to make sure our system of inference obeys the axioms of probability. In my experience, researchers think all the time in terms of the degree of support data provide for a hypothesis. If they want to think that way, they should make sure their inferences obey the axioms of probability.

One version of such continuous degrees of belief are subjective probabilities, i.e. personal conviction in an opinion (e.g. Howson & Urbach, 2006). One can hone in on one’s initial personal probabilities by various gambling games (see Dienes, 2008, chapter four, for an introductory review of these ideas). This can be a useful idea for how one could have probabilities for different propositions when it is hard to specify clear and full reasons for why the probabilities must have certain values. It is natural that people regard the same theory as being more or less plausible, and that probabilities can be personal. However, when probabilities of different propositions form part of the inferential procedure we use in deriving conclusions from data then we need to make sure that the procedure is fair. Thus, there has been an attempt to specify “objective probabilities” that follow from the informational specification of a problem (e.g. Jaynes, 2003). This will be a useful way of thinking about probabilities for evaluating how much data support different hypotheses. In this sense, probabilities can be normative convictions a person should have given the constraints and information made explicit in the statement of the problem. In this way, the probabilities become an objective part of the problem, whose values can be argued about given the explicit assumptions, and do not depend on any further way on personal idiosyncrasies. Note these sort of probabilities can be regarded as consistent with critical rationalism (despite Popper’s aversion to Bayes): The assumptions defining the problem are without absolute foundation, they are open to criticism, but can be debated until tentatively accepted. In any case, whatever probabilities one starts with (entirely subjective personal ones gathered by reaching deep in one’s soul, or objectively specified ones given stated constraints), Bayesian inference insists that one must revise these initial probabilities in the light of data in ways consistent with the axioms of probability.