Examples of the Use of Power Analysis in Actual Research Projects

Two Conditions, Within-Subjects Design

Here is a real-life example of an a priori power analysis done by a graduate student in our health psychology program. I think it serves as a good example of how power analysis is an essential part of planning research.

I have a within subjects design. I am trying to predict the necessary sample size (smallest) necessary to achieve adequate power (0.80 is fine). My problem now is I don't know the expected value for the correlation between baseline scores and post-test scores. Could you give me an idea about this --- or where I would get it from?

I am using prior research to estimate how large the effect will be in my research. The stats from that prior research are:

Baseline: Mean = 41.7; SD = 9.93; Post: Mean = 32.8; SD = 4.94

The expected value for the correlation between baseline scores and post-test scores must be estimated. Look at my document at the section “Correlated Samples T” – the table on page 5 shows how the required number of cases to achieve 80% power differs with both size of the effect and the correlation between conditions.

If the scores will be obtained from an instrument that has been used before, you should be able to find, in the literature or from the researchers who have used that instrument, an estimate of its reliability (Cronbach’s alpha, for example). You could then use that as an estimate of the baseline-posttest correlation. If others have used the same dependent variable in pre-post designs, you could estimate the pre-post correlation for your study as being about what it was in those others’ studies. If you are still striking out, you can simply estimate the correlation as having a modest value, say .7, and then after you have started collecting data check to what the correlation is. If it is much less than .7, then your study will be underpowered and you know you need to increase the sample size beyond what you expected to need – unless, of course, the data show that the effect is also large enough to be able to detect with the sample size you will obtain.

I suggest you obtain enough data to be able to detect an effect that is only medium in size (one-half standard deviation) or, if you expect the effect to be small but not trivial, small in size (one-fifth standard deviation). For the stats you provided, g is about 1.1:

If we were computing g to report as an effect-size estimate, I would probably use the baseline SD as the standardizer, that is, report Glass’ delta rather than Hedges’ g.

For a medium effect and rho = .7, ddiff = .5/sqrt(2(1-.7)) = .645. Taking that to G*Power, we find that you need 22 cases (each measured pre and post) to get 80% power to detect a medium-sized effect.

Two Groups, Independent Samples

Sylwia Mlynarska was working on her proposal (for her Science Fair project) and the Institutional Review Board (which has to approve the proposal before she is allowed to start collecting data) requested that she conduct a power analysis to determine how many respondents she need recruit to answer her questionnaire. As I was serving as her research mentor in this matter, I assisted her with this analysis. Below is a copy of my correspondence with her.

Wuensch, Karl L.

From: Wuensch, Karl L.
Sent: Friday, June 02, 2000 1:04 PM
To: 'Sylwia Mlynarska'
Subject: A Priori Power Analysis

Sylwia, it so happens that the question you ask concerns exactly the topic that we are covering in my undergraduate statistics class right now, so I am going to share my response with the class.

PSYC 2101 students, Sylwia is a sophomore at the Manhattan Center for Science and Mathematics High School in New York City. She is researching whether or not ethnic groups differ on attitude towards animals (animal rights), using an instrument of my construction. She is now in the process of obtaining approval (from the high school's Institutional Review Board) of her proposal to conduct this research. I am assisting her long-distance. Here is my response to her question about sample sizes:

Sylwia, the more subjects you have, the greater your power. Power is the probability that you will find a difference, assuming that one really exists. Power is also a function of the magnitude of the difference you seek to detect. If the difference is, in fact, large, then you don't need many subjects to have a good chance of detecting it. If it is small, then you do. Of course, you don't really know how large the difference between ethnic groups is, so that makes it hard to plan. We assume that you will be satisfied if your power is 80%. That is, if there really is a difference, you have an 80% chance of detecting it statistically. Put another way, the odds of your finding the difference are 4 to 1 in your favor. We also assume that you will be using the traditional 5% criterion (alpha) of statistical significance.

If the difference between two ethnic groups is small (defined by Cohen as differing by 1/5 of a standard deviation), then to have 80% power you would need to have 393 subjects in each ethnic group. If the difference is of medium size (1/2 of a standard deviation), then you need only 64 subjects in each ethnic group. If the difference is large (4/5 of a standard deviation), then you only need 26 subjects per ethnic group.

It is typically difficult to get enough subjects to have a good chance of detecting a small effect, so we generally settle for getting enough to have a good chance of detecting a medium effect -- but if you can get high power even for small effects, there is this advantage: If your results fall short of statistical significance (you did not detect the difference), you can still make a strong statement, you can say that the difference, if it exists, must be quite small. Without great power, when your result is statistically "nonsignificant," it is quite possible that a difference is present, but your research did just not have sufficient power to detect it (this circumstance is referred to as a Type II error).

Post Hoc Power Analysis As Part of a Critical Evaluation of Published Research

Michelle Marvier wrote a delightful article for the American Scientist (2001, Ecology of Transgenic Crops, 89: 160-167). She noted that before a transgenic crop (genetically modified) receives government approval, it must be shown to be relatively safe. Then she went on to discuss an actual petition to the government. Calgene Inc. submitted a petition for approval of a variety of Bt cotton (cotton which contains genes from a bacterium that result in it producing a toxin that kills insects which prey upon cotton). To test this transgenic crops on friendly invertebrates found in the soil around the plants (such as earthworms), they conducted research with a sample of four “subjects.” The test period only lasted 14 days, and in that period the earthworms exposed to the transgenic cotton plants gained 29.5% less weight than did control earthworms. The difference between the groups was not “statistically significant,” which the naive consumer might interpret to mean that the transgenic cotton had no influence on the growth of the earthworms. Of course, with a sample size of only 4, the chances of finding an undesirable effect of the transgenic cotton, assuming that such an effect does exist, are very small. Dr. Marvier calculated that with the small sample sizes employed in this research, the effect of the transgenic cotton would need to be quite large (the exposed earthworms gaining less than half as much weight as the control earthworms) to have a good (90%) chance (power) of detecting the effect (using the conventional .05 level of significance, which is, in this case, IMHO, too small given the relative risks of Type I vs Type II errors).

Karl L. Wuensch, Psychology, East Carolina University, December, 2007.

1