Calculating Appropriate Sample Sizes Using Power Analysis
Objective
You're about to embark on a research project that will involve collecting data and performing tests of statistical inference, and you'd like to start your data collection. But as a savvy practitioner of statistics, you know how important is to estimate your sample size before you start collecting data, to ensure that you will be able to generate results that are statistically sound. There is nothing worse than being all ready to compute your results and generate your conclusions and then realizing... oh no, I don't have enough data for my results to be valid! (This happened to me when I was preparing my first dissertation proposal. Trust me, it can be not only unpleasant, but soul crushing.)
For each of your research questions, you will perform one sample size calculation using R; then, select the largest number from this collection. That's the minimum sample size you should collect to ensure that the conclusions you draw from each research question are valid. You'll have to do some estimating (and make some educated guesses), but this is OK. Just be sure to articulate what assumptions you made when you were computing your appropriate sample size, and what trade-offs factored into your final decision. Feel free to consult the academic literature or do a pilot study to help support your estimates and guesses.
Background
Sample size calculation is an aspect of power analysis. If the population really is different than what you thought it was like when you set up your null hypothesis, then you want to be able to collect enough data to detect that difference! Similarly, sometimes you know you'll only have a small sample to begin with, so you can use power analysis to make sure you'll still be able to carry out a legitimate statistical study at a reasonable level of significance a.
The power of a statistical test, then, answers the question:
What's the probability that I will have enough data to determine that what I originally thought the population was like (as expressed by my null hypothesis) was incorrect?
Clearly, having a zero percent chance of being able to use your sample to detect whether the population is unlike what you originally thought... would be bad. Similarly, having a 100% chance of being able to detect a difference would also probably be bad: your sample size would be comparatively large, and collecting more data is usually more costly (in both time and effort). When determining an appropriate sample size for your study, look for a power of at least 0.80, although higher is better; higher power is always associated with bigger sample sizes, though. The standard of 0.80 or higher just reflects that the community of researchers are usually comfortable with studies where the power is at least at this level.
The smaller the difference between what you think the population is like (as described by the null hypothesis) and what the data tells you the population is like, the more data you'll need to actually detect that difference. This difference between your hypothesized mean and the sample mean, or the hypothesized proportion and the sample proportion, or the hypothesized distribution of values over a contingency table and the actual distribution of values in that contingency table, is called an effect size.
For example, if you were trying to determine whether there was a difference between the average age of freshmen and the average age of seniors at your university, it wouldn't require such a large sample size because the effect size is about three years in age. However, if you were trying to determine whether there was a difference between the average age of freshmen and the average age of sophomores at your university, it would require a much larger sample size because the effect size is less than one year in age. If there is a difference, you will need more data in your sample to know for sure.
Effect size is represented in R as how many standard deviations the real estimate is away from the hypothesized estimate. Closer to zero, the effect size is small; around 0.5, the effect size becomes more significant. An effect size of 0.8 to 1 is considered large, because you're saying that the true difference between the hypothesized value of your population parameter and the sample statistic from your data is approaching one standard deviation. A value of 0.1 or 0.2 is a small effect size. (Cohen, 1988) Your job is to estimate what you think the effect size is before you perform the computation of sample size. If you really have no way at all of knowing what the effect size is (and this is actually a pretty common dilemma), just use 0.5 as recommended by Bausell & Li (2002).
To increase or improve your power to detect an effect, do one or more of these:
· Increase your sample size. The more items you have in your sample, the better you will have captured a snapshot of the variation throughout the population... as long as your random sample is also a representative sample.
· Increase your level of significance, a, which will make your test less stringent.
· Increase the effect size. Of course, this is a characteristic of your data and your a priori knowledge about the population... so this one may be really hard to change.
· Use covariates or blocking variables. (I won't explain this in detail; just know that if you're designing an experiment with treatment and control groups, and you need to improve your experimental design to increase the power of your statistical test, you should look into this.)
Now that you understand power and effect size, it will be much easier to get a sense of what Type I and Type II Errors are and what they really mean. You may have heard that Type I Error, a, reflects how likely you are to reject the null hypothesis when you shouldn't have done it. You know how sampling error means that on any given day, you might get a sample that's representative of the population, or you might get a sample that estimates your mean or proportion too high or too low?
Type I Error asks (and note that all of these questions ask exactly the same thing, just in different ways):
· How willing am I to reject the null hypothesis when in fact, it actually pretty accurately represents what's going on with the population?
· How willing am I to get a false positive, where I detected an effect but no effect actually exists?
· What's the probability of incorrectly rejecting the null hypothesis?
This probability, the Type I Error, is the level of significance a. If you choose an a of 0.05, that means you are willing to be wrong like this 1 out of every 20 times (1/20 = 0.05) you collect a sample and run the test of inference. If you choose an a of 0.01, that means you are willing to be wrong like this 1 out of every 100 times (1/100 = 0.01) you collect a sample and run the test of inference -- making this selection of a a more stringent test. On the other hand, if you choose an a of 0.10, that means you are willing to be wrong like this 1 out of every 10 times (1/10 = 0.10) you collect a sample and run the test of inference -- making this selection of a a much less stringent test.
Type I Error and Type II Error have to be balanced depending upon what your goals are in designing your study. Here is how they are related:
What's really going on with the populationH0 is True / H0 is False
The decision you make as a result of your statistical test: / Reject H0 / Type I Error
a
FALSE POSITIVES / Accurate Results! You rejected H0 and you were supposed to, because your data showed that the population was different than what you originally thought
Fail to Reject H0 / Accurate Results! You didn't reject H0 because it was an accurate description of the population / Type II Error
b
FALSE NEGATIVES
Power, 1 - b, is related to the Type II Error... it is:
The probability that you DON'T get a FALSE NEGATIVE
The probability that you DO detect an effect that's REAL
Process
So now it's time to compute your appropriate sample sizes. For each of your research questions, you should have already selected the appropriate methodology for statistical inference that you'll use to draw your conclusions. The inferential tests that are covered in this chapter are:
· One sample t-test
· Two sample t-test
· Paired t-test
· One-proportion z-test
· Two proportion z-test
· Chi-Square Test of Independence
· Analysis of Variance (ANOVA)
Before we get started, here is a summary of some sample size calculation commands you can run in R. All of them are provided by the pwr package, except the last one, which is in the base R installation. Be sure to install the pwr package first, then load it into active memory using the library command, before you move on.
R Command / Statistical Methodologypwr.t.test / One sample, two sample, and paired t-tests; also requires you to specify whether the alternative hypothesis will be one tailed or two
pwr.t2n.test / Two sample t-test where the sizes of the sample from each of the two groups is different
pwr.p.test / One proportion z-test
pwr.2p.test / Two proportion z-test
pwr.2p2n.test / Two proportion z-test where the sizes of the sample from each of the two groups is different
pwr.chisq.test / Chi-square Test of Independence
pwr.anova.test / Analysis of Variance (ANOVA)
power.t.test / Another way to perform power analysis for one sample, two sample, and paired t-tests; here, the advantage is that there is an easy method to plot Type I & Type II Errors power vs. effect
Calculating Sample Sizes for Tests Involving Means (T-Tests)
This section covers sample size calculations for the one sample, two sample, and paired t-tests. You are also required to specify whether your alternative hypothesis will be one tailed or two, so be sure you have defined your H0 and Ha prior to starting your calculations.
The pwr.t.test command takes five arguments, which means you can use it to compute power and effect size in addition to just the sample size (if you want). So the one sample t-test, we can use pwr.t.test like this to compute required sample size:
> pwr.t.test(n=NULL,sig.level=0.05,power=0.8,d=0.3,type="one.sample")
One-sample t test power calculation
n = 89.14936
d = 0.3
sig.level = 0.05
power = 0.8
alternative = two.sided
> pwr.t.test(n=NULL,sig.level=0.05,power=0.8,d=0.3,type="one.sample",
alternative="greater")
One-sample t test power calculation
n = 70.06793
d = 0.3
sig.level = 0.05
power = 0.8
alternative = greater
To get the sample size, we use the n=NULL argument to pwr.t.test. As expected, the one-tailed test below requires a smaller sample size than the two-tailed test above. And always round your n's up!! We can't sample an extra 0.14936 person for the first test above.. we have to sample the entire person. So our correct sample size should be 90 for that test, and 71 for the test below.
We can use the same command to determine sample sizes for the two sample t-test:
> pwr.t.test(n=NULL,sig.level=0.05,power=0.8,d=0.3,type="two.sample",
alternative="greater")
Two-sample t test power calculation
n = 138.0715
d = 0.3
sig.level = 0.05
power = 0.8
alternative = greater
NOTE: n is number in *each* group
And the paired t-test:
> pwr.t.test(n=NULL,sig.level=0.05,power=0.8,d=0.3,type="paired",
alternative="greater")
Paired t test power calculation
n = 70.06793
d = 0.3
sig.level = 0.05
power = 0.8
alternative = greater
NOTE: n is number of *pairs*
Observe that you can calculate any of the values if you know all of the other values. For example, if you know you can only get 28 pairs for your paired t-test, you can first see what the power would be if you kept everything else the same, and then you can check and see what would happen if the effect size were just a little bigger (and thus easier to detect with a smaller sample):
> pwr.t.test(n=28,power=NULL,sig.level=0.05,d=0.3,type="paired",
alternative="greater")
Paired t test power calculation
n = 28
d = 0.3
sig.level = 0.05
power = 0.4612366
alternative = greater
NOTE: n is number of *pairs*
> pwr.t.test(n=28,power=0.8,sig.level=0.05,d=NULL,type="paired",
alternative="greater")
Paired t test power calculation
n = 28
d = 0.4821407
sig.level = 0.05
power = 0.8
alternative = greater
NOTE: n is number of *pairs*
In the first example, we used power=NULL to tell R that we wanted to compute a power value, given that we knew the number of pairs n, the estimated effect size d, the significance level of 0.05, and that we are using the "greater than" form of the alternative hypothesis. But a power of 0.46 is really not good, so we'll have to change something else about our study. If we force a power of 0.8, and instead use d=NULL to get R to compute the effect size, we find that using the 28 pairs of subjects we have available, we can detect an effect size that's about half a standard deviation from what we hypothesized at a power of 0.8 and a level of significance of 0.05. That's not so bad.
You can also access the sample size n directly, if that's all you're interested in, like this: