Chapter 8: Testing the Difference of the Means of Two Populations

Chapter 7:Testing Hypotheses about the Difference between the Means of Two Populations

The Standard Error of the Difference

A lot of research questions involve trying to decide whether two population means differ from one another, and we have to make this decision based on the data from two samples. For example, what if you wanted to know how much your test score would suffer if you went out until 3a.m. on Sunday night versus going to bed at 10p.m. To see if partying until the wee hours truly hurts your GPA, we could take a random sample of students who typically get As on their stats exams, break them randomly into two groups, and then find the mean of their test scores the following day. Then, we can use NHT to determine whether there is a significant difference, basedon how much these two sample means differ from each other.

To decide whether your two sample means are significantly different, you need to find out the amount by which two sample means (the same sizes as yours) typically differ based on only random sampling. This amount is called the standard error of the difference (SED), and we will show you exactly how to calculate it from the means, SDs, and sizes of the samples inyour study. As you will see, the SED is larger when the variation withinyour groups is larger, but it gets smaller when using larger groups. Large groups do not tend to stray much from the population mean, so when you are dealing with large groups, a fairly large difference between the means is very likely to be significant.

Pooling the Sample Variances

Just like the standard error of the mean, the standard error of the difference usually must be estimated based on the values you have from your samples. There are two main ways to do this, but the one that is much more common involves taking a weighted average of your sample variances (dubbed the pooled variance).

Here is the pooled variance formula:

Then insert this value into thenext formula to find the standard error of the difference:

Last, it’s time to plug everything into the ttest formula. Because we are now pooling the variances, we can use the tdistribution that hasdf = N1+ N2 – 2. The most basic two-sample t test formula is:

But given that we most often test the null that the population means are equal to one another (i.e., 2is equal to zero), we can just drop the2from the formula. Also, if youwant to skip the step of finding the standard error of the difference separately, you can first find the pooled variance,and then plug itdirectly into this bad boy:

(For computational convenience, we have factored out the pooled variance instead of dividing it by each sample size.) Once you’ve found your tvalue, just go by the degrees of freedom to look up a critical value in your tdistribution table, and figure out if your calculated t is greater or less than the critical t. But remember, finding a significant t simply informs you that the population means seem to bedifferent. Moreover, the difference of your sample means is just a point estimate of that population difference, so we will show you how to make it much more informative as the center of an interval estimate.

Let’s try an example together:

Everyone has been complaining that taking statistics at 9a.m. on Friday morning is ruining their GPA. Professor Salvatore doesn’t believe that there is much of a difference between his two sections (the other meets at 1p.m. on Wednesdays), and so he wants to determine if the mean scores of the midterm exam for the two sections differ significantly from one another at the .05, two-tailed level. Here are the statistics he has on hand: = 88, sWednesday = 4.3, NWednesday = 18, =84, sFriday = 5.5, NFriday = 14.

Step one: Determine the pooled variance.

spooled ² = (18 – 1)(4.3²) + (14 – 1)(5.5²)/(18 + 14 – 2) = (314.33 + 393.25)/30 = 23.586

Step two: Determine the standard error of the difference.

= √ ((23.586/18) + (23.586/14)) = √ 2.995 = 1.731

Step three: Figure out the tvalue.

t = (88 – 84) / 1.731 = 2.311

Step four: Make your statistical decision.

Figure out the critical tvalue you need, with df =N1+ N2 – 2 = 30, by looking at thetvalue table. You will see that t.05(30) = 2.042.

Because our calculated t = 2.311tcrit (30) = 2.042, Professor Salvatore is going to have to accept the fact that there is a statistically significant difference between theaverage midterm grades of the two class sections. Looks like it’s time to rethink the schedule for next semester. (And really, whether it’s statistics, psychology, economics, etc.—all 9a.m. classes on Fridays should be outlawed!)

Now you try an example:

1.Alexis is running a psychological experiment to determine whether classical music helps people concentrate on a cognitive task. She divides her 31 participants into two groups (N1= 15 andN2 = 16) and finds that those who worked on the problem in silence have an average test score of 74 (M1) with s1 = 3.8, and those who listened to Mozart while working had an average test score of 79 (M2) with s2 = 4.1. Do the two groups differ significantly from one another?

Confidence Interval for the Difference of Two Population Means

To gain more information for a two-sample study, we can revisit the confidence interval formula. The formula for the two-group case is very akin to the one we used in the prior chapter to determine the likely values of the mean of a single population, but we do need to rework the formula a bit when trying to estimate thedifference between two population means.The CI formula for two samples looks like you’re staring at the one-sample formula and seeing double:

Keep in mind that when zero is not included in the 95% confidence interval range, you know that you can reject the null at the .05 level with a two-tailed test.

As usual, the 99% confidence interval is going to be somewhat larger than the 95% confidence interval. Remember, the larger the interval, the more confident we are that the true difference between the population means lies somewhere within that range.

Try the following example, using your stellarexpertise with confidence intervals from the last chapter:

2. Colin wants to determine whether there is a significant difference between the averageticket price for his band, Momma Lovin’, and his rival band, Smack Daddy. For the past eight shows for Momma Lovin’, the mean ticket price was $18.40, with s = 3.2. Smack Daddy’s sales figures showed an average ticket price of $19.60 for the past six shows, with s = 4.3. Since Smack Daddy think they are abundantly better based on the $1.20 difference, Colin would like to estimate the true difference in ticket price with a 95% confidence level.

Measuring the Size of an Effect

When the scale to measure a certain variable isn’t familiar to you (e.g., a researcher arbitrarily made up a scale for his/her experiment), it helps immensely to standardize the measure to createan effect size that can be easily interpreted. And, just because a difference is deemed to be significant by NHT, it doesn’t necessarily mean that it has practical value. Looking at a standard effect-size measure can help you make that call. The most common measure of effect size for samples is the one that is often called d or Cohen’s d, but which we will callg just to confuse you(just kidding—your text calls it g to avoid confusion with its counterpart in the population). This effect size measure looks kinda like a z score:

g =

Note that sp is just the square root of the pooled variance (√s²p). If you already have g, you have most of what goes into the two-sample tvalue;to calculate t from g you can use the following formula:

t = g √ (nh/2)

Unless your two samples are the same size, in which case you can substitute n (the size of either sample) for nh in this formula, you must calculate the harmonic mean to use the formula.The harmonic mean is used in a number of other statistical formulas, so it is worth learning. Here is a simplified version of the harmonic mean formula that works when you have only two numbers:

nharmonic = 2N1 N2_

N1 + N2

For example, the harmonic mean of 10 and 15 is 12 [(2*10*15/10+15) = 300/25], whereas the arithmetic mean is 12.5.

Since size matters quite a bit when it comes to effect size, it helps to have a general rule of thumb about how large is large. Jack Cohen devised the following guidelines for psychological research:

.8 = large

.5 = moderate

.2 = small

Try these examples:

3.You are asked to assess the general contentment (based on results from the Subjective HappinessScale) of psychology majors in comparison to biology majors at Michigan State by determiningthe effect size of the difference (i.e., “g”). The ratings given by a random sample of majors for each group are as follows:

Psychology / Biology
4.6 / 5.0
3.2 / 3.4
5.7 / 4.3
6.8 / 5.6
6.2 / 3.2
5.1 / 3.3
4.2 / 2.9
6.3 / 6.0
5.9 / 4.0
5.4 / 4.2
5.5 / 5.3
6.0 / 4.1

4.For the data in the previous example, compute the tvalue, by using the value you just calculated forg, and determine whether this tvalue is significant at the .05 level. Are you surprised by your statistical determination, given the size of the samples?

The Separate-Variance t Test

If the sample sizes are not equal AND if one variance is more than twice the size of the other variance, you should not assume that you can pool the variances from the two groups. Instead, you should calculate what is sometimes called a separate-variances ttest,for reasons that should be obvious from looking atthe following formula:

Note that when the sample sizes are equal, the pooled and separate-variances tests will always produce exactly the same t value, and everyone just uses N1+ N2 – 2 degrees of freedom to look up the critical value.Unfortunately,when both the Ns and SDs differ a good deal, not only should the separate-variance formula be used to find the t value but also a complex formula should be used to find the df for looking up the critical t value. The good news is that a lot of statistical programs will do that work for you.However, when the variances of the two samples are very different, there are usually other problems with the data, so researchers usually use some more drastic alternative to the pooled-variance t test (e.g., a data transformation or a nonparametric test) than the separate-variances test.

6. The Matched-Pairs (akaCorrelated or Dependent) t Test

Sometimes, you’re lucky enough to deal with samples that match up with one another. And the bottom line is, the better the matching, the better chance you have of finding significance for your ttest. So, whether the matching is based on using the same people twice or by matching two groups of individuals with one another on some relevant characteristics, if you can match samples, you probably should!

Luckily for us, the matched t test formula is essentially the formula for a one-sample ttest, which makes life abundantly easier,providedthat you’ve learned thematerial from the previous chapter. The main change between the two formulas is that in the matched ttest formula, you will use the mean of the difference scores in the numerator, and then use the standard deviation of the difference scores in the denominator of the formula.

To make it even clearer, look at the two formulas side-by-side:

One sample t-testMatched t-test

Keep in mind that, when you use the matched ttest, your degrees of freedom correspond with the number of matched pairs you are using, not the total number of scores.For example, if you are matching 16 participants into 8 pairs, your df = 8 – 1 = 7, not 14 (i.e., 16 – 2). The fact that your df decreases means that your critical t increases when you perform a matched t-test,so don’t take matching your participants lightly. There is a downside. If the matching doesn’t really work well, you’ve just tossed out a bunch of dfs for nothing!

Let’s try an example together . . .

Jared wants to see if pulse rates differ significantly for students one hour after taking the GRE as compared to one hourbefore taking it. Hecan see from the Before and After means that the pulse rates are quite a bithigher before the test (most of his friends have been freakingout in the morning whenever they take the exam), but when he performs a two-sample t test on his data, he fails to get even close to significance. Then he finally notices that the variance within his groups is huge (since everyone is at varying fitness levels in the samples he uses) greatly inflating his denominator. That’swhenhe realized that he was doing the wrongttest; he wasnot taking advantage of the reduction in error variance that occurs when you use the before-after difference scores for each student’s pulse rate. With the difference scores in place, his data setlooks like this:

Pretest / Posttest / Difference
88 / 80 / 8
79 / 75 / 4
90 / 85 / 5
86 / 82 / 4
75 / 70 / 5
80 / 76 / 4
79 / 72 / 7
92 / 86 / 6
100 / 91 / 9
67 / 65 / 2
70 / 67 / 3
83 / 76 / 7
=82.42 / =77.083 / = 5.33
s = 9.434 / s = 7.971 / sD = 2.103

Using the data from the table, let’s try both ttests and see how much of a difference matching scores can make. First, let’s try the ordinary (i.e., independent-samples) ttest,using the simplified formula for groups with the same sample size.

t = (82.42 – 77.083)/[√(9.434² + 7.971²)/12] = 1.496; df = 12 + 12 – 2 = 22;t.05 (22) = 2.074;because 1.5 < 2.074, we are not evenclose to significance with this test.

Now, let’s see what happens when we take the matching into account and test the difference scores.

t = (5.33 – 0)/(2.103/√12) = 8.784 (note that we are using or to mean the same as D with a bar over it). For this matched test, df = 11 (# of pairs – 1), so the critical t increases to t.05 (11) = 2.201. But, as usual, the change in the critical t (2.074 to 2.201) is small compared to the change in the calculated t (1.496 to 8.784). Finally, Jared can back up his claim that his friends are in a physiological frenzy before taking the GRE!

7.Matched and Repeated-Measures Designs

So as you can see, by matching individuals who have correlated data or asking people to perform a task multiple times (repeated measures) with related outcomes, it can help out immensely when trying to attain significance with a ttest. Keep in mind, however, that sometimes there is no relevant basis for matching participants and repeating conditions on the same person more than once can be seriously problematic (imagine trying to teach the same person a foreign language twice, in order to compare two different teaching methods!). And,when it is reasonable to test the same person twice (memorizing a list of words while listening to sad music and a similar list while listening to happy music),you will usually have to counterbalance the conditions to prevent practice orfatigue from affecting your results, and even counterbalancing can present problems, as well (e.g., carry-over effects).Just remember, sometimes in research, unlike taking a statistics course, practice isn’t always a good thing.

Now you try a matched example:

5. Emily is measuring the effect of cognitive behavioral therapy (CBT) on patients with panic disorder.She uses the number of panic attacks that occurred during the week before each patient began CBT and during the week following the completion of eight sessions of CBT as her Before and After measures, respectively. The data are as follows:

Before CBT / After CBT
14 / 13
8 / 4
6 / 1
14 / 8
20 / 10
13 / 12
9 / 8
12 / 6

a.Determine whether the difference between the two groups is significant when using a dependent t-test.

b.Perform an independent two-sample t-test, and compare the results with those from the dependent t-test.

c.Do you think there are any reasons not to use a dependent t-test design?

Additional t test examples:

Participants in a study were taught SPSS by either one-on-one tutoring or sitting in a 200-person lecture hall and were classified into one of two groups (undergraduates versus graduate students). Mean performance (along with SD and n) on the computer task is given for each of the four subgroups.

Undergraduate Graduate

Tutoring Lecture Hall Tutoring Lecture Hall

Mean 36.74 32.14 29.63 26.04

SD 6.69 7.19 8.51 7.29

n 52 45 20 30

6. Calculate the pooled-variance t test for each of the four comparisons that make sense (i.e., compare the two academic levels for each method and compare the two methods for each academic level), and test for significance.

7.Calculate g for each of the t tests in Exercise #6. Comment on the size of g in each case.

8.a.Find the 95% CI for the difference of the two methods for the undergraduate participants.

b.Find the 99% CI for the difference of the two methods for the graduate participants.

Answers to Exercises

1. Yes, there is a significant difference at both the .05 and .01 levels, with t= 5/√2.0235 = 5/1.4225 = 3.515t.05 (29) = 2.045 and t.01 (29) = 2.756.

2. spooled ² = 13.678, tcv .05(12) = 2.179, SED = 1.9973, the 95% CI (–5.552 ≤ μ ≤ 3.1518).Because 0 is contained in the 95% CI we must retain the null (at the .05 level, two-tailed) that there is no difference between the ticket prices of the two bands. So Colin can keep living the dream and tell Smack Daddy they still have a viable rival out there!

3. g = 1.125 (rather large effect), which is based on: = 5.408, s1 = 1.0059, = 4.275, s2 = 1.0083, spooled = 1.0071

4. t = 1.125* √(12/2) = 2.756t .05 (23) = 2.069.Therefore, the two majors differ significantly at the .05 level. Even with the smallish sample sizes, this result is not surprising, given that the effect size was so large.