Test of Hypotheses for More Than Two Independent Samples

Module III

Lecture 4

Test of Hypotheses For More Than Two Independent Samples

Suppose that you recently read an article that indicated that the tempo of music can affect a consumer’s behavior. In particular, it is conjectured that the faster the tempo (speed) of the music, the more likely a consumer is to make a purchase.

Since you are a V.P at George Giant Food Stores, you pick a random sample of 15 stores. You divide the stores into three set of five. At five stores you play no music. At five other stores you play pleasant but slow music. At the remaining five stores you play fast light music. For randomly chosen days, you measure the volume of sales (in purchases not dollars) at the 15 stores.

The resulting data is given below:

Is there evidence that the tempo of the music has an effect?

This is an example of a situation where we are interested in whether or not groups have different means but we have more than two groups.

One approach is to just look at the groups in pairs. That is, use the results of the previous lecture and compare the groups two at a time. In our case we would look at the No Music Group versus the Slow Music Group, the No Music Group versus the Fast Music Group, and the Fast Music Group versus the Slow Music Group. The results are shown below:

Using an alpha level of .05, this would indicate that there is a significant difference between sales in stores that play no music and stores that play fast music.

When comparing the No Music Group and the Fast Music Group, one obtains:

This indicates at the .05 alpha level, that there is a difference between the No Music Group and the Fast Music Group.

Finally, we compare the Slow Music Group and the Fast Music Group obtaining the output:

Again, at the .05 level, this indicates that there is a difference between the Fast Group and the Slow Group.

Unfortunately, things are not quite so simple. As the number of groups increases, the number of pairs of groups goes up quite rapidly. If there are k groups, the number of pair wise comparisons which must be made is:

This number increases quite quickly as the table below shows:

Now recall that in the logic of statistical testing, there is an 100 % chance that we will say that there is a difference when in fact there is none.

Consider performing two independent tests under circumstances where there are no differences. The correct conclusion is that both tests should accept the null hypothesis. The probability of this happening is

Therefore the probability of making at least one error, that is rejecting either one or both of the true null hypotheses, would be:

If  = .05, this probability is 1 – (.95)2 = .0975. So if we do two tests at alpha level .05, we have almost a 1 in ten chance of rejecting a at least one true null hypothesis.

Let E be the probability of rejecting at least one true null hypothesis when we make k(k-1)/2 independent tests, each at level .

Specifically, E is the probability of rejecting the null hypothesis in the following situation:

H0: 1 = 2 = . . . . = k

HA: at least one difference i - j different from zero

While  represents the probability of rejecting one of the k(k-1)/2 hypotheses of the form:

Then from the basic rules of probability, one has that:

Although the pair wise tests are not independent, the following table gives an indication of how the probability of at least one error is related to the number of groups, k, if we apply the previous results:

Thus, even with three groups, as in our example, we have an unacceptably high probability of rejecting at least one true null hypothesis.

One way around this problem is to use a result due to the Italian probabilist, Bonferroni. He showed that for m comparisons, whether independent or dependent, the following relationship always holds:

This implies that if we pick

we could control the overall error of rejecting at least one true null hypothesis.

The procedure is as follows:

1) Pick your overall alpha level of rejecting at least one true null hypothesis,

I will typically choose E = .05 ;

2) If I have k groups and will be comparing the means of the groups two at a time use:

(since m = k (k-1)/2 in this case).

In our case we have k = 3 and we are comparing each group to each of the others. I would therefore use:

Recalling the p-values of the pair wise t-tests, the table below shows the results of testing the various pair wise hypotheses using the Bonferroni approach which gives us only a .05 chance of making any pair wise error:

Any p-value less than .01667 would be declared significant. In this case the None vs Slow and None vs Fast comparisons are significant. This seems to support the argument that music makes a difference but there is no difference between slow and fast music.

For the pair wise differences which show significance using the Bonferroni approach, a confidence interval on the difference in the means is given by the equation:

where the t value is chosen based on the computed degrees of freedom given earlier.

First I compute the mean and standard deviation of each group using the usual EXCEL functions “average” and “stdev”. This results in the table below:

Then I use the template from the EXCEL file “twosamp.xls” .

For the Slow vs No Music comparison, the confidence interval is:

The comparison of Fast Music and No Music is given as:

Finally, even though not significant, the comparison between Fast and Slow Music is:

Notice that the confidence interval contains the value of zero.

Sometimes when one has performed an analysis like the one above, you arrive at a set of seemingly conflicting conclusions. For example you might accept the hypothesis that A = B. You might also accept the hypothesis that B = C. And then reject the hypothesis A = C !!! This seems to violate logic, but it violates mathematical logic not statistical logic.

Consider the three probability distributions below:

Clearly Groups A and B overlap quite a bit. Groups B and C also overlap quite a bit. But Groups A and C overlap very little.

With the picture in mind, what the statistical results are saying is that the data does not provide enough evidence to say that Groups A and B are different. Similarly, the data does not provide enough evidence to say that Groups A and C are different, but the data does provide enough evidence to indicate that Groups A and C are different.

The seeming inconsistency then disappears.

The Analysis of Variance

The Bonferroni procedure described in the previous section is the most general approach to the multiple sample analysis of group differences in that it makes few assumptions on the data. If one is willing to make more assumptions about the data, other methods exist for analyzing the data which, if the assumptions are valid, is more “powerful” than the Bonferroni procedure. By more powerful we mean that they have a higher chance of rejecting the null hypothesis when it is false.

The most used of these alternative procedures is called the “Analysis of Variance”. It is used in exactly the same situation as the Bonferroni method except that it only applies when the standard deviations in each of the groups can be assumed to be the same!

Examining our data below, we see that although the standard deviations in the Fast Music and Slow Music groups are approximate the same, the standard deviation of the No Music group is approximately 60% higher.

It is somewhat debatable as to whether this method can be used on this data, but we will use it to illustrate the procedure.

The basic data structure of this problem is given below:

GroupGroup...Group

1 2 k

x11 x21... xk1

x12 x22... xk2

......

x1n1 .... .

.... xknk

x2n2

Mean ...

Standard

Deviation s1 s2... sk

The basic hypothesis being tested is:

H0: 1 = 2 = . . . . = k

HA: at least one difference i - j different from zero

The basis of the Analysis of Variance Method (ANOVA) is the fact that one can estimate the assumed common variance in two ways.

For each group we can form an estimate of the standard deviation by using the formula:

where,

Now since all groups are assumed to have the same variance, we can pool all k estimates of the sample variance to get one estimate of the common variance as:

We will call this the “Within Group” estimate of the common variance since it is based on the deviations, within each group, of the observations from the group mean.

The second estimate of the common variance is based on the central limit theorem. Remember that:

Since we have a mean for each group, it should be possible to estimate the variance by looking at the variability of the group means “between” the groups. One can show theoretically that:

Where,

One can also show that if the null hypothesis that all the groups come from populations with the same mean is true, then

On the other hand, if the null hypothesis is false , then:

where,

Define,

If the null hypothesis is true, then this F-ratio should be close to one. On the other hand if the null hypothesis is false, then this F-ratio should be shifted upward by an amount that increases as the means of the groups differ more from one another.

This forms the basis of the so called “F-Test” of the null hypothesis that all the groups have the same mean as with the alternative hypothesis that at least one pair of groups have means that differ (actually the alternative is a bit more complicated, but in practice the above alternative hypothesis will suffice).

The procedure for testing the basic null hypothesis is:

a) Pick E (usually .05);

b) Compute Fobs and find its one-sided p-value;

c) Reject the null hypothesis of the one-sided p-value is less than E, otherwise accept the null hypothesis that all groups have the same mean.

Fortunately, almost all of the proceeding computations are done automatically in EXCEL. To perform the analysis, first label each column (group) in the cell immediately above the first number in the group. Then click on “Tools”, “Data Analysis”, and then “ANOVA: Single Factor”. You screen should look something like:

The data (including the group labels) is in cells B6:D11. Be sure to check the box “Labels in First Row” and finally enter alpha. Then hit “OK”.

The output will look something like:

I have highlighted the important variables:

The one-sided p-value is .000212. Since this is much smaller than .05, we reject the null hypothesis that all the groups have the same mean in favor of the alternative that at least one pair of groups have different means.

Since we suspect that there are differences between the groups, we need a procedure to find out which group mean differ from which.

A confidence interval for the difference in means between groups i and j is given by:

using the same value of alpha that we had used to perform the F-test. The degrees of freedom for the t distribution is given in the EXCEL ANOVA output in the row labeled “Within Groups”. In our case this value for the degrees of freedom is shown below:

In our case  = .05, so the appropriate value of t with 12 degrees of freedom is given by the EXCEL function “tinv” as:

Further, in our case all the groups are the same size so we only have to compute the

+/- limits once as:

which gives:

This is all summarized in the table below:

The ANOVA analysis thus indicates that all groups are different from one another leading us to maximize sales by playing fast music with numbers of sales increasing by between 1,584 to 3,365 per day.

These results differ from the Bonferroni approach only in the comparison of the Fast vs Slow group. By making an assumption that seems dubious for the data, the analysis has been changed.

In this case I would stay with the Bonferroni analysis. Further study of the problem might indicate whether or not the difference between Fast and Slow Music is real.

The Binomial Distribution with Multiple Groups

The following data was collected by the marketing department at your company:

Here we have four forms of advertising (groups) and a binomial response of "buy" or "didn’t buy" for each group. We are interested in whether or not the probability of buying or not buying differs based on the form of advertising seen by the consumer. Before performing a formal analysis, I computed the probability of buying or not buying for the four groups with the results shown below:

The results look encouraging since any form of advertising is associated with a higher probability of buying. Further it looks like advertising both on TV and in the newspaper is the most effective technique.

Before rushing to judgment however, we need to apply statistical methods to see if these effects justify major advertising expenditures.

The structure of the data in this situation is:

GroupGroup...Group

1 2... k

Successes x1 x2... xk

Failuresn1 – x1n2 – x2...nk – xk

______...______

Total n1 n2... nk

Estimate...

The null hypothesis being tested is:

H0 : p1 = p2 = . . . pk = p

HA: at least one pair pi and pj not equal

Under the null hypothesis, all the groups have the same probability of success, so the natural estimate of the common value of p is:

Then for each group we can find the expected number of successes and expected number of failures by the following formulae:

Expected Successes in Group i if Null Hypothesis is true =

Expected Failures in Group i if Null Hypothesis is true =

Now since ni is the total for group i (i.e. the column total), and is the total number of successes divided by the grand total of all observations, we arrive at the fact that just as in the two sample case:

EXPij = (Total for Column i) x (Total for Row j) / (Grand Total)

Now if for all i, then we can use the Chi-Square distribution with

(2 –1) x (k – 1) degrees of freedom to test the null hypothesis. Specifically, we would compute the test statistic:

Then we compute the one-sided p-value using the EXCEL function "chidist" as:

If the one-sided p-value is less than  we reject the null hypothesis. Otherwise we accept the null hypothesis.

If the null hypothesis is rejected, we again examine the contributions to the chi-square statistic for values of magnitude greater than 3.5 as the cells providing the greatest deviation of observed values from expectation.

For our data let us work with alpha = .05. We begin with the original table:

Next we compute our Expected values as shown below:

The resultant computation gives the expected table as:

Next we compute for each cell the square difference between the observed and expected values, divided by the expected value. This gives the "Contributions" to chi-square values given below:

The one-sided p-value is computed as:

one-sided p-value = chidist(3.621213, 3) = .305378.

Since this value exceeds .05, we would accept the null hypothesis that the probability of purchasing does not change from group to group. That is, the data is insufficient to support what seemed to be a clear pattern in the data.

Multiple Group Structural Hypotheses

Assume that you are responsible for choosing the health care plan for your company. You wish to choose a plan with provides excellent health care for your employees but at the same time minimizing the cost to the firm. In studying various health plans, you note that the percentage of a hospital bill covered by the plan is one of your major choices. In discussing this issue with your colleagues, one of them points out an article that showed how the length of the hospital stay varied with the percentage of the hospital bill paid for a random sample of 311 patients. The article claims that the greater the percentage of the hospital bill paid by insurance coverage, the longer patients stayed in the hospital. They presented the following data:

Does this data indicate that the pattern of hospital stay changes depending on the coverage plan? Does it indicate that the greater the percentage of the hospital bill covered by the health plan, the longer patients tend to stay in the hospital?

The basic question is whether or not the proportionate distribution of patients by length of stay is the same for the four hospital coverage categories.

The basic structure of the problem is:

CategoryCategoryCategory

1 2... c

Group 1 p11 p12... p1c

Group 2 p21 p22... p2c

......

Group k pk1 pk2... pkc

The hypothesis testing situation is:

H0 : p1j = p2j = . . . . = pkj = pj, for j = 1, 2, . . . , c

HA: at least one pair pij not equal to pmj for some j

All the null hypothesis says is that the probability of falling in a category is the same for all groups. The alternative hypothesis indicates that there are at least two groups where the probability of falling in a category differ.

Let me illustrate this for our data, our original data gives the following probabilities of length of stay by percentage coverage:

Notice that in the less than 5 days stay column, the percentage varies from 38.81% to 15.49%. Overall 26.69% of all people had stays less than 5 days. Is the variability in this column consistent with chance or does it indicate that it is more likely to be changing in some way associated with the hospital coverage? The hypothesis simultaneously asks this question for the all the columns.