Nonparametric Tests

CG9_99_18.1

Nonparametric Tests

Parametric tests:

Require assumptions about population characteristics: normality of the underlying distribution, homogeneity of variance, known mean / variance.

Examples: F, z, t tests

Nonparametric tests:

Do not require assumptions about population characteristics.

Can be used with very skewed distributions or when the population variance is not homogeneous.

Can be used with ordinal or nominal data.

Examples: Chi-square, Wilcoxon, and Kruskal-Wallis tests

Nonparametric tests are less powerful than parametric tests, so we don’t use them when parametric tests are appropriate. But if the assumptions of parametric tests are violated, we use nonparametric tests.

One-factor Chi-Square test (2)

The chi-square test is used mainly when dealing with a nominal variable. The “levels” of the variable are discrete, mutually exclusive categories, and the data consist of frequencies or counts for each category.

The chi-square test is sometimes called a “goodness-of-fit” test, because it asks whether there is a good fit between obtained data and theoretical data.

Example:

When I flew home on Sunday, we had a choice of chicken or ravioli for dinner on the flight. Suppose that instead of watching the horrible in-flight movie, I had gone through the cabin counting how many people picked each dinner. I might have gotten the following counts (cell frequencies):

Chicken / Ravioli
83 / 117

It seems like there was a preference for ravioli. But is the difference significant, or is it just due to chance?

In order to test whether or not this difference is due to chance, we need to know what cell frequencies we would expect to see if we were sampling from a null hypothesis population in which there’s no preference for one dinner or the other.

If there’s really no preference for either dinner, then people should pick them equally often. The cell frequencies should be the same for both categories.

The cell frequencies we expect under the null hypothesis are called expected frequencies; the expected frequency for each cell is symbolized by fe.

The cell frequencies that we actually find in our data are called observed frequencies; the observed frequency for each cell is fo.

How do we determine fe?

If the null hypothesis is that the cell frequencies should all be the same, then the total number of counts should be equally divided among the cells.

In this case, the total number of counts is 200; the expected frequency for each cell is 100. So:

Chicken / Ravioli
fo / 83 / 117
fe / 100 / 100

In general, when the null hypothesis is that the cell frequencies should all be the same,

where N = total number of counts for the

variable

k = number of categories in the

variable

We calculate a test statistic, 2, from fo and fe, to determine how much the observed frequencies deviate from expected frequencies:

For our example,

Cell Number / fo / fe /
1 / 83 / 100 /
2 / 117 / 100 /

This value is evaluated against a critical value, , with k-1 df (from table H).

We have 1 df in this example, so our is 3.841 when  = 0.05. Our obtained value is greater than , so we conclude that the difference is significant; people really did prefer ravioli to chicken.

In this example, we tested the null hypothesis that the cell frequencies would be equal. In fact, we can test any expected frequencies we want.

For example, suppose we knew from past experience that 55% of people will choose ravioli over chicken on a flight, and wanted to know if that was true on this flight as well.

If it were true, then 55% of the 200 passengers would have chosen ravioli, so the expected frequency in the ravioli category would have been (0.55)(200), or 110. The remaining 90 passengers would choose chicken. So, under this hypothesis:

Chicken / Ravioli
fo / 83 / 117
fe / 90 / 110

The calculations are the same:

Cell Number / fo / fe /
1 / 83 / 90 /
2 / 117 / 110 /

Again, our is 3.841. In this case, we would not reject the null hypothesis; we conclude that the number of people choosing each dinner is not different from what we would expect based on past results (based on our expectations).

Two-factor Chi-Square test

The 2 test is often used to test whether two nominal variables are independent from each other, or are related to each other. The two-factor chi-square test is often called a test of independence.

Example:

In Alberta, there are 4 major political parties: Liberal, Progressive Conservative (PC), New Democrat (ND), and Reform. I think that people in northern Alberta tend to vote differently than people in southern Alberta. We might test that by randomly sampling the party affiliations of 250 Albertans in the north and 250 in the south:

Party
Liberal / PC / ND / Reform / Marginal
Totals
North / 80 / 95 / 25 / 50 / 250
South / 59 / 103 / 18 / 70 / 250
Marginal
Totals / 139 / 198 / 43 / 120 / N = 500

This is called a contingency table: it shows the contingency between two variables.

The table gives us observed frequencies for each of the 8 cells. Now we need the expected frequencies.

Our H0 is that the variables are independent. That means that the number of people who support each party should be the same for both regions.

That means that the total number of people who support each party should be evenly divided between the two regions.

The marginal totals give us the total number of people who support each party.

Liberal: 139PC: 198

ND: 43Reform: 120

We split each of those up among the 2 regions to get our expected frequencies, if the variables are independent:

Party
Liberal / PC / ND / Reform / Marginal
Totals
North / 69.5 / 99 / 21.5 / 60 / 250
South / 69.5 / 99 / 21.5 / 60 / 250
Marginal
Totals / 139 / 198 / 43 / 120 / N = 500

The easier way to obtain each fe:

For each cell, multiply the marginal totals for each cell and divide by N.

oE.g., for Liberal, North,

Again, we calculate using

Cell Number / fo / fe /
1
Lib, N / 80 / 69.5 /
2
PC, N / 95 / 99 /
3
ND, N / 25 / 21.5 /
4
Ref, N / 50 / 60 /
5
Lib, S / 59 / 69.5 /
6
PC, S / 103 / 99 /
7
ND, S / 18 / 21.5 /
8
Ref, S / 70 / 60 /

Again, this is evaluated against . We just need to know df.

For the two-factor chi-square,

df = (kA – 1)(kB – 1)

Where kA = # of categories in Factor A

kB = # of categories in Factor B

In our case, df = 3  1 = 3.

For 3 df and  = 0.05, = 7.815.

, so we reject H0

The 2 variables are not independent; people in southern and northern Alberta differ in their support for the 4 major parties. I was right!

The independence test is essentially a correlation between nominal variables. In fact, we can use to calculate a correlation coefficient.

For 2  2 designs, we calculate the phi coefficient, :

For 3  2 (or higher) designs, we calculate the contingency coefficient, C:

Both coefficients can have values that range from 0 to +1.0, with higher values indicating a higher correlation. Like r2, we can determine the proportion of variance accounted for by computing 2 or C2.

In our example:

This means that (0.125)2 = 1.6% of variance in political support is accounted for by region.

Assumptions of Chi-Square tests

1. The categories are mutually exclusive: a subject cannot be counted in more than one category in the table (e.g., a person cannot be counted as supporting two of the parties).

2. The expected frequency in each cell is at least 5 when kA or kB is greater than 2 (e.g., a 2  3 design), or at least 10 when kA and kB are both less than 3. N must be sufficiently large to ensure that this is true.

Other Nonparametric Procedures

There are other nonparametric tests available, primarily in cases in which we are dealing with ranked data. There are two reasons we may have ranked scores:

1. We may use a dependent variable which is a rank ordering of subjects (i.e., the dependent variable may be an ordinal variable).

2. We may start with a dependent variable that is interval or ratio, but then transform it to an ordinal variable.

Because of violations of parametric tests, we may decide that we need to do one of the nonparametric tests.

Two main procedures for ranked data: Wilcoxon test and Kruskal-Wallis test. We’ll cover the Wilcoxon test only.

The Wilcoxon Test

Uses a correlated groups design.

Similar to correlated groups t-test, but with ordinal data instead of interval or ratio data.

Example:

Suppose we want to compare reaction times to bright and dim stimuli. We have 10 subjects perform the task using bright stimuli; the same 10 subjects also perform the task using dim stimuli. We have their RT in each condition:

Subject / Bright / Dim
1 / 540 / 760
2 / 580 / 710
3 / 600 / 1105
4 / 680 / 880
5 / 430 / 500
6 / 740 / 990
7 / 600 / 1050
8 / 690 / 640
9 / 605 / 595
10 / 520 / 520

Steps in Wilcoxon test:

1. Calculate a difference score, D, for each subject.

2. Determine the number of difference scores, N, ignoring subjects for whom D = 0.

3. Rank the difference scores from smallest to largest, based on their absolute values (i.e., ignore sign).

4. Assign a sign to each rank, equal to the sign of the corresponding difference score (signed rank).

5. Separate the positive ranks from the negative ranks.

6. Sum the positive ranks, then sum the negative ranks.

7. Calculate Wilcoxon Tobt.

8. Determine Tcrit and use it to evaluate Tobt.

CG9_99_18.1

Subject / Bright / Dim / D / Rank / Signed Rank / Negative Ranks / Positive Ranks
1 / 540 / 760 / -220 / 6 / -6 / 6
2 / 580 / 710 / -130 / 4 / -4 / 4
3 / 600 / 1105 / -505 / 9 / -9 / 9
4 / 680 / 880 / -200 / 5 / -5 / 5
5 / 430 / 500 / -70 / 3 / -3 / 3
6 / 740 / 990 / -250 / 7 / -7 / 7
7 / 600 / 1050 / -450 / 8 / -8 / 8
8 / 690 / 640 / +50 / 2 / +2 / 2
9 / 605 / 595 / +10 / 1 / +1 / 1
10 / 520 / 520 / 0
N = 9 /  = 42 /  = 3

CG9_99_18.1

For a two-tailed test,

Tobt = the smaller of the two sums

For a one-tailed test, we are predicting an effect in one direction or the other. That means that we are predicting mostly negative ranks or mostly positive ranks.

In the former case, the sum of the negative ranks should be larger.

Tobt = the sum of the positive ranks

In the latter case, the sum of the positive ranks should be larger.

Tobt = the sum of the negative ranks

First let’s assume we had a non-directional hypothesis. Tobt = the smaller of the two sums = 3.

Tcrit is obtained from Table I. For N = 9 and  = 0.05, Tcrit = 5.

Very Important Point:

Unlike the parametric tests we know, Tobt is significant if it is less than Tcrit. In our case,

Tobt < Tcrit, so we conclude that the brightness of a stimulus affects RT.

Suppose we had a directional hypothesis that subjects would respond more quickly to the dim stimulus.

In this case, we’d expect more positive D scores, and thus more positive ranks. Because we have a one-tailed test,

Tobt = the sum of the negative ranks,

…which in our case is 42.

Tcrit for a one-tailed test when  = 0.05 and N = 9 is 8.

Tobt > Tcrit, so we’d retain the null hypothesis and conclude that subjects to not respond faster to dim stimuli.

One more example:

Suppose we’re interested in whether doing practice problems improves performance in statistics. We have 8 students take a quiz before and after practice.

H0: Practice does not improve performance.

H1: Practice improves performance.

CG9_99_18.1

Subject / Before Practice / After Practice / D / Rank / Signed Rank / Negative Ranks / Positive Ranks
1 / 12 / 16 / -4 / 5 / -5 / 5
2 / 11 / 18 / -7 / 8 / -8 / 8
3 / 13 / 19 / -6 / 7 / -7 / 7
4 / 17 / 16 / 1 / 1 / 1 / 1
5 / 18 / 15 / 3 / 3.5 / 3.5 / 3.5
6 / 11 / 14 / -3 / 3.5 / -3.5 / 3.5
7 / 13 / 15 / -2 / 2 / -2 / 2
8 / 14 / 19 / -5 / 6 / -6 / 6
N = 8 /  = 31.5 /  = 4.5

CG9_99_18.1

We have a directional hypothesis: we expect people to do better after practice. Because of the way we set up the difference scores, this means we expect more negative differences than positive differences.

This means the sum of negative ranks should be larger than the sum of positive ranks, so:

Tobt = the sum of positive ranks = 4.5

For  = 0.05, 1-tail, with N = 8,

Tcrit = 5

Tobt < Tcrit reject H0

Practice affects performance.