AMS 572 Lecture Notes#11

October31st, 2011

Ch.9. Categorical Data Analysis

Exact test on one population proportion:

Data: sample of n, X are # of “Successes”, n-X are # of “Failures”

,

(1)

(2)

(3)

.

Inference on Two Population Proportions: ,

Independent samples, both are large

e.g. Suppose we wish to compare the proportions of smokers among male and female students in SBU.

Two large independent samples:

Population 1: Population 2:

Sample 1: Sample 2:

(,)

①Point estimator:

②By CLT,

Two samples are independent

③P.Q. for :

not P.Q.

P.Q.

④100(1-)% large samples CI for :

Here,

⑤Test

General case:

p-value=P(Z≥)

At the significance level , we reject if and p-value<α.

When =0, one often uses the following test statistic

here

Example 1. A random sample of Democrats and a random sample of Republicans were polled on an issue. Of 200 Republicans, 90 would vote yes on the issue; of 100 democrats, 58 would vote yes. Let p1 and p2 denote respectively the proportions of all Democrats or all Republicans who would vote yes on this issue.

(a)Construct a 95% confidence interval for (p1 - p2)

(b)Can we say that more Democrats than Republicans favor the issue at the 1% level of significance? Please report the p-value.

(c)Please write up the entire SAS program necessary to answer question raised in (b). Please include the data step.

Solution:

(a)Democrats:

Republicans:

The 100(1-α)% confidence interval for (p1 - p2) is

After plugging in Z0.025 = 1.96 etc., we found the 95% CI to be [0.01, 0.25]

(b)Hypotheses are v.s .

.

.

We cannot reject at . Therefore, Democrats favor the issue as same as Republicans at the 1% significance level.

(c) SAS code:

Data Poll;

Input Party $ outcome $ count;

Datalines;

Republican yes 90

Republican no 110

Democrats yes 58

Democrats no 42

;

Run;

Procfreqdata=poll;

Tables party*outcome/chisq;

Weight count;

Run;

Output:

The SAS System

The FREQ Procedure

Table of Party by outcome

Party outcome

Frequency‚

Percent ‚

Row Pct ‚

Col Pct ‚no ‚yes ‚ Total

ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ

Democrat ‚ 42 ‚ 58 ‚ 100

‚ 14.00 ‚ 19.33 ‚ 33.33

‚ 42.00 ‚ 58.00 ‚

‚ 27.63 ‚ 39.19 ‚

ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ

Republic ‚ 110 ‚ 90 ‚ 200

‚ 36.67 ‚ 30.00 ‚ 66.67

‚ 55.00 ‚ 45.00 ‚

‚ 72.37 ‚ 60.81 ‚

ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ

Total 152 148 300

50.67 49.33 100.00

Statistics for Table of Party by outcome

Statistic DF Value Prob

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Chi-Square 1 4.5075 0.0337

Likelihood Ratio Chi-Square 1 4.5210 0.0335

Continuity Adj. Chi-Square 1 4.0024 0.0454

Mantel-Haenszel Chi-Square 1 4.4924 0.0340

Phi Coefficient -0.1226

Contingency Coefficient 0.1217

Cramer's V -0.1226

Fisher's Exact Test

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Cell (1,1) Frequency (F) 42

Left-sided Pr <= F 0.0226

Right-sided Pr >= F 0.9877

Table Probability (P) 0.0103

Two-sided Pr <= P 0.0377

Sample Size = 300

Inference on several proportions—the Chi-square test (large sample)

Def. Multinomial Experiment.

We have a total of n trials (sample size=n)

①For each trial, it will result in 1 of k possible outcomes.

②The probability of getting outcome i is , and =1

③These trials are independent.

Example2. Previous experience indicates that the probability of obtaining 1 healthy calf from a mating is 0.83. Similarly, the probabilities of obtaining 0 and 2 healthy calves are 0.15 and 0.02 respectively. If the farmer breeds 3 dams from the herd, find the probability of getting exact 3 health calves.

Def. Multinomial Distribution

Let be the number of trials resulted in i-th category out of a total of n trials and be the probability of getting i-th category outcome, then

Solution:

P(exact 3 health calves)=+

=0.015+0.572=0.59

*Relations to the Binomial Distribution (k=2)

Category / 1 / 2
Probability / =p / =1-p
# trials / =x / =n-x

Chi-square goodness of fit test

Example3. Gregor Mendel (1822-1884) was an Austrian monk whose genetic theory is one of the greatest scientific discovery of all time. In his famous experiment with garden peas, he proposed a genetic model that would explain inheritance. In particular, he studied how the shape (smooth or wrinkled) and color (yellow or green) of pea seeds are transmitted through generations. His model shows that the second generation of peas from a certain ancestry should have the following distribution.

wrinkled-green / wrinkled-yellow / smooth-green / smooth-yellow
Theoretical probabilities / / / /

n=556

General test:

Test whether the theoretical probability is correct

T.S

where is the observed number of observations in category i

is the expected count of the i-th category ,

At the significance level α, reject iff

Solution:

wrinkled-green / wrinkled-yellow / smooth-green / smooth-yellow
Theoretical probabilities / / / /
Observed count out of 556 / =31 / =102 / =108 / =315
Expected counts / =34.75 / =104.25 / =104.25 / =312.75

T.S

=7.815

At significance level 0.05, we cannot reject

SAS Code:

DATA GENE;

INPUT @1 COLOR $13. @15 NUMBER 3.;

DATALINES;

YELLOWSMOOTH 315

YELLOWWRINKLE 102

GREENSMOOTH 108

GREENWRINKLE 31

;

* HYPOTHESIZING A 9:3:3:1 RATIO;

PROCFREQDATA=GENE ORDER=DATA; WEIGHT NUMBER;

TITLE3'GOODNESS OF FIT ANALYSIS';

TABLES COLOR / CHISQNOCUMTESTP=(0.56250.18750.18750.0625);

RUN;

The SAS System

GOODNESS OF FIT ANALYSIS

The FREQ Procedure

Test

COLOR Frequency Percent Percent

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

YELLOWSMOOTH 315 56.65 56.25

YELLOWWRINKLE 102 18.35 18.75

GREENSMOOTH 108 19.42 18.75

GREENWRINKLE 31 5.58 6.25

Chi-Square Test

for Specified Proportions

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Chi-Square 0.6043

DF 3

Pr > ChiSq 0.8954

Sample Size = 556

Example 4. A classic tale involves four car-pooling students who missed a test and gave as an excuse of a flat tire. On the make-up test, the professor asked the students to identify the particular tire that went flat. If they really did not have a flat tire, would they be able to identify the same tire?

To mimic this situation, 40 other students were asked to identify the tire they would select.

The data are:

Tire / Left front / Right front / Left rear / Right rear
Frequency / 11 / 15 / 8 / 6

At α=0.05, please test whether each tire has the same chance to be selected?

Solution:

n=40, =n=10

Fail to reject .

The chi-square goodness of fit test is an extension of the Z-test for one population proportion.

Data: sample size n, x: successes with probability p

n-x: failures with probability 1-p

TS.

At α, reject iff

Success / Failure
Expected / /
Observed / =x /

=

=

Recall: If , then .

When k=1,

Let Z~N(0,1), then

=

The two tests are identical.

Exact Tests for Inference on Two Population Proportions

  1. Fisher’s exact test:

A little bit of history (Fisher’s Tea Drinker): R.A. Fisher described the following experiment. Muriel Bristol, his colleague, claimed that when drinking tea, she could distinguish whether milk or tea was added to the cup first (she preferred milk first). To test her claim, Fisher designed an experiment with 8 cups of tea – 4 with milk added first and 4 with tea added first. Muriel was told that there were 4 cups of each type, and was asked to predict which four had the milk added first. The order of presenting the cups to her was randomized. It turned out that Muriel correctly identified 3 from each type. Now we are testing the null hypothesis that she did so by pure guessing, versus the alternative hypothesis that she could do better than pure guessing. The p-value of the test is derived as follows:

Since the p-value is large, we could not reject the null hypothesis. It means that it is possible that she chose 3 correctly by pure guessing.

Inference on 2 population proportions, 2 independent samples

Example 5. The result of a randomized clinical trial for comparing Prednisone and Prednisone+VCR drugs, is summarized below. Test if the success and failure probabilities are the same for the two drugs.

Drug / Success / Failure / Row total
Pred
PVCR / 14
38 / 7
4 / =21
=42
m=52 / n-m=11 / n=63

General setting:

“S” / “F” / Total
Sample1 / x / -x /
Sample2 / y / -y /
m=x+y / n-m / n

Solution:

<0.05

Reject

SAS code:

Data trial;

input drug $ outcome$ count;

datalines;

pred S 14

pred F 7

PVCR S 38

PVCR F 4

;

run;

procfreqdata=trial;

tables drug*outcome/chisq;

weight count;

run;

  1. McNemar’s test

Inference on 2 population proportions- paired samples

Example6. A preference poll of a panel of 75 voters was conducted before and after a TV debate during the campaign for the 1980 presidential election between Jimmy Carter and Ronald Reagan. Test whether there was a significant shift from Carter as a result of the TV debate.

Preference
before / Preference after
Carter / Reagan
Carter / 28 / 13
Reagan / 7 / 27

General setting:

Condition1
response / Condition2 response
Yes / No
Yes / A=a, / B=b,
No / C=c, / D=d,

+++=1, A+B+C+D=n, (A, B, C, D)~Multinomial

,

P(B=k| B+C=m)~Bin(m,p=)

Under , P(B=k| B+C=m)~Bin(m,p=1/2)

Solution:

SAS code:

Data election;

input before $ after $ count;

datalines;

Carter Carter 28

Carter Reagan 13

Reagan Reagan 27

Reagan Carter 7

;

run;

procfreqdata=election;

exactagree;

tables before*after/agree;

weight count;

run;

The SAS System 21:15 Saturday, November 6, 2010 2

The FREQ Procedure

Table of before by after

before after

Frequency‚

Percent ‚

Row Pct ‚

Col Pct ‚Carter ‚Reagan ‚ Total

ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ

Carter ‚ 28 ‚ 13 ‚ 41

‚ 37.33 ‚ 17.33 ‚ 54.67

‚ 68.29 ‚ 31.71 ‚

‚ 80.00 ‚ 32.50 ‚

ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ

Reagan ‚ 7 ‚ 27 ‚ 34

‚ 9.33 ‚ 36.00 ‚ 45.33

‚ 20.59 ‚ 79.41 ‚

‚ 20.00 ‚ 67.50 ‚

ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ

Total 35 40 75

46.67 53.33 100.00

Statistics for Table of before by after

McNemar's Test

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Statistic (S) 1.8000

DF 1

Asymptotic Pr > S 0.1797

Exact Pr >= S 0.2632 (= 2*0.1316)

Simple Kappa Coefficient

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Kappa (K) 0.4700

ASE 0.1003

95% Lower Conf Limit 0.2734

95% Upper Conf Limit 0.6666

Test of H0: Kappa = 0

ASE under H0 0.1140

Z 4.1225

One-sided Pr > Z <.0001

Two-sided Pr > |Z| <.0001

Exact Test

One-sided Pr >= K 3.614E-05

Two-sided Pr >= |K| 5.847E-05

Sample Size = 75

1