SPS 580 Lecture 2 Bivariate Chi Square

BIVARIATEANALYSIS: CROSSTABULATION

ASSIGNMENT #1 ASKED YOU: For each dependent variable . . . Comment on the findings from the univariate percentage table and/or column chart and implications they might have for the test of your theory . . .

Here’s what I was thinking . . .

IDEA: People from Chicago are more likely to be in favor of a tax on Lakefront beach use for non-City residents

THEORY: Place of Residence  Opinion on Beach tax

FREQUENCIES VARIABLES=bchtax

The majority oppose a beach tax for non-City residents

On average 41% of the general population favor a beach tax

IF THE THEORY IS RIGHT WE EXPECT SOMETHING LIKE THIS . . .

(ABSTRACTION ALERT)

 A BIVARIATE PERCENTAGE TABLE

“MARGINAL” TOTAL

Frequencies = marginal distribution

BIVARIATE CROSSTABULATION: REAL DATA

RECODE X and Y variables so categories = the desired end product

RECODE bchtax (1 thru 2=1) (3 thru 4=2) (ELSE=9) INTO bchtax2.

VARIABLE LABELS bchtax2 'dichotomy'.

value labels bchtax2 1 'Favor tax' 2 'Oppose tax' 9 'other'.

missing values bchtax2 (9) .

RECODE region (1=1) (2=2) (3 thru 7=3) (ELSE=9) INTO region3.

VARIABLE LABELS region3 'region recoded'.

value labels region3 1 'Chicago' 2 'Suburban Cook Co' 3 'Collar Counties'.

missing values region3 (9).

PERFORM A CROSSTABULATION OF X and Y VARIABLES

ANALYZE / DESCRIPTIVE STATISTICS / CROSSTABS / ROWS region3 / COLUMNS bchtx2 / CELLS row percentage CONTINUE / OK

region3 region recoded * bchtax2 dichotomy Crosstabulation
% within region3 region recoded
bchtax2 dichotomy / Total
1.00 Favor tax / 2.00 Oppose tax
region3 region recoded / 1.00 Chicago / 54.0% / 46.0% / 100.0%
2.00 Suburban Cook Co / 33.1% / 66.9% / 100.0%
3.00 Collar Counties / 37.0% / 63.0% / 100.0%
Total / 41.8% / 58.2% / 100.0%

SPSS HINT: When you COMPUTE new variables they appear at the end of the list of SPSS variables, not in alphabetical order

A BIVARIATE PERCENTAGE TABLE

PQ Bivariate column chart

Y = %Y given the value of X, Conditional distributionY = %Y|X

X = independent variable, place of residence

MAKING A PRESENTATION QUALITY BIVARIATE GRAPH . . .

Highlight the Bivariate percentage table

INSERT COLUMN 2D-COLUMN

Delete gridlines

Delete legend

Add data labels

Re-set y-axis metric

Delete x-axis tick marks

Resize fonts

The theory is right . . . people from Chicago are more likely to support the beach tax (54%) than people from the suburbs (33% and 37%).
Is it also right to say that people from the Collar Counties are more likely (37%) to support the beach tax than people who live in Cook County (33%) ???
Well . . . It doesn’t make much sense, and it is probably not a statistically significant difference.

Statistical Significance I

A.When we talk about statistical significance, we are talking about significance ofdifferences

B.When you say a difference is not statistically significant, that means you think . . .

It’s small It’s too small
Smaller than what?

C.The pattern in the data could have arisenby chance, so don’t get all worked up about it

D.OPERATIONAL DEFINITION . . . “Chance” means the PERCENT in each place (each category of X) is the same as the TOTAL PERCENT plus or minus some sampling error

ObservedData

Sampling Error Model

for testing stx significance

Y% does not depend on the value of X

aka . . . Null Hypothesis . . . “no” causal relationship

E.Test of Statistical Significance Means . . . Compare the Observed Data to the SEM (null hypothesis) to see if the data are “statistically significant”

Null Hypothesis =>

conditional %s = marginal %

F.What you get from a significance test is the likelihood the null is true. General rule for policy research: if the likelihood of something being true is 5% or less, then IT IS NOT TRUE

G.So, what’s the likelihood that the % in favor everywhere is 42% and the differences are just sampling variation? = Likelihood the apparent differences could have arisen from sampling variation

The Chi Square Test

XTAB observed data

619 / 1147 = 54%

354 / 1070 = 33%

353 / 954 = 37%

1326 / 3171 = 42%

Calculate expected data if null is right

42% * 1147 = 480

42% * 1070 = 447

42% * 954 = 399

42% * 3171 = 1326

Calculate Observed minus Expected

The positive differences show where there are “too many” cases if null is right

Calculate [ (O-E)^2 / E ]

139 ^ 2 = 19,222.65

19,222.65 / 480 = 40

Add it all up SUM [ (O-E)^2 / E ]= 112

Determine degrees of freedom (df) = (# rows – 1) * (# columns – 1) = 2*1 = 2

Look up the “critical value” of chi square, given the df. . .

. . . if chi square is greater than the critical value then the likelihood of the null hypothesis is less than 5%, which means IT IS NOT TRUE

So in this case, chi square = 112 df = 2 critical value = 5.991 Null is NOT TRUE

The data COULD NOT have arisen because of sampling variation.

THE DIFFERENCES BETWEEN LOCATION AND ATTITUDE TOWARD THE BEACH TAX ARE STATISTICALLY SIGNIFICANT

What about the difference between Suburban Cook and the Collar counties?

So in this case, chi square = 3.41 df = 1 critical value = 3.841 Null is TRUE

The data could have arisen because of sampling variation.

THE DIFFERENCE BETWEEN SUBURBAN COOK AND THE COLLAR COUNTIES ON ATTITUDE TOWARD THE BEACH TAX IS NOT STATISTICALLY SIGNIFICANT

CHI SQUARE – A BLANKET TEST

Idea: Poor people are more likely to use public transit

THEORY: Income causes mode of transportation

 this is what’s in the data set

(NOT PRESENTATION QUALITY)

Let’s say you coded it into these four groups . . .

RECODE tranmt (1=1) (2=1) (6=1) (7=1) (11=1) (10=4) (12=4) (3 thru 5=2) (8 thru 9=3) (ELSE=SYSMIS).

value labels tranmt 1 'private' 2 'public' 3 'bike or walked' 4 'other'.

There’s a nice income variable already there, just needs DK assigned to missing

INCOME . . . missing values inc4gp (7,8,9).

BIVARIATE PERCENTAGE TABLE PQ

BIVARIATE CHARTS (Ugly, but PQ) …

Chi Square = 188 df = 9

But this chi square tests the null hypothesis that ALL OF THE INCOME GROUPS MAKE ALL THE SAME TRANSPORTATION CHOICES i.e., . . .

Chi square is a BLANKET TEST of all possible differences

Likelihood < 5 %

Null is FALSE

All income groups did not make the same transportation choices

FOCUSING CHI SQUARE TESTS ON SPECIFIC HYPOTHESES

It might be that the differences in PUBLIC transportation use are NOT SIGNIFICANT, even though the overallpattern of differences is significant.

So you need to construct a chi square test that makes use of all the data available, but focuses on the set of differences you are most interested in . . .

Chi square = 99.6 df = 3

Likelihood < 5 %

Null is FALSE

Income groups not equally likely to use public transportation

PHI, SENSITIVITY OF CHI SQUARE TEST TO N

THEORY: Smoking is correlated with health status

RECODE helthr (1=1) (2=2) (3=3) (4=3) (7 thru 9=9) INTO helthr3.

VARIABLE LABELS helthr3 '3 categories'.

value labels helthr3 1 'Excellent' 2 'Good' 3 'Fair or Poor'.

missing values helthr3 (7,8,9).

CHI SQUARE rule:can’t have cell counts with Expected value < 5.0

Rule of Thumb: recode so there is always 10% + per final category

Rule of thumb: rare categories OK for DEPENDENT VARIABLES only

Chi square = 152 df = 2

Null is not true > 5.99

Smoking is correlated with health status

But the difference is not very big (42% - 32% ) = 10%

Chi square is large because N is large (N = 18,381)

If there were 500 cases with all other distributions the same chi sq = 4.13 < 5.99 null is true

CHI SQUARE IS SENSITIVE TO SAMPLE SIZE

PHI = SQUARE ROOT [ CHI SQ / N ]

Case 1 . . . n = 18,381 . . . phi = .09 (weak)

Case 2 . . . n = 500 . . . phi = .09 (weak)

PUTTING IT ALL TOGETHER

THEORY: Age causes likelihood of unemployment

missing values umemp2 (7,8,9).  NOTE variable name is UMEMP2 typo in data entry ????

big table

somewhat large N = 2,032

2 expected values = 6.0

Chi sq = 74 df = 9

Phi = .191

Null is not true the likelihood of being unemployed is not the same in all age groups

Chart/graph . . . have to choose what to focus on because the whole table is unwieldy