SPS 580 Lecture 2 Bivariate Chi Square
- BIVARIATEANALYSIS: CROSSTABULATION
ASSIGNMENT #1 ASKED YOU: For each dependent variable . . . Comment on the findings from the univariate percentage table and/or column chart and implications they might have for the test of your theory . . .
Here’s what I was thinking . . .
IDEA: People from Chicago are more likely to be in favor of a tax on Lakefront beach use for non-City residents
THEORY: Place of Residence Opinion on Beach tax
FREQUENCIES VARIABLES=bchtax
The majority oppose a beach tax for non-City residents
On average 41% of the general population favor a beach tax
IF THE THEORY IS RIGHT WE EXPECT SOMETHING LIKE THIS . . .
(ABSTRACTION ALERT)
A BIVARIATE PERCENTAGE TABLE
“MARGINAL” TOTAL
Frequencies = marginal distribution
- BIVARIATE CROSSTABULATION: REAL DATA
- RECODE X and Y variables so categories = the desired end product
RECODE bchtax (1 thru 2=1) (3 thru 4=2) (ELSE=9) INTO bchtax2.
VARIABLE LABELS bchtax2 'dichotomy'.
value labels bchtax2 1 'Favor tax' 2 'Oppose tax' 9 'other'.
missing values bchtax2 (9) .
RECODE region (1=1) (2=2) (3 thru 7=3) (ELSE=9) INTO region3.
VARIABLE LABELS region3 'region recoded'.
value labels region3 1 'Chicago' 2 'Suburban Cook Co' 3 'Collar Counties'.
missing values region3 (9).
- PERFORM A CROSSTABULATION OF X and Y VARIABLES
ANALYZE / DESCRIPTIVE STATISTICS / CROSSTABS / ROWS region3 / COLUMNS bchtx2 / CELLS row percentage CONTINUE / OK
region3 region recoded * bchtax2 dichotomy Crosstabulation% within region3 region recoded
bchtax2 dichotomy / Total
1.00 Favor tax / 2.00 Oppose tax
region3 region recoded / 1.00 Chicago / 54.0% / 46.0% / 100.0%
2.00 Suburban Cook Co / 33.1% / 66.9% / 100.0%
3.00 Collar Counties / 37.0% / 63.0% / 100.0%
Total / 41.8% / 58.2% / 100.0%
SPSS HINT: When you COMPUTE new variables they appear at the end of the list of SPSS variables, not in alphabetical order
A BIVARIATE PERCENTAGE TABLE
PQ Bivariate column chart
Y = %Y given the value of X, Conditional distributionY = %Y|X
X = independent variable, place of residence
MAKING A PRESENTATION QUALITY BIVARIATE GRAPH . . .
Highlight the Bivariate percentage table
INSERT COLUMN 2D-COLUMN
Delete gridlines
Delete legend
Add data labels
Re-set y-axis metric
Delete x-axis tick marks
Resize fonts
- The theory is right . . . people from Chicago are more likely to support the beach tax (54%) than people from the suburbs (33% and 37%).
- Is it also right to say that people from the Collar Counties are more likely (37%) to support the beach tax than people who live in Cook County (33%) ???
- Well . . . It doesn’t make much sense, and it is probably not a statistically significant difference.
- Statistical Significance I
A.When we talk about statistical significance, we are talking about significance ofdifferences
B.When you say a difference is not statistically significant, that means you think . . .
- It’s small It’s too small
- Smaller than what?
C.The pattern in the data could have arisenby chance, so don’t get all worked up about it
D.OPERATIONAL DEFINITION . . . “Chance” means the PERCENT in each place (each category of X) is the same as the TOTAL PERCENT plus or minus some sampling error
ObservedData
Sampling Error Model
for testing stx significance
Y% does not depend on the value of X
aka . . . Null Hypothesis . . . “no” causal relationship
E.Test of Statistical Significance Means . . . Compare the Observed Data to the SEM (null hypothesis) to see if the data are “statistically significant”
Null Hypothesis =>
conditional %s = marginal %
F.What you get from a significance test is the likelihood the null is true. General rule for policy research: if the likelihood of something being true is 5% or less, then IT IS NOT TRUE
G.So, what’s the likelihood that the % in favor everywhere is 42% and the differences are just sampling variation? = Likelihood the apparent differences could have arisen from sampling variation
- The Chi Square Test
XTAB observed data
619 / 1147 = 54%
354 / 1070 = 33%
353 / 954 = 37%
1326 / 3171 = 42%
Calculate expected data if null is right
42% * 1147 = 480
42% * 1070 = 447
42% * 954 = 399
42% * 3171 = 1326
Calculate Observed minus Expected
The positive differences show where there are “too many” cases if null is right
Calculate [ (O-E)^2 / E ]
139 ^ 2 = 19,222.65
19,222.65 / 480 = 40
Add it all up SUM [ (O-E)^2 / E ]= 112
Determine degrees of freedom (df) = (# rows – 1) * (# columns – 1) = 2*1 = 2
Look up the “critical value” of chi square, given the df. . .
. . . if chi square is greater than the critical value then the likelihood of the null hypothesis is less than 5%, which means IT IS NOT TRUE
So in this case, chi square = 112 df = 2 critical value = 5.991 Null is NOT TRUE
The data COULD NOT have arisen because of sampling variation.
THE DIFFERENCES BETWEEN LOCATION AND ATTITUDE TOWARD THE BEACH TAX ARE STATISTICALLY SIGNIFICANT
What about the difference between Suburban Cook and the Collar counties?
So in this case, chi square = 3.41 df = 1 critical value = 3.841 Null is TRUE
The data could have arisen because of sampling variation.
THE DIFFERENCE BETWEEN SUBURBAN COOK AND THE COLLAR COUNTIES ON ATTITUDE TOWARD THE BEACH TAX IS NOT STATISTICALLY SIGNIFICANT
- CHI SQUARE – A BLANKET TEST
Idea: Poor people are more likely to use public transit
THEORY: Income causes mode of transportation
this is what’s in the data set
(NOT PRESENTATION QUALITY)
Let’s say you coded it into these four groups . . .
RECODE tranmt (1=1) (2=1) (6=1) (7=1) (11=1) (10=4) (12=4) (3 thru 5=2) (8 thru 9=3) (ELSE=SYSMIS).
value labels tranmt 1 'private' 2 'public' 3 'bike or walked' 4 'other'.
There’s a nice income variable already there, just needs DK assigned to missing
INCOME . . . missing values inc4gp (7,8,9).
BIVARIATE PERCENTAGE TABLE PQ
BIVARIATE CHARTS (Ugly, but PQ) …
Chi Square = 188 df = 9
But this chi square tests the null hypothesis that ALL OF THE INCOME GROUPS MAKE ALL THE SAME TRANSPORTATION CHOICES i.e., . . .
Chi square is a BLANKET TEST of all possible differences
Likelihood < 5 %
Null is FALSE
All income groups did not make the same transportation choices
- FOCUSING CHI SQUARE TESTS ON SPECIFIC HYPOTHESES
It might be that the differences in PUBLIC transportation use are NOT SIGNIFICANT, even though the overallpattern of differences is significant.
So you need to construct a chi square test that makes use of all the data available, but focuses on the set of differences you are most interested in . . .
Chi square = 99.6 df = 3
Likelihood < 5 %
Null is FALSE
Income groups not equally likely to use public transportation
- PHI, SENSITIVITY OF CHI SQUARE TEST TO N
THEORY: Smoking is correlated with health status
RECODE helthr (1=1) (2=2) (3=3) (4=3) (7 thru 9=9) INTO helthr3.
VARIABLE LABELS helthr3 '3 categories'.
value labels helthr3 1 'Excellent' 2 'Good' 3 'Fair or Poor'.
missing values helthr3 (7,8,9).
CHI SQUARE rule:can’t have cell counts with Expected value < 5.0
Rule of Thumb: recode so there is always 10% + per final category
Rule of thumb: rare categories OK for DEPENDENT VARIABLES only
Chi square = 152 df = 2
Null is not true > 5.99
Smoking is correlated with health status
But the difference is not very big (42% - 32% ) = 10%
Chi square is large because N is large (N = 18,381)
If there were 500 cases with all other distributions the same chi sq = 4.13 < 5.99 null is true
CHI SQUARE IS SENSITIVE TO SAMPLE SIZE
PHI = SQUARE ROOT [ CHI SQ / N ]
Case 1 . . . n = 18,381 . . . phi = .09 (weak)
Case 2 . . . n = 500 . . . phi = .09 (weak)
- PUTTING IT ALL TOGETHER
THEORY: Age causes likelihood of unemployment
missing values umemp2 (7,8,9). NOTE variable name is UMEMP2 typo in data entry ????
big table
somewhat large N = 2,032
2 expected values = 6.0
Chi sq = 74 df = 9
Phi = .191
Null is not true the likelihood of being unemployed is not the same in all age groups
Chart/graph . . . have to choose what to focus on because the whole table is unwieldy
1