A. Data Verdict;

HW8 solution

8.4

a. data verdict;

input v $ d $ p $ count;

datalines;

white white yes 53

white white no 414

white black yes 11

white black no 37

black white yes 0

black white no 16

black black yes 4

black black no 139

;

data verdict1;

set verdict;

count = count + 0.0001;

run;

proccatmoddata=verdict1;

weight count;

model d*v*p= _response_/param=ref;

loglin d|v|p@2;

run;

procgenmoddata=verdict1;

class v d p;

model count=v|d|p@2/dist = poi link = log;

outputout=verdictpred pred=fitted;

run;

procfreqdata=verdictpred;

weight fitted;

table v*d*p/relrisk;

run;

The odds ratio between D and P at each level of V is 0.42. It means that the odds of a no death penalty verdict for a black defendant is 0.42 times the odds for a white defendant for given race of the victim. Therefore, black defendants tend to get more death penalty verdict.

b. procsql;

createtable marginal as

select

d, p, sum (count) as count, sum(fitted) as fitted

from verdictpred

groupby d, p;

procfreqdata=marginal;

weight count;

table d*p/relrisk;

run;

procfreqdata=marginal;

weight fitted;

table d*p/relrisk;

run;

Both odds ratio are 1.45. They are equal because the loglinear model (DV, DP, PV) is a saturated model for the variables D and P. If we consider the marginal data formed by D and P, the loglinear model fit is exactly the same as the sample marginal table. This odds ratio 1.45 is greater than 1, however, if we look at the odds ratio conditional on each level of the victim's race, it is 0.42 which is less than 1. The Simpson's paradox occurs because the large imbalance between the sample sizes of the two victim's races. They are 515 white victims and 159 black victims. When aggregating data over victim's race, the imbalance in sample sizes can cause the association between D and P to reverse.

c. procgenmoddata=verdict;

class d v;

freq count;

modelp=d v/scale=1dist=bin link=logit aggregate=(d v);

run;

From the logistic model,

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 1 0.3781 0.3781

Scaled Deviance 1 0.3781 0.3781

Pearson Chi-Square 1 0.1976 0.1976

Scaled Pearson X2 1 0.1976 0.1976

Standard Wald 95% Confidence Wald

Parameter DF Estimate Error Limits Chi-Square Pr > ChiSq

Intercept 1 2.0595 0.1458 1.7736 2.3453 199.40 <.0001

d black 1 -0.8678 0.3671 -1.5872 -0.1483 5.59 0.0181

v black 1 2.4044 0.6006 1.2273 3.5816 16.03 <.0001

From the loglinear model

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 1 0.3798 0.3798

Scaled Deviance 1 0.3798 0.3798

Pearson Chi-Square 1 0.1978 0.1978

Scaled Pearson X2 1 0.1978 0.1978

Standard Wald 95% Confidence Wald

Parameter DF Estimate Error Limits Chi-Square Pr > ChiSq

p no 1 2.0595 0.1458 1.7736 2.3453 199.40 <.0001

v*p black no 1 2.4044 0.6006 1.2272 3.5816 16.03 <.0001

d*p black no 1 -0.8678 0.3671 -1.5872 -0.1483 5.59 0.0181

d. procgenmoddata=verdict;

class d v;

freq count;

modelp=v/scale=1dist=bin link=logit aggregate=(d v);

run;

procgenmoddata=verdict1;

class v d p;

model count=v|p d|v/dist = poi link = log;

run;

The G2 statistics for the reduced model with D removed is 5.45 on 2 df which has p-value 0.067. Therefore the reduced model is barely acceptable. The reduced model means D and P are conditionally independent given V. The corresponding log linear model is shown in the second genmod procedure above.

8.5

a. data accident;

input safety $ ejected $ injury $ count;

datalines;

belt yes nonfatal 1105

belt yes fatal 14

belt no nonfatal 411111

belt no fatal 483

none yes nonfatal 4624

none yes fatal 497

none no nonfatal 157342

none no fatal 1008

;

proccatmoddata=accident;

weight count;

model safety*ejected*injury= _response_/pred=freq ;

loglin safety|ejected|injury;

odsoutput predictedfreqs=tmp1 anova=temp1(where=(source='Likelihood Ratio'));

run;

procsql;

createtable d as

select sum( abs(obsfunction-predfunction) ) / (2*sum(obsfunction)) as d

from tmp1;

data temp1;

merge temp1 d;

model = 'safety|ejected|injury';

run;

proccatmoddata=accident;

weight count;

model safety*ejected*injury= _response_/pred=freq ;

loglin safety|ejected|injury@2;

odsoutput predictedfreqs=tmp2 anova=temp2(where=(source='Likelihood Ratio'));

run;

procsql;

createtable d as

select sum( abs(obsfunction-predfunction) ) / (2*sum(obsfunction)) as d

from tmp2;

data temp2;

merge temp2 d;

model = 'safety|ejected|injury@2';

run;

proccatmoddata=accident;

weight count;

model safety*ejected*injury= _response_/pred=freq ;

loglin safety|ejected ejected|injury;

odsoutput predictedfreqs=tmp3 anova=temp3(where=(source='Likelihood Ratio'));

run;

procsql;

createtable d as

select sum( abs(obsfunction-predfunction) ) / (2*sum(obsfunction)) as d

from tmp3;

data temp3;

merge temp3 d;

model = 'safety|ejected ejected|injury';

run;

proccatmoddata=accident;

weight count;

model safety*ejected*injury= _response_/pred=freq ;

loglin safety|injury ejected|injury;

odsoutput predictedfreqs=tmp4 anova=temp4(where=(source='Likelihood Ratio'));

run;

procsql;

createtable d as

select sum( abs(obsfunction-predfunction) ) / (2*sum(obsfunction)) as d

from tmp4;

data temp4;

merge temp4 d;

model = 'safety|injury ejected|injury';

run;

proccatmoddata=accident;

weight count;

model safety*ejected*injury= _response_/pred=freq ;

loglin safety|injury ejected|safety;

odsoutput predictedfreqs=tmp5 anova=temp5(where=(source='Likelihood Ratio'));

run;

procsql;

createtable d as

select sum( abs(obsfunction-predfunction) ) / (2*sum(obsfunction)) as d

from tmp5;

data temp5;

merge temp5 d;

model = 'safety|injury ejected|safety';

run;

proccatmoddata=accident;

weight count;

model safety*ejected*injury= _response_/pred=freq ;

loglin safety|injury ejected;

odsoutput predictedfreqs=tmp6 anova=temp6(where=(source='Likelihood Ratio'));

run;

procsql;

createtable d as

select sum( abs(obsfunction-predfunction) ) / (2*sum(obsfunction)) as d

from tmp6;

data temp6;

merge temp6 d;

model = 'safety|injury ejected';

run;

proccatmoddata=accident;

weight count;

model safety*ejected*injury= _response_/pred=freq ;

loglin safety injury|ejected;

odsoutput predictedfreqs=tmp7 anova=temp7(where=(source='Likelihood Ratio'));

run;

procsql;

createtable d as

select sum( abs(obsfunction-predfunction) ) / (2*sum(obsfunction)) as d

from tmp7;

data temp7;

merge temp7 d;

model = 'safety injury|ejected';

run;

proccatmoddata=accident;

weight count;

model safety*ejected*injury= _response_/pred=freq ;

loglin safety|ejected injury;

odsoutput predictedfreqs=tmp8 anova=temp8(where=(source='Likelihood Ratio'));

run;

procsql;

createtable d as

select sum( abs(obsfunction-predfunction) ) / (2*sum(obsfunction)) as d

from tmp8;

data temp8;

merge temp8 d;

model = 'safety|ejected injury';

run;

proccatmoddata=accident;

weight count;

model safety*ejected*injury= _response_/pred=freq ;

loglin safety ejected injury;

odsoutput predictedfreqs=tmp9 anova=temp9(where=(source='Likelihood Ratio'));

run;

procsql;

createtable d as

select sum( abs(obsfunction-predfunction) ) / (2*sum(obsfunction)) as d

from tmp9;

data temp9;

merge temp9 d;

model = 'safety ejected injury';

run;

data combo;

set temp1 temp2 temp3 temp4 temp5 temp6 temp7 temp8 temp9;

rename chisq=G2;

rename probchisq=P_value;

run;

procprintdata=combo noobs;

var model G2 df P_value d;

run;

Model G2 DF P_value d

safety|ejected|injury . 0 . 0.000000

safety|ejected|injury@2 2.85 1 0.0911 0.000048

safety|ejected ejected|injury 1144.64 2 <.0001 0.002330

safety|injury ejected|injury 7133.98 2 <.0001 0.010833

safety|injury ejected|safety 1680.41 2 <.0001 0.001605

safety|injury ejected 9557.06 3 <.0001 0.011660

safety injury|ejected 9021.29 3 <.0001 0.013661

safety|ejected injury 3567.72 3 <.0001 0.003288

safety ejected injury 11444.38 4 <.0001 0.014028

The above table suggests that based on P-value, the model of choice is safety|ejected|injury@2 which is a homogeneous association model. That is, given any variable, the other two variables are independent.

b. procgenmoddata=accident;

freq count;

class safety ejected injury;

model injury=safety ejected/dist=bin link=logit;

run;

c. The dissimilarity index for the homogeneous model is 0.000048, meaning that the sample data follow the model pattern very closely. Actually the mutual independent model has a dissimilarity index 0.014 which is also acceptable.