Ashley Viola
Propensity Score Analysis of Berkeley Birthweight Data
Sample size = 954 cases
We wish to look at relationship between maternal smoking and infant birthweight. Analyses of this data were conducted using the ‘PSAgraphics’ package.
First we wish to estimate a propensity score model:
Model1: The first model includes gestation, mothers age, mothers height, mothers weight, mothers education, mothers race and fathers race. These variables were chosen due to likely influence on infant birthweight.Significant variables include gestation, height, weight, education, and race.
> model1=glm(smoke~gestation + age + height + weight + ed + racer + dracer, data=birthwt, family=binomial(link="logit"))
> summary(model1)
Call:
glm(formula = smoke ~ gestation + age + height + weight + ed +
racer + dracer, family = binomial(link = "logit"), data = birthwt)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6569 -0.9942 -0.7863 1.2448 2.0796
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.161304 2.596777 1.217 0.223454
gestation -0.023096 0.006705 -3.445 0.000572 ***
age -0.013750 0.012423 -1.107 0.268365
height 0.081698 0.032510 2.513 0.011970 *
weight -0.009602 0.004090 -2.348 0.018886 *
ed -0.212412 0.049477 -4.293 1.76e-05 ***
racer -0.134964 0.051618 -2.615 0.008931 **
dracer 0.072149 0.050235 1.436 0.150933
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1273.2 on 953 degrees of freedom
Residual deviance: 1222.3 on 946 degrees of freedom
AIC: 1238.3
Number of Fisher Scoring iterations: 4
Model2: The second model will eliminate mothers age. Father’s race is still insignificant.
> model2=glm(smoke~gestation + height + weight + ed + racer + dracer, data=birthwt, family=binomial(link="logit"))
> summary(model2)
Call:
glm(formula = smoke ~ gestation + height + weight + ed + racer +
dracer, family = binomial(link = "logit"), data = birthwt)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6111 -0.9906 -0.7922 1.2541 2.0935
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.619434 2.551084 1.027 0.304518
gestation -0.022850 0.006697 -3.412 0.000645 ***
height 0.085315 0.032362 2.636 0.008383 **
weight -0.010460 0.004029 -2.596 0.009429 **
ed -0.220629 0.048886 -4.513 6.39e-06 ***
racer -0.136397 0.051556 -2.646 0.008154 **
dracer 0.073483 0.050158 1.465 0.142912
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1273.2 on 953 degrees of freedom
Residual deviance: 1223.5 on 947 degrees of freedom
AIC: 1237.5
Number of Fisher Scoring iterations: 4
Model 3: The third model will take out fathers race.
> model3=glm(smoke~gestation + height + weight + ed + racer, data=birthwt, family=binomial(link="logit"))
> summary(model3)
Call:
glm(formula = smoke ~ gestation + height + weight + ed + racer,
family = binomial(link = "logit"), data = birthwt)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4951 -0.9936 -0.7928 1.2507 1.9285
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.883201 2.544152 1.133 0.257102
gestation -0.023343 0.006691 -3.489 0.000485 ***
height 0.082528 0.032257 2.558 0.010514 *
weight -0.009725 0.003987 -2.440 0.014706 *
ed -0.224383 0.048812 -4.597 4.29e-06 ***
racer -0.072104 0.026079 -2.765 0.005695 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1273.2 on 953 degrees of freedom
Residual deviance: 1225.7 on 948 degrees of freedom
AIC: 1237.7
Number of Fisher Scoring iterations: 4
Model #3 seem like the best fitting model. Variables include gestation, height, weight, race, and education.
Next step is to create strata based on propensity score. Each stratum will consist of 20% of the sample.
> pscore=model3$fitted
> strata=cut(pscore, quantile(pscore,seq(0,1,1/5)), include.lowest=TRUE, labels=FALSE)
> table(strata)
strata
1 2 3 4 5
191 191 190 191 191
Now that we have strata defined based on propensity score we would like to assess the balance within each strata for each covariate between mothers who smoked and mothers who did not smoke.
For continuous variables gestation, height and weight, we can use box.psa which generates a graphic that shows pairs of boxplots of the smoker and non-smoker groups across all strata.
> box.psa(gestation, smoke, strata, ylab="gestation")
Each stratum is reasonably balanced as the distribution of gestation for smokers and non-smokers are similar for each stratum. The distribution also does not change dramatically across strata looking at the size of the smokers groups below the box plots.
Similar results are shows for height and weight below.
> box.psa(height, smoke, strata, ylab="height")
> box.psa(weight, smoke, strata, ylab="weight")
For categorical variables education and race, cat.psa can be used to graphically depict the balance across strata for smokers and non-smokers.
> cat.psa(ed,smoke,strata, xlab="strata", ylab="Proportion for 'education'", barnames=c("no smoke","smoke"))
$`treatment:stratum.proportions`
0:1 1:1 0:2 1:2 0:3 1:3 0:4 1:4 0:5 1:5
0 0.000 0.000 0.000 0.000 0.000 0.013 0.019 0.000 0.034 0.000
1 0.021 0.067 0.054 0.098 0.096 0.118 0.132 0.165 0.258 0.382
2 0.103 0.089 0.246 0.115 0.368 0.289 0.594 0.529 0.596 0.559
3 0.027 0.067 0.038 0.049 0.070 0.105 0.019 0.118 0.056 0.020
4 0.370 0.289 0.315 0.393 0.386 0.289 0.132 0.141 0.045 0.039
5 0.459 0.489 0.331 0.344 0.079 0.171 0.104 0.047 0.011 0.000
7 0.021 0.000 0.015 0.000 0.000 0.013 0.000 0.000 0.000 0.000
Every 2 columns represents a stratum, comparing non-smoker (0) to smoker (1).
It appears that education is well balanced between smokers and non-smokers across strata.
> cat.psa(racer,smoke,strata, xlab="strata", ylab="Proportion for 'race'", barnames=c("no smoke","smoke"))
$`treatment:stratum.proportions`
0:1 1:1 0:2 1:2 0:3 1:3 0:4 1:4 0:5 1:5
1 0.445 0.356 0.662 0.705 0.781 0.789 0.821 0.847 0.944 0.951
6 0.034 0.022 0.023 0.033 0.035 0.026 0.066 0.000 0.011 0.010
7 0.281 0.489 0.238 0.197 0.175 0.184 0.085 0.141 0.045 0.039
8 0.123 0.111 0.038 0.033 0.000 0.000 0.028 0.000 0.000 0.000
9 0.075 0.000 0.031 0.000 0.009 0.000 0.000 0.012 0.000 0.000
10 0.041 0.022 0.008 0.033 0.000 0.000 0.000 0.000 0.000 0.000
Similarly for race the strata are well balanced. Except for maybe level ‘9’(?).
Having examined the covariate difference from stratum to stratum, we can now compare outcome measures for the treatment groups across strata using the loess.psa function and the circ.psa function. First is the loess plot.
Separate loess regression lines are shown for smokers and non-smokers. It can be seen that for most of the left side of the propensity score distribution the distance between smoker and non-smoker is about the same. At the upper levels of the propensity score distribution estimated birthweights appear to widen and decrease for both smokers and non-smokers. Since the regression lines do not cross this indicates a similar effect across propensity score levels.
Lastly, is the circ.psa plot. The size of the circles represents the size of each stratum. The center to each stratum’s circle corresponds to outcome means for the smoker and non-smoker groups for the stratum. The graphic shows that since all circles are below the Y=X line this indicates that the outcome mean in the non-smoker groups is larger than in the smoker group. So, birthweight is higher for mothers who did not smoke across all stratum. The graphic below shows that all circles lie on the same side of the identity line indicating a similar direction of effects across all strata, similar to the loess plot above. The distribution of effects across strata is shown by the x’s.
> circ.psa(bwtt, smoke, strata, xlab="nonsmoker", ylab="smoker")
$summary.strata
n.0 n.1 means.0 means.1
1 146 45 123.9452 116.9111
2 130 61 123.1692 115.4098
3 114 76 125.1754 119.5526
4 106 85 126.0283 113.3294
5 89 102 119.8090 111.1961
$`0.wtd.Mn`
[1] 123.6238
$`1.wtd.Mn`
[1] 115.2753
$dae
[1] -8.348473
$se.wtd
[1] 1.148378
$approx.t
[1] -7.269795
$df
[1] 944
$CI.95
[1] -10.602143 -6.094804