1
Logistic Regression, Testing for Interaction
Example, Influenza Shots
A local health clinic sent fliers to its clients to encourage everyone, but especially older persons at high risk of complications, to get a flu shot for protection against an expected flu epidemic. In a pilot follow-up study, 159 clients were randomly selected and asked whether they actually received a flu shot. A client who received a flu shot was coded Y = 1; and a client who did not receive a flu shot was coded Y = 0. In addition, data were collected on their age (X1) and their health awareness. The latter data were combined into a health awareness index (X2), for which higher values indicate greater awareness. Also included in the data were client gender (X3), with males coded X3 = 1 and females coded X3 = 0.
It is suspected that there may be some interactions between predictor variables; e.g., perhaps the relationship between health awareness and the response variable is mediated by gender. Hence, we want to test for interaction effects. To do this, we will estimate two logistic regression models – one with interactions included and another without.
Assuming that we have already looked at the univariate relationships between Y and each explanatory variable, we will proceed to look for possible interaction effects.
The reduced model (with only the three explanatory variables) is
,
where the i subscript denotes the ith observation in the data set, and is a random error term associated with the ith observation.
The full model (with interaction terms included) is
.
We want to test whether there are interaction effects present.
Step 1: H0: v.HA: Not all 0.
Step 2: We have , = 0.05.
Step 3: The test statistic is the Likelihood Ratio chi-square statistic , where is the maximum of the log-likelihood function for the model with the interaction terms as well as the predictors, and is the maximum of the log-likelihood function for the model with just the predictor variables. Under the null hypothesis, the statistic has a chi-square distribution with
d.f. = 3.
Step 4: We will reject the null hypothesis if G2.
Step 5: From the output, we find G2 = 105.093 – 104.994 = 0.099.
Step 6: We fail to reject the null hypothesis at the 0.05 level of significance. We do not have sufficient evidence to conclude that the interaction terms need to be included in the model.
If we had rejected the null hypothesis, then we could have used a follow-up procedure, such as stepwise multiple regression, to find which explanatory variables and which interaction terms would need to be included in the model. If we find that a particular interaction term is significant, then we would also want to include those two explanatory variables in our final model
(Note: In this case, if we were to perform stepwise regression, we would find that only two of the predictors, Age and Health Awareness, would need to be included in our final model.)
The estimated model is therefore (from the output of the last PROC LOGISTIC):
, or
(1) .
We also want to check the assumption that the logit is linear in each of the (continuous) predictor variables. There are several ways to do this. One way is a rather tedious, graphical approach, involving grouped data. A simpler approach is to test whether we need to include nonlinear terms in the model. To do this, we will use the final model above as our reduced model, and add quadratic terms in each of the two explanatory variables, so that the full model is
.
We want to test whether these quadratic terms are needed. The last PROC LOGISTIC in the SAS program below estimates the model with the quadratic terms included.
Step 1: H0: v.HA: Not both 0.
Step 2: We have , = 0.05.
Step 3: The test statistic is the Likelihood Ratio chi-square statistic , where is the maximum of the log-likelihood function for the model with the quadratic terms as well as the predictors, and is the maximum of the log-likelihood function for the model with just the predictor variables. Under the null hypothesis, the statistic has a chi-square distribution with
d.f. = 2.
Step 4: We will reject the null hypothesis if G2.
Step 5: From the output, we find G2 = 105.795 – 104.706 = 1.089.
Step 6: We fail to reject the null hypothesis at the 0.05 level of significance. We do not have sufficient evidence to conclude that the quadratic terms need to be included in the model.
Our final model therefore is given by Equation (1) above.
The estimate of the regression slope for Age is , with a standard error of . Thus, a 95% confidence interval estimate for the slope is
. Now, since Age is a continuous variable, it is not very interesting to consider the odds ratio for a unit increase in Age. Instead, we will calculate a 95% confidence interval estimate of the odds ratio for an increase in Age of 5 years. The point estimate of the odds ratio is , and a 95% confidence interval estimate is . We are 95% confident that the odds of having had a flue shot increase by between 1.1034 and 1.9750 for each 5-year increase in Age, for this population.
The SAS program for conducting data analysis is given below, followed by the output.
SAS Program
procformat;
value difmt 0 = "No "
1 = "Yes";
value sexfmt 0 = "Female"
1 = "Male ";
;
data flushot;
input y x1 x2 x3;
x1x2 = x1*x2;
x1x3 = x1*x3;
x2x3 = x2*x3;
x1sq = x1**2;
x2sq = x2**2;
label y = "Flu Shot?"
x1 = "Age in Years"
x2 = "Health Awareness Index"
x3 = "Gender"
x1x2 = "Interaction of Age with Health Awareness"
x1x3 = "Interaction of Age with Gender"
x2x3 = "Interaction of Health Awareness with Gender"
x1sq = "Square of Age in Years"
x2sq = "Square of Health Awareness";
format y difmt. x3 sexfmt.;
cards;
The data set is listed in the appendix.
;
proclogistic;
model y (order=formatted event='Yes') = x1 x2 x3;
title"Multiple Logistic Regression of Flu Shot";
title2"Against Age, Health Awareness, and Gender";
;
proclogistic;
model y (order=formatted event='Yes') = x1 x2 x3 x1x2 x1x3 x2x3;
title"Multiple Logistic Regression of Flu Shot";
title2"Against Age and Health Awareness";
title3"With Interaction Terms Included";
;
proclogistic;
model y (order=formatted event='Yes') = x1 x2;
title"Multiple Logistic Regression of Flu Shot";
title2"Against Age and Health Awareness";
title3;
;
proclogistic;
model y (order=formatted event='Yes') = x1 x2 x1sq x2sq;
title"Multiple Logistic Regression of Flu Shot";
title2"Against Age and Health Awareness";
title3"With Quadratic Terms Included";
;
run;
Output of SAS Program
Multiple Logistic Regression of Flu Shot
Against Age, Health Awareness, and Gender
The LOGISTIC Procedure
Model Information
Data Set WORK.FLUSHOT
Response Variable y Flu Shot?
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring
Number of Observations Read 159
Number of Observations Used 159
Response Profile
Ordered Total
Value y Frequency
1 No 135
2 Yes 24
Probability modeled is y='Yes'.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
AIC 136.941 113.093
SC 140.010 125.369
-2 Log L 134.941 105.093
Multiple Logistic Regression of Flu Shot
Against Age, Health Awareness, and Gender
The LOGISTIC Procedure
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 29.8476 3 <.0001
Score 27.0173 3 <.0001
Wald 19.9803 3 0.0002
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -1.1772 2.9824 0.1558 0.6930
x1 1 0.0728 0.0304 5.7401 0.0166
x2 1 -0.0990 0.0335 8.7419 0.0031
x3 1 0.4339 0.5218 0.6917 0.4056
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
x1 1.076 1.013 1.141
x2 0.906 0.848 0.967
x3 1.543 0.555 4.291
Association of Predicted Probabilities and Observed Responses
Percent Concordant 82.1 Somers' D 0.644
Percent Discordant 17.7 Gamma 0.645
Percent Tied 0.2 Tau-a 0.166
Pairs 3240 c 0.822
Multiple Logistic Regression of Flu Shot
Against Age and Health Awareness
With Interaction Terms Included
The LOGISTIC Procedure
Model Information
Data Set WORK.FLUSHOT
Response Variable y Flu Shot?
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring
Number of Observations Read 159
Number of Observations Used 159
Response Profile
Ordered Total
Value y Frequency
1 No 135
2 Yes 24
Probability modeled is y='Yes'.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
AIC 136.941 118.994
SC 140.010 140.476
-2 Log L 134.941 104.994
Multiple Logistic Regression of Flu Shot
Against Age and Health Awareness
With Interaction Terms Included
The LOGISTIC Procedure
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 29.9472 6 <.0001
Score 32.4819 6 <.0001
Wald 19.4560 6 0.0035
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 2.6164 13.5486 0.0373 0.8469
x1 1 0.0138 0.2007 0.0047 0.9452
x2 1 -0.1650 0.2490 0.4394 0.5074
x3 1 0.3907 6.0739 0.0041 0.9487
x1x2 1 0.00103 0.00371 0.0773 0.7809
x1x3 1 0.00588 0.0615 0.0091 0.9238
x2x3 1 -0.00634 0.0679 0.0087 0.9255
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
x1 1.014 0.684 1.503
x2 0.848 0.520 1.381
x3 1.478 <0.001 >999.999
x1x2 1.001 0.994 1.008
x1x3 1.006 0.892 1.135
x2x3 0.994 0.870 1.135
Association of Predicted Probabilities and Observed Responses
Percent Concordant 82.5 Somers' D 0.653
Percent Discordant 17.2 Gamma 0.656
Percent Tied 0.4 Tau-a 0.168
Pairs 3240 c 0.827
Multiple Logistic Regression of Flu Shot
Against Age and Health Awareness
The LOGISTIC Procedure
Model Information
Data Set WORK.FLUSHOT
Response Variable y Flu Shot?
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring
Number of Observations Read 159
Number of Observations Used 159
Response Profile
Ordered Total
Value y Frequency
1 No 135
2 Yes 24
Probability modeled is y='Yes'.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
AIC 136.941 111.795
SC 140.010 121.002
-2 Log L 134.941 105.795
Multiple Logistic Regression of Flu Shot
Against Age and Health Awareness
The LOGISTIC Procedure
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 29.1454 2 <.0001
Score 26.7071 2 <.0001
Wald 19.8291 2 <.0001
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -1.4578 2.9153 0.2500 0.6170
x1 1 0.0779 0.0297 6.8761 0.0087
x2 1 -0.0955 0.0324 8.6786 0.0032
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
x1 1.081 1.020 1.146
x2 0.909 0.853 0.969
Association of Predicted Probabilities and Observed Responses
Percent Concordant 80.7 Somers' D 0.618
Percent Discordant 18.9 Gamma 0.620
Percent Tied 0.4 Tau-a 0.159
Pairs 3240 c 0.809
Multiple Logistic Regression of Flu Shot
Against Age and Health Awareness
With Quadratic Terms Included
The LOGISTIC Procedure
Model Information
Data Set WORK.FLUSHOT
Response Variable y Flu Shot?
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring
Number of Observations Read 159
Number of Observations Used 159
Response Profile
Ordered Total
Value y Frequency
1 No 135
2 Yes 24
Probability modeled is y='Yes'.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Intercept
Intercept and
Criterion Only Covariates
AIC 136.941 114.706
SC 140.010 130.050
-2 Log L 134.941 104.706
Multiple Logistic Regression of Flu Shot
Against Age and Health Awareness
With Quadratic Terms Included
The LOGISTIC Procedure
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 30.2348 4 <.0001
Score 34.2112 4 <.0001
Wald 19.4995 4 0.0006
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 0.2193 14.2594 0.0002 0.9877
x1 1 0.2296 0.4052 0.3210 0.5710
x2 1 -0.3518 0.2638 1.7780 0.1824
x1sq 1 -0.00112 0.00303 0.1363 0.7120
x2sq 1 0.00238 0.00236 1.0171 0.3132
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
x1 1.258 0.569 2.784
x2 0.703 0.419 1.180
x1sq 0.999 0.993 1.005
x2sq 1.002 0.998 1.007
Association of Predicted Probabilities and Observed Responses
Percent Concordant 81.2 Somers' D 0.629
Percent Discordant 18.3 Gamma 0.632
Percent Tied 0.4 Tau-a 0.162
Pairs 3240 c 0.815
Appendix: Flu Shot Data
0 59 52 0
0 61 55 1
1 82 51 0
0 51 70 0
0 53 70 0
0 62 49 1
0 51 69 1
0 70 54 1
0 71 65 1
0 55 58 1
0 58 48 0
0 53 58 1
0 72 65 0
0 56 68 0
0 56 83 0
0 81 68 0
0 62 44 0
0 49 70 0
0 56 69 1
0 50 74 0
0 53 57 0
0 56 64 1
0 56 67 1
0 50 83 1
0 52 48 1
0 52 81 0
0 67 53 1
0 51 61 0
0 70 51 0
0 64 51 0
0 61 65 1
0 53 51 0
0 77 54 1
0 73 64 1
0 67 69 0
0 50 71 0
1 80 38 0
1 75 51 0
0 65 54 1
0 60 59 1
1 68 57 1
0 61 63 0
1 62 48 0
0 53 58 0
0 72 56 0
0 54 59 0
1 59 75 0
0 61 48 0
0 50 79 1
0 48 66 0
0 52 57 1
0 54 68 0
0 62 48 0
0 71 60 0
1 65 63 0
0 49 61 0
0 58 57 0
0 62 69 0
0 69 38 1
1 56 50 1
1 76 45 1
0 51 72 0
0 64 51 0
0 57 62 1
0 51 81 0
0 81 55 1
0 50 77 0
0 64 65 1
0 64 53 1
1 59 49 1
0 53 65 0
0 63 58 0
0 59 60 1
1 70 57 1
0 72 37 0
0 68 49 0
0 75 55 1
0 57 60 0
0 67 57 1
0 59 56 1
0 55 58 0
1 75 64 1
0 66 51 0
0 67 59 0
0 59 61 0
0 78 49 1
0 59 49 0
0 68 55 0
1 59 61 1
0 68 50 1
1 78 47 1
0 55 73 1
1 71 45 1
0 51 45 0
0 65 59 0
0 54 61 1
0 79 52 0
0 64 50 0
1 82 46 1
0 64 67 0
0 70 56 1
1 59 50 0
0 59 56 1
0 63 61 1
0 48 74 0
0 61 78 0
0 51 68 0
0 48 71 0
1 71 58 1
0 51 57 0
0 57 51 1
0 49 74 0
0 67 56 1
0 73 57 0
0 73 65 0
0 56 47 0
0 48 69 1
0 50 71 0
0 50 76 1
0 66 60 1
0 53 75 1
0 50 65 1
1 51 42 0
0 68 66 1
1 72 49 1
0 51 58 1
0 62 61 1
0 60 55 0
0 67 60 1
0 70 54 1
0 55 63 1
0 66 56 0
0 65 59 1
0 84 52 1
0 58 63 0
1 68 57 1
0 51 59 1
0 67 53 1
0 52 67 0
0 68 62 0
0 76 63 1
0 54 62 1
0 50 52 1
0 63 58 0
0 77 49 1
0 60 65 1
0 51 55 0
0 51 60 1
0 66 51 1
0 52 67 0
0 66 64 1
0 56 55 1
0 49 58 0
0 67 66 0
0 57 64 1
0 56 66 0
1 76 22 1
1 68 32 0
1 73 56 1