Chapter 5-21. Regression Post Tests
In this chapter, we will discuss statistical comparisons made after one or more regression models have been fitted. These are called post tests, and can be used to combine regression coefficients in the same model, compare coefficients across models, and compare model fit across models.
Simultaneous Comparison of Groups
We did this in Chapter 6, and will cover it quickly here. Opening the rmr dataset in Stata,
FileOpen
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on rmr.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\
rmr.dta", clear
* which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use rmr, clear
This dataset comes from Nawata et al (2004)[on course CD]. The data were taken from the authors’ Figure 1, a scatterplot, and so only approximate the actual values used by the authors.
File rmr.dta Codebook
group urinary excretion of albumin group (U-Alb)
a = U-Alb < 30 mg/d
b = 30 mg/d ≤ U-Alb ≤ 300 mg/d
c = 300 mg/d < U-Alb
lbm lean body mass (kg)
rmr resting metabolic rate (kJ/h/m2)
The group variable is a string variable with values “a”, “b”, and “c”.
Creating indicator variables and checking our work,
______
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010.
gen groupa = cond(group=="a",1,0)gen groupb = cond(group=="b",1,0)
gen groupc = cond(group=="c",1,0)
tab group groupa
tab group groupb
tab group groupc
| groupa
group | 0 1 | Total
------+------+------
a | 0 12 | 12
b | 10 0 | 10
c | 12 0 | 12
------+------+------
Total | 22 12 | 34
| groupb
group | 0 1 | Total
------+------+------
a | 12 0 | 12
b | 0 10 | 10
c | 12 0 | 12
------+------+------
Total | 24 10 | 34
| groupc
group | 0 1 | Total
------+------+------
a | 12 0 | 12
b | 10 0 | 10
c | 0 12 | 12
------+------+------
Total | 22 12 | 34
We see that the indicator variables were correctly set up.
We want to test the hypothesis that the three group means are the same:
Even better, we want to do this adjusting for lbm, which is a potential confounding variable:
Fitting the linear regression model, with group a as the referent,
StatisticsLinear models and related
Linear regression
Dependent variable: rmr
Independent variables: groupb groupc lbm
OK
regress rmr groupb groupc lbm
Source | SS df MS Number of obs = 34
------+------F( 3, 30) = 9.18
Model | 8798.55426 3 2932.85142 Prob > F = 0.0002
Residual | 9588.38691 30 319.612897 R-squared = 0.4785
------+------Adj R-squared = 0.4264
Total | 18386.9412 33 557.180036 Root MSE = 17.878
------
rmr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
groupb | -3.410606 7.654802 -0.45 0.659 -19.0438 12.22258
groupc | 29.91667 7.305324 4.10 0.000 14.9972 44.83613
lbm | .5454562 .3431357 1.59 0.122 -.1553203 1.246233
_cons | 112.2196 15.87451 7.07 0.000 79.79954 144.6397
------
We see that there is a significant difference between group c and group a, while controlling for lbm, where the adjusted means shown here are:
Group a = _cons = 112.2196
Group b = _cons + groupb = 112.2196 – 3.410606 = 108.80899
Group c = _cons + groupc = 112.2196 + 29.91667 = 142.13627
We can get these adjusted group means using, setting lbm = 0, using
StatisticsPostestimation
Adjusted means and proportions
Main tab: Compute and display predictions for each level of variables: group
Variables to be set to there overall mean value: <leave blank this time>
Variables to be set to a specified value: Variable: lbm Value: 0
Options tab: Confidence interval or prediction intervals
OK
adjust lbm=0, by(group) xb ci
------
Dependent variable: rmr Command: regress
Variables left as is: groupa, groupb, groupc
Covariate set to value: lbm = 0
------
------
group | xb lb ub
------+------
a | 112.22 [79.7995 144.64]
b | 108.809 [76.0153 141.603]
c | 142.136 [109.108 175.165]
------
Key: xb = Linear Prediction
[lb , ub] = [95% Confidence Interval]
To get a single p value for the group comparison, if all we wanted to say was that there was a significant difference among the groups, we use what we know about combining the coefficients,
Group a = _cons = 112.2196
Group b = _cons + groupb = 112.2196 – 3.410606 = 108.80899
Group c = _cons + groupc = 112.2196 + 29.91667 = 142.13627
and then construct the following post-estimation Wald test,
test _cons = _cons+groupb = _cons+groupc( 1) - groupb = 0
( 2) - groupc = 0
F( 2, 30) = 12.10
Prob > F = 0.0001
We can now report the three groups are not equal after adjusting for lbm (p<.001).
What we just tested is the group main effect in an analysis of covariance,
encode group, generate(group2) // create numeric group variableanova rmr group2 lbm, continuous(lbm)
Number of obs = 34 R-squared = 0.4785
Root MSE = 17.8777 Adj R-squared = 0.4264
Source | Partial SS df MS F Prob > F
------+------
Model | 8798.55426 3 2932.85142 9.18 0.0002
|
group2 | 7734.4293 2 3867.21465 12.10 0.0001
lbm | 807.629752 1 807.629752 2.53 0.1224
|
Residual | 9588.38691 30 319.612897
------+------
Total | 18386.9412 33 557.180036
In the authors statistical methods section, they stated they used analysis of variance (ANOVA) and analysis of covariance (ANCOVA).
Although there are special routines in Stata to fit them more easily, analysis of variance (ANOVA) is just a linear regression with categorical predictor variables, and analysis of covariance (ANCOVA) is just a linear regression with both categorical variables and continuous variables. In ANOVA terminology, categorical variables are called factors and continuous variables are called covariates.
Likelihood Ratio Test For Model Fit
We did this in Chapter 24, and will cover it again quickly here. Using the 11.2.Isoproterenol.dta dataset provided with the Dupont (2002, p.338) textbook, described as,
“Lang et al. (1995) studied the effect of isoproterenol, a β-adrenergic agonist, on forearm blood flow in a group of 22 normotensive men. Nine of the study subjects were black and 13 were white. Each subject’s blood flow was measured at baseline and then at escalating doses of isoproterenol.”
Reading the data in,
use "C:\Documents and Settings\u0032770.SRVR\Desktop\Biostats & Epi With Stata\datasets & do-files\
11.2.Isoproterenol.dta", clear
* which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use 11.2.Isoproterenol, clear
To get a p value to test if a random slope is needed (in other words, the random slope improves the goodness of fit), we compare the log likelihoods between a model without it and a model with it included,
reshape long fbf , i(id) j(dose)*
xtmixed fbf dose || id: , mle cov(unstructured)
estimates store modelA // store model estimates
xtmixed fbf dose || id: dose , mle cov(unstructured)
estimates store modelB // store model with added term estimates
lrtest modelB modelA
display "LR Chi-square = " -2*(-476.95216 –(-437.38499))
. xtmixed fbf dose || id: , mle cov(unstructured)
Note: single-variable random-effects specification; covariance structure set to identity
Log likelihood = -476.95216 Prob > chi2 = 0.0000
------
fbf | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------+------
dose | .0352574 .0027581 12.78 0.000 .0298517 .0406631
_cons | 4.966772 1.245969 3.99 0.000 2.524718 7.408827
------
. xtmixed fbf dose || id: dose , mle cov(unstructured)
Log likelihood = -437.38499 Prob > chi2 = 0.0000
------
fbf | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------+------
dose | .0356267 .0049976 7.13 0.000 .0258315 .0454219
_cons | 4.929394 .6680961 7.38 0.000 3.61995 6.238838
------
. lrtest modelB modelA
Likelihood-ratio test LR chi2(2) = 79.13
(Assumption: modelA nested in modelB) Prob > chi2 = 0.0000
. display "LR Chi-square = " -2*(-476.95216 -(-437.38499))
LR Chi-square = 79.13434
The LR ratio test statistic = 79.13, p<0.0001 informs us that adding the random slopes improved the fit. The display command, was not needed, but was used just to show that this statistic is -2 × the difference of the two model likelihoods.
You could use this likelihood ratio test to get a p value for any variable in the model. It is only slightly more powerful than the Wald test p value shown in the regression table, frequently just changing the p value in the second or third decimal place.
Comparison of Regression Coefficients Within The Same Model
The regression coefficients within the same model can be compared using a post test. This might be useful, for example, if you wanted to compare the coefficients for different study sites.
First, read in some hypothetical data in the do-file editor,
clearinput y x1 x2 site
5 7 6 1
6 3 4 1
7 6 8 1
8 7 6 2
7 5 7 2
8 6 7 2
7 5 8 3
4 3 4 3
6 5 5 3
end
list , sepby(site)
+------+
| y x1 x2 site |
|------|
1. | 5 7 6 1 |
2. | 6 3 4 1 |
3. | 7 6 8 1 |
|------|
4. | 8 7 6 2 |
5. | 7 5 7 2 |
6. | 8 6 7 2 |
|------|
7. | 7 5 8 3 |
8. | 4 3 4 3 |
9. | 6 5 5 3 |
+------+
For this illustration, we will not concern ourselves with overfitting (too few subjects for the number of predictors in the model).
Although it is not necessary, if you add the “nocons” option, than you can see the mean value of each study site. Fitting such a linear regression model,
tab site , gen(site) // create site indicator variablesregress y site1 site2 site3, nocons
------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------+------
site1 | 6 .6382847 9.40 0.000 4.438174 7.561826
site2 | 7.666667 .6382847 12.01 0.000 6.10484 9.228493
site3 | 5.666667 .6382847 8.88 0.000 4.10484 7.228493
------
Some possible site comparisons that might be of interest are:
*
test site1 = site2 // pairwise site comparisons
test site1 = site3
test site2 = site3
. test site1 = site2 = site3 // no difference among the sites
( 1) site1 - site2 = 0
( 2) site1 - site3 = 0
F( 2, 6) = 2.82
Prob > F = 0.1371
. *
. test site1 = site2 // pairwise site comparisons
( 1) site1 - site2 = 0
F( 1, 6) = 3.41
Prob > F = 0.1144
. test site1 = site3
( 1) site1 - site3 = 0
F( 1, 6) = 0.14
Prob > F = 0.7246
. test site2 = site3
( 1) site2 - site3 = 0
F( 1, 6) = 4.91
Prob > F = 0.0686
The two-tailed p values are shown in red.
The p values will be slightly different from the same pairwise t tests, but either approach is valid. An advantage of the approach shown here, however, is that you can test the pairwise differences while controlling for confounding variables, something which cannot be done with t tests.
Protocol Suggestion
The association between y and study site will be estimated using linear regression. To determine if the study sites differ, and which ones, the regression coefficients for specific study sites will be compared using a post-estimation Wald test (Judge et al, 1985, 20-28).
The Judge citation is given in the Stata-11 Base Reference, p.1919, which you can refer to using the Stata help menu (PDF Documentation).
Comparison of Regression Coefficients From Separate Models
The regression coefficients computed from different models can be compared using a post test. This might be useful, for example, if you wanted to compare the coefficients for different subgroups.
First, read in some hypothetical data in the do-file editor,
clearinput y x1 x2 site
5 7 6 1
6 3 4 1
7 6 8 1
8 7 6 1
7 5 7 1
8 6 7 2
7 5 8 2
4 3 4 2
6 5 5 2
8 4 3 2
end
list , sepby(site)
+------+
| y x1 x2 site |
|------|
1. | 5 7 6 1 |
2. | 6 3 4 1 |
3. | 7 6 8 1 |
4. | 8 7 6 1 |
5. | 7 5 7 1 |
|------|
6. | 8 6 7 2 |
7. | 7 5 8 2 |
8. | 4 3 4 2 |
9. | 6 5 5 2 |
10. | 8 4 3 2 |
+------+
We are interested in determining if the association between y and x1 is greater in Site 1 than in Site 2.
This can be done using,
regress y x1 if site==1 // fit 1st modelestimates store site1 // store estimates of 1st model
regress y x1 if site==2 // fit 2nd model
estimates store site2 // store estimates of 2nd model
suest site1 site2 // combine estimation results for testing
test [site1_mean]x1 = [site2_mean]x1 // test for equality of coef
. regress y x1 if site==1 // fit 1st model