Lab 4 10

Lab 4

Regression Models with Dummy Variables Versus ANOVA:An Experimental Study of the Effects of Alternative Drug Therapies on Systolic Blood Pressure

New Stata Commands and Techniques

· Missing values in Stata.

· Creating dummy variables with generate or tabulate commands.

The Data Set

The data set sysage.dta, provided by Stata, is not documented, but it appears to be an experiment to measure the effect of four alternative drug treatments on systolic blood pressure, as there are four drug categories. It might be that one of the drug categories, say the first, is a control group representing either no drug treatment or a placebo. The study also allows one to control for confounding effects of age and type of disease on systolic blood pressure. There are three disease categories. We will assume that diseases of the study participants have been assigned into these three categories in some meaningful way.

This is the data set we used when studying ANOVA in the very first lab session. Today we will show that ANOVA models are just linear regression models with dummy variables. We will also learn about interactions, fully interactive models, and the equivalence of fully interactive OLS regression models with a set of separate regressions on each category of the factor with which the fully interactive model is fully interactive.

Open the Stata data file sysage.dta, and inspect the data.

describe
sum
tab drug
tab disease
kdensity systolic
kdensity age

Perform ANOVA to Test the Effect of Drug

To see if changes in blood pressure differ by drug treatment, you can perform an ANOVA test very easily in Stata.

anova systolic drug

The results are:

Number of obs = 58 R-squared = 0.3355

Root MSE = 10.7211 Adj R-squared = 0.2985

Source | Partial SS df MS F Prob > F

-----------+----------------------------------------------------

Model | 3133.23851 3 1044.41284 9.09 0.0001

|

drug | 3133.23851 3 1044.41284 9.09 0.0001

|

Residual | 6206.91667 54 114.942901

-----------+----------------------------------------------------

Total | 9340.15517 57 163.862371

Forming Dummy Variables for a Categorical Variable: The generate Command Method

You can form dummy variables for a categorical variable with the generate or tabulate commands. The latter method is easier in Stata, but may not be available in other statistical applications. The former method, however, has an analogue in virtually every other statistical application. First, we’ll use the generate command method.

gen drug1=drug==1
gen drug2=drug==2
gen drug3=drug==3
gen drug4=drug==4

Check to see that these commands worked as intended:

tab drug drug1
tab drug drug2
etc.

If the Categorical Variable has Missing Values

If any of the observations on the categorical variable are missing, then creating dummy variables with the generate command technique requires an “if condition” qualifier on each generate command, to insure that the created dummy variables have missing values if the categorical variable is missing. Stata indicates a value for a variable is missing by assigning it a very large positive value (greater than the largest value allowed by computational standards). These are referred to by the symbols “.”, “.a”, “.b”, etc., to “.z”. Of these, the missing value “.” has the smallest value. See “help missing” for an explanation. So, if your data has any missing values, then the generate commands would be the following:

gen drug1=drug==1 if drug<.
gen drug2=drug==2 if drug<.
etc.

Another possibility is that certain codes are meant to indicate if the value of a variable is missing. Then the if condition above should be modified accordingly. Suppose, for example, that the value “9” indicates that the value is missing (“unknown”, “NA”, etc.). Then the generate commands would be the following:

gen drug1=drug==1 if drug~=9
gen drug2=drug==2 if drug~=9
etc.

In either of these two cases, the created dummy variables will have the Stata missing value “.” whenever the categorical variable is missing.

It’s probably a good idea to convert “user” missing values, like the example of “9” above, into Stata missing values, before you conduct any type of analysis. This can be done with the following command menu window:

Data | Create or change variables | Other variable transformation commands | Change numeric values to missing

You can also convert user-assigned missing values to Stata missing values with replace commands. For example,

replace drug=. if drug==9

The tabulate Command Method of Creating Dummy Variables

Now we’ll use the tabulate command method. First, delete the dummy variables we just created, using the drop command.

drop drug1 drug2 drug3 drug4

Data | Create or change variables | Other variable creation commands | Create indicator variables
Variable to tabulate: drug; New variables’ stub: drug

The variable names for the dummy variables consist of two parts: a prefix that you supply as the “new variables’ stub”; and a suffix formed from the values of the variable. This method handles Stata missing values correctly, but not user-assigned missing values (like the example of using “9” above). If you want to use the tabulate command method, first change user-assigned missing values into Stata missing values.

The Equivalence of ANOVA and Regression

ANOVA is a regression where the independent variables are dummy variables. To verify this, operationalize the drug treatment by a set of dummy variables:

Do a regression to estimate the effect of the treatments on blood pressure:

regress systolic drug2 drug3 drug4

Note the similarity in the analysis of variance tables. Here is the one given by the regression:


Source | SS df MS Number of obs = 58

---------+------------------------------ F( 3, 54) = 9.09

Model | 3133.23851 3 1044.41284 Prob > F = 0.0001

Residual | 6206.91667 54 114.942901 R-squared = 0.3355

---------+------------------------------ Adj R-squared = 0.2985

Total | 9340.15517 57 163.862371 Root MSE = 10.721

Hypothesis Tests

The null hypothesis that the drug treatments have no effect on blood pressure (i.e., they have the same effect as the reference category drug 1) is tested in the anova table of the regression output. This is the hypothesis that the coefficients on the variables drug2, drug3, and drug4 are jointly zero. This can also be accomplished by testing the appropriate linear hypotheses with a post-estimation test command:

Statistics | General post-estimation | Tests | Test linear hypotheses after model fitting
Main: highlight “Specification 1 (required)”; choose “Coefficients are 0”; Specification 1, test test coefficients: drug2 drug3 drug4

Note how Stata states the null hypothesis in the output, and that the test is equivalent to the one performed in the anova table. Also notice that the null hypothesis involves three constraints on the coefficients of the model, i.e., that the coefficients on the three dummy variables are each equal to zero. The above joint test can also be done in another way:

Statistics | General post-estimation | Tests | Test linear hypotheses after model fitting
Main: highlight “Specification 1 (required)”; choose “Linear expressions are equal”; Specification 1, linear expression: drug2=drug3=drug4=0

The regression output conveniently tests the null hypotheses that each drug treatment is no different than drug treatment one in its effect on blood pressure. Suppose you wanted to test the null hypothesis that drug treatments three and four had the same effect on blood pressure. This could be done by:

Statistics | General post-estimation | Tests | Test linear hypotheses after model fitting
Main: highlight “Specification 1 (required)”; choose “Linear expressions are equal”; Specification 1, linear expression: drug3=drug4

One important thing to remember when using Stata’s test commands is that they must be given after the regression to which they apply but before the next estimation command, e.g., another regress command. The reason is that the test command uses the estimated coefficients, coefficient variance matrix, and other statistics from the regression, which Stata keeps in memory until they are overwritten by similar statistics from the next estimation command.

Look at the command syntax generated by each of these commands. You might find it more convenient to type these commands in the command window than to use the menu/windowing system.

ANOVA is Regression

The anova procedure is actually forming dummy variables, running a regression, and performing F‑tests just as we did above. To show that ANOVA is regression, use the regress option to see what regression Stata’s anova command performs.

anova systolic drug, regress

Stata chose the fourth drug category as the reference group, but the estimated model is identical to one estimated with drug category 1 as the reference category. You can verify this by evaluating the expected blood pressure for each drug category. You can also verify this by running a regression with the fourth drug category as the reference group, and compare the regress and anova estimated coefficients.

regress systolic drug1 drug2 drug3

Controlling for Possibly Confounding Factors. The Main Effects Model

Regression

Operationalize the disease factor, and estimate a model that includes disease and age as additional explanatory variables.

gen disease2=disease==2
gen disease3=disease==3
regress systolic drug2 drug3 drug4 age disease2 disease3

Compare this model to the one with only the treatment factor. Do the treatment effects appear to be similar? What does this suggest? Answer: That the treatment dummies are not correlated with age or disease, which suggests that the drug therapies may have been randomly assigned to the patients. Also, the lack of correlation between disease and age suggests careful experimental design in this study.

Analyze the model estimates for the effects of age and disease on blood pressure.

This is called a "main effects" model because each factor enters the model independently of each other factor. The effect of any factor does not depend on the level or category of any other factor.

Test for the significance of each factor.

test drug2 drug3 drug4
test age
test disease2 disease3

Note that the second test command is redundant, since the test statistics are already reported on the regression output. (Remember that with one degree of freedom in the numerator, the F statistic is the square of the t-statistic.)

Analysis of Covariance

This model can also be estimated and the significance of the factors tested using the anova command. You need to tell Stata that age is not a categorical factor, but a continuous factor (else what would Stata do?). This can be accomplished with the continuous option. When the model contains a continuous factor, the method is called “analysis of covariance”.

anova systolic drug age disease, continuous(age)

Notice that the hypothesis tests are the same as those from the regression model, because Stata is really just running the same regression as the one we just did (but perhaps with different reference categories).

Specify and Test a Model That Allows an Interaction between the Drug Therapy and Age

Regression

The effect of a drug on blood pressure may depend on the age of a patient. To specify a model that allows for such possibilities, form the interaction between the drug treatments and age, and estimate the model.

gen ageg2=age*drug2
gen ageg3=age*drug3
gen ageg4=age*drug4

For advanced users who understand dummy variables and interactions, Stata has a command, xi, which can be used for creating interactions, and alternatively using them immediately in an estimation command like regress. This command can be accessed by the following menu combination:

Data | Create or change variables | Other variable creation commands | Interaction expansion

Estimate the model that allows an interaction between the drug and age factors.

regress systolic drug2 drug3 drug4 age disease2 disease3 ageg2 ageg3
ageg4

To test whether there is an interaction between any particular drug therapy and age, use the t-statistic or p-value for the corresponding coefficient. To test the more general null hypothesis of no interaction between the drug factor and age, form the F-statistic and p-value for the following hypothesis:

test ageg2 ageg3 ageg4

Analysis of Covariance

See how easy it is to include an interaction of two factors with the anova command.

anova systolic drug age disease age*drug, continuous(age)

Again, it is no surprise that the test for the interaction yields the same result.

Specifying a Model That is Fully Interactive with Respect to a Factor

To make a model fully interactive with respect to a factor, interact that factor with each other factor in the model. Let’s start with the main effects model and make it fully interactive with respect to the disease factor. First, form the required interactions.

gen aged2=age*disease2
gen aged3=age*disease3

gen g2d2=drug2*disease2
gen g2d3=drug2*disease3
gen g3d2=drug3*disease2
gen g3d3=drug3*disease3
gen g4d2=drug4*disease2
gen g4d3=drug4*disease3

Then, estimate the model. The regress command appears on two lines below. Type it as a single command in the command window. (Be careful not to include the drug/age interaction in this model. The model should have 15 regressors including the constant.)

regress systolic drug2 drug3 drug4 age disease2 disease3
aged2 – g4d3

The term “aged2 – g4d3” specifies a list of variables, being all those variables in Stata’s variable list between aged2 and g4d3 inclusive. The order of the variables is as displayed in the variables window in Stata. By default, newly generated variables are added to the end of the list.

The Equivalence of a Model that is Fully Interactive with Respect to a Factor and a Set of Regressions Consisting of Separate Regressions on Each Category of the Factor

The fully interactive model is quite complex and requires the creation of dummy variables and interaction terms. It is sometimes more convenient to estimate the model by running separate regressions on each category of the factor. The same estimated models result, which can be verified by evaluating the fully interactive model for each category.

regress systolic drug2 drug3 drug4 age if disease==1
regress systolic drug2 drug3 drug4 age if disease==2
regress systolic drug2 drug3 drug4 age if disease==3

The “if” conditional expression is a convenient way to select subsets of observations on which to run the regressions. It is easy to verify that the regression on the patients who have the first disease gives the same estimated model as the fully interactive specification. Verifying the equivalence for the other two disease categories requires adding together the appropriate coefficients from the fully interactive model. For example, the estimated coefficient on age for those with the second disease can be directly read from the regression on the patients with disease 2. It is the same as the sum of the coefficients on the age and aged2 variables from the fully interactive model.

This equivalence results from the fact that OLS is a linear estimator. A similar equivalence would not be obtained from nonlinear estimators that we will study later this semester, such as GLS and probit/logit. However, even for nonlinear estimators, the results would be nearly equivalent.

When should you run the fully interactive model, and when should you run separate regressions instead? First, because they are equivalent, it doesn’t really matter, so it is really a matter of convenience. In this example, note that in either case you are estimating fifteen coefficients. You should be aware, however, that if you estimate separate regressions, the corresponding population model is one in which the effects of each factor (or association of each factor with the dependent variable) are generally different for each category – in this case, for example, the effects of drug and age are different for each disease. Is this really what you want to assume? If so, or if it really doesn’t matter for your analysis, then go ahead and estimate separate regressions if that is more convenient.