Stratified Regression: Re-thinking Regression Coefficients

Farrokh Alemi, PhD, Amr Elrafey

Monday, May 29, 2017
In the chapter on regression, we explained that the coefficients in regression equation do not have a meaningful interpretation, when there are multiple variables with interaction terms present. Sure, in regression of Y on a single independent variable, the coefficient of the independent variable can be interpreted as how rate of change in the independent variable affects the dependent variable. This interpretation does not make sense when examining multiple variables. When dealing with multiple variable regressions, many statisticians have observed that the regression coefficients become meaningless. Consider the regression of a dependent variable Y on two independent variables A and R. We are regressing Yi on Ai, Ri, and interaction of Ai and Ri shown as ARi. In this situation, none of the coefficients have any particular meaning, which is troubling. None can be interpreted as how rate of change in one variable affects the dependent variable. These coefficients were derived by minimizing sum of the square of residuals. The procedure is mathematically sound but has results in coefficients that have no real-world interpretation. In this regard, McElreath writes:

“In a simple linear regression, with no interactions, each coefficient says how much the average outcome changes when the predictor changes by one unit. And since all of the parameters have independent influences on the outcome, there’s no trouble in interpreting each parameter separately. Each slope parameter gives us a direct measure of each predictor variable’s influence. Interaction models ruin this paradise, however. Look at the interaction likelihood again:
Yi~Normal μi,σ

μi=α+δiRi+βAAi

δi=βR+βARAi

Now the change in μi that results from a unit change in Ri is given by δi. And since δi is a function of three things (βR, βAR, and Ai) we have to know all three in order to know the influence of Ri on outcome. … The practical implication of this fact is that you can no longer read the influence of either predictor from the table of estimates.” [[1]].

In this section, we show how a regression equation can be re-written so that the coefficients remain meaningful, even when interaction terms are present.

Stratified covariate balancing provides a partial way out of this difficulty [[2]].Recall from earlier chapter that the average impact of Xi on Y is shown as Si and is calculated over S observed unique combinations of Xj, …, Xm, each of which we call a stratum. Within each stratum the values of Xj, …, Xm are fixed resulting in:

ki=sYxi+,Xj, …, Xm-Yxi-, Xj, …, XmS

Where s is an index to the S unique combinations of Xj, …, Xm, Y indicates the average of the dependent variable within the stratum s, xi+ indicates the highest, and xi- indicates the lowest values for Xi. For each independent variable, the stratified covariate balancing can estimate the change in dependent variable when all other independent variables are held constant. The main effect of the regression parameters can then be estimated using stratified covariate balancing. Each coefficient for the main effect provides the average impact of the independent variable holding all other things constant, the so called ceteris paribus condition. If we assume that the influence of combined variables is the sum of each variable then the formula for predicting Y becomes:

Y*=rnkrXr

Where kr parameters are estimated using stratified covariate balancing.

Of course the assumption that the combined impact is the sum of impact of each variable is often wrong. We need to introduce a series of corrections for various combinations of the variables. Since we have “n” binary variables, we will have 2n possible combinations. For each of these combinations, we need to introduce a separate correction factor. Since a factorial design indicates all the possible combinations, all we have to do is estimate the correction factor for each of the combination in the factorial design. The formula, which we call the stratified regression formula, then looks as below:

Y=rnkrXr+c0m(1-Xm)+rncrXr+rnj>rncrjXrXj+rnj>rnl>jncrjlXrXjXl+ …

Where c0, cr, crj, crjl, … are the correction factors and kr parameters measure the stratified covariate balanced impact of independent variable r on Y. The coefficient c0 is the correction for when none of the independent variables are present; the coefficient cr is the correction for when the variables are present by themselves, the coefficients crj are the corrections for pair-wise interactions, cijl is the correction factor for when three independent variables interact, and so on. The parameter kr in the summation rnkrXr shows the average impact of rth independent variable on outcome. This part of the equation makes intuitive sense and corrects the problem in standard regression where the average impact of the variable is not reported. The remaining parameters in the equation have a different interpretation. These parameters show the correction necessary for calculating the impact of combined set of independent variables. It is usually not the case that the impact of combined set of variables is the same as the sum of the impact of each. The correction factors are there to adjust the impact of the interaction of the independent variables and each of these correction factors corrects a specific combination of variables. A negative correction shows how much the combined effect is adjusted downward. A positive correction shows how much the combined effect is adjusted upward. The correction factors can be estimated sequentially by starting with c0 and moving up to higher combinations of independent variables using the following equations:

1.  Y is the predicted value from stratified regression, where parameters not yet estimated are assumed to be zero.

2.  c0=averageY-Y where ∀ xr=0

3.  cr=averageY-Y where xr=0 and ∀ Xm≠r=0

4.  crj=averageY-Y where Xr=1, Xj=1 and ∀ Xm≠r or j=0

5.  crjl=average(Y-Y) where Xr=1, Xj=1, Xl=1 and ∀ Xm≠r, j, or l=0

6.  … so on for higher interaction parameters.

Code for Estimation of the Parameters

In previous chapters we have already described standard query language for estimating the ki coefficients through stratified covariate balancing. Above procedures implies a simple method of using standard query language to detect the interaction corrections. In step 1 start with identifying the naturally occurring combinations of independent variables. Note that not all combinations of variables occur in the data. When the number of independent variables is large, the probability that some combination never occurs increases. Similarly, when independent variables have high correlations, then some combinations never occur. In many databases many combinations are missing, making it easier for the procedure to work in high dimensional data. In step 2, sort the combination in order of lower interaction parameters; i.e. start with no independent variables present, only one independent variable present, only two independent variables present and so on. In step 3, start with the lowest interaction correction and calculate it as the average of the difference between observed and predicted value. In calculating the predicted values assume that correction parameters that are not yet estimated have a value of zero.

Example of Stratified Regression

An example can demonstrate the calculation of parameters for the stratified regression. Suppose we want to evaluate the impact of age (above 65 and below 65), gender (male, female), and co-pay (high and low) on cost of insurance. For simplicity assume that the data is given for 33 observations in Table 3. These data were generated so that:

Cost=1000+2000*Age+3000*Gender+4000*Copay+5000*Age*Gender

+6000*Age*Copay+7000*Gender*Copay+8000*Age*Gender*Copay

+1000*Random Number

Note that if regression would recover a set of parameters close to how the data was generated, because of the presence of the interaction terms we cannot read the average impact of any of the independent variables on cost from the recovered parameters, even if these parameters are exactly as coefficients in the above equation. When age is more than 65, both the main effect of age ($2000) and the interaction of age and gender ($5,000) could have an impact on cost. The interaction of age and gender contributes a value only if both variables are present; in some situations this interaction is zero and in others it is $5,000. So to full capture the impact of age on cost we would need to estimate the parameter for age, check and see if gender or co-pay are present and then adjust for the impact of interaction terms.

Table 3: Simulated Data for Impact of Age, Gender, Co-pay on Cost

Case / Age / Gender / Co-pay
1 / 0 / 1 / 1
2 / 0 / 1 / 1
3 / 1 / 0 / 1
4 / 1 / 1 / 1
5 / 1 / 1 / 1
6 / 1 / 0 / 1
7 / 0 / 1 / 1
8 / 0 / 0 / 1
9 / 1 / 0 / 1
10 / 0 / 0 / 1
11 / 0 / 1 / 1
12 / 0 / 0 / 1
13 / 1 / 0 / 1
14 / 1 / 0 / 1
15 / 1 / 1 / 1
16 / 0 / 1 / 1
17 / 1 / 1 / 0
18 / 1 / 0 / 0
19 / 0 / 1 / 0
20 / 1 / 0 / 0
21 / 1 / 0 / 0
22 / 0 / 1 / 0
23 / 0 / 0 / 0
24 / 1 / 1 / 0
25 / 1 / 0 / 0
26 / 1 / 0 / 0
27 / 1 / 0 / 0
28 / 1 / 0 / 0
29 / 1 / 1 / 0
30 / 0 / 1 / 0
31 / 1 / 1 / 0
32 / 0 / 0 / 0
33 / 0 / 1 / 0

We can now see how regression of cost on age, gender and interaction of age and gender estimates the parameters of each of these variables. The estimated regression coefficient for age is $2165. This coefficient cannot be interpreted as the rate of increase in cost for a change from young to old. These rates also depend on the interaction of age and gender, age and co-pay, and age, gender and co-pay interaction. From any single regression coefficient, we do not know how much insurance cost goes up when patients age. We could say that age goes up different amounts for males and females but to do so is to dodge the question and ask the person looking at the ordinary regression formula to recalculate the impact. To calculate the average effect of age we would need to examine the frequency of occurrence of males and females as well as the frequency of occurrence of low and high co-pays. The fact that we cannot read from the regression equation the impact of X on Y is somewhat bizarre; as the very purpose of regression was to estimate the effect of the variable, yet multivariate regression does not allow this. It seems that the effort to fit an equation to the data has been futile, as the equation does not provide the answer we are looking for. Now, let us look at how stratified regression analyzes the same data.

Table 4: Regression of Cost on Age, Gender and Interaction of Age and Gender

Coefficients / Standard Error / t Stat / P-value / Lower 95% / Upper 95%
Intercept / 1374.07 / 169.59 / 8.10 / 0.00 / 1024.80 / 1723.35
Age / 2165.59 / 182.17 / 11.89 / 0.00 / 1790.41 / 2540.77
Gender / 3198.28 / 234.13 / 13.66 / 0.00 / 2716.08 / 3680.48
Co-pay / 3831.89 / 234.13 / 16.37 / 0.00 / 3349.69 / 4314.09
Age Gender / 4970.88 / 250.86 / 19.82 / 0.00 / 4454.23 / 5487.53
Age & Co-pay / 6150.30 / 273.63 / 22.48 / 0.00 / 5586.74 / 6713.86
Gender & Co-pay / 6947.95 / 310.66 / 22.37 / 0.00 / 6308.14 / 7587.76
Age, Gender & Co-pay / 7861.53 / 382.50 / 20.55 / 0.00 / 7073.74 / 8649.31

First, we use estimate the average impact of each variable while holding. We do so through an SQL code that organizes the data into cases (having the high value of the independent variable) and controls (low value of the independent variable), across strata defined with low and high values of remaining independent variables. For example, the following code shows how the variable X1 is used to define cases and controls and variables X2 and X3 are used to create strata:

SELECT Avg(D.Y) AS AvgOfY, D.X2, D.X3

FROM D

INTO CasesX1

WHERE D.X1=1

GROUP BY D.X2, D.X3

SELECT Avg(D.Y) AS AvgOfY, D.X2, D.X3

INTO ControlsX1
FROM D

WHERE D.X1=0

GROUP BY D.X2, D.X3

SELECT Avg([CasesX1]![AvgOfY]-[ControlsX1]![AvgOfY]) AS k1

INTO k1

FROM ControlsX1 INNER JOIN CasesX1

ON (ControlsX1.X3 = CasesX1.X3) AND (ControlsX1.X2 = CasesX1.X2);

From our data, the SQL code calculates the average impact of each independent variable across the strata created from remaining values:

kAge=$9,377.84

kGender=$11,426.17

kCopay=$12,028.13

These calculations are also shown in Table 5. Notice that the estimated impact of independent variable is radically different from the coefficient of the same variable in the equation that generated the data. These estimates reflect not only the main effect of the variable but also its effect while interacting with other variables.

Table 5: Average Impact while Holding Other Variables Constant

Gender / Co-pay / Cases (Old Age) / Controls (Young Age) / Impact of Age
Female / High / $3,527.52 / $1,416.58 / $2,110.93
Female / Low / $13,521.85 / $5,205.96 / $8,315.89
Male / High / $11,730.08 / $5,793.82 / $5,936.26
Male / Low / $36,500.49 / $15,352.20 / $21,148.29
Average / $9,377.84
Age / Co-pay / Cases (Males) / Controls (Females) / Impact of Gender
Young / High / $5,793.82 / $1,416.58 / $4,377.24
Young / Low / $15,352.20 / $5,205.96 / $10,146.24
Old / High / $11,730.08 / $3,527.52 / $8,202.57
Old / Low / $36,500.49 / $13,521.85 / $22,978.64
Average / $11,426.17
Age / Gender / Cases (Low
Co-pay) / Controls (High
Co-pay) / Impact
Co-pay
Young / Female / $5,205.96 / $1,416.58 / $3,789.38
Young / Male / $15,352.20 / $5,793.82 / $9,558.38
Old / Female / $13,521.85 / $3,527.52 / $9,994.33
Old / Male / $36,500.49 / $11,730.08 / $24,770.41
Average / $12,028.13

The interaction corrections can be calculated sequentially starting with the c0 parameter. This is calculated from the average of the observed values for the situation where all independent variables are at their lowest level. The following SQL code estimates that the correction parameter c0 is $1416.58: