Chapter 5-22. Modeling Cost

In this chapter, we will cover a number of methods to outcomes with severely right skewed distributions. Such a distribution is observed in cost oucomes.

We will practice with the Doshi Cost Dataset (Doshi, 2005).

Doshi Cost Dataset

File: doshicost.dta

Source: downloaded as mdmcea.dta from Doshi’s website: www.uphs.upenn.edu/dgimhsr

No explanation was available of what the data represent, but it appears to be a randomized trial of a study drug.

Codebook

patid patient id

followup follow-up time (months)

died died 0=alive , 1=died

age baseline age (at time of randomization) (continuous variable)

ejfract baseline ejection fraction (continuous variable)

male male gender 0=female , 1=male

ischemic etiology 0=ischem , 1=nonishcem

white White race 0=not white , 1=white

active treament 0=placebo , 1=active

center study center 0=Center A , 1=Center B

nyha4 NYHA classification 0=class III , 1=class IV

cost cost during 2 years follow-up

qaly quality life years at 2-year follow-up (continuous variable)

Reading the data into Stata,

cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use doshicost.dta, clear

______

Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010.


To get a feel for the cost variable, we examine it’s histogram, shown with an overlayed normal curve to provide a sense of the skewness,

histogram cost , bin(50) color(red) normal


Shown separately for the two study groups,

histogram cost , bin(50) color(red) normal by(active)

Looking at the descriptive statistics,

tabstat cost, by(active) stats(mean median sd min max n) col(stat)

Summary for variables: cost

by categories of: active (treatment: active vs placebo)

active | mean p50 sd min max N

------+------

0. placebo | 6832.225 5705.506 5620.974 347.0169 22504.75 100

1. active | 10832.22 8947.354 6514.671 3783.724 45849.69 100

------+------

Total | 8832.223 7853.246 6391.574 347.0169 45849.69 200

------

We see that there are no zero costs in this dataset. We also notice that the mean is quite a bit larger than the median, reflecting right skewness.
Fitting an ordinary linear regression to this cost outcome, with treatment group as the predictor,

regress cost active

Source | SS df MS Number of obs = 200

------+------F( 1, 198) = 21.61

Model | 799998496 1 799998496 Prob > F = 0.0000

Residual | 7.3296e+09 198 37018145.4 R-squared = 0.0984

------+------Adj R-squared = 0.0939

Total | 8.1296e+09 199 40852217.5 Root MSE = 6084.3

------

cost | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------+------

active | 3999.996 860.4434 4.65 0.000 2303.187 5696.806

_cons | 6832.225 608.4254 11.23 0.000 5632.399 8032.05

------

and then requesting the predicted values,

capture drop pred_cost
predict pred_cost , xb
tabstat pred_cost, by(active) stats(min max n) col(stat)

Summary for variables: pred_cost

by categories of: active (treatment: active vs placebo)

active | min max N

------+------

0. placebo | 6832.225 6832.225 100

1. active | 10832.22 10832.22 100

------+------

Total | 6832.225 10832.22 200

------

We see that the min and max are the same in each group, reflecting that the predicted value is just the group mean for every subject. For the active = 0 , or placebo group, the predicted value is just the constant. For the active = 1, or active group, the predicted value is the constant plus the coefficient for active.

display 6832.225+3999.996*0 // active = 0 , or placebo group
display 6832.225+3999.996*1 // active = 1 , or active group

6832.225

10832.221

We see that linear regression fits a line through the means.


Illustrating this by graphing a scatterplot with the predicted linear fit,

#delimit ;
twoway (scatter cost active, mcolor(blue) msize(medium) msymbol(circle_hollow))
(line pred_cost active, clcolor(blue))
, xtitle(Treatment Group)
xlabel(0 "0. Placebo" 1 "1. Active")
legend(off)
ytitle("Cost")
ylabel(0(5000)50000,angle(horizontal))
text(12432 0.87 "$10,832")
text(5232 0.13 "$6,832")
ysize(2) xsize(2) r1title(" ")
;
delimit cr


Root Mean Squared Error

To compare the predictive performance of the various models in their ability to predict mean cost, we will use the root mean squared error (RMSE), which is (Lipscomb, 1998, equation 6),

The smaller this number is, the better the prediction of individual observations.

Austin et al (2003, pp 2804-2805) provide some other measures of predictive accuracy that could also be used.

Computing this for the above linear regression on untransformed cost,

* - compute root mean squared error
capture drop rmsenumerator
capture drop rmsesum
capture drop rmse
gen double rmsenumerator = (pred_cost-cost)^2
egen double rmsesum = sum(rmsenumerator)
gen double rmse = rmsesum/_N
display sqrt(rmse[1])

6053.7562

Summarizing the performance of this model,

Doshi Cost Dataset

Treatment / Actual
Median / Actual
Mean / Linear
Regression
Untransformed
Cost Variable
Predicted
Mean
Active / 8,947 / 10,832 / 10,832
Placebo / 5,706 / 6,832 / 6,832
Difference / 3,241 / 4,000 / 4,000
RMSE / 6,054


Quantile Regression (Median Regression)

Linear regression does an excellent job of predicting the mean, and makes no attempt at all to predict the median.

Doshi Cost Dataset

Treatment / Actual
Median / Actual
Mean / Linear
Regression
Untransformed
Cost Variable
Predicted
Mean
Active / 8,947 / 10,832 / 10,832
Placebo / 5,706 / 6,832 / 6,832
Difference / 3,241 / 4,000 / 4,000

For skewed distributions, the median does a better job of predicting an average cost, in the sense that it is closer to the middle of the distribution of costs.

However, if the purpose is an economic analysis, the mean is a better average because it can be transformed into total cost, whereas the median cannot (Doshi et al , 2005, slide 5)

In an economic analysis, the total cost is the measure of interest.

If the goal was to predict the median cost, then quantile regression could be used (Austin et al, 2005). When the 50th quantile, or 50th percentile (the median), is being estimated, this approach is also called median regression.

qreg cost active, quantile(50)

Median regression Number of obs = 200

Raw sum of deviations 895531.8 (about 7816.248)

Min sum of deviations 857974 Pseudo R2 = 0.0419

------

cost | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------+------

active | 3008.653 1153.682 2.61 0.010 733.5708 5283.735

_cons | 5947.016 815.7767 7.29 0.000 4338.29 7555.742

------


Reqesting adjusted predicted values,

capture drop pred_cost
predict pred_cost , xb
tabstat pred_cost, by(active) stats(min max n) col(stat)

Summary for variables: pred_cost

by categories of: active (treatment: active vs placebo)

active | min max N

------+------

0. placebo | 5947.016 5947.016 100

1. active | 8955.669 8955.669 100

------+------

Total | 5947.016 8955.669 200

------

Adding these predicted medians to the table,

Doshi Cost Dataset

Treatment / Actual
Median / Quantile
Regression
Untransformed
Cost Variable
Predicted
Median / Actual
Mean / Linear
Regression
Untransformed
Cost Variable
Predicted
Mean
Active / 8,947 / 8,956 / 10,832 / 10,832
Placebo / 5,706 / 5,947 / 6,832 / 6,832
Difference / 3,241 / 3,009 / 4,000 / 4,000

We see that they are closer to the medians, but the fit was not nearly as accurate as linear regression through the means.

Quantile regression is particularly well-suited to modeling reference growth charts. (Wei et al, 2006).

Austin et al (2003) included median regression in their comparison of the possible models of cost. Thus, you could use Austin as a citation for the method if you chose to use median regression in your analysis.
The Issue of Skewness

We haved observed that skewed distributions do not lead to inaccurate effect estimates. The linear regression model fitted the means exactly. The issue is the effect of skewness on the significance tests for the effect. Whereas the linear regression coefficient, or slope, for the treatment group represents mean difference between the groups, the significance test

is a test for equality of group means.

An assumption of this test is that the residuals, observed minus predicted values, are normally distributed.

Fitting the linear regression again and graphing histograms of the residuals,

regress cost active
capture drop resid_cost
predict resid_cost , resid
histogram resid_cost , bin(50) color(red) normal percent by(active)

we see that the residuals are not normally distributed.
This lack of normality may not be a problem. It has long been known among statisticians that the regression model coefficient significance tests are robust to normality. That is, for large samples, the p values remain accurate.

Doshi et al (2005, slide 6) state,

“While the normality assumption is routinely violated for cost data, in large samples these tests have been shown to be robust to violations of this assumption.”

The robustness of linear regression can be assessed with simulation when the null hypothesis is true. If the nominal significance level is maintained (do not get significance by chance more than alpha = 0.05, or 5% of the time), then the p value can be considered accurate.

Two of the assumptions of linear regression are:

1) The residuals are normally distributed. In basic statistics courses it is stated that the outcome

variable itself is normally distributed. That is not actually necessary, but it is an easy way to

teach the assumption, since if the variable itself is normally distributed, than so will the

residuals be.

2) The variance is the same at every point on the X-axis, called the homogeneity of variance

assumption.

Linear regression is known to be both robust to the assumption of normality and homogeneity of variance, meaning that the assumptions can be violated to some degree and the p values are still sufficiently accurate. The way statisticians investigated this was to study the impact of the assumption violations on the nominal significance rate (alpha, or Type I error). If Type I error was not increased, then the violation was not considered to be of importance.

In regard to the homogeneity of variance assumption, Lorenzen and Anderson (1993, p.35) state,

“Historically, the assumption thought to be the most critical was the homogeneity of variance. However, Box (1954) demonstrated that the F test in ANOVA was most robust for a while working with a fixed model having equal sample sizes. He showed that for a relatively large (one variance up to nine times larger than another) departures from homogeneity, the a level may only change from .05 to about .06. This is not considered to be of any practical importance. (It should be pointed out that the only time an a level increased dramatically was when the sample size was negatively correlated with the size of the variance.)”


Just how far from normality can one go without creating a problem? It is not so simple to quantify a departure from normality, but a contrast to the homogeneity of variance assumption can be made. We saw above that for the homogeneity of variance assumption, the variance of one group could be three-times the variance of another group without changing alpha (the type I error) at all, and could be nine-times the variance and only change alpha from 0.05 to 0.06, a change of no importance. What might seem like appalling violations, then, are of no consequence due to the robustness of the hypothesis tests. In contrast, the hypothesis tests are even more robust to the assumption of normality. van Belle (2002, p.8) explains,

“Many investigators have studied the issue of the relative importance of the assumptions underlying hypothesis tests (see, Cochran, 1947; Gastwirth and Rubin, 1971; Glass et al., 1972; Lissitz and Chardoes, 1975; Millard et al., 1985; Pettitt and Siskind, 1981; Praetz, 1981; Scheffé, 1959; and others). All studied the effects of correlated errors on classical parametric and nonparametic tests. In all of these studies, positive correlation resulted in an inflated Type I error level, and negative correlation resulted in a deflated Type I error level. The effects of correlation were more important than differences in variances between groups, and differences in variances were more important than the assumption of a normal distribution.”


Simulation 1. Monte Carlo simulation of normal distribution

We will take samples of size n=100, 200, 300, 400, and 500 per group from a normal population, comparing the two groups, mean±SD, 50±20 vs 50±20, or no difference, and count the proportion of times p < 0.05 for each sample size, using a two-sided comarison.

set seed 999
forval j=100(100)500 {
quietly clear
quietly set obs 1
quietly gen signif=.
quietly save tempsignif, replace // file to hold # times significant
forval i=1/10000{
quietly clear
quietly local double=`j'*2 // twice the individual group N
quietly set obs `double'
quietly local range1="in 1/`j'"
quietly local j2=`j'+1
quietly local range2="in `j2'/`double'"
*disp "`range1'" " " "`range2'"
quietly gen group=1 `range1'
quietly replace group=0 `range2'
quietly gen cost=invnorm(uniform())*20+50 // normal mean=50, SD=20
*sum
capture regress cost group
if _rc==0 {
quietly gen signif=(1-(normprob(abs(_b[group]/_se[group]))))*2 ///
< 0.05 in 1/1
*display (1-(normprob(abs(_b[group]/_se[group]))))*2
quietly keep in 1/1
quietly keep signif
quietly append using tempsignif
quietly save tempsignif, replace
}
}
sum signif
}

The results were

Sample Sizes / mean±SD / # times significant (p<.05)
n1 = 100 , n2 = 100 / 50±20 vs 50±20 / 0.052
n1 = 200 , n2 = 200 / 50±20 vs 50±20 / 0.052
n1 = 300 , n2 = 300 / 50±20 vs 50±20 / 0.055
n1 = 400 , n2 = 400 / 50±20 vs 50±20 / 0.052
n1 = 500 , n2 = 500 / 50±20 vs 50±20 / 0.053

The number of times significance (p<.05) was observed by chance was very close to the nominal alpha (0.05), indicating that the p values are accurate when the normality assumption is met.


Simulation 2. Bootstrap simulation of normal distribution

We will take samples of size n=100, 200, 300, 400, and 500 per group from a normal population, comparing the two groups, mean±SD, 50±20 vs 50±20, or no difference, and count the proportion of times p < 0.05 for each sample size, using a two-sided comparison.

We will do this by taking one sample and then making of a duplicate of it. At that point each group contains an identical sample. To introduce variability, we will then take a bootstrap sample of this entire sample and run the regression model. This sample will be resampled 1,000 times.