BIOST/STAT 578B Data Analysis

Biost/Stat 579 – Autumn 2008 11/14/08 - 1/9

BIOST/STAT 579 – Data Analysis

Correlated Responses

A basic assumption of most statistical methods is that observations are independent. As shown in the first section, correlation between responses can have dramatic impacts on inferences that assume independence. Although the large-sample unbiasedness property of most estimators is not affected by correlation, there can be large effects on the variances. Thus, ignoring the correlation will lead to misleading assessments of the uncertainty in the estimators and hence to invalid inferences. The second section illustrates the robust “sandwich” variance estimator, which accommodates correlation due to clustering of observations. The robust estimator typically performs very well as long as the number of clusters in the data set is large. However, its performancecan be poor with small samples. Although it is often not necessary or even worthwhile to do any modeling of correlations between responses, such modeling can increase precision of regression coefficient estimators. The most popular approach for doing this is based on Generalized Estimating Equations (GEE), which are illustrated in section 4. Importantly, it is not necessary to correctly model the correlations to obtain regression coefficient estimates which are at least large-sample unbiased. Gains in precision will depend on how well the correlations and variances of the responses are modeled. By combining GEE with robust variance estimators we get the advantages of both, i.e., large-sample unbiased estimators of regression coefficient estimators which have good precision and valid assessments of their uncertaintly. Fortunately, we get all this without having to model the correlations correctly, because we can usually not assume we have done so. Less commonly, the correlations are of scientific interest in their own right. This situation warrants more emphasis on accurate modeling and may require more elaborate models be developed (such models will notbe described here but may be presented in a special topics session later). Other issues that arise with correlated data may also be presented later if there is interest.

Outline

1. The Impact of Correlation Between Responses

2. General Approaches to Analysis of Correlated Responses

3. Empirical Variance Estimation for Clustered Data

4. Generalized Estimating Equations

The Impact of Correlation on Inferences that Assume Independence

Example (Potthoff & Roy growth data): The distance between two facial landmarks was measured on 27 children (11 females, 16 males) at ages 8, 10, 12, and 14 years.

Spaghetti plots (below) show that there is positive correlation between measurements on the same child (do you see the positive correlation?). (Note that age has been centered at 11.)

Let’s ignore the correlation and fit a linear model to the distance measures to assess associations with age and sex and their interaction.

Model II
Variable / Estimate / SE
Intercept / 22.6 / 0.34
Male / 2.32 / 0.44
Age / 0.48 / 0.15
Male*Age / 0.30 / 0.20

Answer the following questions before looking ahead in the notes:

Are these estimates valid? If not, in what way are they wrong?

Are these standard errors valid? If not, in what way are they wrong?

Would your answers be different if the correlation had been negative?

How can we tell if the inference is valid?

Perform two subject-level analyses:

Analysis of subject means:

Variable / Estimate / SE
Intercept / 22.6 / 0.59
Male / 2.32 / 0.76

Analysis of regression slopes on age:

Variable / Estimate / SE
Intercept / 0.48 / 0.10
Male / 0.30 / 0.13

In the table below, summarize the impacts of correlation on the naïve inferences thatassume independence for each of the terms in the model, assuming the correlation between measures on the same child is positive or negative. Indicate whether the naïve inference is valid, conservative, or anti-conservative.

Data Set /  > 0 /  < 0
Male
Age
Male*Age

What can you say for a general study (not having the balanced structure of the Potthoff-Roy data)?

The Design Effect

The design effect (deff) summarizes the impact of correlation on variances of parameter estimators. It is the ratio of the actual variance (accounting for the correlation) to the naïve variance, which ignores the correlation:

Variance accounting for correlation
deff = / ------
Variance ignoring the correlation

What is the formula for the deff for the Male term in the model for the Potthoff-Roy data?

Assuming only 2 measurements per child (e.g., age 8 and 10), what is the formula for the deff for the age term?

List the factors that determine the impact of correlation on inferences that assume independence:

General Approaches to Analysis of Correlated Responses

Pick one response to analyse (cluster-level analysis). Particularly useful for longitudinal data when the last outcome is most important. (Example: GCSF data)

Use a summary statistic (derived variable) to summarize multiple correlated responses. Examples are mean, median, regression slope. (Example: Potthoff-Roy data)

Analyze individual responses using a naive estimation method that assumes independence, and a different method to assess uncertainty that accommodates the correlation, such as jackknife, bootstrap, permutation, robust variance estimate. (Example: Potthoff-Roy, HSPP)

Analyse individual responses using a method that accounts for the correlation in both estimation and inference.

Empirical Variance Estimation for Clustered Data

Linear model with naïve and robust standard errors and SEs from subject-level analyses:

Variable / Estimate / Naïve SE / Robust SE / Subject-Level
Intercept / 22.6 / 0.34 / 0.61 / 0.59
Male / 2.32 / 0.44 / 0.75 / 0.76
Age / 0.48 / 0.15 / 0.06 / 0.10
Male*Age / 0.30 / 0.20 / 0.12 / 0.13

A Simulation Study of Small-Sample Properties of Empirical Variance Estimators

Data simulated from a logistic regression model with 2 binary predictors (one cluster level and one unit level), with a “dental” correlation structure with average correlation 0.1. Type I errors for .05-level tests:

# Clusters / Cluster sizes / Cluster-level predictor / Unit-level predictor
10 / Equal / .132 / .140
10 / Unequal / .179 / .158
20 / Equal / .071 / .089
20 / Unequal / .104 / .111
30 / Equal / .072 / .079
30 / Unequal / .083 / .088
40 / Equal / .066 / .063
40 / Unequal / .075 / .087

Conclusions:

Wald tests tend to be anti-conservative for both cluster-level and unit-level predictors

The anti-conservatism is serious for 10-20 clusters but still a problem up to 40 clusters or so.

The anti-conservatism is worse for unequal cluster sizes than equal cluster sizes

According to other results (not shown), the anticonservatism is more extreme for composite hypotheses (testing more than 1 parameter) and worsens as the number of parameters tested increases.

Generalized Estimating Equations (GEE)

Incorporates weight matrices patterned after suspected correlation structure to increase precision of regression coefficient estimates.

Example (Potthoff-Roy data): Compare independent and exchangeable correlation structures.

Independent / Exchangeable
Variable / Estimate / Naïve SE / Robust SE / Estimate / Naïve SE / Robust SE
Intercept / 22.6 / 0.34 / 0.61 / 22.6 / 0.57 / 0.61
Male / 2.32 / 0.44 / 0.75 / 2.32 / 0.74 / 0.75
Age / 0.48 / 0.15 / 0.06 / 0.48 / 0.10 / 0.06
Male*Age / 0.30 / 0.20 / 0.12 / 0.30 / 0.12 / 0.12

What are the differences between Independent and Exchangeable analyses? Why do some things agree between the 2 analyses, while others differ?

Let’s try creating some “missing data” to see what happens. Delete the last 2 observations from the first 5 subjects to create imbalance. New results:

Independent / Exchangeable
Variable / Estimate / Naïve SE / Robust SE / Estimate / Naïve SE / Robust SE
Intercept / 22.5 / 0.41 / 0.76 / 22.7 / 0.62 / 0.63
Male / 2.46 / 0.51 / 0.88 / 2.28 / 0.80 / 0.77
Age / 0.53 / 0.18 / 0.11 / 0.46 / 0.12 / 0.06
Male*Age / 0.25 / 0.23 / 0.15 / 0.33 / 0.14 / 0.12

Now explain these results. What changed regarding the comparison between Independent and Exchangeable analyses? Why?

References

Liang and Zeger (1986) Biometrika (robust variance estimates, GEE)

Mancl and Leroux (1996) Biometrics(comparisons of working correlation matrices)

Mancl and DeRouen (2001) Biometrics (small sample properties of GEE)

Potthoff and Roy (1964) Biometrika(random effects models for facial measurement data)