SUPPLEMENTARY MATERIAL

GWAS with longitudinal phenotypes –performance of the approximate procedures

Karolina Sikorska, NahidMostafaviMontazeri, AndréUitterlinden, Fernando Rivadeneira, Paul H.C Eilers, and Emmanuel Lesaffre

  1. Appendix

Linear mixed model

The ML estimator of β in model (1) for balanced data set can be expressed as,

With , the design matrix for the th subject consisting of the columns 1, , and equals . Using Woodbury equation, can be rewritten for model (2) as,

Where is the vector of time points, and By inserting equation (18) in equation (17) the estimate of interaction term in model (2), can be obtained as:

Likewise by inserting in the variance-covariance matrix of LMM, i.e. the variance of the interaction estimator can be written as:

Two-step approach

We can reformulate model (2) into

(19)

with

We call model (19) the reduced model and its fitting constitutes the first step in the two-step procedure. The second step regresses the estimated on SNPs with simple regression model

The MLE of can be expressed as:

where is the best linear unbiased predictor (BLUP) and can be computed by the empirical Bayesian approach as:

and where is the ML estimator of the vector of fixed effects of reduced model (19). Likewise by inserting and in equation (21) the BLUP for model (19), can be obtained as:

where and By inserting the two above equations into equation (20) the estimate of can be derived as:

Note that in the above derivations we have used the assumption that the covariance of the random intercept and slope is zero. We have argued in the text when this assumption is reasonable.

Conditional two-step approach

The transformed matrix (i.e. ) which was introduced in CLMM can be defined as, by using the Gram Schmidt process, where In the second step of the conditional two-step approach (i.e. model (11)), the ML estimator of can be expressed as:

Where is the best linear unbiased predictor (BLUP) and can be computed by empirical Bayesian approach as,

where is the MLE of the reduced model (10), and are the transformed design matrix of fixed and random effects, respectively. Likewise by using the Woodbury equation, can be obtained as,

where By inserting and into equation (24) the BLUP for model (10) can be obtained as:

with and By inserting the two above equations into the equation (23), the ML estimator of SNP effect in model (12) can be derived as:

In the second step of the conditional two-step approach, model (11), variance of the can be expressed as:

where with respect to the variance-covariance matrix of LMM, i.e. can be expressed as:

By inserting former equation into the equation (25), the variance of estimator of SNP effect can be derived as:

  1. Probit regression to estimate power

In the mixed model framework the distributions of the test statistics for the Wald, t-and F-tests are generally known only under the null hypothesis 1. Exhaustive simulations are considered the most accurate method to compute statistical power. However fitting many LMMs is time consuming and so is the simulation-based estimation of the power curve. The effect and parameters are computed repetitively for a grid of values. The proportion of times that a SNP is qualified as significant gives the empirical type I error rate (simulation under null hypothesis) and power (simulation under alternative hypothesis). For an exhaustive simulation study this approach demands a discouraging amount of computation time.

We propose a faster way for power calculations based on the probit model. Helms 2 demonstrated via simulations that the distribution of the general F test

versus

under the alternative can be approximated by a noncentral F-distribution with noncentrality parameter δ given by

.

From this it follows that the t-test for testing

versus ,

underhas a noncentral t-distribution with noncentrality parameter 3. In GWAS situations the number of degrees of freedom is large and the (non-central) t-distribution can be approximated by a normal distribution. Consequently, the power curve plotting the effect size versus the statistical power has approximately the shape of the cumulative normal distribution function. We observed the same shape for the approximate procedures. In our simulations, a grid of equally spaced -values is chosen on the interval , where is the smallest value for which the power is practically 100%. Thousand grid values were chosen and for each value one data set was simulated and the considered models were fitted. The obtained p-values can be dichotomized according to the condition p < 0.05,giving . Finally, a probit model was fittedto .

  1. Discussion on the power loss of the CTS approach

We observed that the behavior of the CTS is quite similar to the CLMM. That there is some loss in power of the CLMM approach might be surprising given the results in Section 4.2 of 4. In that section it is argued that the CLMM implies no loss of information from a Bayesian viewpoint. However, this result is based on the assumption that the random intercept has a flat prior, while we have taken the classical assumption of joint normality for the random intercept and slope. That in the balanced case there is (basically) no power loss and in the unbalanced case there must be in general a power loss can be seen from the following reasoning.

The results in Section 4.3 of the same paper applied to the current simplified situation results in:

withy the stacked vector of responses, () the stacked vector random intercepts (slope), the stacked vector of values and the stacked vector of profile means. Now for each of the profile means the following result holds:

with the average time for the th subject. In the balanced case, one can change the time origin such that without changing anything on the estimation of the longitudinal part of the model. This implies that there is no information on anymore in the second part of the likelihood (part of ). That a minimal loss of information for the CTS approach was seen in some of the simulations for the balanced case has to do with the estimation of the variance parameters. Indeed the variance parameters of the LMM are present in both parts of the likelihood and must therefore be estimated better with the LMM than with any of the two parts separately. In the unbalanced case, no change in origin can remove from that part, however. Hence, a loss of power is expected with the CLMM and hence also with the CTS approach, but the loss of power is often minimal as seen in the simulations.

  1. Supplementary Figures

Supplementary Figure 1: MAR case, scenario 5. Approximation of the CTS approach compared to the CLMM.

Supplemantary Figure 2: Flowchart describing practical use of the CTS approach

Supplementary Figure 3: Time needed to analyze 1 million of SNPs using the CTS approach combined with the semi-parallel regression (left panel). Computation time ratio between the function lmer and the CTS approach combined with the semi-parallel regression depending on the number of longitudinal observations (right panel).

Supplementary Figure 4: 100 SNPs from the BMD data. On the x-axis the p-values for the SNP x time interaction effect from the mixed model assuming uncorrelated errors; on the y-axis the corresponding p-values from the model assuming continuous autoregressive structure of measurement error.

Suppementary Figure 5: Balanced case. Performance of the approximate procedures when a time-varying covariate is included to the linear mixed model.

Suppementary Figure 6: Unbalanced case. Performance of the approximate procedures when a time-varying covariate is included to the linear mixed model.

  1. R codes – an example of applying the CTS approach

The data are arranged in a so-called “long format”, with one row per observation. The SNP data are stored in a matrix S with N rows and ns columns. The size of ns depends on the available RAM. The first few rows of the phenotype data (mydata) look as follows:

id / y / Time
1 / 1.12 / 1
1 / 1.14 / 2
1 / 1.16 / 3
1 / 1.2 / 4
1 / 1.26 / 5
2 / 0.95 / 1
2 / 0.83 / 2
2 / 0.65 / 3
2 / 0.49 / 4
2 / 0.34 / 5

The code below, with function cond, transforms data for conditional linear mixed model. It is based on the SAS macro provided in Verbeke et al. in “Conditional linear mixed models” (2001). Variable “vars” is a vector with names of response and all time-varying covariates that should be transformed.

cond = function(data, vars) {

data = data[order(data$id), ]

### delete missing observations

data1 = data[!is.na(data$y), ]

## do the transformations

ids = unique(data1$id)

transdata = NULL

for(i in ids) {

xi = data1[data1$id == i, vars]

xi = as.matrix(xi)

if(nrow(xi) > 1) {

A = cumsum(rep(1, nrow(xi)))

A1 = poly(A, degree = length(A)-1)

transxi = t(A1) %*% xi

transxi = cbind(i, transxi)

transdata = rbind(transdata, transxi)

}

}

transdata = as.data.frame(transdata)

names(transdata) = c("id", vars)

row.names(transdata) = 1:nrow(transdata)

return(transdata)

}

The code below applies the conditional two-step approach. First, the data are transformed using function cond. Next, the reduced conditional linear mixed model is fit and the random slopes are extracted. Finally, the semi-parallel regression is performed.

# transform data for the conditional linear mixed model

trdata = cond(mydata, vars = c("Time", "y"))

#fit the reduced model and extract random slopes

mod2 = lmer(y ˜ Time - 1 + (Time-1|id), data = trdata)

blups = ranef(mod2)$id

blups = as.numeric(blups[ , 1])

# perform the second step using semi-parallel regression

X = matrix(1, n, 1)

U1 = crossprod(X, blups)

U2 = solve(crossprod(X), U1)

ytr = blups - X %*% U2

ns = ncol(S)

U3 = crossprod(X, S)

U4 = solve(crossprod(X), U3)

Str = S - X %*% U4

Str2 = colSums(Str ˆ 2)

b = as.vector(crossprod(ytr, Str) / Str2)

sig = (sum(ytr ˆ 2) - b ˆ 2 * Str2) / (n - 2)

err = sqrt(sig * (1 / Str2))

p = 2 * pnorm(-abs(b / err))

References

1 .G. Verbeke, G. Molenberghs.Linear Mixed Models for Longitudinal Data. New York, Springer, 2009.

2. R.W. Helms. Intentionally incomplete longitudinal designs: I. methodology and comparison of some full span designs. Statistics in Medicine 1992, 11(14-15):1889–1913.

3. R.C. Littell. SAS for Mixed Models.North Carolina, SAS institute, 2006.

4. G. Verbeke, B. Spiessens, E. Lesaffre. Conditional linear mixed models.The American Statistician 2001, 55(1):25–34.

1