Bivariate linear mixed models using SAS proc MIXED
Rodolphe Thiébauta*, Hélène Jacqmin-Gaddaa, Geneviève Chênea, Catherine Leportb, Daniel Commengesa
a INSERM Unité 330, ISPED, Université Victor Segalen Bordeaux II, 146, rue Léo Saignat 33076, Bordeaux Cedex, France
b Hôpital Bichat Claude Bernard, Paris, France
Abstract
Bivariate linear mixed models are useful when analyzing longitudinal data of two associated markers. In this paper, we present a bivariate linear mixed model including random effects or first-order auto-regressive process and independent measurement error for both markers. Codes and tricks to fit these models using SAS Proc MIXED are provided. Limitations of this program are discussed and an example in the field of HIV infection is shown. Despite some limitations, SAS Proc MIXED is a useful tool which that may be easily extendable to multivariate response in longitudinal studies.
Keywords: Bivariate random effects model, Bivariate First Order Auto-regressive process, SAS proc MIXED, HIV infection
* Corresponding author. Tel: +33 5 57 57 45 21 Fax: +33 5 56 24 00 81
Email:
1. Introduction
Longitudinal data are often collected in epidemiological studies, especially to study the evolution of biomedical markers. Thus, linear mixed models [1], recently available in standard statistical packages [2, 3], are increasingly used to take into account all available information and deal with the intra-subject correlation.
When several markers are measured repeatedly, longitudinal multivariate models could be used, like in econometrics. However, this extension of univariate models is rarely used in biomedicine although it could be useful to study the joint evolution of biomarkers. Into example, in HIV infection, several markers are available to measure the quantity of virus (plasma viral load noted HIV RNA), the status of immune system (CD4+ T lymphocytes which are a specific target of the virus, CD8+ T lymphocytes) or the inflammation process (b2 microglobuline). These markers are associated as the infection measured by HIV RNA induces inflammation and the destruction of immune cells. Several authors have developed methods to fit evolution of CD4 and CD8 cells [4] or CD4 and b2 microglobuline [5]. Amrick Shah et al. [4] used an EM algorithm to fit a bivariate linear random effects model. Sy et al [5] used the Fisher scoring method to fit a bivariate linear random effects model including an integrated Orstein-Uhlenbeck process (IOU). IOU is a stochastic process which that includes Brownian motion as special limiting case.
Their programs were implemented using IML module of SAS Software [6]. However, their flexibility is not sufficient to allow a large use by researchers not familiar with IML. Also, the EM algorithm chosen is slow.
In this paper, we propose some tricks to use SAS MIXED procedure in order to fit multivariate linear mixed models to multivariate longitudinal gaussian data. SAS MIXED procedure uses Newton-Raphson algorithm known to be faster than the EM algorithm [7]. In section 2 and 3, we present bivariate linear mixed models and the code used in SAS to fit these models. In section 4, we apply these models to study the joint evolution of HIV RNA and CD4+ T lymphocytes in a cohort of HIV-1 infected patients (APROCO) treated with highly active antiretroviral treatment.
2. Model for bivariate longitudinal gaussian data
We define a general bivariate linear mixed model including a random component, a first order auto-regressive process and an independent error.
Let , the response vector for the subject i, with the -vector of measurements of the marker k (k=1, 2) with . If the two markers are independent, we can use the two following models:
where and
where is a design matrix, is a -vector of fixed effects, is a design matrix which is usually a subset of , is a -vector of individual random effects with . is a vector of realization of a first order auto-regressive process with covariance given by and is a identity matrix.
To take into account correlation between both markers, one could use the following bivariate linear mixed model:
with
where , , , and is a -vector of realization of a bivariate first order auto-regressive process and represents independent measurement errors.
The covariance matrix of measurement errors is defined by and (the symbol represents the Kronecker product). The covariance function of the bivariate auto-regressive process is given by with is the process covariance matrix at and B is a matrix such that (i) the eigenvalues of B have negative real parts, and (ii) C and are positive definite symmetric [5]. The covariance matrix of random effects is the matrix . With the assumption that and are mutually independent, it is obvious that.
3. Models using Proc MIXED of SAS software
3.1 Random effects
As described in the documentation [3], multivariate random effects models can be fitted using the statement random and an indicator variable for each marker to define , and .
To add an independent error for each response variable in a multivariate random effect model, one must use the repeated statement with the option GROUP(VAR) where VAR is a binary variable indicating the response variable concerned (VAR=0 for and VAR=1 for into example). This option allows estimation of heterogeneous covariance structure, i.e. the variances of the measurement errors are different for each response variable.
An example of SAS code for a bivariate random effect model with random intercept and random slopes is:
Proc mixed data=BIV;
class CEN_PAT VAR;
model Y=VAR VAR*T;
random VAR VAR*T /type=UN sub=CEN_PAT;
repeated /type=VC grp=VAR sub=CEN_PAT;
run ;
where CEN_PAT is a single identification number of each patient and T is time. In this example, we have and for the measurement j of the subject i. Note that the two markers are independent if .
3.2 First order auto-regressive process
In the repeated statement SAS provides the possibility to fit bivariate models using a Kronecker product notation [8]. For instance, in the bivariate case with 3 repeated measures, the option type=UN@AR(1) in the statement repeated assumes that the covariance matrix has the following structure: . Compared with the general bivariate auto-regressive process defined in the previous section, this structure has two important limitations. First, the covariance structure is a first order auto-regressive process for discrete data and assumes the measures are equally spaced for all subjects and for the two markers. In the univariate case, a continuous time AR(1) model, which allows non equally spaced measures, may be fitted using the structure SP(POW) but this structure is not available for multivariate models. The second limitation is that the SAS program allows to estimate only one correlation parameter for the ‘bivariate process’ rather than a matrix B. Thus, using this formulation, one assumes that the intra-marker correlation is the same for the two markers, i.e. . Moreover, one assumes that inter-marker correlation is proportional to the intra-marker correlation, i.e. . Both markers are independent if the covariance matrix has the form .
To add an independent measurement error for both markers, one must use the option LOCAL(EXP <effects>) which produces exponential local effects, <effects>=VAR being still the indicator variable of response variable. These local effects have the form where U is a full-rank design matrix. PROC MIXED constructs U in terms of 1s and –1s for a classification effect and estimates .
An example of SAS code to fit a bivariate first-order auto-regressive model is:
Proc mixed data=BIV;
class CEN_PAT VAR;
model Y=VAR VAR*T;
repeated VAR /type=UN@AR(1) local=exp(VAR) sub=CEN_PAT;
run ;
where T is the time as a continuous variable and VAR is the indicator variable.
The SAS output contains the following covariance parameters estimates: ‘VAR UN(x,y)’ which correspond to the matrix containing covariance parameter of the auto-regressive process , ‘EXP VAR’ which is the local effect parameter , ‘Residual’ that we noted and a parameter called ‘AR(1)’. From this output, the parameters of the model could be calculated as: , and .
3.3 Incomplete data
Likelihood based inference used by Proc MIXED is valid whenever the mechanism of missing data is ignorable, that is MAR (Missing at Random), i.e. the availability of the measurement do not depend on the true value of the marker at the same time, and the parameters describing the non-response mechanism are distinct from the model parameters [9]. However, using an auto-regressive process, one must be careful when missing data occur. By default, a dropout mechanism is assumed to be responsible of missing data by MIXED procedure: all missing data are considered to occur after the last observed measurement. To take into account for intermittent missing data (one observation missing between two observed), a class variable must be used in the repeated statement indicating the order of observations within a subject. In the following example, the class variable is a copy of the variable time, named ‘Tbis’ :
Proc mixed data=BIV;
class CEN_PAT VAR Tbis;
model Y=VAR VAR*T;
repeated VAR Tbis /type=UN@AR(1) local=exp(VAR) sub=CEN_PAT;
run ;
When the measurements of the two markers never occur at the same time because of a design consideration, auto-regressive process can not be used unlike random effects model.
4. Application
4.1 The APROCO Cohort
The APROCO (ANRS-EP11) cohort is a prospective observational cohort ongoing in 47 clinical centres in France. A total of 1,281 HIV-1-infected patients were enrolled from May 1997 to June 1999 at the initiation of their first highly active antiretroviral therapy containing a protease inhibitor. Standardised clinical and biological data including CD4+ cell counts measurements and plasma HIV RNA quantification were collected at baseline (M0), one month later (M1) and every 4 months (M4-M24) thereafter. In order to ensure sufficient available information, only a sub-sample of patients having both plasma HIV RNA and CD4+ cell counts measurements at M0 and at least two measurements thereafter were included in the analyses. The first measurement after baseline (at one month) was deleted to provide a data set with equally spaced measures. Follow-up data were included until the 24th month,; thus patients had a maximum of 7 measures. The study population and evolution of virological response were described elsewhere [10]. Available information at each study time and description of the evolution of both markers were presented in table 1 and figure 1.
4.2 Modeling
To assure normality and homoskedasticity of residuals distribution, variable response was the change in value of marker at time t since the initial visit, i.e. and .
Fixed effects included a change of slope intensity at time 4 months as suggested in figure 1. Note that we did not include intercept because .
We compared 4 models providing two forms of covariance structure (random effects or auto-regressive process) in two formulations (univariate or bivariate). Univariate and bivariate random effectseffect models were compared using likelihood ratio test as both models waswere nested. The bivariate model had only four covariance parameters in addition. Comparison of random effects versus auto-regressive process were performed using AIC criteria [11]. A general model including random slopes and a bivariate first order auto-regressive process did not converge as reported in univariate cases by others (see [12] for example).
The model including two random slopes and a measurement error for each marker was:
where is the first slope before the time , is the second slope after the time and represents the minimum between and .
Moreover,
The model including an auto-regressive process and a measurement error was:
where
where .
4.3 SAS programming
The initial data set had the following presentation :
CEN_PAT CD4 RNA T
1001 166 -3.02635 4
1001 147 -1.96563 8
1001 171 -1.42426 12
1001 355 -1.07208 16
1001 223 -3.38035 20
1001 52 -2.08382 24
1002 -14 -2.84515 4
1002 -123 -2.84515 8
With CEN_PAT being the patient number, CD4 the difference in CD4 cell count since baseline, RNA the difference in HIV RNA since baseline and T the date of measurement in months. The change in slope intensity at 4 months was computed using a data step:
Data file; set file;
if T<4 then do ; T1=T;T2=0; end ;
if T ge 4 then do; T1=CP;T2=T-4;
end ;
Then, the structure of input data was transformed to allow bivariate modeling. Mainly, it consists in the integration of CD4 and HIV RNA in the same vector (Y here) and an indicator variable (VAR here)
Data var0; set file;
VAR=0;Y=RNA;
keep CEN_PAT Y VAR T T1 T2;
Data var1;set file;
VAR=1;Y=CD4;
keep CEN_PAT Y VAR T T1 T2;
Data biv ;set var0 var1 ;
run ;
Thus, a bivariate random effect model was fitted using the code described below.
Proc mixed data=BIV CL;
class CEN_PAT VAR;
model Y=VAR*T1 VAR*T2/noint s;
random VAR*T1 VAR*T2/type=UN sub=CEN_PAT G GCORR;
repeated /type=VC grp=VAR sub=CEN_PAT;
run ;
The option "CL" requests confidence limits for the covariance parameter estimates. A Satterthwaite approximation is used to construct limits for all parameters that have a default lower boundary constraint of zero. In the statement model, the option "noint" was used to avoid the inclusion of intercepts and "s" to obtain solution for fixed effects.
In the same way, a bivariate model with an auto-regressive process and separate measurement errors was fitted using the following code:
Proc mixed data=BIV CL;
class CEN_PAT VAR T;
model Y=VAR*T1 VAR*T2 /noint s;
repeated VAR T/type=UN@AR(1) local=exp(VAR) sub=CEN_PAT;
run ;
4.4 Results
The bivariate random effects model was significantly better than two separate univariate random effects models (-25194 vs. -25307, likelihood ratio = 226 with 4 degrees of freedom, p<10-4, table 2) showing a strong association between the two markers. The bivariate random effectseffect model allows to estimate the correlation matrix between individual slopes for each marker. In this correlation matrix, every element was significantly (p < 0.05) different from 1 (table 3). Briefly, the highest correlations were between the slopes of the two markers at the same period: ( before 4 months and after 4 months). These results were expected because of biological relation between the two markers. Moreover, the second slope of CD4 cell count was highly correlated to the first slope of the same marker .
The bivariate model including a bivariate auto-regressive process was better than the bivariate random effects model despite the restrictive assumption that the two intra-marker correlations are equal (AIC 50386 vs. 50646).
Output obtained with the model including a first order auto-regressive process provide estimations of and , significantly different from 0 (Wald test, p<10-4). This last result underlines the relationship between the two markers. The parameter is the correlation between two consecutive measures of CD4 cell count or HIV RNA. Variances of measurement error are calculated as: and .
Thus, the relationship between the two makers was underlined by the correlation between the markers at each period and the improvement of likelihood of the bivariate model compared to two univariate models. Bivariate random effect model offers a direct interpretation of the relationship between the markers without assumption on the dependence of one marker in relation to the other.