Correlation in repeated measures designs

The story

Data by Ossama Dimassi (FG Valle-Zarate)

  • 14 goats
  • Each goats measured at 7 dates (months)
  • two traits (percentage of cheese yield, percentage of protein)

Objective: Assess correlation of percentage cheese yield and percentage protein yield

Dahlem Cashmere (DC) is a multipurpose goat breed developed at the end of the 1980s at the Technical University of Berlin, based on crosses between Angora and dairy goats, in particular the German White dairy goat with some influence of the German Fawn and Anglo Nubian. Along with the valuable cashmere wool, DC is used for meat and milk production. Empirical results indicate that milk of DC goats has superior processing properties compared to milk of other dairy goats conventionally kept in Germany. In order to assess the technological potential of milk of Dahlem Cashmere (DC) goats individual milk samples from two groups of 5 DC goats (at 2nd and 3rd lactation) and one group of 5 German Fawn dairy goats (GF) at 2nd lactation, were taken fortnightly over lactation periods of 32 and 28 weeks, respectively. Along with the main components (protein, casein, fat) cheese yield was measured. Significant differences have been detected between the different breeds thus the next step is to try and at least partly explain this variation by milk components level, one of which is protein, a quantitative variable. The simple correlation, however, will not account for the complex design structure and the repeated measures nature of the data.

Exploring the correlation structure

We have different options for looking at the correlation, e.g.,

  • use all data, ignoring the structure (goats, months) (Fig. 1)
  • compute means across months and correlate goat means (Fig. 2)
  • compute means across goats and correlate month means (Fig. 3)
  • look at each month separately; within-month = between-subject correlation (Fig. 4)
  • look at each goat separately; within-goat = within-subject correlation (Fig. 5)

The resulting correlation coefficients are not the same, and some are quite dramatically different, so the question arises which correlation is the correct one.

Obviously, there is structure in the data (months, goats), which needs to be accounted for. In order to partition the correlation according to the factors month and goat, a factorial model is needed.

Fig. 1: Plot of cheese yield [%] vs. protein [%] for repeated measurements ignoring goats and months (r = 0.85).

Fig. 2: Plot of cheese yield [%] vs. protein [%] for goat means across months (r = 0.96).

Fig. 3: Plot of cheese cheese yield [%] vs. protein [%] for month means across goats (r = 0.65).

Fig. 4: Plot of cheese yield [%] vs. protein [%] for goats at month=7 (r = 0.93); within-month analysis.

Fig. 5: Plot of cheese yield [%] vs. protein [%] across months for goat = 124 (r = -0.05, n.s.); within-goat analysis.

Modelling a single trait

To develop a model, it is useful to look at a single trait first and then extend to two (or more) traits.

yit =  + mt + gi + eit(1)

where

yit = trait value for i-th goat at t-th month

 = general mean

mt = effect of t-th month

gi = effect of i-th goat

eit = residual

If goats are a random sample, then gi is a random effect.

Residuals eit from the same goat are potentially correlated due to repeated measurements on the same animal. Here, we use an AR(1) model, which states that errors on the same goat are correlated by

Corr(eit, eit’) = |t-t’|(2)

Thus, the correlation decays with distance in time. Also, errors of different goats are uncorrelated.

The model may be modified to account for a possible linear time trend using

yit =  + gi + xt + ixt + eit(3)

intercept slope

of i-th goat

where xt is the time in months. If the trend is nonlinear, a simple extension is to add quadratic and cubic terms, if necessary, The lack-of-fit of the mean regression () can be tested by adding the month effect mt. It is important to fit mt after  and to use a Type I analysis. The lack-of-fit of the goat-specific regression (i) cannot be tested, even with replicate data per goat and month. The reason for this is that the random term, eit, which models random fluctuation around a smooth trend, would be counfounded with a goat-specific lack-of-fit effect.

For performing the regression, it is necessary to define the quantitative variable xt within a datastep. Here, this variable is simply coded as "t". In addition, a "month" effect ((i) is fitted to test the lack-of-fit. The following analysis is done taking goats in (3) as fixed and using an AR(1) correlation-model for the residual eit.

Protein yield (percentage):

Type 1 Tests of Fixed Effects

Num Den

Effect DF DF F Value Pr > F

Goat 15 15.3 41.18 <.0001

t 1 23.1 0.10 0.7558

month 5 54.7 1.48 0.2126

t*Goat 15 23.2 1.83 0.0924

Cheese yield (percentage):

Type 1 Tests of Fixed Effects

Num Den

Effect DF DF F Value Pr > F

Goat 15 4.96 83.92 <.0001

t 1 7.64 36.03 0.0004

t*t 1 13.2 21.26 0.0005

month 4 38.5 1.25 0.3077

t*Goat 15 7.09 1.55 0.2861

t*t*Goat 14 12.5 0.67 0.7684

For both traits there is no indication of heterogeneity among goats in the time trend. For protein there is no overall trend at all, while for cheese yield the trend is quadratic. Thus, we would select the following models:

Protein yield (percentage):

yit =  + gi + eit(4)

Cheese yield (percentage):

yit =  + gi + 1xt + 2x2t + eit(5)

SAS code:

procmixeddata=d method=reml;

where trait='cheesey';

class month goat;

model y=goat t t*t month goat*t goat*t*t/ddfm=kr solutionhtype=1;

repeated month/sub=goat type=ar(1);

run;

Joint analysis for both traits

In order to come up with a joint analysis for both traits, it will convenient to amalgamate models (4) and (5). To do so, it will be convenient to introduce a dummy variable identifying the trait with the more general model (5). Indexing the trait by j, the model is

yijt = j + gij + 1jwjxt + 2jwjx2t + eijt(6)

where

wj = 0 for protein (j = 1)

wj = 1 for cheese (j = 2)

The model has two random effects: the between-goat effect gij and the within-goat effect eijt. Both of these effects will be correlated among traits. For both effects, the correlation of effects pertaining to different animals is zero. The between-goat correlation is

corr(gi1, gi2) = g

For the pair of random effects (gi1, gi2), we impose an unstructured variance-covariance matrix:

var(gi1, gi2) = g

The within-goat correlation, defined for a fixed point in time and for different traits, can be defined as

corr(ei1t, ei2t) = e

For the pair of random effects (ei1t, ei2t), we impose an unstructured variance-covariance matrix:

var(ei1t, ei2t) = e

To complete the within-goats error model, we need to account for serial correlation among observations on the same trait at different points in time. Assuming an AR(1) model, the serial correlation for the same trait at different points in time takes the form

Corr(eijt, eijt’) = s|t-t’|

where s is the serial correlation. The variance-covariance matrix for a vector of errors on the same goat and the same trait can be denoted as

var(eij1, eij2, …, eij7) = s

Now what about two observations on different traits and different points in time? A parsimonious way of modelling this is given by the direct product of within-goat correlation and serial correlation:

Corr(eijt, eij’t’) = es|t-t’|

For the while error vector, sorted by trait and time, the variance-covariance matrix is

 = es

where  is the direct product operator (Kronecker product). This model can be fitted in MIXED using the TYPE=UN@AR(1) option in the REPEATED statement.

SAS code:

procmixeddata=d;

class trait month Goat;

model y=trait w*trait*t w*trait*t*t;

random trait/subject=Goat type=un;

repeated trait month/sub=goat type=un@ar(1);

run;

Three components of between-trait correlation

(1)Between-goat correlation (g)

(2)Within-goat correlation (e)

(3)Time-trend related correlation

The first two components have already been defined as per the joint model (xx). To illustrate the meaning of (3), assume that gij and eijt have zero variance (are absent) and that there is a time trend in both traits. For example, if both cheese yield and protein yield percentages increase in time, this will induce a positive correlation when plotting cheese versus protein yield percentages from different points in time.

In the present case, there is no sigificant trend in protein, so the third component of correlation can be ignored here. But it is stressed that (3) requires attention when there is a time trend in both traits.

It is useful to define a between-within-goats correlation as

bw = corr(gi1 + ei1t, gi2 + ei2t)

This correlation refers to the correlation of both traits on a randomly selected goat i at a common point in time t.

1

Results

The fit of model (xx) is as follows:

Covariance Parameter Estimates

Cov Parm Subject Estimate

UN(1,1) Goat 0.2935

UN(2,1) Goat 0.1837

UN(2,2) Goat 0.1191

trait UN(1,1) Goat 0.03376

UN(2,1) Goat 0.004966

UN(2,2) Goat 0.02596

month AR(1) Goat 0.05760

Fit Statistics

-2 Res Log Likelihood -73.5

AIC (smaller is better) -59.5

AICC (smaller is better) -58.9

BIC (smaller is better) -54.1

The between-goat correlation is 0.1837/(0.2935*0.1191) = 0.9825.

The within-goat-correlation is 0.004966/(0.03376*0.02596) = 0.1677.

[The serial correlation equals 0.0576]

Thus, the between-goat correlation is much more important than the within-goat correlation. The between-within correlation is

(0.1837+0.004966)/[(0.2935+0.03376)*(0.1191+0.02596)] = 0.8659

Obviously, the between-goat correlation dominates the between-within-correlation, which is a result of the larger between-goats variances.

To test the significance of both correlations, we fit models with g = 0 and with e = 0 are record the value of twice the residual log-likelihood. Under the null hypothesis, the difference in twice the log-likelihood between (i) a model with the correlation set to zero and (ii) a model with the correlation allowed to vary, has a chi-squared distribution with one d.f. Thus, the difference must exceed a value of 3.84 to be significant.

g = 0:

Fit Statistics

-2 Res Log Likelihood -35.6

AIC (smaller is better) -23.6

AICC (smaller is better) -23.2

BIC (smaller is better) -19.0

The difference to the full model is –35.6 + 73.5 = 37.9 > 3.84, which is highly significant. Thus, the between-goat correlation is highly significant.

e = 0:

Fit Statistics

-2 Res Log Likelihood -71.1

AIC (smaller is better) -59.1

AICC (smaller is better) -58.6

BIC (smaller is better) -54.4

The difference to the full model is –71.1 + 73.5 = 2.4 < 3.84, which is not significant. Thus, the within-goat correlation is not significant.

SAS-code for g = 0:

procmixeddata=d;

class trait month Goat;

model y=trait w*t*trait w*t*t*trait/ddfm=kr solutionnoint;

random trait/subject=Goat type=un;

repeated trait month/sub=goat type=un@ar(1);

parms (1)(0)(1)(1)(0)(1)(0.5)/hold=2;

run;

SAS-code for e = 0:

procmixeddata=d;

class trait month Goat;

model y=trait w*t*trait w*t*t*trait/ddfm=kr solutionnoint;

random trait/subject=Goat type=un;

repeated trait month/sub=goat type=un@ar(1);

parms (1)(0)(1)(1)(0)(1)(0.5)/hold=5;

run;

Alternatively, oue may fit the foll model and add the option COVTEST to the PROC MIXED line. This will invoke a Wald-test for the covariance parameters. The test is appropriate only for the two covariances [UN(2,1) in the output] and the serial correlation [AR(1)]. It is NOT appropriate for the variance components [UN(1,1) and UN(2,2)]. Generally, likelihood-ratio tests tend to be more reliable than Wald-tests, so I recommend the former. The results below show that the between-goats covariance (and thus correlation) is significant, while the within-goats covariance (correlation is not significant.

The Mixed Procedure

Covariance Parameter Estimates

Standard Z

Cov Parm Subject Estimate Error Value Pr Z

UN(1,1) Goat 0.2935 0.1094 2.68 0.0037

UN(2,1) Goat 0.1837 0.06895 2.66 0.0077

UN(2,2) Goat 0.1191 0.04528 2.63 0.0043

trait UN(1,1) Goat 0.03376 0.005392 6.26 <.0001

UN(2,1) Goat 0.004966 0.003244 1.53 0.1258

UN(2,2) Goat 0.02596 0.003928 6.61 <.0001

month AR(1) Goat 0.05760 0.09233 0.62 0.5327

SAS code for Wald-Tests of covariances:

procmixeddata=d covtest;

class trait month Goat;

model y=trait w*trait*t w*trait*t*t;

random trait/subject=Goat type=un;

repeated trait month/sub=goat type=un@ar(1);

run;

Finally, if an overall test of the between-within correlation is desired, we can compare the full model to the one with both correlations set to 0. The LR-statistic is then compared against chi-squared with 2 d.f., which has a critical value of 5,99 at the 5% level of significance.

SAS-code for e = 0 and g = 0:

procmixeddata=d;

class trait month Goat;

model y=trait w*t*trait w*t*t*trait/ddfm=kr solutionnoint;

random trait/subject=Goat type=un;

repeated trait month/sub=goat type=un@ar(1);

parms (1)(0)(1)(1)(0)(1)(0.5)/hold= 2,5;

run;

Output:

Fit Statistics

-2 Res Log Likelihood -11.5

AIC (smaller is better) -1.5

AICC (smaller is better) -1.2

BIC (smaller is better) 2.3

The difference to the full model is –11.5 + 73.5 = 62.0 > 5.99, which is highly significant.

A p-value for the LR-statistic can be computed as follows:

data;

chi=62;

df=2;

p_value=1-probchi(chi,df);

procprint; run;

Output:

Obs chi df p_value

1 62 2 3.4417E-14

Thus, the p-value is p = 3.4417*10-14 < 0.001 which is highly significant.

Convergence problems: Occasionaly, MIXED does not converge for these types of model. There are many ways to tackle convergence problems. Outlying observations (typos) may be one reason for lack of convergence. Also, it is helpful to scale the data so that both traits have about the same mean. This may be effected by multiplication of one trait by a constant factor. Note that rescaling will not affect the correlation estimate. Finally, if none of these tricks work, the AR(1) model may not be adequate. An alternative model is the compound symmetry model, which states that the serial correlation is constand for any time lag:

Corr(eit, eit’) =  for any tt’(7)

The model can be fitted using TYPE=UN@CS in place of TYPE=UN@AR(1) in the REPEATED statement.

SAS-code for reading the data:

Data d;

Input Goat month trait$y;

w=0; if trait='cheesey'then w=1;

t=month;

Cards;

1241cheesey4.30

1241protein3.57

1242cheesey4.25

1242protein3.64

1243cheesey4.35

1243protein3.28

1244cheesey4.10

1244protein3.60

1245cheesey3.86

1245protein3.77

1246cheesey4.33

1246protein3.82

1247cheesey4.42

1247protein3.96

1261cheesey4.48

1261protein3.95

1262cheesey4.36

1262protein3.72

1263cheesey4.53

1263protein3.79

1264cheesey4.23

1264protein3.70

1265cheesey4.08

1265protein4.05

1266cheesey4.63

1266protein3.81

1267cheesey4.34

1267protein3.97

1281cheesey4.95

1281protein3.92

1282cheesey4.46

1282protein3.75

1283cheesey4.67

1283protein3.18

1284cheesey4.30

1284protein3.69

1285cheesey4.20

1285protein3.69

1286cheesey4.27

1286protein3.71

1287cheesey4.14

1287protein3.83

1301cheesey4.92

1301protein3.88

1302cheesey4.32

1302protein3.95

1311cheesey4.44

1311protein3.62

1312cheesey5.14

1312protein3.81

1313cheesey4.62

1313protein3.30

1314cheesey4.67

1314protein3.71

1315cheesey4.59

1315protein3.83

1316cheesey4.51

1316protein3.87

1317cheesey4.72

1317protein3.77

1341cheesey3.89

1341protein3.24

1342cheesey3.52

1342protein3.30

1343cheesey3.62

1343protein3.08

1344cheesey3.62

1344protein3.14

1345cheesey3.45

1345protein3.09

1346cheesey3.63

1346protein3.24

1347cheesey3.45

1347protein3.06

1401cheesey3.87

1401protein3.36

1402cheesey4.16

1402protein3.52

1403cheesey3.81

1403protein3.23

1404cheesey3.94

1404protein3.35

1405cheesey3.63

1405protein3.23

1406cheesey3.72

1406protein3.34

1407cheesey3.89

1407protein3.37

1431cheesey4.40

1431protein3.59

1432cheesey4.45

1432protein3.76

1433cheesey4.25

1433protein3.72

1434cheesey3.64

1434protein3.71

1435cheesey4.20

1435protein3.61

1436cheesey4.37

1436protein3.92

1451cheesey4.64

1451protein3.88

1452cheesey4.47

1452protein3.65

1453cheesey4.45

1453protein3.87

1454cheesey3.88

1454protein3.75

1455cheesey4.20

1455protein3.76

1456cheesey4.27

1456protein3.86

1457cheesey4.34

1457protein3.76

1471cheesey4.11

1471protein3.38

1472cheesey4.45

1472protein3.32

1473cheesey4.02

1473protein3.46

1474cheesey4.01

1474protein3.54

1475cheesey3.63

1475protein3.27

1476cheesey3.90

1476protein3.41

1477cheesey3.94

1477protein3.28

1571cheesey4.24

1571protein3.36

1572cheesey4.04

1572protein3.34

1573cheesey3.96

1573protein2.82

1574cheesey3.72

1574protein3.27

1575cheesey3.64

1575protein3.35

1576cheesey4.00

1576protein3.57

1577cheesey3.78

1577protein3.50

1831cheesey3.19

1831protein2.84

1832cheesey3.02

1832protein2.71

1833cheesey3.01

1833protein3.15

1834cheesey2.97

1834protein2.69

1835cheesey2.98

1835protein2.79

1836cheesey2.98

1836protein2.72

1837cheesey3.00

1837protein2.79

1841cheesey3.60

1841protein3.19

1842cheesey3.18

1842protein2.87

1843cheesey3.10

1843protein3.25

1844cheesey2.99

1844protein2.76

1845cheesey2.89

1845protein2.73

1846cheesey2.84

1846protein2.76

1847cheesey2.75

1847protein2.70

1851cheesey3.63

1851protein3.11

1852cheesey3.51

1852protein3.35

1853cheesey3.37

1853protein3.17

1854cheesey3.12

1854protein3.11

1855cheesey3.00

1855protein2.86

1856cheesey3.16

1856protein2.94

1857cheesey3.54

1857protein3.10

1861cheesey3.33

1861protein2.99

1862cheesey3.10

1862protein2.90

1863cheesey3.11

1863protein3.13

1864cheesey2.97

1864protein2.89

1865cheesey2.92

1865protein2.96

1866cheesey2.96

1866protein2.97

1867cheesey2.94

1867protein3.03

1871cheesey4.02

1871protein3.42

1872cheesey3.68

1872protein3.11

1873cheesey3.49

1873protein3.30

1874cheesey4.08

1874protein3.29

1875cheesey3.77

1875protein3.25

1876cheesey3.56

1876protein3.33

1877cheesey3.97

1877protein3.36

;

1