Analysis of variance, general balance and large data sets

Roger Payne

Statistics Department, IACR-Rothamsted,

Harpenden, Herts AL5 2JQ

Email:

General balance..

• is a very useful concept for small to medium sized data sets

• it caters for several sources of variation

• it leads to very efficient algorithms

• & to clear output, test statistics etc

• but what about large to very large data sets...?

General balance..

• fits a mixed model with several error (or block) terms

• total sum of squares partitioned into components known as strata, one for each block term: each stratum contains

• the sum of squares for the treatment terms estimated between the units of that stratum

• a residual representing the random variability of those units

General balance: properties..

• block (error) terms mutually orthogonal

• treatment terms mutually orthogonal

• contrasts of each treatment term all have equal efficiency factors in each of the strata where they are estimated

General balance: theory..

• mixed model:

• y = Sa Za ea + X q

• dispersion structure:

• Var(y) = V = Sa xa Ša

• Ša known symmetric matrices with

• Ša Šb = dab Ša (i.e. orthogonal)

• Sa Ša = I

General balance: theory..

• random effects model:

• y - E(y) = Sa Za ea

• where E(ea) = 0 and Var(ea) = sa2 I

• then

• Var(y) = Sa sa2 Za Za¢

• and if terms orthogonal

• Sa = Za (Za¢ Za )- Za¢ = (1 / na) Za Za¢ (if equal rep.)

• Sa is the projection operator (form and project means)

• Ša = Sa ( I - S{b: term b marginal to term a} Šb )

• xa = na sa2 + S{b: term b marginal to term a} nb sb 2

General balance: theory..

• treatment structure

• E(y) = X q = Si Xi qi (X = [ X0 | X1 | X2 | ...] )

• E(y) = t = T t (T = X ( X¢ X )- X )

• treatment terms are orthogonal, so

• E(y) = Si Ťi ti

• Ti = Xi (Xi¢ Xi )- Xi¢ = (1 / ni) Xi Xi¢ (if equal rep.)

• Ťi = Ti ( I - S{j: term j marginal to term i} Ťj )

General balance: theory..

• orthogonal block structure implies

• independent least squares analysis within each stratum

• residual s.s. (Ša y - Ša T t )¢ (Ša y - Ša T t )

• normal equations T Ša T t^a = T Ša y

• final condition

• Ťi Ša Ťj = dij lai Ťi

• normal equations now Si lai Ťi t^ai = Si Ťi Ša y

• solved by t^ai = (1/lai) Ťi Ša y

• with var-cov matrix xa Ťi / lai

• lai is efficiency factor of term i and eigenvalue of ŤiŠaŤi

Analysis by “sweeps”..

• requires a first-order balance

• all effects of each model term have an equal efficiency factor, at each point where the term is estimated

• (see Wilkinson 1970, Biometrika; Payne & Wilkinson 1977, Applied Statistics)

• similar to general balance, but that has

• block (i.e. error) terms mutually orthogonal (note: always true if nested)

• treatment terms mutually orthogonal

• (see Payne & Tobias 1992, Scandinavian J. Stats.)

Analysis by “sweeps”..

• requires a working vector v which

• initially contains the data values

• finally contains the residuals

• terms fitted sequentially: sweeps

• estimate and remove effects of a term i in stratum a by

v(r+1) = { I - ( 1 / lai) Ti } v(r)

• and are then followed by a repeat of the sweeps up to this point (a reanalysis sequence)

• can omit reanalysis sweeps of terms orthogonal to term i (so none if lai ¹ 1, & much simpler if general balance)

• notice

• projection operator Ti simply calculates tables of means

• so no matrix inversion (unless there are covariates)

• see Payne & Wilkinson (1977, Appl.Statist.)

Analysis by “sweeps”..

• with general balance: initial working vector for stratum a calculated by

• Sa Õba (I - Sb ) y

• Sa is a pivot (calculate means, and insert into vector)

• and fitted values for treatment term i in stratum a calculated by

• (1/lai)Ti Õj: laj>0; ji {Rj (I-(1/laj)Tj)} Sa Õba(I - Sb )y

• Ri = I if lai = 1

• Ri = Sa Õba(I - Sb )y if lai < 1

Other issues..

• analysis of covariance

• analyse the response (y) variate and the covariates

• calculate the covariate regression coefficients (regression of y residuals on covariate residuals)

• adjust treatment estimates and sums of squares

• combination of information

• form treatment (and covariate) estimates combining information from all the strata where each is estimated

• weighted combination of estimates with general balance

• estimate stratum variances xa to calculate weights

• see Payne & Tobias (1992, Scand.J.Stats.)

Workspace requirements..

• sweep algorithm

• working vector: N

• vectors for effects of each term: na or ni

• analysis of covariance - symmetric matrices: ncov´(ncov+1)/2, ni ´(ni+1)/2

• c.f. multiple-regression style algorithms (including REML) which typically require

• matrix: neffects´(neffects+1)/2 where neffects is total no block & treatment effects excluding residual

• vector(s): N

• much more efficient for large models

• (see Payne & Welham 1990, COMPSTAT)

Large data sets..

• data may be unbalanced

• take a balanced sample..?

• adapt the algorithm..?

• Wilkinson (1970, Biometrika)

• “general recursive algorithm”

• requires as many sweep sequences for each term i in stratum a as ŤiŠaŤi has eigenvalues

• Iterative algorithms

• Hemmerle (1974, JASA)

• Worthington (1975, Biometrika)

Large data sets..

• Hemmerle (1974, JASA)

• also uses sweep-type operations

• does not require first-order balance

• instead performs a sequence of “balanced sweeps” for each term until estimation converges

• but

• only one error term

• fits whole model at once (so a sequence of increasingly large models is required to assess individual terms)

• data must be “connected” (i.e. no aliasing)

• does not provide sed’s

Large data sets..

• Worthington (1975, Biometrika)

• performs sequence of operations analogous to projections and sweeps

• assumes additional (unspecified) algorithm to determine the strata (and their projectors)

• assumes orthogonal (equal replicated) block structure

• and assumes equally replicated treatment combinations

• again fits whole model at once

Generalizing the algorithms..

• both algorithms are based on the result

• (I - M)-1 = I + M + M2 + ... (when this converges)

• apply this to the general form

• T Ša T t^a = T Ša y

• T Sa Õba (I - Sb ) T t^a = T Sa Õba (I - Sb) y

• t^a = (I + Sm(I - TSa Õba(I-Sb )T)m) TSa Õba(I-Sb) y

• t^a (1) = TSa Õba(I - Sb ) y

• t^a(m+1) = t^a(m) + (I-TSa Õba(I-Sb )T)m TSa Õba(I-Sb)y

• relatively straightforward algorithm

• sequences of sweeps & pivots with efficiency 1

Generalizing the algorithms..

• iterative scheme

• t^a (1) = TSa Õba(I-Sb) y

• t^a(m+1) = t^a(m) + (I-TSa Õba(I-Sb )T)mTSa Õba(I-Sb )y

• implementation

• t^a(m+1) = t^a(m) + d(m)

• d(m) = (I - T Sa Õba(I - Sb) T)m TSa Õba(I - Sb ) y

= (I - T Sa Õba(I - Sb) T) d(m-1)

• calculation of d(m) ..

• project d(m-1) into the treatment space T d(m-1)

• project into stratum a SaÕba(I-Sb)T d(m-1)

• project into treatment space TSaÕba(I-Sb)T d(m-1)

• subtract result from d(m-1) (I-TSaÕba(I-Sb)T) d(m-1)