Analysis of variance, general balance and large data sets
Roger Payne
Statistics Department, IACR-Rothamsted,
Harpenden, Herts AL5 2JQ
Email:
General balance..
• is a very useful concept for small to medium sized data sets
• it caters for several sources of variation
• it leads to very efficient algorithms
• & to clear output, test statistics etc
• but what about large to very large data sets...?
General balance..
• fits a mixed model with several error (or block) terms
• total sum of squares partitioned into components known as strata, one for each block term: each stratum contains
• the sum of squares for the treatment terms estimated between the units of that stratum
• a residual representing the random variability of those units
General balance: properties..
• block (error) terms mutually orthogonal
• treatment terms mutually orthogonal
• contrasts of each treatment term all have equal efficiency factors in each of the strata where they are estimated
General balance: theory..
• mixed model:
• y = Sa Za ea + X q
• dispersion structure:
• Var(y) = V = Sa xa Ša
• Ša known symmetric matrices with
• Ša Šb = dab Ša (i.e. orthogonal)
• Sa Ša = I
General balance: theory..
• random effects model:
• y - E(y) = Sa Za ea
• where E(ea) = 0 and Var(ea) = sa2 I
• then
• Var(y) = Sa sa2 Za Za¢
• and if terms orthogonal
• Sa = Za (Za¢ Za )- Za¢ = (1 / na) Za Za¢ (if equal rep.)
• Sa is the projection operator (form and project means)
• Ša = Sa ( I - S{b: term b marginal to term a} Šb )
• xa = na sa2 + S{b: term b marginal to term a} nb sb 2
General balance: theory..
• treatment structure
• E(y) = X q = Si Xi qi (X = [ X0 | X1 | X2 | ...] )
• E(y) = t = T t (T = X ( X¢ X )- X )
• treatment terms are orthogonal, so
• E(y) = Si Ťi ti
• Ti = Xi (Xi¢ Xi )- Xi¢ = (1 / ni) Xi Xi¢ (if equal rep.)
• Ťi = Ti ( I - S{j: term j marginal to term i} Ťj )
General balance: theory..
• orthogonal block structure implies
• independent least squares analysis within each stratum
• residual s.s. (Ša y - Ša T t )¢ (Ša y - Ša T t )
• normal equations T Ša T t^a = T Ša y
• final condition
• Ťi Ša Ťj = dij lai Ťi
• normal equations now Si lai Ťi t^ai = Si Ťi Ša y
• solved by t^ai = (1/lai) Ťi Ša y
• with var-cov matrix xa Ťi / lai
• lai is efficiency factor of term i and eigenvalue of ŤiŠaŤi
Analysis by “sweeps”..
• requires a first-order balance
• all effects of each model term have an equal efficiency factor, at each point where the term is estimated
• (see Wilkinson 1970, Biometrika; Payne & Wilkinson 1977, Applied Statistics)
• similar to general balance, but that has
• block (i.e. error) terms mutually orthogonal (note: always true if nested)
• treatment terms mutually orthogonal
• (see Payne & Tobias 1992, Scandinavian J. Stats.)
Analysis by “sweeps”..
• requires a working vector v which
• initially contains the data values
• finally contains the residuals
• terms fitted sequentially: sweeps
• estimate and remove effects of a term i in stratum a by
v(r+1) = { I - ( 1 / lai) Ti } v(r)
• and are then followed by a repeat of the sweeps up to this point (a reanalysis sequence)
• can omit reanalysis sweeps of terms orthogonal to term i (so none if lai ¹ 1, & much simpler if general balance)
• notice
• projection operator Ti simply calculates tables of means
• so no matrix inversion (unless there are covariates)
• see Payne & Wilkinson (1977, Appl.Statist.)
Analysis by “sweeps”..
• with general balance: initial working vector for stratum a calculated by
• Sa Õba (I - Sb ) y
• Sa is a pivot (calculate means, and insert into vector)
• and fitted values for treatment term i in stratum a calculated by
• (1/lai)Ti Õj: laj>0; ji {Rj (I-(1/laj)Tj)} Sa Õba(I - Sb )y
• Ri = I if lai = 1
• Ri = Sa Õba(I - Sb )y if lai < 1
Other issues..
• analysis of covariance
• analyse the response (y) variate and the covariates
• calculate the covariate regression coefficients (regression of y residuals on covariate residuals)
• adjust treatment estimates and sums of squares
• combination of information
• form treatment (and covariate) estimates combining information from all the strata where each is estimated
• weighted combination of estimates with general balance
• estimate stratum variances xa to calculate weights
• see Payne & Tobias (1992, Scand.J.Stats.)
Workspace requirements..
• sweep algorithm
• working vector: N
• vectors for effects of each term: na or ni
• analysis of covariance - symmetric matrices: ncov´(ncov+1)/2, ni ´(ni+1)/2
• c.f. multiple-regression style algorithms (including REML) which typically require
• matrix: neffects´(neffects+1)/2 where neffects is total no block & treatment effects excluding residual
• vector(s): N
• much more efficient for large models
• (see Payne & Welham 1990, COMPSTAT)
Large data sets..
• data may be unbalanced
• take a balanced sample..?
• adapt the algorithm..?
• Wilkinson (1970, Biometrika)
• “general recursive algorithm”
• requires as many sweep sequences for each term i in stratum a as ŤiŠaŤi has eigenvalues
• Iterative algorithms
• Hemmerle (1974, JASA)
• Worthington (1975, Biometrika)
Large data sets..
• Hemmerle (1974, JASA)
• also uses sweep-type operations
• does not require first-order balance
• instead performs a sequence of “balanced sweeps” for each term until estimation converges
• but
• only one error term
• fits whole model at once (so a sequence of increasingly large models is required to assess individual terms)
• data must be “connected” (i.e. no aliasing)
• does not provide sed’s
Large data sets..
• Worthington (1975, Biometrika)
• performs sequence of operations analogous to projections and sweeps
• assumes additional (unspecified) algorithm to determine the strata (and their projectors)
• assumes orthogonal (equal replicated) block structure
• and assumes equally replicated treatment combinations
• again fits whole model at once
Generalizing the algorithms..
• both algorithms are based on the result
• (I - M)-1 = I + M + M2 + ... (when this converges)
• apply this to the general form
• T Ša T t^a = T Ša y
• T Sa Õba (I - Sb ) T t^a = T Sa Õba (I - Sb) y
• t^a = (I + Sm(I - TSa Õba(I-Sb )T)m) TSa Õba(I-Sb) y
• t^a (1) = TSa Õba(I - Sb ) y
• t^a(m+1) = t^a(m) + (I-TSa Õba(I-Sb )T)m TSa Õba(I-Sb)y
• relatively straightforward algorithm
• sequences of sweeps & pivots with efficiency 1
Generalizing the algorithms..
• iterative scheme
• t^a (1) = TSa Õba(I-Sb) y
• t^a(m+1) = t^a(m) + (I-TSa Õba(I-Sb )T)mTSa Õba(I-Sb )y
• implementation
• t^a(m+1) = t^a(m) + d(m)
• d(m) = (I - T Sa Õba(I - Sb) T)m TSa Õba(I - Sb ) y
= (I - T Sa Õba(I - Sb) T) d(m-1)
• calculation of d(m) ..
• project d(m-1) into the treatment space T d(m-1)
• project into stratum a SaÕba(I-Sb)T d(m-1)
• project into treatment space TSaÕba(I-Sb)T d(m-1)
• subtract result from d(m-1) (I-TSaÕba(I-Sb)T) d(m-1)