Normal Probability plot
Shibdas Bandyopadhyay
Indian Statistical Institute
Abstract
Normal probability plots are made to graphically verify normality assumption for data from a univariate population that are mutually independent and identically distributed. Normal probability plot is very common option in most statistical packages. In the context of design of experiments or regression, though the observations are assumed to be mutually independent and homoscedastic, they have different unknown expectations. So the raw data are inappropriate for normality check. To overcome the problem of unequal expectations, it is common to use residuals of a fitted regression model. The residuals have zero expectation, but these are heteroscedastic, and also mutually dependent. It is thus inappropriate to use the residuals for normality check. In this study, mutually independent homoscedastic components with zero mean are extracted from residuals through principle component analysis; these are then used for normal probability plot. The technique is illustrated with data.
Key words and phrases: Normal probability plot, principal component analysis.
AMS (1991) subject classification: 62P.
Normal Probability plot
Shibdas Bandyopadhyay
Indian Statistical Institute
1. Introduction
Let be mutually independent with common mean and standard deviation . To check graphically if the data are from a common normal distribution, one plots , the ith ordered statistic of , against , i = 1,2, …., n; if the line plot is nearly linear, one is satisfied with the normality assumption. In the plot, happens to be the slope of the straight line of on ; s are chosen to estimate ‘efficiently’ (David and Nagaraja, 2003). Currently used s (Blom, 1958) in statistical packages like in Minitab are:
= (i-)/(), i=1,2,…, n. (1.1)
Line plot of on is called Normal Probability Plot.
While testing for , it is natural to check normality assumption using normal probability plot. Use of normal probability plot to check normality assumption has been common in other situations also. In this study, we shall consider the use of normal probability plot to check normality assumption for response in the context of regression and design of experiments.
Consider the standard linear regression model:
Y = X + (1.2)
where Y is n1 response vector, X is np design matrix of rank r p, is p1 vector of unknown parameters, and is n1 unobservable vector of error components; error components are assumed to be mutually independent and identically distributed with zero mean and standard deviation .
Though the n components of Y are independently distributed with common standard deviation , components of Y do not have a common mean . The ith component Yof Y has the mean = X, where X is the ith row of X, i=1,2,…, n. So, a line plot of on is not meaningful to check the normality of Y’s. It has become a standard practice, as in Minitab, to work with , the n1 residuals:
= Y – X (1.3)
and make a line plot of , the ith component of , on , ’s given by (1.1).
We use a match factory data (Roy et al, 1959) for illustration. Data are scores of n = 25 workers on three psychological tests U, U, Uand also their efficiency index Y.
Components of' after fitting the regression
Y = + U+U+U (1.4)
(with X1, X=U, X=Uand X=U)is 1 25:
( 3.33 –0.18 –0.88 –3.62 –5.16 –2.24 0.92 3.42 –0.22 –0.52
–1.61 –1.37 –1.27 1.31 0.12 1.16 2.17 0.66 0.88 –3.07
0.055 –2.28 0.69 3.87 3.84).
Fig.1 is a line plot of on , with = (i-)/(25.25), i=1,2,…, 25.
Fig.1 : Normal Probability Plot with regression residuals
But this line plot of on with = (i-)/(), i=1,2,…, nis not appropriate to check normality of Y’s. It true that, when the mutually independent Y’s are normally distributed with mean = Xand common standard deviation ,’s are distributes as normal with mean zero but standard deviations are different multiples (depending on X) of . Also’s are not mutually independent. So, one needs modification (Hocking, 2003).
This study suggests a natural modification by extracting independent and identically distributed normal components from = Y – X using principal component analysis. It will not be possible to carry out the suggested modification by using statistical tables and calculators or on PC; it computer intensive. One would need principal component analysis module, which is common in most statistical packages such as Eigen Analysis in Minitab.
2. Extraction of independent and identically distributed components using principal component analysis
Consider the regression model Y = X + along with what follows (1.1). One may
write as, for = (X'X)X'Y,
= Y – X= (I-X(X'X)X') Y HY (2.1)where (X'X) is a g-inverse of X'X and H = I-X(X'X)X'. It follows that has a singular normal distribution, mass of the joint density of n components of lies in
(n-r) dimension, with zero mean and covariance matrix H with rank (H)= (n-r).
Since H is symmetric and idempotent of rank (n-r), characteristic roots of H are 1 of multiplicity (n-r) and 0 of multiplicity r. Using spectral decomposition of H we may write
H = PP', PP' = P'P = .
P is non-stochastic orthogonal matrix and depends only on the design matrix X.P' has a singular normal distribution, mass of the joint density of n components of P' lies in (n-r) dimension, with zero mean and covariance matrix . Thus, if we write P = ( P P), where P consists of the first (n-r) columns of P (the characteristic vectors corresponding to the (n-r) non-zero characteristic roots of H ) and P consists of the remaining rcolumns of P, (n-r) components of P' are independent and identically distributed normal with zero mean and standard deviation while the remaining r components of P' are identically zero ( zero mean and zero variance).
For the match factory data, (P')' is 1 25,
(P')' = ( 1.84 –0.36 1.30 1.40 0.59 –4.58
–2.55 –1.93 0.26 –1.56 0.82 2.83
0.68 2.77 3.97 –2.59 –3.63 2.11
5.54 –0.86 –0.061 0 0 0 0). (2.2)
Notice that each of the last four components of P', P', is 0, as these should be.
Fig.2 is a line plot of ith ordered statistic of the 21 components of P’ with= (i-)/(21.25), (since r=p=4, n-r= 21)on , i=1,2,…, 21.
Fig.2 :Normal Probability Plot with PC of regression residuals
We do not wish to compare the two figures. We only want to point out that the analysis suggested with principal components is an appropriate method and is not difficult to implement in packages that have eigen analysis module.
References
Blom, G. (1958). Statistical Estimates and Transformed Beta-Variables. Wiley, New
York.
David, H.A. and Nagaraja, H. N. (2003). Order Statistics. Wiley – Interscience.
Hocking R. R.(2003). Methods and Applications of Linear Models. Wiley –
Interscience.
Roy, J., Chakravarty, I.M. and Laha, R.G.(1959). Handbook of Methods of
Applied Statistics, Vol. 1. John Wiley & Sons, Inc.