Do data belong to the Normal Distribution?
As you have probably figured out by now, the normal distribution plays a major role in many types of probabilistic and statistical analyses. Some statistical procedures are heavily dependent on the assumption of normality, and in case one can verify that this assumption is questionable, these procedures should be avoided. It is therefore useful to have techniques available that can verify the validity of the normality assumption. This is the objective of this short note.
The Normal Probability Plot
The following procedure helps conclude qualitatively that a sample was drawn from a normal distribution. Here is a summary of the procedure:
- Place the values in the data set (X) into an ordered array. Call the smallest value in the ordered set X1 and the largest value Xn. Then the set becomes X1, X2… Xn.
- Calculate the Fx/(n+1), the cumulative relative frequency for each value Xi. From the chart of the standard normal distribution or from Excel (using “=normsinv[Fx/(n+1)]”), find the corresponding standard normal value of Z for each point in the ordered data set. In doing so we hypothesize the data set was drawn from a normal distribution with some mean and standard deviation.
- Plot the pairs of points (Z, X) using the observed data values (Xi) on the vertical axis, and the associated Zi values on the horizontal axis.
- Inspect the points plotted for evidence of linearity (i.e. a straight line).
Explanation:
The Z-score for any value of X is Z = (X – so there is a linear relationship between X and Z, that is X = Z + Since the empirical probability to have a number as large as X in the sample is Fx/(n+1), (where Fxis the cumulative frequency of X), if X is indeed normally distributed the Z value obtained from the normal distribution for the corresponding X value by using Fx/(n+1) should be the Z-score of that X, thus providing a linear relationship with X. So if there is a linear relationship between X and Z-table, then X is normally distributed.
Example1:
Suppose we wish to obtain the first and the second standard normal ordered values (Z1, and Z2) for to a sample of 19 observations (each observation is different in value).
Obtaining Z1: Since Fx=1, P(Z<Z1) = 1/(19+1) = 1/20 = .05. Under the standard normal distribution Z1 = -1.645 (note P(Z<-1.645) = .05.
Obtaining Z2: Since Fx=2, P(Z<Z2) =2/(19+1) = 2/20 = .10. Under the standard normal distributionZ2 = -1.285 (note P(Z<-1.285) = .10.
In a similar manner we complete the rest of the Zi values. Now the pairs (Xi, Zi) are plotted and if they are found to lie (approximately) along a straight line we can safely say, that the data belong to a normal distribution.
To determine whether or not there is linear relationship between X and Z we can test the correlation between them as follows:
H0: The data come from normal distribution
H1: The data do not come from a normal distribution.
Calculate the test statistic (R) as the sample correlation coefficient between X and Z. Compare R to a critical value Rcr from a table of critical values (provided below; the table was constructed from simulation results). Rcr depends on the sample size and the significance level selected for the test. If R < Rcr there is sufficient evidence to reject H0 and conclude that the data is not normal at alpha level of significance.
Important Comment: If the Xi and Zi appear to form a linear relationship, then the line intercept represents the population mean (), and the line slope represents the standard deviation ().
Example 2
Test scores of 19 students in each of two classes were drawn. Some of the sorted scores are shown below along with the calculated cumulative proportion from the sample (Fx/(n+1)) and with the resulting Z values. Details can be found in the fileAssess Normal.
Partial set:
Order / Class I / Class II / Prob / Z value1 / 48 / 47 / 0.05 / -1.64485
2 / 52 / 54 / 0.1 / -1.28155
3 / 55 / 58 / 0.15 / -1.03643
4 / 57 / 61 / 0.2 / -0.84162
After the Z values were derived, the following two graphs were plotted.
Conclusion: In class I scores were produced from a normal distribution.
From the graph it seems = 65 and the = (83 – 47)/(1.645 – (-1.645) = 10.94
Now observe the probability plot for class II
The result is unclear. Although it seems there is some curvature in the line the “non-normality” does not appear to be too severe. Since the sample size is only 19 one should not judge the distribution to be non-normal. Let us proceed by testing the correlation as explained above (we’ll run the correlation test for the two classes):
H0: The data come from normal distribution
H1: The data do not come from a normal distribution.
The test statistic calculated with Excel for Class I :R = .999
The test statistic calculated with Excel for Class II: R =.959
The critical value for n=19, and alpha = .05is .9479
There is insufficient evidence to reject the normal distributionat 5% significance level for both classes (since .999 > .9479 and .959 > .9479).To estimate and we run linear regression to construct the best fit line, which results with the equation
X = 11.026Z+70.894. So ≅70.7 and ≅ 10.89 (see the Excel file).
The following example demonstrates how to construct a probability plot when multiple same-values are present in the sample drawn (which did not occur in the previous example).
Example 3
To help make a decision about expansion plan, the president of a music company needs to know how many CDs teenagers buy annually. Accordingly, he commissions a survey of 250 teens, in which they are asked to report how many CDs they purchased in the previous 12 months. Can we assume the number of CDs bought annually by a teenager is normally distributed?
Solution
The following table summarizes the data (see the file AssessNormal1 – the Probability Plot sheet):
6 / 1 / 1 / 0.003984 / -2.65342
8 / 1 / 2 / 0.007968 / -2.41037
9 / 7 / 9 / 0.035857 / -1.80093
10 / 10 / 19 / 0.075697 / -1.43462
11 / 16 / 35 / 0.139442 / -1.08283
12 / 26 / 61 / 0.243028 / -0.6966
13 / 23 / 84 / 0.334661 / -0.42708
14 / 25 / 109 / 0.434263 / -0.16553
15 / 29 / 138 / 0.549801 / 0.125158
16 / 28 / 166 / 0.661355 / 0.416163
17 / 26 / 192 / 0.76494 / 0.722285
18 / 29 / 221 / 0.880478 / 1.17738
19 / 11 / 232 / 0.924303 / 1.434623
20 / 11 / 243 / 0.968127 / 1.853959
21 / 4 / 247 / 0.984064 / 2.146006
22 / 1 / 248 / 0.988048 / 2.258663
23 / 1 / 249 / 0.992032 / 2.410372
26 / 1 / 250 / 0.996016 / 2.653417
Explanations:
- The column ‘X’ represents the number of CDs purchased by a teenager annually.
- The column ‘f’ is the frequency of X (counts how many times each number appears in the sample). For example, the value 11 appears 16 times (16 teenagers purchased 11 CDs).
- The column ‘Fx’ calculates the cumulative frequency. For example, 10 or less CDs per person appear 19 times (1+1+7+10=19).
- The column ‘Fx/(n+1)’ calculates the empirical cumulative frequency. For example, F10/(250+1) = 19/251 = .075697.
- ‘Z’ is found by “normsinv” as before.
Now we can draw the graph of Z against X.
Interpretation:
The graph raises some suspicion with regard to the normality of the CD s distribution. Because the two ends are curved. Yet the amount of deviation from the normal curve needs to be rechecked. The correlation test used above yields the following results:
R = .990375; Rcr = .9943 (for n = 250, alpha = .05). Thus there is insufficient evidence to reject the normality at 5% level of significance.
In what follows we present a fewhypotheses testing procedures designed to analytically test the normality of a data set.
The Goodness of Fit Chi Squared Test
Example 4
Re-solve example 3 using the goodness of fit Chi square test at 5% significance level.
Solution:
First, determine Z values that comply with the rule of 5 (the expected value of the number of observation that fall in each interval should be at least 5). The following table demonstrates such a selection of Z values, and additional information:
1 / (z -2) / 0.02275 / 5.6875 / 2
2 / (-2 < z -1) / 0.135905 / 33.97625 / 33
3 / (-1 < z 0) / 0.341345 / 85.33625 / 74
4 / (0 < z 1) / 0.341345 / 85.33625 / 112
5 / (1 < z 2) / 0.135905 / 33.97625 / 26
6 / (z > 2) / 0.02275 / 5.6875 / 3
- Explanations:
Determine the probabilities for the ranges selected.
P(Z -2)=.0225;
P(-2 Z -1) = .1359;
Comment: The Z values (-2, -1, 0, 1, 2) were selected such that when the interval probabilities are calculated the expected number of observation in each one (Ei) will be at least 5. See details below. A symmetrical selection of Z values is preferable.
- The expected values (Ei) are calculated as follows:
First interval: Second Interval:
E1 = P(Z-2)(250) = 5.6875 E2 = P(-2Z -1)(250) = 33.97625
…and so on…
- The actual frequency (Fi) counts the number of sample observations in each interval. Of course you need to transform first the observation values Xi to their corresponding Z- scores using the sample mean and sample standard deviation: , and then count how many Z values belong to each interval. For example, in the interval Z-2there are two Z-scores found so F1 = 2.
- Test the following hypothesis:
H0: The distribution is normal with = 14.98 and = 3.14
H1: The distribution is not the above
The test is performed using a Chi-square distribution. Use Ei and Fi to calculate the Chi square statistic.
The test is performed as follows: If22, k-1-L, reject H0 (where k is the number of intervals and L is the number of parameters estimated; since we estimate both and L=2).
Let the significance level be .05.This rule translates to a critical value of2.05, 6-1-2 = 7.8147 (a value found in the chi-square table or by using the Excel function: =chiinv(.05,3)).
Conclusion: Since 15.39 > 7.8147, there is sufficient evidence to reject H0 at 5% significance level. The distribution is not normal with = 14.98, and = 3.14.
Anderson Darling Test
This is a very strong test that works well on small samples (even n≤25). The test is performed on the ordered data set (X1 ≤X2…≤Xn). It applies to any distribution. Specifically for the normal case define:
Zi is calculated by where and sare the sample mean and standard deviation respectively.AlsoΦ(zi) =Pr(Z < zi) of the normal distribution.
Now calculate the statistic (A*)2, the adjustment of A2 to the sample size (especially important for small samples) by
.
If (A*)2A2critthe hypothesis of normality is rejected. Below you can view a few critical values A2crit.
/ 0.1 / 0.05 / 0.025 / 0.01A2 crit / 0.631 / 0.752 / 0.873 / 1.035
Example 6
For the data used in example 3 here is a summary of the calculations:
A2 = -250 – (1/250)[(2(1)-1)Ln(z1)+(2(250-1)+1)Ln(1-(z1)+
(2(2)-1)Ln(z2)+(2(250-2)+1)Ln(1-(z2)+…… = 1.42
(A*)2 = 1.42(1+.75/250+2.25/2502) = 1.43
Find details in the file AssessNormal1- Anderson Darling CD example.
A2crit for 5% significance level = .752
Since 1.43 > .752 there is sufficient evidence at 5% significance level to reject the null hypothesis. The sample does not belong to a normal distribution.
The Lilliefors Test
This hypothesis test method is known to give very strong results for samples of size n2000. As in the normal plot approach, here too we calculate cumulative probabilities. Yet here we compare probabilities for a known normal distribution with their sample based empirical counterparts.
Here is a summary of the procedure:
- Determine the mean and standard deviation of the normal distribution under investigation. Set up the hypotheses:
H0: The distribution is normal with and.
H1: The distribution is not normal.
- Place the values in the data set (X) into an ordered array.
- Find the corresponding standard normal Zi values for each point in the ordered data set using the hypothesized mean and standard deviation. That is
Zi=(Xi- - Determine the cumulative normal probabilities F(Zi) = P(Z<Zi) for each Zivalue found in part ‘2’.
- Determine the cumulative sample distribution S(Xi) = Fx/n for each point in the sample.
- Calculate the largest absolute difference (D) between F(*) and S(*).
D = max{|F(Z1)-S(X1)|, |F(Z2)-S(X2)|…, |F(Zn)-S(Xn)|} - Perform the test as follows: If D>Dcr, reject the null hypothesis. Otherwise, do not reject the null hypothesis. Dcr is a critical value determined by alpha and the sample size, and is provided by the Lilliefors table (see below).
The Lilliefors method was applied to a data set of n = 2000, that can be found in AssessNormal1 – Lilliefors; all the calculations were performed in Excel.
Appendix 1: The Lilliefors Table
Appendix 2
The Critical value of correlation for the probability plot normality test
N 0.01 0.05
3 0.8687 0.8790
4 0.8234 0.8666
5 0.8240 0.8786
6 0.8351 0.8880
7 0.8474 0.8970
8 0.8590 0.9043
9 0.8689 0.9115
10 0.8765 0.9173
11 0.8838 0.9223
12 0.8918 0.9267
13 0.8974 0.9310
14 0.9029 0.9343
15 0.9080 0.9376
16 0.9121 0.9405
17 0.9160 0.9433
18 0.9196 0.9452
19 0.9230 0.9479
20 0.9256 0.9498
21 0.9285 0.9515
22 0.9308 0.9535
23 0.9334 0.9548
24 0.9356 0.9564
25 0.9370 0.9575
26 0.9393 0.9590
27 0.9413 0.9600
28 0.9428 0.9615
29 0.9441 0.9622
30 0.9462 0.9634
31 0.9476 0.9644
32 0.9490 0.9652
33 0.9505 0.9661
34 0.9521 0.9671
35 0.9530 0.9678
36 0.9540 0.9686
37 0.9551 0.9693
38 0.9555 0.9700
39 0.9568 0.9704
40 0.9576 0.9712
41 0.9589 0.9719
42 0.9593 0.9723
43 0.9609 0.9730
44 0.9611 0.9734
45 0.9620 0.9739
46 0.9629 0.9744
47 0.9637 0.9748
48 0.9640 0.9753
49 0.9643 0.9758
50 0.9654 0.9761
N.01.05
55 0.9683 0.9781
60 0.9706 0.9797
65 0.9723 0.9809
70 0.9742 0.9822
75 0.9758 0.9831
80 0.9771 0.9841
85 0.9784 0.9850
90 0.9797 0.9857
95 0.9804 0.9864
100 0.9814 0.9869
110 0.9830 0.9881
120 0.9841 0.9889
130 0.9854 0.9897
140 0.9865 0.9904
150 0.9871 0.9909
160 0.9879 0.9915
170 0.9887 0.9919
180 0.9891 0.9923
190 0.9897 0.9927
200 0.9903 0.9930
210 0.9907 0.9933
220 0.9910 0.9936
230 0.9914 0.9939
240 0.9917 0.9941
250 0.9921 0.9943
260 0.9924 0.9945
270 0.9926 0.9947
280 0.9929 0.9949
290 0.9931 0.9951
300 0.9933 0.9952
310 0.9936 0.9954
320 0.9937 0.9955
330 0.9939 0.9956
340 0.9941 0.9957
350 0.9942 0.9958
360 0.9944 0.9959
370 0.9945 0.9960
380 0.9947 0.9961
390 0.9948 0.9962
400 0.9949 0.9963
410 0.9950 0.9964
420 0.9951 0.9965
430 0.9953 0.9966
440 0.9954 0.9966
450 0.9954 0.9967
460 0.9955 0.9968
470 0.9956 0.9968
480 0.9957 0.9969
490 0.9958 0.9969
500 0.9959 0.9970
525 0.9961 0.9972
550 0.9963 0.9973
575 0.9964 0.9974
600 0.9965 0.9975
625 0.9967 0.9976
650 0.9968 0.9977