Assessing the Relationship of Data to the Normal Distribution

Do data belong to the Normal Distribution?

As you have probably figured out by now, the normal distribution plays a major role in many types of probabilistic and statistical analyses. Some statistical procedures are heavily dependent on the assumption of normality, and in case one can verify that this assumption is questionable, these procedures should be avoided. It is therefore useful to have techniques available that can verify the validity of the normality assumption. This is the objective of this short note.

The Normal Probability Plot

The following procedure helps conclude qualitatively that a sample was drawn from a normal distribution. Here is a summary of the procedure:

Place the values in the data set (X) into an ordered array. Call the smallest value in the ordered set X1 and the largest value Xn. Then the set becomes X1, X2… Xn.
Calculate the Fx/(n+1), the cumulative relative frequency for each value Xi. From the chart of the standard normal distribution or from Excel (using “=normsinv[Fx/(n+1)]”), find the corresponding standard normal value of Z for each point in the ordered data set. In doing so we hypothesize the data set was drawn from a normal distribution with some mean and standard deviation.
Plot the pairs of points (Z, X) using the observed data values (Xi) on the vertical axis, and the associated Zi values on the horizontal axis.
Inspect the points plotted for evidence of linearity (i.e. a straight line).

Explanation:

The Z-score for any value of X is Z = (X – so there is a linear relationship between X and Z, that is X = Z + Since the empirical probability to have a number as large as X in the sample is Fx/(n+1), (where Fxis the cumulative frequency of X), if X is indeed normally distributed the Z value obtained from the normal distribution for the corresponding X value by using Fx/(n+1) should be the Z-score of that X, thus providing a linear relationship with X. So if there is a linear relationship between X and Z-table, then X is normally distributed.

Example1:

Suppose we wish to obtain the first and the second standard normal ordered values (Z1, and Z2) for to a sample of 19 observations (each observation is different in value).

Obtaining Z1: Since Fx=1, P(Z<Z1) = 1/(19+1) = 1/20 = .05. Under the standard normal distribution Z1 = -1.645 (note P(Z<-1.645) = .05.

Obtaining Z2: Since Fx=2, P(Z<Z2) =2/(19+1) = 2/20 = .10. Under the standard normal distributionZ2 = -1.285 (note P(Z<-1.285) = .10.

In a similar manner we complete the rest of the Zi values. Now the pairs (Xi, Zi) are plotted and if they are found to lie (approximately) along a straight line we can safely say, that the data belong to a normal distribution.

To determine whether or not there is linear relationship between X and Z we can test the correlation between them as follows:

H0: The data come from normal distribution

H1: The data do not come from a normal distribution.

Calculate the test statistic (R) as the sample correlation coefficient between X and Z. Compare R to a critical value Rcr from a table of critical values (provided below; the table was constructed from simulation results). Rcr depends on the sample size and the significance level selected for the test. If R < Rcr there is sufficient evidence to reject H0 and conclude that the data is not normal at alpha level of significance.

Important Comment: If the Xi and Zi appear to form a linear relationship, then the line intercept represents the population mean (), and the line slope represents the standard deviation ().

Example 2

Test scores of 19 students in each of two classes were drawn. Some of the sorted scores are shown below along with the calculated cumulative proportion from the sample (Fx/(n+1)) and with the resulting Z values. Details can be found in the fileAssess Normal.

Partial set:

Order / Class I / Class II / Prob / Z value
1 / 48 / 47 / 0.05 / -1.64485
2 / 52 / 54 / 0.1 / -1.28155
3 / 55 / 58 / 0.15 / -1.03643
4 / 57 / 61 / 0.2 / -0.84162

After the Z values were derived, the following two graphs were plotted.

Conclusion: In class I scores were produced from a normal distribution.

From the graph it seems  = 65 and the  = (83 – 47)/(1.645 – (-1.645) = 10.94

Now observe the probability plot for class II

The result is unclear. Although it seems there is some curvature in the line the “non-normality” does not appear to be too severe. Since the sample size is only 19 one should not judge the distribution to be non-normal. Let us proceed by testing the correlation as explained above (we’ll run the correlation test for the two classes):

H0: The data come from normal distribution

H1: The data do not come from a normal distribution.

The test statistic calculated with Excel for Class I :R = .999

The test statistic calculated with Excel for Class II: R =.959

The critical value for n=19, and alpha = .05is .9479

There is insufficient evidence to reject the normal distributionat 5% significance level for both classes (since .999 > .9479 and .959 > .9479).To estimate  and  we run linear regression to construct the best fit line, which results with the equation
X = 11.026Z+70.894. So  ≅70.7 and ≅ 10.89 (see the Excel file).

The following example demonstrates how to construct a probability plot when multiple same-values are present in the sample drawn (which did not occur in the previous example).

Example 3

To help make a decision about expansion plan, the president of a music company needs to know how many CDs teenagers buy annually. Accordingly, he commissions a survey of 250 teens, in which they are asked to report how many CDs they purchased in the previous 12 months. Can we assume the number of CDs bought annually by a teenager is normally distributed?

Solution

The following table summarizes the data (see the file AssessNormal1 – the Probability Plot sheet):

X / f / Fx / Fx/(n+1) / Z
6 / 1 / 1 / 0.003984 / -2.65342
8 / 1 / 2 / 0.007968 / -2.41037
9 / 7 / 9 / 0.035857 / -1.80093
10 / 10 / 19 / 0.075697 / -1.43462
11 / 16 / 35 / 0.139442 / -1.08283
12 / 26 / 61 / 0.243028 / -0.6966
13 / 23 / 84 / 0.334661 / -0.42708
14 / 25 / 109 / 0.434263 / -0.16553
15 / 29 / 138 / 0.549801 / 0.125158
16 / 28 / 166 / 0.661355 / 0.416163
17 / 26 / 192 / 0.76494 / 0.722285
18 / 29 / 221 / 0.880478 / 1.17738
19 / 11 / 232 / 0.924303 / 1.434623
20 / 11 / 243 / 0.968127 / 1.853959
21 / 4 / 247 / 0.984064 / 2.146006
22 / 1 / 248 / 0.988048 / 2.258663
23 / 1 / 249 / 0.992032 / 2.410372
26 / 1 / 250 / 0.996016 / 2.653417

Explanations:

The column ‘X’ represents the number of CDs purchased by a teenager annually.
The column ‘f’ is the frequency of X (counts how many times each number appears in the sample). For example, the value 11 appears 16 times (16 teenagers purchased 11 CDs).
The column ‘Fx’ calculates the cumulative frequency. For example, 10 or less CDs per person appear 19 times (1+1+7+10=19).
The column ‘Fx/(n+1)’ calculates the empirical cumulative frequency. For example, F10/(250+1) = 19/251 = .075697.
‘Z’ is found by “normsinv” as before.

Now we can draw the graph of Z against X.

Interpretation:

The graph raises some suspicion with regard to the normality of the CD s distribution. Because the two ends are curved. Yet the amount of deviation from the normal curve needs to be rechecked. The correlation test used above yields the following results:

R = .990375; Rcr = .9943 (for n = 250, alpha = .05). Thus there is insufficient evidence to reject the normality at 5% level of significance.

In what follows we present a fewhypotheses testing procedures designed to analytically test the normality of a data set.

The Goodness of Fit Chi Squared Test

Example 4

Re-solve example 3 using the goodness of fit Chi square test at 5% significance level.
Solution:
First, determine Z values that comply with the rule of 5 (the expected value of the number of observation that fall in each interval should be at least 5). The following table demonstrates such a selection of Z values, and additional information:

i / Intervals / Probability / Expected (Ei) / Actual (Fi)
1 / (z -2) / 0.02275 / 5.6875 / 2
2 / (-2 < z  -1) / 0.135905 / 33.97625 / 33
3 / (-1 < z  0) / 0.341345 / 85.33625 / 74
4 / (0 < z  1) / 0.341345 / 85.33625 / 112
5 / (1 < z  2) / 0.135905 / 33.97625 / 26
6 / (z > 2) / 0.02275 / 5.6875 / 3

Explanations:
Determine the probabilities for the ranges selected.
P(Z -2)=.0225;

P(-2 Z -1) = .1359;

Comment: The Z values (-2, -1, 0, 1, 2) were selected such that when the interval probabilities are calculated the expected number of observation in each one (Ei) will be at least 5. See details below. A symmetrical selection of Z values is preferable.

The expected values (Ei) are calculated as follows:

First interval: Second Interval:
E1 = P(Z-2)(250) = 5.6875 E2 = P(-2Z -1)(250) = 33.97625

…and so on…

The actual frequency (Fi) counts the number of sample observations in each interval. Of course you need to transform first the observation values Xi to their corresponding Z- scores using the sample mean and sample standard deviation: , and then count how many Z values belong to each interval. For example, in the interval Z-2there are two Z-scores found so F1 = 2.
Test the following hypothesis:
H0: The distribution is normal with  = 14.98 and  = 3.14

H1: The distribution is not the above

The test is performed using a Chi-square distribution. Use Ei and Fi to calculate the Chi square statistic.

The test is performed as follows: If22, k-1-L, reject H0 (where k is the number of intervals and L is the number of parameters estimated; since we estimate both and  L=2).
Let the significance level be .05.This rule translates to a critical value of2.05, 6-1-2 = 7.8147 (a value found in the chi-square table or by using the Excel function: =chiinv(.05,3)).

Conclusion: Since 15.39 > 7.8147, there is sufficient evidence to reject H0 at 5% significance level. The distribution is not normal with  = 14.98, and  = 3.14.

Anderson Darling Test

This is a very strong test that works well on small samples (even n≤25). The test is performed on the ordered data set (X1 ≤X2…≤Xn). It applies to any distribution. Specifically for the normal case define:

Zi is calculated by where and sare the sample mean and standard deviation respectively.AlsoΦ(zi) =Pr(Z < zi) of the normal distribution.

Now calculate the statistic (A*)2, the adjustment of A2 to the sample size (especially important for small samples) by

If (A*)2A2critthe hypothesis of normality is rejected. Below you can view a few critical values A2crit.

 / 0.1 / 0.05 / 0.025 / 0.01
A2 crit / 0.631 / 0.752 / 0.873 / 1.035

Example 6

For the data used in example 3 here is a summary of the calculations:

A2 = -250 – (1/250)[(2(1)-1)Ln(z1)+(2(250-1)+1)Ln(1-(z1)+
(2(2)-1)Ln(z2)+(2(250-2)+1)Ln(1-(z2)+…… = 1.42

(A*)2 = 1.42(1+.75/250+2.25/2502) = 1.43

Find details in the file AssessNormal1- Anderson Darling CD example.

A2crit for 5% significance level = .752

Since 1.43 > .752 there is sufficient evidence at 5% significance level to reject the null hypothesis. The sample does not belong to a normal distribution.

The Lilliefors Test

This hypothesis test method is known to give very strong results for samples of size n2000. As in the normal plot approach, here too we calculate cumulative probabilities. Yet here we compare probabilities for a known normal distribution with their sample based empirical counterparts.

Here is a summary of the procedure:

Determine the mean and standard deviation of the normal distribution under investigation. Set up the hypotheses:

H0: The distribution is normal with and.

H1: The distribution is not normal.

Place the values in the data set (X) into an ordered array.
Find the corresponding standard normal Zi values for each point in the ordered data set using the hypothesized mean and standard deviation. That is
Zi=(Xi-
Determine the cumulative normal probabilities F(Zi) = P(Z<Zi) for each Zivalue found in part ‘2’.
Determine the cumulative sample distribution S(Xi) = Fx/n for each point in the sample.
Calculate the largest absolute difference (D) between F(*) and S(*).
D = max{|F(Z1)-S(X1)|, |F(Z2)-S(X2)|…, |F(Zn)-S(Xn)|}
Perform the test as follows: If D>Dcr, reject the null hypothesis. Otherwise, do not reject the null hypothesis. Dcr is a critical value determined by alpha and the sample size, and is provided by the Lilliefors table (see below).

The Lilliefors method was applied to a data set of n = 2000, that can be found in AssessNormal1 – Lilliefors; all the calculations were performed in Excel.

Appendix 1: The Lilliefors Table

Appendix 2

The Critical value of correlation for the probability plot normality test

N 0.01 0.05

3 0.8687 0.8790

4 0.8234 0.8666

5 0.8240 0.8786

6 0.8351 0.8880

7 0.8474 0.8970

8 0.8590 0.9043

9 0.8689 0.9115

10 0.8765 0.9173

11 0.8838 0.9223

12 0.8918 0.9267

13 0.8974 0.9310

14 0.9029 0.9343

15 0.9080 0.9376

16 0.9121 0.9405

17 0.9160 0.9433

18 0.9196 0.9452

19 0.9230 0.9479

20 0.9256 0.9498

21 0.9285 0.9515

22 0.9308 0.9535

23 0.9334 0.9548

24 0.9356 0.9564

25 0.9370 0.9575

26 0.9393 0.9590

27 0.9413 0.9600

28 0.9428 0.9615

29 0.9441 0.9622

30 0.9462 0.9634

31 0.9476 0.9644

32 0.9490 0.9652

33 0.9505 0.9661

34 0.9521 0.9671

35 0.9530 0.9678

36 0.9540 0.9686

37 0.9551 0.9693

38 0.9555 0.9700

39 0.9568 0.9704

40 0.9576 0.9712

41 0.9589 0.9719

42 0.9593 0.9723

43 0.9609 0.9730

44 0.9611 0.9734

45 0.9620 0.9739

46 0.9629 0.9744

47 0.9637 0.9748

48 0.9640 0.9753

49 0.9643 0.9758

50 0.9654 0.9761

N.01.05

55 0.9683 0.9781

60 0.9706 0.9797

65 0.9723 0.9809

70 0.9742 0.9822

75 0.9758 0.9831

80 0.9771 0.9841

85 0.9784 0.9850

90 0.9797 0.9857

95 0.9804 0.9864

100 0.9814 0.9869

110 0.9830 0.9881

120 0.9841 0.9889

130 0.9854 0.9897

140 0.9865 0.9904

150 0.9871 0.9909

160 0.9879 0.9915

170 0.9887 0.9919

180 0.9891 0.9923

190 0.9897 0.9927

200 0.9903 0.9930

210 0.9907 0.9933

220 0.9910 0.9936

230 0.9914 0.9939

240 0.9917 0.9941

250 0.9921 0.9943

260 0.9924 0.9945

270 0.9926 0.9947

280 0.9929 0.9949

290 0.9931 0.9951

300 0.9933 0.9952

310 0.9936 0.9954

320 0.9937 0.9955

330 0.9939 0.9956

340 0.9941 0.9957

350 0.9942 0.9958

360 0.9944 0.9959

370 0.9945 0.9960

380 0.9947 0.9961

390 0.9948 0.9962

400 0.9949 0.9963

410 0.9950 0.9964

420 0.9951 0.9965

430 0.9953 0.9966

440 0.9954 0.9966

450 0.9954 0.9967

460 0.9955 0.9968

470 0.9956 0.9968

480 0.9957 0.9969

490 0.9958 0.9969

500 0.9959 0.9970

525 0.9961 0.9972

550 0.9963 0.9973

575 0.9964 0.9974

600 0.9965 0.9975

625 0.9967 0.9976

650 0.9968 0.9977