Normality and Outliers in Anova and Manova

NORMALITY AND OUTLIERS IN ANOVA AND MANOVA

Checking for Normality and Outliers in ANOVA and MANOVA

Lynne Cox

University of Calgary

Checking for non-normality and outliers in ANOVA and MANOVA

A parametric test is a statistical procedure that takes a sample statistic and applies those results to make inferences regarding the general population. To ensure the components of the test are compatible with each other, there are assumptions that must be met within each multivariate analysis (Stevens, 2009). Once you have collected your data and before moving forward with statistical analysis, the next step is to look at the quality of the data and take some necessary precautions. Data screening involves checking if the data has been correctly inputted, checking for missing values and outliers and checking for normality (Hindes, 2012).

Two assumptions I will cover in this paper will be checking for non-normality and outliers in the Analysis of Variance (ANOVA) and the Multivariate Analysis of Variance (MANOVA).

ANOVA is a statistical technique used to determine the degree of difference between three or more groups.
MANOVA is an extension of the ANOVA, but it tests the difference in means between two or more groups in vectors of means and allows the examination of two or more dependant variables.

Retrieved from

Based on the Central Limit Theorem, one of the assumptions of parametric tests is that the variables are normally distributed. This Central Limit Theorem states that in a large sample size the mean and the sum of the sample will tend to follow a normal distribution, commonly referred to as the Bell Curve (Stevens, 2009).

If a dataset follows a normal distribution, then about 68% of the observations will fall within one standard deviation of the mean, 95% within 2 standard deviations and 99.7% will fall within 3 standard deviations of the mean. Although no method gives a definitive conclusion, two ways to evaluate normality is through graphical representation and statistical methods.

Read more at: Central Limit Theorem: A Simple Explanation of the CLT | Suite101.com

Retrieved from http://www.stat.yale.edu/Courses/1997-98/101/normal.htm

Outliers

Outliers are data points that are extreme, atypical and infrequent. The values are far from the mean and fall outside the distribution pattern. Outliers are not always random or by chance and need to be given special notice, as a single outlier can have an excessive influence on the size and direction of the strength and direction of the linear relationship between two variables (Sattler, 2008, p. 99). In large data samples, you can expect to find a small number of outliers. A data sample will always have a sample minimum and a sample maximum, but this does not mean the outlier will fall within this range, as the sample minimum and sample maximum may be closer to the other data points

Outliers can be caused by a data recording or entry error, instrument error, or by subjects being simply different from the rest of the sample (Sattler, 2008). Since outliers might cause your data to be non-normal, it is important to identify the cause of the outlier and then decide what to do about them (Stevens, 2009, p. 11)

Examples of outliers

easy to spot when there are 2 data sets

Examples with a small data and a larger data set:

Case Numberx1x2

111168

29246

39050

410759

59850

615066

711854

811051

911759

109497

In the example below, it is harder to identify the outliers when using a larger data set with four variables:

Case Number x1x2x3x4

1111681781

292462867

390501983

4107592571

598501392

6150662090

71185411101

8110512682

9117591887

1094971269

11130571697

12118511978

1315540958

141186120103

15109661388

Retrieved from

When there are extreme values in a data set, it is better to use the median as a measure of central tendency, as the median is unaffected by outliers and is a strong measure of central tendency (Meyers, 2006). If an outlier is discovered it is important to identify the cause before making the decision on further analysis. If the data has been entered incorrectly it can be re-entered or if the data was due to an instrumentation error it could be dropped. The analysis can also be run once with the outlier and once without the outlier.

Checking for non-normality and outliers in an ANOVA

As previously mentioned, the two main methods of assessing normality are:

Graphically- using a visual inspection
Numerically-relying on a statistical test

As a beginning researcher, it is recommended that both methods, rather than relying on just one method are carried out. Using SPSS (Statistical Package for the Social Sciences) allows you to test for both Normality and Outliers. Using the Explore command in SPSS we are able to first look for any outliers and then test for Normality.

Using SPSS to look for univariate (ANOVA) outliers

Retrieved from

Descriptives Table

The mean and trimmed mean will help identify outliers. In the case the Mean is 1.77 while the 5% Trimmed Mean is 1.74, only slightly lower. The trimmed mean shows that 5% of the higher and lower scores have been removed. By comparing the two scores you can identify if any extreme scores are having an influence on the variable.

Extreme Values and the Boxplot

The Boxplot and the Extreme Values tables, show the mild and extreme outliers. Referring to the Extreme Values tables you can identify the case number. This information will help guide the decision on what is to be done with the outliers. You may choose to re-enter data, get rid of the outlier, or run two analyses; one with the outlier and one without.

Using SPSS to check for non-normality in an ANOVA

Most tests rely on the assumption of normality. Referring to the descriptive table, we are able to begin by looking at the measures of skewness and kurtosis. Skewness measures the symmetry of a distribution while kurtosis measures the general peakedness of a distribution. A normal distributed variable (showing mesokurtosis) will show values of skewness and kurtosis around zero (Meyers, 2006).

Although the Histogram is another approach to be included in looking for non-normality in univariate analysis, it does not provide a definitive indication of violation of normality. The histogram should be used with the probability plot. These plots rank the data along a regression line and when the data falls directly on the straight diagonal line, normality is assumed.

The data does fall off the line, and further analysis is needed.

Tests of Normality

When looking at the Tests of Normality, you want to have the test come out not significant with a significance level of < .001. Both tests would show non-normal, which is what the other approaches have also indicated.

Checking for Non-Normality and Outliers in a MANOVA

As MANOVA tests are sensitive to outliers, data should be screened and run through normality tests and plot tests to see that the assumptions are met. Using Mahalanobis’ Distances will help identify outliers in an MANOVA. If the scores for the Mahalanobis Distances exceed the critical value found in the table it will be considered an outlier. “The Mahalanobis distance statistic D2 measures the multivariate “distance” between each case and the group multivariate mean (known as a centroid)” (Meyers, 2006, p. 67). The critical value tables are located in the back of most textbooks.

As there are more variables needing to be normally distributed to not violate the assumption of normality on the MANOVA, checking for non-normality is a more rigorous task than the assumption of normality on an ANOVA analysis. Two additional properties to check the normality assumption are “(a) any linear combinations of the variables are normally distributed, and (b) all subsets of the set of variables have multivariate normal distributions” (Stevens, 2009, p. 222). The second property implies the scatterplots for each pair of the variables will be elliptical (Steven, 2009).

Scatterplot.gif

The scatterplot also shows outliers as the data points that fall outside the oval- shape (elliptical)

Retrieved from:

Like the ANOVA, the shape of a distribution for the variables in a MANOVA should follow the bell-shaped curve. Variable 2 is showing a positive skewness, while variable 1 is showing the data to be normal, as it following the bell shape.
When a variable has violated the assumption of normality, a data transformation can be used to modify the variable (Hindes, 2012). The square root transformation, the logarithmic transformation and the inverse transformation are three of the more common transformations used. Including a variable that is not normal will reduce the power of the test.

Conclusion

Research and design uses a systematic approach to collecting and analyzing data to help explain or predict a certain occurrence or trend. Using Univariate and multivariate data analysis, we are able to obtain a more detailed description of the relationship of the variables being studied. Stronger results are reached if the data is screened and the assumptions of the test have not been violated. When using the SPSS software program and following a very systematic process, checking for non-normality and outliers in an ANOVA and a MANOVA analysis is straight forward and thorough, with data being analyzed through both graphical and statistical methods.

References

Gay, L. R., Mills, G. E., & Airasian, P. (2012). Educational research: Competencies for

analysis and applications (10th ed.). NJ: Pearson Education, Inc.

Hindes, Yvonne EDPS 607 L20 Multivariate Design and Analysis Spring 2012 Power point.

Meyers, L.S., Gamst, G., & Guarino, A.J. (2006). Applied Multivariate Research: Design and

Interpretation. Thousand Oaks, California: Sage Publication.

Sattler, J.M. (2008). Assessment of Children Cognitive Foundations (5th Ed.). La Mesa

California: Jerome M. Sattler, Publisher, Inc.

Todorov, V., & Filzmoser, P. (2010). Robust statistic for the one-way MANOVA Computational

Statistics and Data Analysis 54, 37-48. Doi10.1016/j.csda.2009.08.015

Retrieved from:

Retrieved from: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm

Retrieved from:

statistics.php

Retrieved from: http://pathwayscourses.samhsa.gov/eval201/eval201_supps_pg16.htm

Scatterplot.gif retrieved from: