Sample exploratory data analysis Info Sys 271

Why are we doing this?

Part of doing an exploratory data analysis is calculating lots of statistics and doing lots of graphs that might not make it into your final write-up. Graphs and statistics that aren’t directly used in the report can be included in appendices to your report. Part of the reason we do exploratory data analysis is so we can justify the methods we use in further analysis. Some statistical methods you will encounter require that data has certain characteristics such as normal distribution (e.g. not too skewed or bi-modal). Exploratory data analysis is also an opportunity to look for interesting features of the data that enable us to form hypotheses (e.g. relationships which might allow us to predict one variable from another). Exploratory data analysis is not a tool for making conclusions (supporting or rejecting hypotheses) because it doesn’t tell us whether the trends we think we can see are actually statistically significant. An exploratory data analysis write-up should be short!

What are we doing?

Numerical methods

Calculate means and standard deviations for data. Sometimes you might want to calculate medians and modes if you suspect the data is not normally distributed. The mode should be used for nominal level data. The median should be used for ordinal level data. Calculate values for skewness and kurtosis. Where we have nominal level variables we might be particularly interested in calculating means and standard deviations for each value of the nominal data. This will help us work out whether there is a difference between each of the categories. For example we should calculate values for each nominal value of who picked the stocks (the values are pros, darts, djia).

Descriptive Statistics
N / Minimum / Maximum / Mean / Std. Deviation / Skewness / Kurtosis
Statistic / Statistic / Statistic / Statistic / Std. Error / Statistic / Statistic / Std. Error / Statistic / Std. Error
PROS / 100 / -37.80 / 75.00 / 10.9470 / 2.2247 / 22.2466 / .479 / .241 / .317 / .478
DARTS / 100 / -43.00 / 72.90 / 4.5210 / 1.9388 / 19.3883 / .770 / .241 / 2.237 / .478
DJIA / 100 / -13.10 / 22.50 / 6.7930 / .8031 / 8.0315 / -.280 / .241 / -.215 / .478
Valid N (listwise) / 100

Commenting on the numerical data

Usually we won’t want to comment on all the numerical data, just the interesting differences. Differences in means tell us that we might be able to find a statistically significant difference later. Differences in standard deviations between categories can violate the assumptions behind some parametric statistical tests and invalidate any results we get.

The Pros had the highest mean increase in stock values (10.9) followed by the Dow Jones Index (6.8) and the Darts (4.5). The Pros and Darts groups had similar standard deviations (2.2 and 1.9 respectively) while the DJIA had a lower standard deviation (0.8).

Graphical data analysis

Graphical data analysis should provide a quick visual summary. From looking at a graph for 10 seconds you should be able to infer general trends in the data . Graphical data analysis is essential to check for normal distributions. A histogram or scatterplot of interval level data will reveal any quirks in the data such as being bi-modal, skewed or having a truncated tail. Nominal or ordinal data should be represented as a bar graph. Box and whisker plots can be used to find differences between different nominal categories on an interval variable (e.g. between stock values for Darts, Pros and the DJIA). To facilitate quick visual inspection there are a couple of tips you should follow: always use the same axes ranges and labels where you want different graphs to be comparable; don’t put more than 7 axis labels on an axis, don’t put more than about five trend lines on a graph (less if the trend lines are hard to distinguish). Some graphs are designed so that it is easy to extract numeric information from them (often histograms or carefully labelled pie charts enable numeric analysis). Pie charts should only be used to show proportions of a whole (or to demonstrate the capabilities of a colour printer )

Distribution of the percentage price change on investments for the Pros

The distribution of percentage price change on investments for the professionals is approximately normally distributed with some positive skew.

Correlational analysis

Correlational analysis can be performed by numerical analysis only but if you don’t do a scatterplot it is easy to miss important trends such as non-linear relationships.

If you look carefully at these scatterplots you will see that there are only three unique scatterplots but each is duplicated (for example, by correlating Pros with Darts as well as Darts with Pros). Confusing! But it does allow you to inspect a lot of data to find where correlations exist with a quick glance. This is what we want for exploratory data analysis especially if we have a lot of variables to explore. Plots like this shouldn’t be used where anyone might want to extract numeric data from the plot.

By looking at the numerical summaries of the correlations we can assess the magnitude of a correlation. The magnitude of a correlation tells us how much of the variance in one variable is explained by the variance in another variable. CORRELATION IS NOT CAUSATION so all we can say is that the scores vary together. Sometimes we have a third factor that is causing both of our correlated variables to change. For example, an increase in the value of stocks of Pros does not cause the value of the DJIA to increase. A rise in the stockmarket causes both to change. Positive or negative correlations refer to the direction of the correlation.

Correlations
PROS / DARTS / DJIA
PROS / Pearson Correlation / 1.000 / .324(**) / .538(**)
DARTS / Pearson Correlation / .324(**) / 1.000 / .428(**)
DJIA / Pearson Correlation / .538(**) / .428(**) / 1.000
** Correlation is significant at the 0.01 level (2-tailed).

We can say:

The performance of the Pros’ stocks is moderately strongly positively correlated with the performance of the DJIA stocks (a Pearson correlation of 0.538). The performance of the Darts stocks is weakly positively correlated with the performance of the Pros stocks (a Pearson Correlation of 0.324).

See this scatterplot from to see a single scatterplot of rainfall vs the number or rainy days.