Chapter 4: Variability
Variability provides a quantitative measure of the degree to which scores in a distribution are spread out or clustered together.
The range (either maximum score – minimum score or maximum score – minimum score + 1) is one gross measure of variability (that relies on only two scores from the distribution). The semi-interquartile range is half of the interquartile range (Q3 – Q1), so (Q3 – Q1) / 2. We won’t really make much use of either of these measures throughout the semester.
Let’s think about the type of measure that we’d like to use to assess variability. First of all, we’d probably like every score to contribute to the measure. And one very reasonable approach would be to assess the difference (or distance) of every score from the mean. If every score were near the mean, then those distances would be small, so the measure of variability would be small as well. If scores were far from the mean, then those distances would also be large, resulting in a large measure of variability. Sound good so far?
OK, then we could simply sum all the differences between the scores in our distribution and the mean of the distribution. But we already know what will happen then, right?
X / X - 1
2
3
Sum
The problem is that the signs of the differences cancel out one another, resulting in zero. So, how to get rid of the signs? One approach would be to take absolute values of the differences [ |(X - )| ].
X / |X – 1
2
3
Sum
That’s actually a pretty good measure of variability, except that it will typically be larger with larger data sets. A more reasonable measure of variability would correct for the number of scores (as we do for the mean). So, then a very reasonable measure of variability (the mean absolute deviation) would be:
Although reasonable, we won’t actually use this measure. Instead, we’ll take a different approach to removing those pesky signs…we’ll square all the differences. That measure is called the sum of squared differences from the mean (SS), and would be computed as follows:
X / (X – 1
2
3
Sum
Once again, however, the SS isn’t a good measure of variability because it will typically be larger for larger data sets. So, once again, we’d be inclined to divide the SS by the number of scores in the data set. That measure of variability is one that we will use, and it’s called the variance (here for the population).
The variance is a nifty measure, but it has the problem that the units involved in the measure (seconds, inches, etc.) are now squared. In order to rectify that problem, we need a new measure of variability that returns to the original units of measurement, which is the standard deviation (). To compute the standard deviation, you simply take the square root of the variance.
So, there you have it. That’s the logic that underlies the measures of variability that we’ll use most often. Now, we’ll have to expand on these measures just a bit as we develop an easier way to compute the SS on a calculator and develop the measures of variability that we’d use to estimate population variability from a sample.
The computational formula for SS
Computer programs for statistical analyses actually use the definitional formula for computing SS. However, when using a calculator (and a small data set), it’s better to use the computational formula for SS:
So, for this simple data set, compute SS.
X / X21
2
3
Sum
SS =
Note that this is the only computational formula for SS. It does not differ when computing the SS for a population or a sample.
Describing population variance and Estimating population variance from a sample
The formulas for describing the variance and standard deviation of a population are straightforward (as illustrated above). However, when we are dealing with sample data (as is typically the case), and we’re interested in estimating the population variability, then the formulas are slightly different.
Describing Population Variability / Using a Sample to Estimate Population VariabilitySS / /
Variance / /
Standard Deviation / /
Why are the degrees of freedom (n-1) for sample variance?
When one wants to estimate the population variance (2) from a sample, one divides the sample sum of squares (SS) by the appropriate degrees of freedom (df). I want to demonstrate here why the df needs to be (n - 1).
Suppose that one has a simple population, with only three members (1, 2, 3). From the computations above, you should know the population mean () and variance (2). [Remember, this is the population, not a sample, so one would use descriptive statistics. For the variance, just divide the SS by N, not (N - 1).] Write those values below:
/ 2Next we want to get a sense of all possible samples of two scores we might take from this population. We will sample with replacement, which means that it is possible to get the same score twice (like picking it out of the hat, putting it back in, and then picking it out again). What are the nine possible combinations of scores we could get from this population?
Now, for each sample compute the mean of the sample () and the SS for each sample. Then divide the SS by n (n = 2 in this instance) for one estimate of the population variance. Then divide SS by n - 1 (or 1 in this instance) as a second estimate of the population variance. Finally, compute the mean for each column of statistics.
Sample / Mean / SS / SS / n / SS / (n-1)Mean ->
When you compute the statistic for every possible sample which can be taken from a distribution, and then create a frequency distribution, you have created a sampling distribution. So the means of all 9 samples form a sampling distribution of the mean. [This is a very important distribution, to which we will return shortly.] The mean of the sampling distribution should equal the population parameter that one wants to estimate. So, if you wanted to estimate the parameter , the population mean, the sample mean would be an accurate statistic, because the average (mean) sample mean is exactly equal to the population mean. Check the mean of all 9 sample means. Does it equal ? Can you see why the sample mean is, on average, a good estimate of the population mean? Can you also see that the sample mean does not always equal the population mean? (That’s sampling error.)
Here are a couple of other questions related to the sampling distribution of the mean. What proportion of the population scores fell at the mean ()? What proportion of means in the sampling distribution fell at the mean ()? Can you explain why this happens?
So, now the question is, “Using which df in the sample variance yields the best estimate of the population variance?” Looking at the means of the two columns should convince you that using df = n produces an underestimation of the population variance, while using df = n - 1 yields an accurate estimation of the population variance. So you should be convinced that it is important to use n - 1 as the df when computing the sample variance to estimate the population variance.
How is variability affected by constant changes in scores?
What would be the impact on variance of adding a constant to each member of the data set? What would be the impact on variance of multiplying each member of the data set by a constant?
You could answer these questions logically, but let’s simply compute to determine the answers.
X / X + 3 / X • 31 / 4 / 3
2 / 5 / 6
3 / 6 / 9
4 / 7 / 12
5 / 8 / 15
SS
s2
Ch4 - 1