P. Box Plots (Modified and Unmodified), Quartiles, and IQR

Demonstrations II

Dr. Scott Stevens

N. Arithmetic mean (Average)

O. Median (and comparison with the mean)

P. Box plots (modified and unmodified), quartiles, and IQR

Q. The mean, from grouped data

R. Comparison of measure of central tendency: mean, median, mode, midrange, midhinge

S. Standard deviation and variance

N. Arithmetic Mean (Average)

[For the mean from grouped data, see Q.]

Problem: During the 1980 presidential campaign, Ronald Reagan repeatedly asked voters if they were better off in 1980 than they were 4 years before. Here are some data (in yellow) on the unemployment and inflation rates during the Carter Administration (’77 to ’80) and the Reagan Administration (’81 to ’88). Is there any difference between the average unemployment rate in the two administrations? Average inflation rate?

This is quite easy to compute by hand. Assuming that by "average" the problem means "mean", we just take the numbers to be averaged, add them, then divide by the number of numbers added. We can also do this with the Excel =AVERAGE(range) function.

A / B / C
1 / Year / Unemployment / Inflation
2 / 1977 / 7.1 / 6.7
3 / 1978 / 6.1 / 9
4 / 1979 / 5.8 / 13.3
5 / 1980 / 7.1 / 12.5
6 / 1981 / 7.6 / 8.9
7 / 1982 / 9.7 / 3.8
8 / 1983 / 9.6 / 9.6
9 / 1984 / 7.5 / 7.5
10 / 1985 / 7.2 / 7.2
11 / 1986 / 7 / 7
12 / 1987 / 6.2 / 6.2
13 / 1988 / 5.5 / 5.5
14
15 / Carter Average / 6.525 / 10.375
16 / Reagan Average / 7.5375 / 6.9625

So the Carter unemployment average, for example, was computed as = AVERAGE(B2:B5), since the unemployment figures for the Carter years were in spreadsheet cells B2 through B5.

The result show that unemployment was lower during the Carter years, but that inflation was considerably lower during the Reagan years. The question that remains is whether the differences in these figures are too large to be credited to random fluctuation alone. That question is one we take up later in the semester.

It’s important to note that the average, a single number, cannot tell the whole story of a data set. Be careful what conclusions you draw from a “one number summary”!

O. Median (and comparison with the mean)

Problem: Below is a table of the rainfall recorded in the Los Angeles area in the last 10 years. (Unlike most of these demonstrations, I’ve made up this data for this one.) Compute the median rainfall based on these data. Without computing the mean, state whether the mean rainfall for this ten year period would be above, below, or equal to this median value.

rainfall in a year / 3 / 4 / 5 / 5 / 5 / 6 / 6 / 14 / 17 / 20
median rainfall / 5.5

The first step is to sort the data, smallest to largest
–I’ve already done this with the data. Then, in a set of 10 observations, the median should be observation number (10+1)/2 = 5.5…that is, the number halfway between observation #5 and observation #6. Since #5 is 5” and #6 is 6” in this case, the median is 5.5”. You can also get this from Excel, by using the =MEDIAN(range) command.

Now, how does this compare to the mean?

If you imagine these bars sitting on a see-saw along the number line of the x-axis, the mean would be the balance point of the see-saw. (Think of where you’d have to put the fulcrum (or balance point) of the see-saw to make the “kids” balance. You should be able to see that it’s somewhere near where I put it below, at 8.5. The three little kids on the right side at positions 14, 17, and 20 will just balance the two small and two big kids on the left side (at positions 3, 4, 5, and 6).)

On the other hand, the median point for a distribution is the point at which half of the "stuff" is to the left and half of the "stuff" is to the right. If we stick with our see-saw imagery, "stuff" means "weight", and I hope that you can see that 5 units of weight lie to the left of 5.5 (the median value), and 5 units of weight lie to the right of 5.5

This way of thinking about mean and median is quite useful, so take some time to lock it down. What you can learn from this example holds in general. Note, for example:

If the last kid slides from position 20 to position 25 on the see-saw, the balance point would shift to the right (a higher mean), but the median ("half-weight") point would remain unchanged. In general, the mean is more affected by extreme values than then median.

If the distribution has a long "tail" in one direction (like this one does to the right—it is "skewed right"), then the mean tends to be more than the median. In general, if a distribution is unimodal (one highest point) and skewed, then as you run up the long slope of the "hill", you'll encounter the mean, median, and mode, in that order. The mode, of course, will be at the top of the hill.

If the distribution is symmetric ("mirror imaged, left to right"), then the mean and median are equal. If the symmetric distribution is also unimodal (one highest point), then the mode is also equal to the mean and the median.

Don't memorize these results—understand them. And think about what the measures given in a given problem really tell you. For example, suppose you’re told that the average number of people in a US household is 2.4 persons. This is the mean (since it clearly isn't the median or mode!). It is the "balance point" of the population distribution. Saying it another way—if all of the people in all of the households were gathered together, and then distributed evenly among all of those households…well, we'd have a bloody mess, since each household would get 2.4 people.

But that doesn't mean that 2.4 is "typical", or even close to "typical". For example, if 80% of all households have 1 person, and the remaining 20% all had 8 persons, the average would be 2.4.

The mean is a measure of "middle", but we almost always need a measure of "spread", as well. The most common is standard deviation.

P. Box Plots (Modified and Unmodified), Quartiles, and IQR

Problem: In the yellow box on the next page, find the annual inches of precipitation at the Los AngelesCivicCenter for the years 1961 to 1990. Summarize this data with a boxplot and modified box plot.

It's not hard to do this work by hand, but the goal of this course is to give you tools that you can use effectively and responsibly. To help you with this, I've written a number of spreadsheet templates that will perform common statistical tasks, to supplement those functions already a part of Excel. Sometimes, my templates duplicate functions already available in Excel. When I do this, it is for one of two reasons. Either 1) Excel's built-in function is restrictive about how the input data must be supplied, or 2) Excel's built-in function is not helpful in understanding how the answer is obtained. Since you must understand a tool clearly in order to use it effectively, my template often provide a "step-by-step" approach.

I'll also be providing templates to perform some of the tasks that Excel can't do at all, or can only do incompletely. That's what I've done for this example. The template (available at my website) is Frequency Distribution, Histogram, and Box and Whisker Plot. Before you start using my templates, though, there are some general things you should know about them. Please check the website post entitled Using Stevens’ Statistical Templates: Useful Information. It uses this problem as an example.

Data / Outlier? / Statistics / Values / Formulas
4.56 / mean, x-bar / 14.70533333 / =AVERAGE(range)
5.83 / smallest entry / 4.56 / =MIN(range)
6.49 / largest entry / 34.04 / =MAX(range)
6.54 / number of observations / 30 / =COUNT(range)
7.58
7.98 / median / 12.61 / =MEDIAN(range)
8.9 / 1st quartile / 8.9675 / =QUARTILE(range, 1)
8.92 / 3rd quartile / 17.3375 / =QUARTILE(range, 3)
9.11 / IQR / 8.37 / = 3rd quartile - 1st quartile
9.26 / lower whisker / 4.56 / = 1st quartile - 1.5 * IQR or smallest val
9.98 / upper whisker / 29.8925 / = 3rd quartile + 1.5 * IQR or largest val
10.7
10.92
11.01
12.31 / Frequency Table for Histogram
12.91 / suggested # of cats / 5
14.41 / category size / 5
14.97 / At least… / …but less than… / frequency
15.37 / 4 / 9 / 8
16.54 / 9 / 14 / 8
16.69 / 14 / 19 / 8
17 / 19 / 24 / 1
17.45 / 24 / 29 / 3
18 / 29 / 34 / 1
23.66 / 34 / 39 / 1
26.32
26.33
26.81
30.57 / outlier
34.04 / outlier

We compute the numbers needed for a modified box plot and unmodified box plot. Let's start with the modified box plot. Note the commands that Excel uses to find median, 1st quartile, and 3rd quartile. The interquartile range (IQR) is just the difference between the first and third quartile.

Formulas for first and third quartile. Different sources compute the quartiles slightly differently. Excel computes the first quartile as observations number (n +3)/4 in the sorted list of n observations, while your text uses observation number (n+1)/4. The median is observation (n+1)/2, as in your book. The third quartile is computed in Excel as observation number (3n+1)/4, while your book uses observation number (3n+3)/4. We’ll be happy with either of these calculation rules. (Here, for example, the first quartile turns out to be observation (30 + 3)/4 = 33/4 = 8.25. What is observation number 8.25? It's ¼ of the way from observation #8 to observation #9 on the sorted list. #8 is 8.92 and #9 is 9.11. You can find the number that is ¼ of the way from A to B by computing (0.75  A) + (0.25  B). So, for our data, this is 0.75(8.92) + 0.25(9.11) = 8.9675, as reported.) The Excel command for the first quartile is, as you can see, = QUARTILE(range, 1). The third quartile replaces the "1" with a "3": =QUARTILE(range, 3).

In the modified box plot, the lower whisker extends down to 1.5 * IQR below the 1st quartile, and the upper whisker extends up to 1.5 * IQR above the 3rd quartile. Any data points beyond the whisker's ends are marked with dots, and identified as outliers. With the unmodified box plot, the whiskers extend all the way to the most extreme data values—the maximum and minimum observations. My spreadsheet here computes the "whisker's end" values for both cases, and I’ve provided two different spreadsheet templates on the website to create the two kinds of box plots. For this course, you’ll be responsible only for creating the unmodified box plots.

Let's take a look at the modified box plot, since we can use it to talk about both types of box plots.

We can see that there are no outliers on the lower end; no observed rainfall is more than 1.5 IQRs from the 1st quartile, so the lower tail stops at the lowest observation. The central box shows the 1st quartile, median, and 3rd quartile. Remember what this means. 25% of all observations represent rainfall below the "left wall" of this box (about 8.97"). 25% of all observations lie between the left wall and the median line. 25% more lie between this median line and the "right wall" of the box (about 17.34"). Finally, the highest 25% of the rainfalls fall to the right of the "right wall" of the box. To get a rough idea of what the histogram for this data would look like, you can imagine dumping the same amount of water in each of these four “compartments”. The “water” between the 1st quartile and the median would be higher than any other “compartment”, indicating that the numbers are crunched together there more than anywhere else. Conversely, the numbers from the third quartile to the maximum value are “spread out”—they’re not packed in to their interval as densely.

What else? We see that there are two observations that fall above the end of the upper whisker. We identify these as outliers—both correspond to more than 30 inches of rain.

And the unmodified box plot? How does it differ? Only in that the upper whisker extends to include both of the pink outliers. This is the only kind of boxplot that your book uses.

While I expect you to be able to create an unmodified box plot without needing my spreadsheet, it's unlikely you'll be able to do it (without my help) in Excel. Excel doesn't support box plots, so I did a fair amount of work to make it draw them, anyway. When you use my spreadsheet on other data, be sure to change the axis name so that it fits your problem.

Q. The mean, from grouped data

[For the mean of ungrouped data, see N.]

Problem: In a random sample of 50 college students, 5 said that they sit “in the very front” of the class and 21 said that they sit “toward the front”. The GPA of the students in the very front was 10.94 (on a 12 point scale) while for the students who sat “toward the front”, the average GPA was 9.38. What is the average GPA of all 26 of these students?

The answer has to come out exactly as if 5 students had GPAs of 10.94 (the average for the “front” group) and 21 students have GPAs of 9.38 (the “near front” group). Computing this is easy: Average 26 numbers, 5 of which are 10.94 and 21 of which are 9.38. If you think for a moment, you’ll realize that the math is going to look like [(5  10.94) + (21  9.38)]/(5 + 21). We can generalize this work to find the mean from any set of grouped data.

Finding the Mean from Grouped Data in Excel

Excel (as well as a number of software packages) expects that, when you want to do statistics, you'll type in every single data point. Sometimes, though, like in this problem, you don't really want to do that. You want to enter the different values observed, and how many times each value was observed. Happily, Excel can still easily compute the average of data presented in this way. Here's how you do it.

1. Enter your data in two columns, "value" and "frequency".

2. Compute the average of the data with the command

=SUMPRODUCT(valuerange, frequencyrange)/SUM(frequencyrange)

Here, valuerange refers to the cells containing the observed values (the numbers in the "value" column). frequencyrange refers to the cells containing the number of times each value is observed (the numbers in the "frequency" column).

You can also find the mean of grouped data from a relative frequency distribution. The formula is even simpler:

=SUMPRODUCT(valuerange, relfreqrange)

where relfreqrangeis the range of cells containing the relative frequencies of the observed values.

We'll use this here.

Location / # of students / Average GPA
front / 5 / 10.94
toward front / 21 / 9.38
mean / 9.68 / =SUMPRODUCT(C2:C3,B2:B3)/SUM(B2:B3)

You could, of course, have typed the 26 numbers in separately, then used the =AVERAGE command.

The standard deviation and variance is computed from grouped data using the same idea used here: treat every observation in a class as though it fell at the midpoint of that class.

R. Comparison of measure of central tendency: mean, median, mode, midrange, midhinge

See N for mean. See O for comparison of median and mean.

Mode The mode is simply the most frequently occurring value in a data set—what single observed value occurs most often? Some data sets have no mode, since each observation occurs only once. Other data sets are “bimodal”—they have two different modes. The term “bimodal” is often used to describe frequency distributions of interval or ratio data that have two prominent, nonadjacent “peaks” of comparable size in their histograms. The mode is of limited usefulness with ungrouped data. If more people got 67 points on the exam than any other single number, how much does that really tell you? On the other hand, suppose we view soda consumption, and consider classes of 0 to 5 ounces, 5 to 10 ounces, and so on. It might be quite useful for the soda company to know thatwhen test subjects were given soda to drink in one sitting, their modal consumption was between 10 and 15 ounces. They may decide to market 12 ounce cans rather than 8 ounce ones.

In Excel, you can find the mode of a data set with the command = MODE(range), where range is the range of cells containing the data.

Midrange (Not used in your book.) The midrange is the value halfway between the smallest and largest observation in a data set. It can be computed in Excel as =(MAX(range) + MIN(range))/2. It’s of limited usefulness, especially for skewed data. How could it be useful?

Suppose a policewoman is responsible for answering calls for help along a section of route 81 that includes exits at mile markers 30.4, 31.4, 39.4, 39.8, 40.2, and 50.4. She wants to minimize the impact of a “worst case scenario”—that she receives a call at one exit when she is far away from it. If this is her concern, where should she station herself along the highway? Well, since her segment of road runs from marker 30.4 to marker 50.4, she should station herself at the midrange, 40.4. In this way, she is at most 10.0 miles from the call, regardless of where it originates.

Midhinge (Not used in your book.) The distinction between the midrange and midhinge is the same as the distinction between the range and IQR. (See L for information on IQR.) That is, the midhinge is the point halfway between the first and third quartile values. In Excel, it can be computed as

= (QUARTILE(range,1) + QUARTILE(range,3))/2. It, too, is of limited usefulness. The feel of it is that it gives you the geometric middle of the “middle half” of the data.

To concoct an example, suppose that our policewoman of the midrange example, above, decides that it is foolish to treat the exit at 50.4 miles on an equal footing with the others. Her midrange position of 40.1 is far from the “center” of things, due to the “outlier” at 50.4. She may instead decide to worry about the “middle 50% of the exits”. With 6 exits, Excel’s formula for 1st quartile gives observation (6+3)/4 = “observation 2.25”, which is 33.4. The Excel formula for the 3rd quartile gives observation (18+1)/4 = “observation 4.75”, or 40.1. The midhinge, the point halfway between the values, is (33.4+40.1)/2 = 36.75. By stationing herself at mile marker 36.75, she’ll be as close as she can be to the “middle 50%” of the exits. (Note: your book uses slightly different formulae.)

You can relate median, midhinge, and midrange rather nicely with the unmodified box plot. Take a look.