Unit 9: Descriptive Statistics

9.1Representing Univariate Data

9.2Analyzing Univariate Data

9.3Representing Bivariate Data

9.4Analyzing Bivariate Data

9.5Functions of Best Fit

9.6Two-Way Tables

9.1 Representing Univariate Data

Data, data, data! What can we do with data? Most mathematics in real life comes down to gathering and analyzing data. Hence NASA, the Hadron Collider, and most other scientific exploration decides what is most likely true based on data gathered. In this unit, we’ll be analyzing data both univariate data, which means data gathered about one variable like how many cups you can stack on top of your head, and bivariate data, which means data gathered about two variables like how many ping pong balls you can accurately throw in a bucket over different distances. Our starting point is gathering and representing the univariate data.

Frequency Graphs

Let’s say a class of students tried to stack as many blocks on their head as they could (without them falling off) in a minute. The table below shows the result of that experiment.

Anne / Bob / Carl / Dan / Ed / Fred / George / Hidalgo / Ingrid / Jake
3 / 6 / 7 / 3 / 4 / 3 / 5 / 4 / 3 / 5
Kate / Leo / Mac / Nancy / Oscar / Pam / Quinn / Rose / Sam / Tom
5 / 1 / 2 / 4 / 3 / 4 / 3 / 6 / 2 / 3

When we look for frequency, we are really looking at how often a number occurred in the data set. For example, the most frequent number in the above data set is 3 which occurred seven times. The frequency of the value 4 is only three because it occurred four times. Taking those frequency values, we can make several different types of frequency graphs including histograms, dot plots, line plots, and more.

Histogram

Histograms are a frequency graph that uses bar graphs with equal intervals on the -axis. As always, we want to make sure we label our axes and the graph. Here are a couple of different histograms we could make with the above data. Which histogram do you think best represents this data and why?

Dot Plots/Line Plots

Dot plots use a number line and place dots above each value on the number line as it occurs in the data set. A line plot does the same thing but uses X’s instead of dots. They essentially are the same exact graph and only differ in aesthetics. If you like circles, use the dot plot. If you like X’s, use the line plot. Here they are for our previous data set.

Dot PlotLine Plot

Box Plots

Box plots, sometimes called box and whisker plots, are another type of frequency graph but they don’t just show the frequency. They also give us a picture of the distribution of the data, or how spread out the data is. To make a box plot we need to do the following: 1) find the median of the data called the 2nd Quartile, 2) find the median of the lower half of the data called the 1st Quartile, and 3) find the median of the upper half of the data called the 3rd Quartile. To do this, we need to line up our data in order from least to greatest as follows:

1 2 2 3 3 3 3 3 3 3 4 4 4 4 5 5 5 6 6 7

Remember that if there are two numbers in the middle for the median, you add those two numbers together and divide by 2. Also, we will exclude the median when calculating the 1st and 3rd Quartile. The same rule of adding and dividing by 2 is true for the 1st and 3rd Quartile.

Now that we have our quartiles, we’re ready to actually begin the Box Plot. First, we create two boxes over the number line, the first extending from the 1st Quartile to the 2nd Quartile and the second extending from the 2nd Quartile to the 3rd Quartile. You can see this below.

Finally we add the whiskers by drawing a line out of either end of the boxes. On the left, we draw a line from the 1st Quartile to the lowest value in the data set. On the right, we draw a line from the 3rd Quartile to the highest value in the data set. You can see the final product below.

The advantage of the box and whisker plot is that we see frequency in the big picture. Each box and each whisker shows where of the data in our data set is.

9.2 Analyzing Univariate Data

Once we have our data graphed or gathered, we need some tools to be able to describe what is happening in the data set. Some of these tools we are familiar with, such as mean, median, and mode, but we’ll also use some more sophisticated tools this year. Two ways that we can look at a data set are the center and the spread of the data.

Center

The center of a data set is a way to describe the central tendency of the data set. In other words, if you had to boil the data set down to a single value, what would that value be? You would the data is about what number? There are two main measures of central tendency we use: mean and median.

The mean of a data set is what we would typically call the average. Sum the values of the data set and divide by the number of data points. Just to look fancy, here’s some new notation for you:

Crazy, huh? We’re only introducing this because we’ll use the notation a little later. Let’s break it down.The symbol is called “x bar.” We usually use that to represent the mean of a data. We could also use the symbol , which is the letter mu of the Greek alphabet, to represent the mean of a data set. The big sideways M symbol is the capital letter sigma in the Greek alphabet. We use this symbol when we want to add a bunch of numbers, hence it is called summation notation. So sigma means we will add what comes after that. The below and the on top of the sigma means that we are adding different numbers starting with the 1st number. The is just subscript notation meaning we’ll add , or the first number, plus , or the second number, plus so on and so on all the way to which is the th and last number. Then we divide that sum by which is how many numbers we added. See? It just means add them up and divide! Aren’t you glad we keep it simple?

The median is the middle value of the data set. Line the numbers up in order and find the exact middle number. If there are an even number of data points, you won’t have an exact middle number. In that case you average the two middle numbers to find the median. The notation for this is... just kidding! We’ll leave it at that!

How does the center describe the data?

That’s the real question. Let’s think of two data sets. The first data set has a mean of and a median of . The second data set has a mean of and a median of . What is the difference in the two data sets? We know they have the same mean, but the medians are different. With a lower median of , the first data set must have much higher values at the top end of the data set to bring the average up. With a median of close to the average, the second data set is probably fairly evenly distributed since the middle is near the average.

Let’s look at another example from the US Census Bureau. From the years 2008-2012, the mean yearly income in the United States was only while the median was . This probably means in the lower half of our population we have a lot of people making very low (likely ) amounts per year pulling the average down. The median is higher because once you start making money, it jumps up significantly from zero.

Spread

The spread of a data set is a way to describe the distribution of the data set. In other words, how spread out are the data values? Are they all clumped up in the middle, evenly distributed, or widely spread out? There are three main measures of spread we will use: range, interquartile range, and standard deviation.

The range of a data set is the simplest description we can give which is the maximum data value minus the minimum data value. Let’s say we have a math quiz where the highest score was and the lowest score was . The range would be . Notice that if the high score on the quiz was and the low score was , it would also have a range of . This demonstrates why we need to look at both the spread and the center of a data set.

The interquartile range of a data set is the third quartile minus the first quartile. This tells us the range where half of the data falls. For example, think of a data set with a first quartile of , second quartile of , and third quartile of . The interquartile range would be because half of the data is within a range of , namely from to . Think now of a second data set that has an interquartile range of only , and let’s say both data sets have the same average. What is the difference between the two data sets? The first data set would be spread out farther but keeping roughly the same proportion of distribution as the second data set.

The standard deviation of a data set tells us on average how far away from the mean the data points are, but there are two ways to calculate the standard deviation. The first is called the Population Standard Deviation which we use when we have data from the entire population of whatever it is that we are studying. To find the population standard deviation, we use the following formula: (Just imagining the look on your face when you read this is priceless.)

Take a deep breath. That’s right. In through the nose, out through the mouth. Let’s explain each piece to make sure we know what’s happening here. The symbol is the lowercase letter sigma. We use that symbol to represent the standard deviation. The summation notation we are already familiar with as wellas . So with the expression what we’re doing is taking the difference between each data value and the mean, or measuring how far away from the average that data point is. However, why do we square it? The answer is because if don’t, we would end up with some positive and some negative numbers that would end up canceling each other out as we added them all up. To get rid of the negatives, we square everything and then square root at the end to get back down to the number range we were at originally.

The second form of standard deviation is the Sample Standard Deviation which we use when our data set only contains a sample of the whole population. The calculation for this allows a little bit more buffer since we don’t have information from the whole population. It is calculated nearly exactly the same way as follows:

Now, calculating either of these by hand would be tedious and time consuming. Most calculators have a function that will give you either or both standard deviations if you input the data set. There are also many online calculators to find the standard deviation. Basically, I’m asking you to not sweat over doing the computation. The important thing here is do you understand the concept of what standard deviation tells us?

Let’s say we have two data sets, the first with a mean of and a standard deviation of and the second with a mean of and a standard deviation of . What’s the difference between the two data sets? The second data set is much more spread out because the average distance from the mean is much larger than the first set.

It turns out that in a data set with a normal curve distribution, of the data is within one standard deviation of the mean, and of the data within two standard deviations of the mean. For example, the average height of adult males is inches with a standard deviation inches. That means that of adult males have a height between and inches while of adult males have a height between and .

9.3 Representing Bivariate Data

A scatter plot is a plot on the coordinate plane used to compare two sets of data (bivariate data) and look for a correlation between those data sets. An associationis a relationship or dependence between data. For example, the price of oil and the price of gasoline have a strong association. The daily price of oil and the number of penguins swimming in the ocean on that day most likely have no association at all. However, to find this association we need to make a scatter plot.

Start with the Data

Before we can make a scatter plot, we need two sets of data that we want to compare. For example, we might compare the number of letters in a student’s first name and their math grade. Do people with shorter names tend to score higher in math? Do people with the lowest grades have longer names? These are questions of relationship, or correlation, that we can explore with a scatter plot once have some data. That data set might look like this:

Name / Nichole / Josiah / Kame / Gungar / Roberto / Frank / John / Herman / Sami / Daimon
Letters / 7 / 6 / 4 / 6 / 7 / 5 / 4 / 6 / 4 / 6
Grade / 58 / 83 / 61 / 70 / 31 / 76 / 81 / 70 / 72 / 57
Name / Yolina / Johanne / Karolinea / Kurt / Addison / Ian / Dennis / Ophelia / Kristina / Bradford
Letters / 6 / 7 / 9 / 4 / 7 / 3 / 6 / 7 / 8 / 8
Grade / 77 / 90 / 87 / 83 / 76 / 78 / 87 / 87 / 80 / 41

Prepare the Coordinate Plane

Now that we have our data, we need to decide how to put this data on the coordinate plane. We can let the -axis be the number of letters in a student’s name and the -axis be the students overall math grade. Once we have decided this we should label our axes.

Next we’ll need to decide on a scale and interval. The scale is the low to high number on the axis and the interval is what we count by. Notice first of all that we’re only looking at Quadrant I because we won’t have negative amounts of letters or negative grades. Since the grades can be from zero to one hundred, we might choose to count by tens on the -axis giving us a scale of 0-100 and an interval of 10. Since the letters range from three to nine, we might count by ones on the -axis. This gives us a scale of 0-10 with an interval of 1.

When to use a broken axis

A broken axis is useful whenever more than half of the area of the scatter plot will be blank. Nobody likes to see a blank graph with all the data in one tiny area. So instead, we zoom use by using a broken axis. If the range of your data is less than the lowest data point, a broken axis may be useful. For example, in our math test situation above if everyone scored above a 60%, then we might break the -axis and begin counting at 60. We could then count by 4’s to make it up to 100%.

Plot the Points

Finally we would then plot each person on the graph. So Nicholas will be the point , Josiah the point , and so forth. Using Excel to make our scatter plot, the final scatter plot might look like the following. Notice that each dot on the graph represents a person. While the labeling is not necessary, it may be useful in some circumstances.

Many times on a scatter plot you may have the same data point multiple times. One way to represent this fact is to put another circle around the data point. Let’s add a few new students to our data set: Johnathan (9 letters and 87 math score), Jacob (5 letters and 76 math score), and Helga (5 letters and 76 math score). The new graph could look like this:

While this practice is not necessarily standard, it can be useful as a visual representation of what is happening with the data. We can more easily see the multiple data points this way. In Excel, you wouldn’t get the red circles. Those would have to be put in by hand.

9.4 Analyzing Bivariate Data

Now that we know how to draw scatter plots, we need to know how to interpret them. A scatter plot graph can give us lots of important information about how data sets are related if we understand what each part of the graph means.

Reading Data Points

Each individual point on a scatter plot represents a single idea. For example, in the picture below each point represents a country. The axes tell us information about that country. The -axis tells us about how many minutes per day that country spends eating and drinking. The -axis tells us about how many minutes per day that country spends sleeping. Can you find the United States on this scatter plot? About how many minutes do we sleep per day? About how many minutes we spend eating and drinking per day? Are these numbers reasonable to you?