Lesson 1-2Describing Distributions with Numbers

How to describe a distribution

Make sure you consider all of the following points:

  1. What is the approximate center of the distribution?
  2. Do any unusual features stick out? Does it have outliers?
  3. Is the distribution symmetric or is it skewed? Does the distribution have a single, central mode (unimodal) or does it have several modes (bimodal or multimodal)? Is it uniform?
  4. How spread out is the distribution? Remember, the range is a single value: maximum – minimum

It might be helpful to use CUSS to remember all the important points: Center, Unusual points, Shape, Spread. If you are asked to compare two graphs, discuss both similarities and differences. Be sure to include comparative words such as greater, less, more, half, etc.

Measures of Center

The meanis the most common of measures of central tendency. If n observations are denoted by x1, x2, x3, …, xn, their mean is:

A more common notation is:

Another measure of center is the middle number in a dataset, the median. To find themedian of a distribution:

  1. Arrange all observations in order of size, from the smallest to the largest.
  2. If the number of observations, n, is odd, the median M is the center observation in the ordered list.
  3. If the number of observations, n, is even, the median M is the mean of the two center observations in the ordered list.

Unusual Points and Spread

The distance between the first and third quartiles is called the Interquartile range (IQR). It measures the spread of the middle half of the data. The interquartile range is a single value, not a place on a graph.

For most purposes, an outlier will be any value which falls more than 1.5 × IQR below Q1or 1.5 × IQR above Q3.

The five-number summary consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation.

MinimumQ1MQ3Maximum

The term percentile refers to the percent of data points which fall at or below an individual data point.

The variance(s2) and the standard deviation(s) measure how far the “typical” observations are from the mean. The variance is the average squared deviation. In order to retain the original units of measure, statisticians use the square root of the variance, the standard deviation.

Or, in compact notation:

The standard deviation is the square root of the variance:

Shape

A distribution is symmetric if the right and left sides of the histogram are roughly mirror images of each other.

A distribution is skewed if one side stretches out much farther than the other. The distribution is skewed to the left if the left side extends out farther and it is skewed to the right if the right side extends out much farther.

Mean vs. Median

The mean is nonresistant because it is sensitive to the influence of extreme observations which may or may not be outliers. A skewed distribution with no outliers will still pull the mean in the direction of the tail.

The median is a resistant measure of center because it is not sensitive to a few extreme observations.

When to Use Standard Deviation

Variance and standard deviation are used to describe the spread when the mean is used to describe the center. Since both the mean and standard deviation are influences by outliers or skewed distributions, they should only be used for reasonably symmetric distributions. 

What Can Go Wrong?

  • Don’t forget to sort the values before you find the median, quartiles, or percentiles.
  • Don’t compute numerical summaries for categorical data.
  • Beware of outliers.
  • Make a picture as your first step. Don’t use the mean and standard deviation until you know it is reasonable to do so.
  • Don’t forget to check the reasonableness of your answers.

Homework: (Distributions, Graphs, and Numerical Summaries, oh my!)

Page 42:37, 39, 40, 44, 45, 46, 49, 51, 52, 64, 65, 69-74

Page 70:79-81, 83-92, 94, 96, 97, 98, 103, 104, 107-110.