Numerical Summaries

Two of the most important features of a data set that the histogram shows are:

Central tendency:the data tends to stack up around some ‘typical’

or ‘middle’ value

 Dispersion: the data points are not all the same; some are larger than

others.

SECTION 4.1 (Measures of Central Location)

The (Arithmetic) Mean

In statistics, measures of location are often called mean values or simply means.

There are several (in fact, infinitely many!) means, using a slightly different

method of describing the 'location' or 'center' of a set of data.

To distinguish between them, we use names such as the arithmetic mean, the

weighted mean, the geometric mean, and so forth.

 For instance, the familiar average of a set of numbers is also called the arithmetic mean.

The arithmetic mean of the measurements x1, x2, x3,..., xn is found by dividing

their sum by n. We use the notation (which is read "x bar") to denote the

result:

In statistical applications, data usually arises by drawing samples from an

on-going process or population, is usually called the sample mean.

The symbol is used to denote the sum of the data, .

The upper case Greek letter , which corresponds to the English upper case S, stands for the word ''sum" and the expression is read "the sum of the values as the subscript i runs from i = 1 up through i = n”. The notation is a great

time-saver in statistics because we are forever summing collections of numbers and it is convenient to have a simple notation like so we can avoid writing long phrases (like the one in quotes in the previous sentence). With this notation, the sample mean can be compactly written as:

THE SAMPLE MEAN

 always coincides with the center of gravity of the data. That is, if you imagine the data as equal-sized weights spread out on a balancing beam (see following figure), then marks the position at which a fulcrum would be placed to exactly balance these weights. This center-of-gravity property is handy for quickly estimating (with no calculation) directly from a histogram of the data.

Example: The table below shows a company's monthly ending cash balance for several months. Using the ‘center-of-gravity property’, it appears that the mean lies somewhere between $120,000 and $160,000, although it is difficult to pinpoint it exactly by eye.

To calculate the exact value of the sample mean, we would first find their sum 188,588 + 132,691 + ... + 138,110 = 7,897,724 and then divide by 50 (the number of data points used in the sum) to find = $157,954.48.

Monthly Ending Cash Balance (in $'s)

188588 / 164149 / 136332 / 189911 / 84703
132691 / 250420 / 211840 / 197800 / 127134
125246 / 218811 / 218484 / 181532 / 116383
27040 / 168177 / 181973 / 205211 / 191439
123139 / 97844 / 132985 / 199288 / 213784
200466 / 158587 / 194562 / 62536 / 165292
118707 / 172195 / 141901 / 208126 / 151870
74883 / 180930 / 71733 / 135596 / 129598
233390 / 187027 / 147146 / 139364 / 136336
128652 / 173568 / 139400 / 222846 / 138110
Histogram of Monthly Ending Cash Balances

THE MEDIAN

●The sample mean is very sensitive to extremely large or small values in the data.

●Example: A sample of 10 people, one of whom was a highly-paid sports

professional (see figure below).

The mean and median of a sample of 10 peoples' incomes

●One solution to this problem is to use the median.

●The median of a set of numbers is defined to be their 'middle value', which is

found by first sorting the data from smallest to largest and then counting half-way

into this sorted list. If the sample size n happens to be an odd number, then there

will be exactly one middle value. However, if n is even, then there will be two

middle values and, in this case, we define the median to the average of these two

values.

THE MEDIAN

After sorting n measurements according to increasing size, the median is

the middle number (if n is odd) or the average of the two middle numbers

(if n is even)

SECTION 4.2 (Measures of Variability/Dispersion/Risk)

The Range

The most intuitive measure of dispersion in a set of data is the range, which simply the distance between the largest and smallest measurements in the data.

To find the range of a set of measurements:

(1)Sort the data from smallest to largest. With very small sets of data, a visual scan of the data may be all that is necessary.

(2)Letting m and M denote the minimum and maximum data values, the range R is simply the difference between them, R = M-m. Because the minimum m can never exceed the maximum M, the range is always a positive measure of dispersion (as are all measures of dispersion)

THE RANGE

The range R of a set of data is the positive distance between the

largest and smallest items in the data. If M and m denote the maximum

and minimum values in the data, then

R = M-m.

Example: In a manufacturing operation, parts are made that are supposed to be 0.5 inches in length. A sample of 5 recently made parts have lengths .49, .50, .47, .53, and .51. For this data, the range is R = M - m = .53 - .47 = 0.06 (inches).

The range is most useful when working with small samples (where n is less than

10 or so).

With large sample sizes, the range encounters two problems that make it unsuitable as a measure of dispersion:

1)As you add additional readings to your data set, the range can increase.

Example: Consider what would happen if you wanted to describe the variability in the cash balance data (above) by regularly calculating the range after each new month's data becomes available. The following table shows what would happen for the first 10 months (reading down the first column of data. The range increases from $55,898 up to $206,351 (about a 269% increase). It would be difficult to point to any of these ranges in as being 'typical' of the cash balance data.

Successive ranges for the first 10 cash balance values

Range of items

k through

1188,588*

2132,69155,898

3125,24663,342

4 27,040161,548

5123,139161,548

6200,466173,426

7118,707173,426

8 74,883173,426

9233,390206,351

10128,652206,351

2)The range does not describe the dispersion among the remaining items

in the data (i.e., the data between the maximum and minimum)

Example: The following graph shows histograms of three distinctly different sets of data having the same range (because the maximum and minimum value is the same in each data set). However, the dispersion of the rest of the data varies quite a bit among the data sets.

A Drawback of the Range: Different sets of data having the same Range

THE VARIANCE AND STANDARD DEVIATION

●To overcome some of the shortcomings of the range, we need a need a measure of dispersion that is better able to distinguish between the variation exhibited by different sets of data. One approach is to measure the distance of each individual data point from some central point, such as the sample mean, and then combine all these distances into one overall measure of dispersion.

●With the sample mean as the location of the 'center' of the data, the distances , , ,..., of each data point from the center can be averaged.

●However, some of these distances will be negative (since some of the 's must fall to the left of ).

●A simple way of making them all positive is to square them, i.e., to use instead of.

●The last step is to combine all these new distances into one overall measure, called the sample variance and denoted by . We ' average' these distances dividing by n-1, not by n (to be explained later):

THE SAMPLE VARIANCE

The sample variance of a set of data is denoted by and

is calculated by:

= or,

=

To illustrate the calculation of , consider the data of the numbers
{.49, .50, .47, .53, and .51}. The first step is to find the sample mean,

= .

Next, the distances are calculated and their squares are summed:

i

______

1.49(.49-.50) = -.01 (-.01)2 = .0001

2 .50 (.50-.50) = .00 (.00)2 = .0000

3 .47 (.47-.50) = -.03 (-.03)2 = .0009

4 .53 (.53-.50) = .03 (.03)2 = .0009

5 .51 (.51-.50) = .01 (.01)2 = .0001

______Sum = .0020

The final step is to divide .002 by n-1 = 5-1 = 4, which gives a sample variance of

= .

●There are two reasons for the division by n-1 in the formula for.

(1)Using n-1 in the formula for makes it a better estimate of the

variability in the process from which we have obtained the data.

(2)If you look carefully, there are effectively only n-1 terms in the sum

, so n-1 really is the proper divisor.

(3)The concept that the number of terms in a sum of squared quantities

can be reduced somewhat will reoccur throughout this text. Statisticians

call this reduced number of terms the degrees of freedom associated with

the sum of squares.

THE SAMPLE STANDARD DEVIATION

●There is a slight problem with. It is measured in square units.

(1)Example: if the sample variance of 5 people's heights (expressed in

inches) is 9, then you would have to report that was 9 square inches!

(2)The cash balance data (above) sample variance of 2,268,902,689 square

dollars!

●The solution is to take the square root of . The resulting quantity is denoted by

the letter s, which is an abbreviation for the sample standard deviation.

●s is just as good a measure of dispersion as is, and sis measured in the same units as the data.

(1)Example: The standard deviation of people's heights above would now be

reported as s = inches.

(2)The standard deviation of the cash balance data is

s = .

THE SAMPLE STANDARD DEVIATION

The sample standard deviation s of a set of measurements is

thesquare root of the sample variance:

or,

●What does s tell you about the data?

(1)The more disperse a set of data is around its mean, the larger s will be.

(2)s can not discern the direction of the variability in the data (see figure)

Return (in %) on two types of investments.

Mean and standard deviation are the samein each case, but the risks are different.

COMPUTING THE STANDARD DEVIATION

●Be able to use your calculator and EXCEL to calculate s:

This web site shows how to use your calculator for entering data and

then calculating and s:

●There are also a“short-cut formulas for and s (page 109 of the text).

These formulas are based on the algebraic identity,

SHORT-CUT FORMULAS FOR AND s

= and,

s =

Interpreting the sample standard deviation, s

[1] The sample mean and range R are easy to compute and easy to interpret.

Example: If you examine a sample of invoices and calculate that the mean time takes to ship orders to your customers is 2.5 days, then you can immediately infer that the shipment times are generally near 2.5 days, some of them being shorter than 2.5 days and some a little longer. If, in addition, you compute the range and find R = 6, you can quickly state that there is no more than 6 days difference between the shortest shipment time and the longest (for this data).

[2] Interpreting the sample standard deviation, s, is not as immediate.

The key thing to remember about interpreting s is that it is not a 'stand-alone' measure like the sample mean and range are. There is no absolute scale that says an s of 100 is large while an s of 10 is small. The standard deviation can only be interpreted in relation to the mean. In a sense, it should come as no surprise that must be involved in any interpretation of s, since the very definition of s involves the sample mean.

[3]s estimates the average or “typical” deviation of a data value from the

center,

Z –scores

●Converting the raw distances from into distances measuredin terms of s is

called standardization.

●Each ratio that we calculate when comparing the distance to the

standard deviation is called a z-score.

●z-scores can be used to describe the distance that any point is from the mean,

not just numbers belonging to the data set.

Z-SCORES

For any number x (not necessarily contained in the data set),

the z-score of x with respect to the data is

given by:

where and s are the sample mean and standard deviation

of the data.

Example: The scores on a 100-point vocational placement exam have a mean of 70 and a standard deviation of 10. Two job applicants who took this exam received scores of 60 and 85. What were their z-scores?

The applicant with a score of 85 has a z-score of . That is, this

applicant's score was 1.5 standard deviations above the average of all people taking the exam. Similarly, the other applicant has a z-score of , which is 1.0 standard deviations below the average score on the exam.

●What is the reason for converting numbers into z-scores?

(1)The answer goes back to the beginning of this section, where it is noted that there is no absolute scale for comparing various values of s.

(2)z-scores, on the other hand, can be compared against fixed scales, one of

which is described below (The Empirical Rule). Using such scales, we can make more precise statements about how we expect the process generating the data to behave.

●In essence, the scales against which we compare z-scores tell us when a particular

value of z is "too large" or "about right" and so on.

Large z-scores are associated with unlikely events, while

Small z-scores are indicative of more likely events.

●The scales that we use to determine what is "likely' and what is "unlikely" are

called probability distributions (in upcoming chapters).

THE EMPIRICAL RULE (page 111)

●There is one scale for interpreting z-scores that is exceedingly simple to use and that applies in a wide variety of situations. It provides an easily-remembered rule for interpreting the sample standard deviation and, once you become proficient in its use, you will have taken a large step towards 'thinking statistically' and understanding the topics in the rest of this course.

●The Empirical Rule, as it is sometimes called, is a brief table that tells you

approximately what proportion of your data falls within certain distances (measured as z-scores) from the mean . The rule is based on the fact that a great many processes and populations generate data whose histograms look remarkably similar to a particular bell-shaped curve (the so-called Normal curve discussed in Chapter 8 of the text)

●The Empirical Rule is often stated in terms of and s as follows:

1)About 68% of the data falls within 1 standard deviation of the mean

2)About 95% of the data fall within 2 standard deviations of the mean

3)Almost all (over 99%) of the data fall within 3 standard deviations

of the mean

●In practice, the Empirical Rule acts as a rough table of probabilities. You can use it to give rough bounds within which the process data is likely to fall, or, you can evaluate particular value to see how likely it is.

Example (Cash balance data):

(1)For this data, = $157,954.48 and s = $47,633.00.

(2)The Empirical Rule gives the following (approximate) bounds:

About 68% of the time, the ending cash balance should be within 1 standard deviation from the mean. Since 1 standard deviation is equivalent to $47,633.00,

this translates into values of $157,954.48-$47,633.00 = $110,321.48 (i.e., one standard deviation below the mean) and $157,954.48+$47,633.00 = $205,587.48 (or, 1 standard deviation above the mean).

About 95% of the time, the ending cash balance will be within 2 standard deviations of the mean, i.e., between , $157,954.48-2($47,633.00) = $62,688.48 and $157,954.48+2($47,633.00) = $253,220.48.

Almost all (over 99%) of the time the ending cash balance will fall within 3 standard deviations of the mean, i.e., between $157,954.48-3($47,633.00) = $15,055.48 and $157,954.48+3($47,633.00) = $300,853.48.

These results can be used in numerous ways. Suppose, for instance, that the company is expecting to purchase an expensive piece of equipment (costing $50,000) next month. Is it likely that this purchase will cause the ending cash balance to be negative that month, which will require the company will need to get a short-term bank loan to make up the difference? The answer is, most likely no loan will be needed because about 95% of the time the cash balance should fall in the $62,688.48 to $253,220.48 range. There is only a relatively small chance (about 5% or less) that the cash balance will drop below $62,688.48 so the equipment purchase should not create any additional interest costs stemming from a bank loan.