Chapter 18. the Normal Approximation for Probability Histograms

Chapter 18

Page 1 of 5

Statistics 103

Chapter 18. The Normal Approximation for Probability Histograms

Key Terms

Empirical histogram – A histogram for observed data.

Probability histogram – A histogram for chance.

Central Limit Theory – When drawing with replacement from a box, the probability histogram for the sum will follow the normal curve, even if the contents of the box do not.

THE PROBABILTIY HISTOGRAM

The expected value and the standard error are good tools to use in locating a specific data value, just as an empirical histogram is a good tool for graphing data. However, it is the probability histogram that gives the complete picture. A probability histogram is a graph that represents chance by area rather than representing data, as does the empirical histogram. The probability histogram is made up of rectangles. The base of each rectangle is centered at a possible value for the sum of the draws (when related to the box model), and the area of the rectangle equals the chance of getting that value. The total area of the histogram is 100%. The expected value pins the center of the probability histogram to the horizontal axis, and the standard error fixes the spread. According to the square root law, the expected value and standard error for a sum can be computed from:

The number of draws
The average of the box
The SD of the box

These three quantities just about determine the behavior of the sum. This is why the SD of the box is such an important measure of the spread.

If the chance process for getting a sum is repeated many times, the empirical histogram for the observed values converges to the probability histogram. Assume a pair of dice is rolled 100 times, 1,000 times and 10,000 times. The histograms below show the distribution of the number of times that the dice were rolled plotted against the number of spots. The first three histograms are empirical and they progressively converge to the last probability histogram. Therefore, the more repetitions, the more likely the histogram will look like its probability histogram.

Very importantly, as we will see, in the above example the number of draws from the box 1, 2, 3, 4, 5, 6 was fixed. With each repetition we drew twice since we required a sum: the basic chance process was drawing from the box and taking the sum. This process was repeated a larger amount of times – 100, 1000, 10000 – and the empirical histogram converged to the probability histogram.

THE CENTRAL LIMIT THEOREM

One must realize that with the above examples there are two sorts of convergence for histograms. The first shows the convergence of an empirical histogram to a probability histogram and the second a convergence of a probability histogram to a normal curve. In other words, when the number of repetitions is large, the empirical histogram will be close to the probability histogram and when the number of draws is large, the probability histogram for the sum will be close to the normal curve: the probability histogram for the sum got smoother and smoother, and the limit became the normal curve. The critical distinction is the number of draws going into each sum versus the number of sums. The latter is better known as the Central Limit Theorem. It states that when drawing with replacement from a box, the probability histogram for the sum will follow the normal curve, even if the contents of the box do not.

The Central Limit Theorem applies to sums but not to other operations like products. The probability histogram for a product will usually be quite different from one for addition, even when increasing the repetitions. Take for example the product of the spots that show on a pair of dice. For certain numbers, specifically the prime numbers, such as 7 or 11, there is no way to obtain them from a product. The probability histogram will have gaps, which then contradicts the Central Limit Theory. How can you have a normal curve with gaps?

Example 1, Page 328, # 7

A pair of dice is thrown. The total number of spots is like

(i)one draw from the box

2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12

(ii)the sum of two draws from the box

1, 2, 3, 4, 5, 6

Explain.

Solution

Refer to the three key questions on page 281 necessary for creating a box model:

What numbers go into the box?
How many of each kind?
How many draws?

Our experiment consists of rolling a pair of dice and looking at the total number of spots. You would think the box model for this would be option (i). BUT THAT IS WRONG! The reason is due to the fact that if that was our box, it would say the chances of getting a 7, say, is the same as the chance of getting a 2. But that is wrong. The chances of rolling a 7 is 6/36 not 1/11 as box (i) would have it. Box (i) doesn’t take into account the different chances of rolling each of the numbers 2 through 12. So instead we should use box (ii) and double the number of draws. Therefore box (ii) is the correct model.

Example 2, Page 329, # 12

Solution

To answer this question, again refer to the three key questions on how to create a box model on page 281:

What numbers go in the box?

The numbers that go into the box would be +1 for the positive numbers and –1 for the negative numbers.

How many of each kind?

There are 4 positive numbers and 6 negative numbers.

How many draws?

There are 1,000 draws.

(a)More information is needed since the numbers in the box are not given. There is no way to calculate the expected value and the standard error. The box could have 4 tickets marked “1” and 6 tickets marked “10” or it could have 4 tickets marked “3” and 6 tickets marked “-3”. With these two boxes alone, the chances would be very different.

(b)Yes, because with the average, one could calculate the expected value and with the SD, one could calculate the SE.

THE NORMAL CURVE AND NORMAL APPROXIMATION

In Chapter 17 we learned how to utilize the normal curve to figure chances. Recalling briefly, we use the normal curve to figure chances by finding areas under the curve for given intervals. The horizontal axis of the histogram must be converted to standard units and a z-table must be handy (or link to the Normal Table ).

At this point, something also must be said about the endpoints. If you want to find the chance of getting exactly 50 heads if a coin is tossed 100 times then you need the area of the rectangle over 50. If you use the normal curve, there is no area of a rectangle from 50 to 50. So we approximate the area by going from 49.5 to 50.5 instead.

The way the histogram is scaled then requires you to find the area under the curve between 49.5 and 50.5. It would be these two numbers you must convert to standard units. With this then, if your endpoints are inclusive then the rectangle begins at –.5 from the first endpoint and +.5 from the second endpoint. If your endpoints are exclusive, then the rectangle begins at +.5 from the left endpoint and –.5 from the right endpoint. For example, suppose again you tossed a coin 100 times and you wanted to estimate the chance of getting between 45 and 55 heads inclusive. To find the area under the curve you would convert 44.5 and 55.5 to standard units. If you wanted to estimate the chance of getting between 45 and 55 heads exclusive then you would go from 45.5 to 54.5.