Statistics 2: Normal Distribution

ELE 300 Spring 2007

Experimental Error

When an operation or experiment is repeated under what are, as nearly as possible, the same conditions, the observed results are never quite identical. The fluctuation that occurs from one repetition to another is called experimental error (without emotion or blame). It refers to variation that is often unavoidable.

Sources of experimental error can include:

Ambient temperature

Variability in materials

Condition of equipment

Distinguish experimental error from careless mistakes (such as misplacing a decimal point or using the wrong reagent), errors of measurement, errors of analysis, and errors of sampling. Experimental error is the term used for the random variation in results due to uncontrolled variables.

Example from Statistics for Experimenters, Box, Hunter, and Hunter 1978 John Wiley & Sons

An industrial process produces a certain yield of acceptable product, measured in percent. Perhaps it is a chemical process. Perhaps it is an integrated circuit process. Note that the yield is a continuous function; that is, the yield can take on any value between 0 and 100. Over a period of time, 210 batches of product are manufactured, with the results shown in the figure. The yield is measured to the nearest whole percent.

It is seen that 1) there is some variability in the yield from run to run and 2) the yield clusters around 83-85%. Today we will consider such questions as: What is the average yield? What is the range of yields? What is the probability that a particular (future) batch will have a certain yield? What is the probability that the yield will be in a certain range?

2.2 Theory: Probability Distributions, Parameters, and Statistics

Let’s look at the experimental data for the manufacturing process.

YieldNumber of DataRelative

PercentPoints in RangeFrequency

76 1.005

77 2.010

78 2.010

7910.048

8016.076

8116.076

8224.114

8330.143

8429.138

8525.119

8623.110

8711.052

88 9.043

89 6.029

90 5.024

91 1.005

Sum 210Sum 1.000

This distribution is shown in the figure.

We can convert this plot to a relative frequency distribution by dividing the number in each percent range by 210, the total number of data points. Since the yield was measured to the nearest whole percent, and since the actual yield can take on any real value between 0 and 100, we also express the relative frequency distribution in terms of intervals, each interval including all possible values between the limits of the interval.

The probability of a result being in a certain range is the area under the probability distribution curve in that range. Note that the sum of the areas equals one. There is 100% probability, for the population of measurements taken, that every result of the measurements will fall in the range of the measured results.

What is the probability that the yield is between 85 and 86?

0.12(1)=0.12 the area under the curve between 85 and 86

What is the probability that the yield is between 85 and 87?

What is the probability that the yield is greater than 90?

A population is that group or set of things about which a decision is to be made. It is the entire universe of possible results, given the experimental conditions.

A sample is a subset of the population, examination of which provides information about the whole population. The experimental data portrayed above is a sample of the entire population.

An experimental run has been performed when a well-defined process has been followed and data has been collected.

An experimental result or datum is usually a numerical measurement that describes the outcome of the experimental run. Each datum in the sample population above is the result of an experimental run.

A distribution defines the shape of data when data are plotted in the form of a histogram.

A frequency distribution shows the number of times, or frequency, that any outcome occurs.

If the frequency distribution is normalized, by dividing each frequency by the total number of outcomes, the resulting histogram shows the relative frequency or probability distribution of the outcomes.

The population mean is a measure of location of the population distribution.

The population mean is where N is the number of data points in the entire population (a large number), and yi are the values of all those data points.

The Normal Distribution

Repeated observations that differ because of experimental error often vary about some central value in a roughly symmetric distribution in which small deviations occur much more frequently than large ones. This is characteristic of the yield distribution we are considering here. A continuous distribution of this nature is the Gaussian or normal distribution.

Gauss determined the mathematical description of such a distribution as shown in the following figure.


where e is the base of the natural logarithm, y is a particular value of the distribution, = is the mean of the population and  is the standard deviation of the population. The mathematical expression is for an infinitely large population, but provides a good representation of much smaller populations having the characteristics described above.

By taking the second derivative of the mathematical expression with respect to y, ICBST the inflection point is at y= . The inflection point is chosen as a convenient, reproducible, standard deviation from the mean. It is a measure of the degree of dispersion of the data.

Figure 2.12 shows some features of the normal distribution.

Figure 2.11 shows normal distributions with different means and variances, where the variance is the square of the standard deviation.

We can normalize any normal distribution by the following transformation:

z = y -Shifts the distribution to be symmetrical around zero; z=0 when y=.

Normalizes the spread around the mean; z=1 when y-=.

When y-=, z = 1.

The distribution z has a mean of 0 and a variance of 1. The variance is 2.

See Table A on Page 8

  1. p(y > +) = p{(y-) >  = p{(y-)/ > 1} = p(z > 1) = 0.1587
  1. p(z < -1) = 0.1587
  1. p(|z| > 1) =
  1. p(z > 2.14) =

so we can use a single table of z for a normal distribution with any mean and variance.

We can consider our manufacturing process 210 data points to be a large number, so we can use the z table to answer some questions about the process. What this means is that we can accept the mean and standard deviation of the 210 experimental results as being very good estimates of the mean and standard deviation of the universe of possible experimental results for this process. The mean of the 210 data points is 83.7 and the standard deviation is 2.87.

We can construct the normal curve having a mean of 83.7 and a s.d. of 2.87 using Gauss’s formula:

where 2.45 is a normalizing constant based on the maximum of the experimental distribution. The normal or Gaussian mathematical distribution and the experimental distribution (dots) are shown in the figure below.

When we use the z table, we are using the mean and the s. d. of the experimental population as good approximations of the Gaussian distribution.

  1. What is the probability that the yield for a given experimental run will be greater than 88?

Normalizing in order to use the z table: z=(y-

From the z table, the probability of z>1.5 is 0.0668. There is a probability of 0.0668 or about a 7% chance that the yield for a given experimental run will be greater than 88. Notice that the probability of getting a yield greater than 83.7 is 0.5, which one would expect, since the area to the right of the mean of the Gaussian curve is one half the total area.

  1. What is the probability that the yield on some future run will be between 83.7 and 87?

z = (87-83.7)/2.87 = 3.3/2.87 = 1.15

From the z table, the probability of z>1.15 (corresponding to y=87) is 0.1251. That is, 12.51% of the area of the Gaussian curve is to the right of z=1.15. The area to the right of the mean (z=0) is 0.500, so the area between the mean and z=1.15 is 0.5000-0.1251=0.3749. There is about a 37.5% chance that the yield on some future run will be between 83.7 and 87.

Notice that the area between the mean and one standard deviation is 0.5000-0.1587=0.3413 so the probability of a value between +- one standard deviation is 0.3413+0.3413=0.6826. That is, we expect that 68% of the experimental results will be between plus and minus one standard deviation of the mean.

Since the normal distribution is a continuous function, it can take an infinite number of values of z. Therefore, there is a zero probability of getting exactly a certain value. Consequently, we always pose questions in terms of probability of getting less than, greater than, or between two values.

EGN 300 Homework Problem

Due March 15, 2007

Measure and record the major diameter of at least 100 candies. Do not replace until you have measured all 100 candies. Why?

Consider the measured candies to be an excellent representation of the entire population. Using a calculator or computer with statistical software, find the mean diameter and standard deviation of the diameters for the population.

Using the z statistic:

  1. What is the diameter of a candy that is exactly one standard deviation smaller than the mean diameter of the population of candies?
  2. What is the probability that a candy chosen at random from the population will have a diameter smaller than one standard deviation smaller than the population mean?
  3. What is the diameter of a candy that is exactly two standard deviations larger than the mean diameter of the population of candies?
  4. What is the probability that a candy chosen at random from the population will have a diameter larger than two standard deviations from the mean?
  5. What is the probability that a candy chosen at random from the population will have a diameter between minus three standard deviations and plus three standard deviations from the mean? What are the diameters of that smallest candy and that largest candy?

1