Samples & Summary Measures

Samples & Summary Measures

Sample

A set of observations from a population:

x1, x2, ..., xn

Example: Measure the diameters of 20 pistons produced on a production line, xi = diameter of piston # i.

Summary Measures

Sample Mean

Sample Variance

Sample Mean

Just the average of the sample:

Example: x1 = 10, x2 = 12, x3 = 16, x4 = 18

NOTE: The sample mean is an unbiased estimator of the population mean, that is:

where  is the population mean.

Sample Variance

The sample variance, S2, is a measure of how widely dispersed the sample is. The sample variance is an estimator of the population variance, 2.

Example: x1 = 10, x2 = 12, x3 = 16, x4 = 18

Question: Why n-1 instead of n?

provides an unbiased estimate

E(S2) = 2

only n-1 degrees of freedom

n-1 degrees of freedom

For a set of n points, if we know the sample mean & also know the values of n-1 of the points, we also know the value of the remaining point.

Example:

Then,

Central Limit Theorem

Revisited

When sampling from a population with mean  and standard deviation , the sampling distribution of the sample mean, , will tend to a normal distribution with mean  and standard deviation as the sample size n becomes large.

For “large enough” n,

Note: As n gets larger the variance (& standard deviation) of the sample size gets smaller.

Confidence Intervals

Heart Valve Manufacturer

Dimension / Mean / Std. Deviation
Piston Diameter / 0.060 / 0.0002
Sleeve Diameter / 0.065 / 0.0002
Clearance (unsorted) / 0.005 / 0.000283

Decision: Implement sorting with batches of 5

A random sample (after sorting has been implemented) of 100 piston/valve assemblies yields 79 valid (meet tolerances) assemblies out of the 100 trials.

How do we know whether or not the process change has really improved the resulting yield?

The yield (# good assemblies out of 100) is a binomial random variable.

Our estimate of the mean (based on this sample) is 79% (or 79 out of 100).

One way of determining whether the process has been improved is to construct a confidence interval about our estimate.

To see how to do this, let X denote the number of good assemblies in 100 trials. Note that we can use the BINOMDIST function in excel to compute, for example, the probability that X is within + or – 10 of our estimate, 79:

P{69  X  89} = P{ X  89}- P{ X  69}

= BINOMDIST(89,100,0.79,true)-

BINOMDIST(69,100,0.79,true)

 0.9971 – 0.0123  0.985

Another way of stating this in words is that 79  10 is a 98.5% confidence interval for the number of valid assemblies out of 100.

The following table was constructed using the binomdist function as described above. It gives confidence intervals for various confidence levels.

Confidence Interval / % Confidence
79  2 / 37.6%
79  4 / 67.4%
79  8 / 94.9%
79  10 / 98.5%
79  14 / 99.9%

Note: the larger the interval the more certain we become that it covers the true mean.

Note that the yield of the original process was 52%. Since the lower limit of a 99.9% confidence interval about our sample mean is 65% (substantially larger than 52%) we can be pretty certain the process has improved.

Confidence Intervals - Using Central Limit Theorem

Recall that: X = # successful assemblies in 100 trials

An estimate, , of the probability of obtaining a successful assembly, p, is given by:

If we define:

thenX = X1 + X2 + X3 +… + X100

Furthermore,

Note that Xi is the number of successes in 1 trial - a binomial random variable where p is the probability of success.

It follows that the mean and standard deviation of each Xi are given by:

Applying the Central Limit Theorem tells us that (since it is the weighted sum of a large number of independent random variables) must be approximately normally distributed with mean and standard deviation given by:

Central Limit Theorem

For

Population Proportions

As the sample size, n, increases, the sampling distribution of approaches a normal distribution with mean p and standard deviation .

When the parameter p approaches 1/2, the binomial distribution is symmetric and shaped much like the normal distribution. When p moves either above or below 1/2, the binomial distribution becomes more heavily skewed away from the normal & hence the sample size, n, necessary for the CLT to apply becomes larger.

A commonly used rule of thumb is that np and n(1-p) must both be larger than 5.

From our sample of 100 assemblies, 79 were good.

This implies that our estimates for the mean and standard deviation of are given by:

To reiterate, we are using the Central Limit Theorem to approximate the distribution of as a Normal distribution with mean 0.79 and standard deviation 0.040731.

What we mean by, for example, a 95% confidence interval is to find a number, r, satisfying:

Since the Normal distribution is symmetric about its mean (in this case 0.79), this means that exactly half of the “leftover” probability (5% for a 95% confidence interval) must lie in each tail. This means that a probability of 2.5% must lie in each tail for a 95% confidence interval. In other words,

To perform this calculation, use the NORMINV function from excel:

0.79 + r = NORMINV(0.975,0.79,0.040731)  0.8698.

Solving for r, we get r  0.0798, so

a 95% confidence interval for is given by

= 0.79  0.0798

Note that the resulting confidence interval is almost identical to that obtained by using the BINOMDIST function directly-----clearly an indication that our use of the Central Limit Theorem to approximate the distribution of with a Normal distribution is valid.

The following table summarizes calculations for various confidence levels using the Normal approximation.

Confidence Level / Confidence Interval
99.9% / 0.79  0.135
98.5% / 0.79  0.099
95% / 0.79  0.0798
67.4% / 0.79  0.04

In general, when sampling for the population proportion in this manner our estimates for the mean and standard deviation will be given by:

Election Polling Example:

1500 prospective voters are surveyed. 825 say they will vote for candidate A and 675 say they will vote for B. What is your estimate of the percentage of voters who will vote for A? Construct a 95% confidence interval.

To construct the confidence interval, use the normal approximation.

NORMINV(0.975,0.55,0.0128) = 0.575

The associated confidence interval is, therefore,

55%  2.5%. The newsmedia would report this result by stating: “The poll has a margin of error of plus or minus 2.5 percentage points”.

What would happen if the 55% estimate were based on a sample of size 750?

NORMINV(0.975,0.55,0.0182) = 0.586

The associated confidence interval is, therefore,

55%  3.6%.

Notation

The Standard Normal distribution has:

Let z be a random variable with a standard normal distribution. We define to be the number satisfying:

Example: If  = 0.05, then 1-/2 = 0.975, and the value of can be found using the excel function NORMSINV:

NORMSINV(1-/2)= , or

NORMSINV(0.975)=1.9599.

Confidence Intervals for a Population Proportion

Using the Standard Normal

From a sample of 500, the number who say they prefer Coke to Pepsi is 275. Your estimate of the population proportion who prefer Coke is 275/500 = 55%.

Since n is large, we can apply the CLT and construct a confidence interval for the population proportion, p:

provides a (1-)100% confidence interval for the population proportion, p. For a 95% interval in this case, we would first determine that NORMSINV(0.975) = 1.96. We would then compute:

Confidence Intervals on the Sample Mean

95% of the observations from a normal distribution fall within +/- 2 standard deviations from the mean.

The average of a sample, , is (from CLT) normally distributed with mean  and standard deviation .

It follows that the true mean will lie within +/- 2 standard deviations of the sample average 95% of the time. The associated confidence interval is:

Note that the standard deviation used here is rather than  .
Confidence Intervals (Again)

In general, if you have estimated the true mean by using the sample mean, , as an estimate, the range

is expected to contain, or cover, the true mean

100(1-)%

of the time.

Example: The diameter of pistons is normally distributed with unknown mean and standard deviation of 0.01. You take a small sample of 5, measure their diameters, and compute a sample mean of 1.55. A 90% confidence interval would be given by:

From Excel, we can compute

Z0.95= NORMSINV(0.95)=1.645,

& hence the interval is:

(1.543, 1.557).

Example : 1-tailed Test on a Single Mean

The yield of a new process is known to be normally distributed. The current process has an average yield of 0.85 (85%) with a standard deviation of 0.05. The new process is believed to have the same deviation as the old one. To determine whether the yield of the new process is higher than that of the old, you collect a random sample of size 10 & compute a sample mean of 90%.

Let  = 0.01.

Solution:

1. Formulate Hypothesis:

H0:   0.85, H1:  > 0.85

(Important Note: alternative hypothesis is associated with action.)

2. Compute Test Statistic:

3. Determine Acceptance Region: Reject if

NORMSINV(1-) < z



NORMSINV(0.99) < 3.162



2.326 < 3.162

Reject Null Hypothesis!

Our test statistic lies 3.162 standard deviations to the right of the mean of the sampling distribution----clearly an indicator that the observed outcome is very unlikely if the null hypothesis held.
Example: 2-Tailed Test on a Single Mean

An auto manufacturer has an old engine that produces an average of 31.5 mpg. A new engine is believed to have the same standard deviation in mpg, 6.6, as the old engine, but it is unknown whether or not the new engine has the same average mpg. The sample mean of a random sample of 100 turns out to be 29.8. Let  = 0.05.

Solution:

1. Formulate Hypothesis:

H0:  = 31.5, H1:   31.5

2. Compute Test Statistic:

3. Determine Acceptance Region: Fail to reject if

NORMSINV(/2)  z  NORMSINV(1-/2)



NORMSINV(0.025)  -2.576  NORMSINV( 0.0975)



-1.959  -2.576  1.959

Since the inequality fails to hold, we

Reject Null Hypothesis!

Our test statistic lies 2.576 standard deviations to the left of the mean of the sampling distribution. If the null hypothesis were true, we would expect to observe a test statistic this low (or lower) only about 0.4998% of the time---less than 1/2 of 1 percent.

T-test

Used when , the standard deviation of the underlying population is unknown.

Instead of forming the test statistic with , we substitute an estimate for , namely the sample deviation:

The resulting test statistic is:

The test statistic, t, has a t distribution with n-1 degrees of freedom. In comparison, the test statistic, z, formed using the population standard deviation, , has a normal distribution .

The t distribution is shaped like the standard normal distribution (eg. bell-shaped. but more spread out). Its mean is 0, and its variance (whenever degrees of freedom > 2) is df/(df-2). As n (and hence df) increases, the variance approaches 1 & the t distribution approaches the standard normal distribution.

When n is large, the z test is sometimes substituted (as a close approximation) for the t test.

Assumptions: Either n is large enough for the central limit theorem to hold or the underlying distribution is normal.

T-test & EXCEL

Suppose t has a t distribution with n-1 degrees of freedom.

We define to be the number satisfying:

That is, the area under the t distribution (probability) to the left of is 1-/2.

Note: This is the exact analogue of for the standard normal distribution.

To calculate the value for in EXCEL, you use the TINV function:

EXAMPLE: 1-Tailed t-Test

Average weekly earnings of full time employees is reported to be $344. You believe this value is too low. A random sample of 1200 employees yields a sample mean of $361 and a sample deviation of $110. Formulate the appropriate null hypothesis and analyze the data:

1. Formulate Null Hypothesis:

H0:  < 344, H1:  > 344

2. Compute Test Statistic:

3. Analysis: This is an extreme value for the test statistic, more than 3 standard deviations away from the mean (if the null hypotheis were true). For  = 0.001, we can calculate:

Since our test statistic is even larger, we would reject the null hypothesis at the 99.9% significance level.

The p-value associated with our test statistic is given by:

p-value = TDIST(3.353612, 1199, 1) = 0.000411.

Thus, if the null hypothesis were true, the probability of obtaining a test statistic as large (or larger) than 3.353612 is only 0.0411%.

Average weekly earnings are almost certainly larger than $344.

p-Values

Definition: The p-value is the smallest level of significance, , for which a null hypothesis may be rejected using the obtained value of the test statistic.



The p-value is the probability of obtaining a value of the test statistic as extreme, or more extreme than, the actual test statistic, when the null hypothesis is true.

Example: Your z-statistic in a z-test is 3.162. To calculate the p-value, use the EXCEL function NORMSDIST. NORMSDIST(z) is the probability of obtaining a test statistic value less than or equal to z. To be more extreme, the test statistic would have to be larger than 3.162. Thus,

p-value = 1-NORMSDIST(3.162) = 1-0.999216 = 0.000784

Example: Your z-statistic in a z-test is -0.4. To be more extreme, the test statistic would have to be less than -0.4. Thus,

p-value = NORMSDIST(-0.4) = 0.3446

Example: Your t-statistic in a t test with 15 degrees of freedom is 4.56. To calculate the p-value use the EXCEL function TDIST. TDIST(t,n-1,1) = P{test statistic value > t | null hypothesis true}. (Note the different direction of the inequality!) Hence

p-value = TDIST(4.56,15,1) = 0.000188

Rules of Thumb for p-values

p-value / interpretation
< 0.01 / very significant
between 0.01 & 0.05 / significant
between 0.05 & 0.1 / marginally significant
> 0.1 / not significant

Chi-Square Tests

Like the T-distribution, the chi-square distribution is defined by its number of degrees of freedom. A chi-square random variable with k degrees of freedom is normally denoted by the symbol , and is defined by the equation:

That is, the sum of the squares of k standard normal random variables. Since squares are always non-negative, so is their sum, and hence a chi-square random variable can only take on non-negative values. Illustrations of the PDF for 1 and 5 degrees of freedom are shown below.

Values for the chi-square distribution can be referenced using the EXCEL functions CHIDIST and CHIINV.

CHIDIST(x, k) gives the probability that a chi-square random variable with k degrees of freedom attains a value greater than or equal to x. In other words, the area under the PDF to the right of x. In the pictures above, this is reported as the p-value.

CHIINV(p, k) gives the inverse, or critical value. That is, if p = CHIDIST(x, k), then CHIINV(p, k) = x. In the pictures above, this is reported as the chi-sq critical value.

Examples:

CHIDIST(5,5) = 0.41588

CHIDIST(25,5) = 0.00139

CHIINV(0.41588, 5) = 5

CHIINV(0.00139, 5) = 25

Values can also be referenced using a table of chi-square values.

For example, to find the critical value for a chi-square with 10 degrees of freedom at the 95% significance level, use row 10 and the  = 0.05 column of the attached table (giving a value of 18.31).

Alternatively, using EXCEL, one could compute CHIINV(0.05, 10) = 18.307.

Test for Population Variance

Sometimes it is of interest to draw inferences about the population variance. The distribution used is the chi-square distribution with n-1 degrees of freedom (where n = sample size), and

the test statistic is given by:

where s2 is the sample variance and the denominator is the value of the variance stated in the null hypothesis.

Example: Heart Valves.

Without any sorting, the clearance was normally distributed with mean of 0.005 and standard deviation of 0.000283 (which implies a variance of 8 * 10-8).

One key indicator of process improvement is whether or not process variability has been reduced. In this case, we would look to see if the variance of the clearance dimension has been reduced by sorting.

The null hypothesis in this case is that the variance has not been reduced:

H0: s2  8 * 10 -8

A random sample of size 50 (after sorting by batches of 50) yields a sample variance of :

2.308 * 10-9

Computing the test statistic yields:

The critical value is found (for an alpha of 0.001) by:

CHIINV(0.999, 49) = 23.98.

We would reject the null hypothesis for any value of the test statistic less than the critical value, 23.98.

Example: A machine makes small metal plates used in batteries. The plate diameter is a random variable with a mean of 5 mm. As long as the variance is at most 1.0, the production process is under control & the plates are acceptable. Otherwise, the machine must be repaired. The QC engineer wants, therefore, to test the following hypothesis:

H0: s2 < 1.0

With a random sample of 31 plates, the sample variance is 1.62.

Solution: Computing the test statistic, we see that:

For a critical value of  = 0.05, the critical value is found by:

CHIINV(0.05, 30)=43.77.

Since our test statistic lies to the right of the critical value, we would reject the null hypothesis.

The p-value is given by:

CHIDIST(48.6, 30) = 0.017257.

Thus, we would reject the null hypothesis for any value of  > 0.017257, and fail to reject the null hypothesis for smaller values.

Important Note: The use of the chi-square test on variance requires that the underlying population be normally distributed.

Chi-Square Test for Independence

It is often useful to have a statistical test that helps us to determine whether or not two classification criteria, such as age and job performance are independent of each other. The technique uses contingency tables, which are tables with cells corresponding to cross-classifications of attributes.

In marketing research, one place where the chi-square test for independence is frequently used, such tables are called cross-tabs. You will recall that we have previously used the pivot-table facility within EXCEL to produce contingency or cross-tabs tables from more unwieldy tabulations of raw data.

Example: A random sample of 100 firms is taken. For each firm, we record whether the company made or lost money in its most recent fiscal year, and whether the firm is a service or non-service company. A 2 X 2 contingency table summarizes the data.

Industry Type
Service / Non-service / Total
Profit / 42 / 18 / 60
Loss / 6 / 34 / 40
Total / 48 / 52 / 100

Using the information in the table, we want to investigate whether the two events:

the company made a profit in its most recent fiscal year, and
the company is in the service sector

are independent of each other.

Before stating the test, we need to develop a little bit of notation:

r = number of rows in the table

c = number of columns in the table

Oij = observed count of elements in cell (i, j)

Eij = expected count of elements in cell (i, j) assuming that the two variables are independent

Ri = total count for row i

Cj = total count for column j

The expected number of items in a cell is equal to the sample size, n, times the probability of the event signified by the particular cell. In the context of a contingency table, the probability associated with cell (i, j) is the joint probability of occurrence of both events. That is,

From the definition of independence, it follows that