Script

TWO POPULATIONS – Matched Pairs

Slide 1

  • Welcome back. In this module we look at the situation of sampling from two populations but where the sample values match up based on some characteristic. This could be like the amount of television watched by students before an exam and the amount watched by the same students after the exam; it could be the salary made by the husband in a household and the salary of the wife of the same household; it could be the same store sales in one year versus the sales the next; and so forth. Thus there is something linking the i-th observation from one population to the i-th observation from the other population.

Slide 2

  • So in a matched pairs experiment
  • We take n observations from each of two populations in such a way that the i-th observation from the first population has something in common with the i-th observation from the second population.
  • It is this common element, such as the same date, same size, same person, same household, same store, etc, that is chosen at random and then this dictates the observations selected from each population.
  • It is the differences, not the actual values of the observations that are analyzed.
  • If it can be assumed that the differences form a normal population or if we take a large sample size, then t-tests can be performed and t-intervals can be constructed for the average value of these differences.
  • The important element about pairing is that it typically gives a fairer comparison of values from two populations and usually reduces the overall variance. When there is less variability, we are more sure of our results.

Slide 3

  • We now discuss how to perform hypothesis tests and construct confidence intervals for experiments consisting of paired differences.
  • Suppose a random sample of size n sub d (d for differences) is taken from one population and the corresponding paired values are observed from the second population. The difference between observation i in the first population and observation i from the second population is recorded as d sub i. So there is a set of n sub d values for d sub i.
  • We can then calculate the following statistics.
  • d-bar which is the average value of the sample differences, s sub d-squared which is the sample variance of the differences and s sub d, which is the sample standard deviation of the differences.

Slide 4

  • The hypothesis test that one might do is
  • That the average difference is greater than, less than or different from a hypothesized value d.
  • The test statistic t then is
  • Defined in the usual way: point estimate minus the hypothesized value, d, divided by the length of a standard error.
  • The point estimate is d-bar, and the standard error is s sub d divided by the square root of n sub d so that the t-statistic is given by d-bar minus d divided by s sub d divided by the square root of n sub d.
  • And the confidence interval is also defined in the usual way
  • Point estimate plus or minus t alpha over 2 with the right number of degrees of freedom times the length of a standard error
  • Which is d-bar plus or minus t alpha over 2 times s sub d divided by the square root of n sub d.
  • The number of degrees of freedom for the hypothesis test and the confidence interval is the number of differences, n sub d, minus 1.

Slide 5

  • Let’s look at an example.
  • Suppose a company wishes to compare sales of a particular product at two of its branch stores: one in Anaheim and one in Irvine.
  • We want to know if we can conclude that the average daily sales of the product in the Anaheim store exceeds that of the Irvine store by more than $200.
  • And we want to construct a 95% confidence interval for the difference in average daily sales of the product between the two stores.

Slide 6

  • One approach might be to
  • Record sales of seven randomly selected dates in Anaheim and seven randomly selected dates in Irvine
  • as shown here.
  • In this case there is nothing in common between say the first observation in Anaheim and the first observation in Irvine. The first observation in Anaheim occurred on the 15th of December, while the first observation in Irvine occurred on November 30.
  • Since there is nothing in common the approach we would take is the difference in two population means approach discussed previously, where we would begin by doing an F-test to see if we could assume that the variances of the two populations are equal, and then do the appropriate t-test either assuming the variances are equal or that they aren’t.
  • This is probably not a very good approach for this problem.

Slide 7

  • We could have been unlucky, and maybe the seven observations taken for Irvine could have all occurred near Christmas, Mothers Day and Fathers Day, while the seven observations for Anaheim could have all occurred near tax day, April 15, right after the new year when spending is notoriously low, and so forth. This would give very biased results in favor of Irvine. A better approach would be not to select seven receipts at random from Anaheim and seven receipts at random from Irvine, but to select seven dates at random and see what the receipts for the product were on those seven dates in Anaheim and in Irvine.
  • We still have seven observations from each population, but the first observation in Anaheim has something in common with the first observation in Irvine –
  • they both occurred on November 25, and so forth.
  • For each of these dates we can calculate the difference between the Anaheim sales and the Irvine sales. These are the d sub i’s. It is not the values of sales in Anaheim and Irvine that are relevant, but these differences between the two.
  • We can then calculate the mean value of this sample of differences, d-bar which is $400 and the standard deviation of this sample of differences which is $258.20.

Slide 8

  • The relevant hypothesis test is then
  • H0: mu-d equals 200 versus HA: mu-d is greater than 200.
  • We select our value of alpha to be .05
  • And the test becomes the t-test for a single population, only this one is for the population of differences. It is reject H0 or accept HA if we get a t-statistic that exceeds the critical t-value of t sub .05 with 7 minus 1 or 6 degrees of freedom --which is 1.943.
  • The calculation for the t-statistic is d-bar or 400 minus 200 divide s sub d, 258.2 divided by the square root of n sub d, 7. This gives a value of 2.049.
  • Since 2.049 is greater than the critical value of 1.943, there is enough evidence to conclude that average sales of the product in Anaheim exceed that of the product in Irvine by more than $200 per day.

Slide 9

  • The 95% confidence interval is
  • D-bar plus or minus t sub .025 with 7 minus 1 or 6 degrees of freedom times s ub d divided by the square root of n sub d.
  • Substituting the calculated values for d-bar and s sub d, we get
  • 400 plus or minus 238.8 or an interval from $161.20 to $638.80.

Slide 10

  • Now let’s see how the hypothesis test and confidence interval can be done in Excel.
  • For the hypothesis test
  • We select DATA ANALYSIS from the Tools menu and then choose t-Test Paired Two Sample for Means
  • We then look at the p-value for this test.
  • For confidence intervals
  • We first create a column of differences
  • Then select DATA ANALYSIS from the Tools menu and then choose DESCRIPTIVE STATISTICS. From the output the required confidence interval will be the mean plus or minus the Confidence Level.

Slide 11

  • Here is the spreadsheet with the seven entries for Anaheim in column B and the corresponding seven entries for Irvine in column C.
  • We go to the Tools menu and select DATA ANALYSIS
  • And from the DATA ANALYSIS menu we select t-Test Paired Two Sample for Means

Slide 12

  • In the dialogue box,
  • We enter column B fromB1 to B8 for the Variable 1 Range, column C from C1 to C8 for the Variable 2 Range, and 200 for the Hypothesized Mean Difference
  • We check Labels since there Labels in cells B1 and C1
  • And we designate a cell in which we would like the output to begin.

Slide 13

  • From the output
  • We see from cell F11, that the p-value for the one-tail test is .043157.
  • Since .043157 is less than our alpha value of .05, we do conclude that average sales of the product in Anaheim exceed those in Irvine by more than $200 per day.
  • If we were doing a “not equal to” 2-tailed test (we weren’t, but if we were), the p-value for this test is also printed in the output in cell F13.

Slide 14

  • We now show how to use Excel to construct a 95% confidence interval for the difference in average daily sales of the product between Anaheim and Irvine.
  • We begin by going to the spreadsheet and
  • In column D we calculate the daily differences between the Anaheim and Irvine stores. We enter the formula “Equal to B2 minus C2” in cell D2 and drag this cell down to cell D8 to give the column of differences
  • Then we select DESCRIPTIVE STATISTICS from DATA ANALYSIS in the Tools menu
  • To produce the output shown in columns H and I.
  • The lower confidence limit then is found by the mean in I3 minus the Confidence Level in cell I16.
  • And the upper confidence limit then is found by the mean in I3 plus the Confidence Level in cell I16.

Slide 15

  • Let’s review what we’ve discussed in this module.
  • We defined what is meant by a matched pairs experiment and
  • Pointed out that a matched pairs experiment typically results in a “more fair” test with less variability.
  • We emphasized that it is not the magnitude of the data values that are important but the values of the set of differences that are relevant in constructing confidence intervals and performing hypothesis tests.
  • And we showed how to perform hypothesis tests and construct confidence intervals for matched pairs experiments both
  • By hand and
  • By Excel

That’s it for this module. Do any assigned homework and I’ll be back to talk to you again next time.