Script

TWO POPULATIONS – Differences in Proportions

Slide 1

·  Welcome back. In this module we discuss how to perform hypothesis tests and construct confidence intervals for the difference in the proportion of the members of two different populations that possess some attribute.

Slide 2

·  The situation we are looking at here has two populations

·  Population 1

·  And Population 2

o  Population 1 might be a population of men and population 2 a population of women and we may wish to test if a greater proportion of men than women consider themselves Republican.

o  Or Population 1 might be a population of television shows produced in Cuba and Population 2 might be a population of television shows produced in China and we may wish to test if overall the proportion of good shows produced, as rated by a panel of television experts, is different between Cuba and China.

o  Or population 1 might be a population of workers with a high school education and population 2 might be a population of workers with a college education, and we might want to construct a confidence interval for the difference in the proportion of college graduates and high school graduates that earn more than $100,000 per year.

o  Or population 1 might a population of students that take the management science course in a traditional format and population 2 might be a population of students that take the same course by distance learning, and we might be interested in detecting whether there is a difference in the proportion of the students that get an “A” in the course between the two modes of teaching.

·  Each person or entry surveyed will either answer yes to the question or no. The number of “yes’s” are counted so that yes is represented by a 1 and no by a 0.


Slide 3

·  Let’s briefly look at the notation we will use in this module.

x1 will be the number of 1’s in a sample of size n1 from population 1 and

x2 will be the number of 1’s in a sample of size n2 from population 2

p1bar is the proportion of 1’s observed from the sample from population 1 so it is found by x1 over n1.

And q1bar is the proportion of 0’s observed from the sample from population 1 so it is found by 1 minus p1bar

p2bar is the proportion of 1’s observed from the sample from population 2 so it is found by x2 over n2.

And q2bar is the proportion of 0’s observed from the sample from population 2 so it is found by 1 minus p2bar

pbar, without any subscript, is the total proportion of 1’s from both populations and so it is the total number of 1’s, x1 plus x2, divided by the total number surveyed from both populations, n1 plus n2

qbar is the total proportion of 0’s from both populations and it is found by 1 minus pbar.

Slide 4

·  We now differentiate between the random variable indicating an individual response and a random variable indicating a proportion of 1’s.

·  Each individual response is either a

o  1 or a 0; a “yes” or a “no”

o  The random variable, X, which is the outcome of a single response of either a 1 which occurs with probability p or 0 which occurs with probability q, is what is called a Bernoulli random variable.

§  The mean of a Bernoulli distribution is p times 1 plus q times 0 or p. And the variance of a Bernoulli random variable is p times 1 squared plus q times 0 squared minus the mean, p, squared. So the variance is p minus p-squared. Factoring out the p this is p times 1 minus p or p times q .

·  From the central limit theorem, the proportion of n responses, in other words the average of n observations of a Bernoulli random variable,

o  Is approximately normally distributed with mean p

o  And variance equal to the variance of the Bernoulli distribution for a single observation, which is p times q, divided by the sample size n.


Slide 5

·  In considering tests and intervals dealing with the differences in proportions, we observe that

·  The proportion of 1’s from a sample of size n1 from population 1 is

o  Distributed approximately normal

o  With mean equal to p1, which we don’t know

o  And variance p1 times q1 divided by n1 which we can’t calculate because we do not know the true values of p1 and q1.

·  Similarly, the proportion of 1’s from a sample of size n2 from population 2 is

o  Distributed approximately normal

o  With mean equal to p2, which also we don’t know

o  And variance p2 times q2 divided by n2 which we also can’t calculate because we do not know the true values of p2 and q2.

Slide 6

·  So what is the distribution of the random variable that means the difference in the proportions from samples of size n1 form population 1 and of size n2 from population 2?

·  Its true mean is p1 minus p2 which we don’t know

·  And its true variance (remember you add the two variances) is p1q1 over n1 plus p2q2 over n2.

·  And thus its true standard deviation is the square root of p1q1 over n1 plus p2q2 over n2, which we cannot calculate because we don’t know the values for p1, q1, p2 and q2.

o  So what should we use for the standard deviation when perform hypothesis tests and constructing confidence intervals?

o  We use the observed values of p1bar, q1bar, p2bar, and q2bar in the sample for all cases except for the one hypothesis test of p1 minus p2 equals 0.

o  So in all these cases the appropriate standard error is the square root of p1bar times q1bar divided by n1 plus p2bar times q2bar divided by n2.

o  For the one hypothesis test that has the null hypothesis of p1 minus p2 equal to 0, we use

o  The square root of pbar times qbar times the quantity of 1 over n1 plus 1 over n2.


Slide 7

·  So here are the three cases: -- all hypothesis tests except p1 minus p2 equals 0 have a z-statistic calculated by point estimate minus v divided by the standard error, whereas for the hypothesis test of p1 minus p2 equals 0, the z-statistic is given by the point estimate minus 0 over the standard error. The confidence interval has the usual form of point estimate plus or minus z alpha over 2 times the standard error.

o  For all hypothesis tests, except those of p1 minus p2 equals 0, the point estimate is p1bar minus p2 bar

o  And the standard error is the square root of p1bartimes q1bar over n1 plus p2bar times q2bar over n2

o  For the hypothesis test of p1 minus p2 equal to 0, again the point estimate is p1bar minus p2bar

o  But this time the standard error is the square root of pbar qbar times the quantity 1 over n1 plus 1 over n2.

o  In confidence intervals, again the point estimate is p1bar minus p2 bar

o  And the standard error is the square root of p1bartimes q1bar over n1 plus p2bar times q2bar over n2.

Slide 8

·  Let’s look at an example.

·  Midas wants to compare customer satisfaction percentages in New York and Los Angeles. In particular it wants to know if it can conclude that

o  A greater proportion of its LA customers are satisfied with their service than New York customers

o  Can it conclude that the difference in customer satisfaction rates in Los Angeles and New York is greater than .02, and

o  We want to construct a 95% confidence interval for the difference in the proportion of satisfied customers in Los Angeles as compared to those in New York.

·  It took surveys of customer satisfaction in Los Angeles and New York and found

o  350 out of 400 Los Angeles customers surveyed were satisfied with Midas’s service

o  And 160 out of 200 New York customers surveyed were satisfied with Midas’s service.


Slide 9

·  Let’s look at the first question, can we conclude that customer satisfaction in Los Angeles exceeds that of New York?

·  This is a hypothesis test of the form H0 p1 minus p2 equals to 0, and HA, p1 minus p2 is greater than 0, where population1 is Los Angeles and population2 is New York.

·  This is a hypothesis test where the hypothesized value is 0 and thus we use the square root of pbar qbar times the quantity 1 over n1 plus 1 over n2 for the standard error.

·  Selecting alpha equal to .05,

·  The test becomes reject H0 and accept HA if we get a z-statistic that is greater than the critical z-value of z sub .05 or 1.645.

·  Here p1bar is 350 over 400 or .875 and p2 bar is 160 over 200 or .8. p-bar is 350 plus 160 or 510 over 400 plus 200 or 600,which is .85, so qbar is .15. Substituting these values into the formula we calculate a value for the z-statistic of 2.425.

·  Since 2.425 is greater than 1.645, we do have enough evidence to show that customer satisfaction in Los Angeles is greater than customer satisfaction in New York, based on the proportion of satisfied customers.

Slide 10

·  In the second case we wish to show that the difference in the proportion between Los Angeles and New York satisfied customers is greater than .02.

·  This a test of p1 minus p2 equals .02 versus p1 minus p2 is greater than .02

·  Since the hypothesized value ins not 0, this time we use the square root of p1bar times q1bar divided by n1 plus p2bar times q2bar divided by n2 for the standard error.

·  After selecting a value of alpha of .05, the test again is

·  Reject H0 and accept HA if we get a z-statistic that is greater than the critical z-value of z sub .05 or 1.645.

·  This time using p1bar equal to .875, q1bar equal to .125, p2bar equal to .8 and q2bar equal to .2, the required z-statistic, calculated by p1bar minus p2bar minus .02 in the numerator divided by the square root of p1bar times q1bar divided by n1 plus p2bar times q2bar divided by n2 in the denominator, is 1.679.

·  Since 1.679 is also greater than 1.645, we can conclude that customer satisfaction in Los Angeles is at least .02 greater than customer satisfaction in New York.

Slide 11

·  Now let’s calculate a 95% confidence interval for the difference in the proportion of satisfied customers between Los Angeles and New York.

·  The confidence interval is p1bar minus p2bar plus or minus z sub .025 times the square root of p1bar times q1bar divided by n1 plus p2bar times q2bar divided by n2

·  Substituting the appropriate quantities, this an interval of .075 plus or minus .064 or an interval from .011 to .139.


Slide 12

·  Let’s show how to perform these calculations using Excel.

·  To get the total number of observations in a column we use the COUNTA function. COUNTA of A2 to A401 says that n1 is 400 and COUNTA of B2 to B201 says that n2 is 200. Summing these two values in cell F3 gives the total customers surveyed of 600.

·  To get the number of Yes’s in columns A and B we use the COUNTIF function. This function has two arguments – the first is which columns are we searching (A2 to A401 in the first case, and B2 to B201 in the second) and the second argument, which appears in quotes, is what value are you trying to count – in this case we are trying to count the “YES” entries. In cell F7 we add the “YES” entries and find there are 510 total “Yes” responses form the combined surveys in Los Angeles and New York.

·  The observed proportions of “Yes’s” from each population and the overall proportion of “yes’s” in the surveys are found by dividing the number of yes’s by the total number in each population. F5 over F1 gives p1bar, F6 over F2 gives p2bar, and F7 over F3 gives pbar.

·  The z-statistic for the first case is found by p1-bar (cell F9) minus p2bar (cell F10) minus 0 divided the square root of pbar (cell F11) qbar, which is 1 minus pbar (or 1 minus cell F11) times the quantity 1 over n1 (cell F1 plus 1 over n2 (cell F2). Since this is a “greater than test’ the p-value is the area to the right of this value which is given by 1 minus NORMSDIST of the z-statistic in cell E14. Since this value of .007647 turns out to be much lower than .05, then YES we can conclude that p1 minus p2 is greater than 0.

·  The z-statistic for the second case is found by p1-bar (cell F9) minus p2bar (cell F10) minus .02 divided the square root of p1bar (cell F9) q1bar (1 minus cell F9) over n1 (cell F1) plus p2bar (cell F10) q2bar (1 minus cell F10) over n2 (cell F2). Since this is a “greater than test’ the p-value is the area to the right of this value which is given by 1 minus NORMSDIST of the z-statistic in cell E19. Since this value of .046605 is lower than .05, then YES we can conclude that p1 minus p2 is greater than .02.

·  Finally the lower confidence limit in cell E24 is p1bar (cell F9) minus p2bar (cell F10) minus t sub .025, which is gotten by NORMSINV of .975 times the square root of p1bar (cell F9) q1bar (1 minus cell F9) over n1 (cell F1) plus p2bar (cell F10) q2bar (1 minus cell F10) over n2 (cell F2). The upper confidence limit can be gotten by making the cell references in cell E24 absolute by using the F4 function key, dragging this formula down to cell E25 and changing the sign in front of NORMSINV from minus to plus.


Slide 13

·  We conclude this module with a discussion of how to determine the appropriate sample sizes for a difference in proportions model.

·  The typical assumptions are that

o  The same number will be surveyed from each population, although this need to be the case,

o  And we will use the worst case scenario, in other words use the values for the p’s and q’s, that make the variance as large as possible. We do this unless we have some better idea about what the p’s and q’s should be.