STP 420 SUMMER 2005

STP 420

INTRODUCTION TO APPLIED STATISTICS

NOTES

PART 2 – PROBABILITY AND INFERENCE

CHAPTER 7

INFERENCE FOR DISTRIBUTIONS

7.1Inference for the Mean of a Population

The t distributions – density curve

- symmetric about the mean 0

- area under the curve is 1

- shape similar to the standard normal curve

-has mean 0 and standard deviation varies and decreases as sample size increases

- as n becomes large the t curve approaches the N(0, 1) curve

- more appropriate since the  of a population is rarely known.

If we sample from a population where the standard deviation is not known, then we have to estimate it using the sample mean. The z statistic is now not valid and we use a more appropriate statistic (t statistic).

Standard error () of a statistic – standard deviation is estimated from the data

The t distributions

Suppose that an SRS of size n is drawn from an N(, ) population. Then the one-sample t statistic has the t distribution with n – 1 degrees of freedom.

degrees of freedom – always n – 1 since only n – 1 of the observations are free

The One-Sample t Confidence Interval

Suppose that an SRS of size n is drawn from a population having unknown mean . A level C confidence interval for  is

where t* is the value for the t(n – 1) density curve with area C between –t* and t*.

This interval is exact when the population distribution is normal and is approximately correct for large n in other cases (non normal populations).

Margin of error - is similar in structure as when we used the N(0, 1) distribution except we replace z* with t* and  by s.

We can report the confidence interval, or we can report the mean of the interval with half the confidence interval as the margin of error.

The One-Sample t Test

Suppose that an SRS of size n is drawn from a population having unknown mean . To test the hypothesis H0 :  = 0 based on an SRS of size n, compute the one-sample t statistic in terms of a random variable T having the t(n – 1) distribution, the P-value for a test of H0 against

Ha : 0isP(T  t)

Ha : 0isP(T  t)

Ha : 0is2P(T  |t|)

These P-values are exact if the population distribution is normal and are approximately correct for large n in other cases.

It is wrong to look at the data and then decide whether you want to do a one-tailed test instead of a two-tailed test. If you have no previous knowledge that suggest the current data being more or less, then go with a two-sided test.

Matched pairs t procedures

Subjects are matched in pairs.

Eg. Difference of pretest scores and post test scores for same set of individuals form a data set that can be tested using the same test as before.

1.A matched pairs analysis is needed when there are two measurements or observations on each individual and we want to examine the change from the first to the second. Before and after are common.

2.For each individual compute after minus before.

3.Analyze the difference using the one-sample confidence interval and significance-testing procedures.

Robustness of the t procedures

robust – procedures that are not strongly affected by non-normality

Robust Procedures

- A statistical inference procedure is robust if the probability calculations required are insensitive to violations of the assumptions made.

Some practical guidelines

1.Sample size < 15:Use t procedures if data are close to normal.

Do not use if data clearly non normal or outliers present

2.Sample size >=15:Use t procedures except when outliers are present or strong

skewness of data

3.Large samples:t procedures can be used even for clearly skewed

distributions (n  40)

The Sign Test

When populations are nonnormal, distribution-free procedures/tests are more straightforward.

They have two drawbacks,

1.Less powerful than tests designed for specific distribution like the t test.

2.Often need to modify hypothesis to use distribution-free tests.

Distribution-free tests – stated in terms of median rather than the mean.

- good when distribution is skewed

Sign test – simplest distribution-free test

- based on counts and the binomial distribution

H0 : p = ½==H0 : population median = 0

Ha : p > ½==Ha : population median > 0

p is the probability of improvement from lets say a pretest to a post test

p = ½ implies no improvement

p > ½ implies improvement

The Sign Test for Matched Pairs

Ignore pairs with difference 0; the number of trials n is the count of the remaining pairs. The test statistic is the count X of pairs with a positive difference, P-values for X are based of the binomial B(n, ½) distribution.

Considering the pretest/posttest experiment, the sign test tests the hypothesis that the median of the differences between the pretest and posttest scores is 0.

The sign test does not use the actual scores but uses a count of the improvement (differences between pretest and posttest greater than 0)

7.2Comparing Two Means

Two-Sample Problems

1.Goal of inferences is to compare responses in two groups

2.Each group is a sample from a distinct population

3.Responses in each group are independent of those from the other groups.

The two-sample z statistic ( is known)

Suppose that is the mean of an SRS of size n1 drawn from an N(1, 1) population and that is the mean of an independent SRS of size n2 drawn from an N(2, 2) population. Then the two-sample z statistic

has standard normal N(0, 1) sampling distribution.

The two-sample t significance test ( is unknown)

Suppose that an SRS of size n1 drawn from a normal population with unknown mean 1 and that an independent SRS of size n2 is drawn from another normal population with unknown mean 2. To test the hypothesis H0 : 1 = 2, compute the two-sample t statistic

and use P-values or critical values for the t(k) distribution, where the degrees of freedom k are either approximated by software or are the smaller of n1 – 1 and n2 – 1.

The Two-Sample t Confidence Interval

Suppose that an SRS of size n1 is drawn from a normal population with unknown mean 1 and that an independent SRS of size n2 is drawn from another normal population with unknown mean 2. The confidence interval for 1 - 2 is given by

has confidence level at least C no matter what the population standard deviation maybe. Here t*, is the value for the t(k) density curve with area C between –t* and t*. The value of the degrees of freedom k is either approximated by software or we use the smaller of n1 – 1 and n2 – 1.

Robustness of the two-sample procedures

Two-sample procedures are more robust that the one-sample procedures. If two populations have same shape, small samples (~5) are okay; otherwise if populations have different shapes, larger samples are needed. Equal sample sizes are better to work with.

Inference for small samples

We need to be very careful.

Not enough observations for boxplots or normal quantile plots.