Review of Ideas from Lectures 13-19

Stat 112 D. Small

Review of Ideas from Lectures 13-19

I. Comparisons Among Several Groups – One Way Layout

Data: Samples of size from several populations 1, 2, ... , I with means .

1. Ideal model used for making inferences

Random, independent samples from each population.
All populations have same standard deviation ()
Each population is normal.

2. Planned comparison of two means.

A. Use usual t-tools but estimate from weighted average of sample standard deviations in all groups (). The estimate of has degrees of freedom n-I where . In JMP, is the Root Mean Square Error.

B. 95% confidence interval for :

3. One-way ANOVA F-test

Is there any difference between any of the means?

vs. at least two of the means are not equal.

F test statistic measures how much the sample means vary relative to the population standard deviation. Large values of F are implausible under .
F test in JMP is reported under the Analysis of Variance

4. Robustness of inferences to assumptions

Normality is not critical. Nonnormality is only a problem is distributions are extremely long tailed or skewed and sample size in a group is <30.
Assumptions of random sampling and independence of observations are critical.
Assumption of equal standard deviation in population is critical. Rule of thumb: Check if largest sample standard deviation divided by smallest sample standard deviation is <2.

5. Multiple comparisons

Family of tests: When several tests (or confidence intervals) are considered simultaneously, they constitute a family of tests.
Unplanned comparisons: In a one-way layout, we are often interested in comparing all pairs of groups. The family of tests consists of all possible pairwise comparisons, I*(I-1)/2 tests for a one-way layout with I groups.
Individual Type I error rate: Probability for a single test that the null hypothesis will be rejected assuming that the null hypothesis is true.
Familywise Type I error rate: Probability for a family of test that at least one null hypothesis will be rejected assuming that all of the null hypotheses are true. Familywise type I error rate will be at least as large and usually considerably larger than the individual type I error rate when the family consists of more than one test.
Multiple comparisons procedure: Seeks to make sure that the familywise Type I error rate is kept at tolerable level.
Tukey-Kramer procedure: Multiple comparisons procedure designed specifically for the family of unplanned comparisons in a one-way layout. Instead of rejecting vs. at the 0.05 level if (approximately) as for a planned comparison, reject if |t|>q* where q* is greater than 2. In JMP, reject if entry in “Comparison of All Pairs Using Tukey’s HSD” is positive.
Bonferroni procedure:
General method for doing multiple comparisons for any family of k tests.
Denote familywise error we want by p*
Compute p-values for each test -- .
Reject null hypothesis for ith test if .
Guarantees that familywise type I error rate of k tests is at most p*.

Multiplicity: General problem in data analysis is that we may be implicitly looking at many things and only notice the interesting patterns. To deal with multiplicity properly, we should determine what our family of tests is (the family of all comparisons we have considered) and use a multiple comparisons procedure such as Bonferroni. Multiplicity makes it hard to detect when a null hypothesis is false because it makes it harder for a particular null hypothesis to be rejected. One way around the multiplicity problem is to design a study to search specifically for a pattern that was suggested by an exploratory data analysis. This converts an unplanned comparison into a planned comparison.

6. Linear combinations of group means:

Parameter of interest: .
Point estimate: .
Standard error: .
95% Confidence interval for : .
Test of . For level .05 test, reject if and only if does not belong to the 95% confidence interval.

II. Chi-Squared Goodness of Fit Test for Nominal Data

Nominal data: Data that place an individual into one of several categories, e.g., color of M&M, candidate person voted for.
Population of nominal data: population with k categories can be described by the proportion in each category, in category 1, in category 2, ..., in category k ().
One sample test for nominal data: Analogue of one sample problem with interval population. Take random sample of size n from a population of nominal data. We want to test vs. at least one of (i=1,...,k).
Chi-squared test: Method for doing a one sample test for nominal data. Based on comparing expected frequencies underto the observed frequencies. A large test statistic is evidence against . Test is only valid if expected frequencies in each category are 5 or more. When necessary, categories should be combined in order to satisfy this condition.
Chi-square test in JMP: We use the Pearson chi-square test in JMP.

III. Simple Linear Regression

1. Regression analysis

Setup: We have a response variable Y and an explanatory variable X. We observe pairs .
Goal: Estimate the mean of Y for the subpopulation with explanatory variable X=x, .
Application to prediction: The mean of Y given X=X0, , provides a good prediction of a future if we know that the future observation has X=X0.

2. Ideal Simple Linear Regression model

Assumptions of model:
The mean of Y given X=x is a straight line function of x, .
The standard deviation of Y given X=x is the same for all x and denoted by .
The distribution of Y given X=x is normally distributed.
The sampled observations are independent.

Interpretation of parameters.
Intercept The mean of Y given that X=0.
Slope : The change in the mean of Y given X=x that is associated with each one unit increase in X.
Standard deviation : Measures how accurate the predictions of y based on x from the regression line will be. If the ideal simple linear regression model holds, then approximately 68% of the observations will fall within of the regression line and 95% of the observations will fall within of the regression line.

Estimation of parameters
and : Estimate by the coefficients and that minimize the sum of squared prediction errors when the regression line is used to predict given .
: The residual for observation i is the prediction error of using to predict , . We estimate by taking the standard deviation of the residuals, corrected for degrees of freedom, . is the root mean square error in JMP.

3. Cautions about regression analysis

Interpolation vs. Extrapolation:
Interpolation: Drawing inference about for x within range of observed X, (). Strong advantage of regression model over a one-way layout analysis is the ability to interpolate from regression analysis.
Extrapolation: Drawing inference about for x outside range of observed X. Dangerous. Simple linear regression model may hold approximately over region of observed X but not for all X.

Association is not causation: Regression analysis shows how the mean of Y for different subpopulations X=x is associated with x. No cause and effect relationship between X and Y can be inferred unless X is randomly assigned to units as in a random experiment. Alternative explanation for strong relationship between mean of Y given X=x other than that X causes Y to change: (i) Reverse is true. Y causes X; (ii) There may be a lurking (confounding variable) related to both X and Y which is the common cause of X and Y.

4. Inference for ideal simple linear regression model

Sampling distribution of : .
Hypothesis tests for and : Based on t-test statistic
Confidence intervals for and : 95% confidence intervals:

Confidence intervals for mean of Y at X=X0 (Confidence intervals for mean response): (95% interval) where , . Note that precision in estimating decreases as gets farther away from sample mean of X’s.

Prediction interval: Interval of likely values for a future randomly sampled from the subpopulation .

Property: 95% prediction interval for X0: If repeated samples are obtained from the subpopulations and a prediction interval is formed, the prediction interval will contain the value of Y0 for a future observation from the subpopulation X0 95% of the time.
Prediction interval must account for two sources of uncertainty
Uncertainty about the location of the subpopulation mean because we estimate from the data using least squares
Uncertainty about the where future value Y0 will be in relation to its mean.

Confidence intervals for the mean response only need to account for the first source of uncertainty and are consequently narrower than prediction intervals.

95% prediction interval at :
Comparison with confidence interval for mean response at : Prediction interval is always wider. As sample size n becomes large, margin of error of confidence interval for mean response goes to zero but margin of error of prediction interval does not go to zero.

R-squared statistic
Definition: is the percentage of the variation in the response variable that is explained by the linear regression of the response variable on the explanatory variable.

where Total sum of squares = and Residual sum of squares =

b. takes on values between 0 and 1, with values nearer to 1 indicating a stronger linear relationship between X and Y. provides unitless measure of strength of relationship between x and y.

c. Caveats about : Not useful for assessing whether ideal simple linear regression model is correct (use residual plots); not useful for deciding whether or not Y is associated with X (use hypothesis test of vs. ).

5. Regression diagnostics:

A. Conditions required for inference from ideal simple linear regression model to be accurate must be checked:

Linearity (mean of Y given X=x is a straight line function of x). Diagnostic: Residual plot
Constant variance. Diagnostic: Residual plot
Normality. Diagnostic: Histogram of residuals
Independence. Diagnostic: Residual plot

B. Residual plot: Scatterplot of residuals versus X or any other variable (such as time order of observations). If the ideal simple linear regression model holds, the residual plot should look like random scatter – there should be no pattern in the residual plot.

a. Pattern in the mean of the residuals, i.e., the residuals have a mean less than zero for some range of x and a mean greater than zero for another range of x. Indicates nonlinearity.

b. Pattern in the variance of the residuals, i.e., the residuals have greater variance for some range of x and less variance for another range of x. Indicates nonconstant variance.

c. Pattern in the residuals over time indicates violation of independence. Pattern in the mean of residuals over time potentially indicates lurking variable that is associated with time.

6. Transformations

Basic idea: If we detect nonlinearity, we may still be able to use the ideal simple linear regression model by transforming X to g(X) and/or Y to f(Y), fitting the ideal simple linear regression model using f(Y) as the response variable and g(X) as the explanatory variable and then “back-transforming” to recover .
Choice of transformation: Use Tukey’s Bulging rule to decide which transformations to try. Make residual plots of the residuals for the transformed model vs. the original X. Choose transformations which make the residual plot look most like random scatter with no pattern in the mean of the residuals vs. X.
Prediction after transformation: To predict Y given X=x (or estimate ) when Y has been transformed to f(Y) and X to g(X),

(it’s easier to think through examples than to use formula directly).