STAT 460Lab 6 Turn in Sheet10/18/2004
To receive credit for this lab, turn this sheet in before leaving the lab.
Name: ______
Lab Section: ____
1. Do the assumptions of regression appear reasonable? Which ones can be eye-ball checked here, and what do you conclude?
2. Does the intercept (constant) of 8.39 have meaning for this study? What is the null hypothesis regarding the intercept (β0)? Why is the hypothesis test for β0uninteresting for this study?
3. What is the null hypothesis for the slope coefficient? Why is this an interesting hypothesis? Interpret the estimate. Interpret the confidence interval for the estimate.
4. List one or two things that are still unclear to you.
1
STAT 460Lab 6 Instructions10/18/2004
Goals: In this lab you will get practice and additional insight on 1) simple regression and 2) residual analysis.
Review of ANOVA vs. regression:
a)Both: Context is an observational study or an experiment with a quantitative outcome. All subjects with the same level of the explanatory variable (“in the same group”) are assumed to have the same mean and vary around that mean according to a Gaussian distribution (bell shaped curve) with a common variance, σ2. Errors (deviations from the group mean) are assumed to be independent across subjects. Group assignments are assumed to be clear cut (fixed x assumption).
b)One-way ANOVA: Categorical explanatory variable. Mean parameters are μ1 through μk. Best predictions are through.
c) Regression: Quantitative explanatory variable. Coefficient parameters are β0 and β1. Mean outcome at X is μ(Y|X)=β0+ β1X (linearity assumption). Best prediction is (Y|X)= (a.k.a. b0+b1X).
Task 1: Guided performance of simple linear regression
1)The data in string.txt is from a study by Elbert et al. called “Increased cortical representation of the fingers of the left hand in string [instrument] players.” The outcome is a “neuron activity index” from magnetic source imaging. The explanatory variable is years as a string player. There are nine string player subjects and 6 controls (who have 0 years as a string player). Is this an experiment or an observational study? What is your null hypothesis? What effect might subject selection have on this study?
2)Load the data into SAS and perform descriptive statistics on the two variables. Go to Edit/Mode and choose Edit. Create a new nominal variable called “player” using Data/Transform/Recode values. In Column to Record choose Years and name the new variable ‘player’, and click OK. Recode 0 years to 0 in player and ≥1 years to 1 in player, and click OK. Now you have a nominal explanatory variable PLAYER, and we will first, for educational purposes only, try ANOVA. Perform EDA of the relationship between activity and player. Do you think the assumptions for ANOVA are well met? Be specific.
3)Perform the ANOVA, ignoring the assumption violation. Unequal variance of the groups will tend to shift the null sampling distribution of F to the right. Can you deduce what this means in terms of false rejection of the null hypothesis? In terms of type 1 error?
4)One rough adjustment for unequal variances is to halve your alpha, say from 0.05 to 0.025. Is the adjustment consistent with your thinking for part 3? Using the adjustment, do you reject H0?
5)Now we will return to simple linear regression. Make, as EDA, a scatter plot of activity vs. years of playing. Which variable should go on the x-axis? Do the x positions of the data match your expectations?
6)Do the assumptions of regression appear reasonable? Which ones can be eye-ball checked here, and what do you conclude? (♠1)
7)Perform the regression:
a)Choose Statistics/Regression/Simple from the menu.
b)Enter the Dependent variable (i.e., the outcome variable)
c)Enter “years” as the independent variable.
d)Go to Statistics and add Std. regression coefficient, Confidence limits for estimates, and Correlation matrix for estimates.
e)Go to Predict and add Predict the original sample and List the predictions.
f)Go to Plots and add Plot observed vs predicted. Choose Residual and add Plot residual vs variable (ordinary and predicted). Also add Normal Probability Plot.
g)Click OK to perform the analyses.
8)First check those assumptions that can be checked using the data.
a)To check normality, look at the Normal P-P (Quantile Normal) Plot of the residuals. If the points are very clearly far from falling on the line, we may need to worry about the effects of breaking the normality assumption on the null sampling distribution of the t and F statistics.
b)To check the equal variance assumption, look at the Residual vs. Predicted scatterplot. Here is how to check for unequal variance. Visually break the graph from left to right into 5 to 10 vertical stripes. Roughly estimate the range of the middle 95% of the data (vertically) in each stripe. Keep in mind that a stripe with little data provides an unreliable estimate of spread. If you see marked differences across the stripes, e.g. some reliable stripes show twice the spread of others, then you should worry about the equal variance assumption.
c)To check the linearity assumption, eyeball a left-to-right curve in the Residual vs. Predicted scatterplot that represents the vertical center of the points in any horizontal region. If the curve clearly follows some pattern different from a horizontal line at Y=0, consider linearity violation.
d)The above assumption checking is called residual analysis, because the plots are of residuals. Note that the assumptions of fixed x (practically, x variation is small compared to y variation) and independence of deviations of observed values from the mean for that X cannot be checked from the data (except for serial correlation, which we will discuss another time).
e)What did you find out about the validity of the regression assumptions for these data?
9)Continue your interpretation with the regression ANOVA and coefficient tables. (Yes, an ANOVA table, similar to but different from the one you are used to, is usually part of the regression output!) Note that the p-values are the same for the ANOVA and for the YEARS coefficient. Guess, then verify, the relationship between the t and F values in these two tables.
10)Does the intercept (constant) of 8.39 have meaning for this study? What is the null hypothesis regarding the intercept (β0)? Why is the hypothesis test for β0uninteresting for this study? (♠2) Interpret the confidence interval for β0.
11)Now lets think about the slope coefficient for YEARS, β1, and its estimate . What is the null hypothesis? Why is this an interesting hypothesis for this study? Interpret the estimate. Interpret the confidence interval for the estimate. (♠3)
12)Interpret R2.
13)Find the residual standard error and the Mean Square of the residual (error), often called MSE. Note that one is the square of the other. Interpret the residual standard error.
Task 2: Unguided example
1)The data in marigold.txt (space delimited) is from a well-designed experiment designed to test the effects of gamma rays on marigold plant growth. Briefly marigold seeds were grown in a vermiculite/nutrient mixture in a constant temperature greenhouse with daily light and water exposure designed for maximum growth. At day 12, twenty four plants that all appeared healthy and of about the same size were numbered, labeled with a bar code, and placed on a continuous serpentine “track” that slowly moves to assure that every plant spends time at every position around the experimental apparatus. The track includes a long “tongue” that can be placed in a gamma ray chamber in such a way that only a single plant at a time is exposed to the gamma rays. Every day at noon, starting at a random position, the track is rapidly cycled such that each plant spends 1 minute in the chamber, where its bar code is read and the appropriate radiation (in rem units) is applied. Only the computer knows which dose was randomly assigned to each plant. On the morning of day 21, a trained technician carefully removes the plant from the vermiculite, rinses off the roots, pats them dry, and records the weight of the plant (in gm).
2)Load the data into SAS.
3)State your null hypothesis or hypotheses of interest for this particular experiment.
4)Run the regression analysis. For the moment, skip the residual analysis, and write out your prediction equation, , by substituting in the coefficient estimates. Write interpretations for the coefficient estimates.
5)Now perform the residual analysis to check the assumptions. What is clearly wrong?
6)Without worrying too much about why right now, create a new explanatory variable using Data/Transform/Compute (make sure you are in Edit mode!). Call the new variable “rem2”. For the Numeric Expression, enter “rem**2” which means rem squared. (This is a ‘transformation” of the explanatory variable, putting it on a new scale, which may be more useful.)
7)Now repeat your analyses, including residual analysis, substituting rem2 for rem as the explanatory variable. In what ways is your new analysis better than the old? Make interpretations of the coefficient estimates (a bit trickier here).
Task 3: Residual Analysis
Interpret the plot:
1