Stat 462 April 5 LAB SOLUTION FOR ACTIVITY 1

1. Use the dataset prostatecancer.txt at www.stat.psu.edu/~rho/462data/. Dataset consists of n = 97 prostate cancer patients

y = PSA_level, prostate specific antigen, a blood chemistry measurement affected by presence of cancer

x1 = CancerVol, cancer volume

x2 = Capsular, a measure of the invasiveness of the cancer

A. Fit the model E(y) = b0 + b1 x1 + b2 x2

Use the Storage button, Store Deleted t residuals, Hi (leverages), and DFITS, and FITS

What is the estimated regression equation? PSA_level = 1.33 + 2.41 CancerVol + 2.45 Capsular

What is the value of MSE? 991

B. Do a dotplot of the Cook’s D values. Use Editor>Brush to help you identify any extreme values. What observation(s) have extreme Cook’s D values?

One observation clearly has an extreme D value.

C. Explain what (in general) is measured by a Cook’s D value.

For the ith observation, the overall change in that occurs when the ith observation is deleted from the data.

D. Do a dotplot of the DFITS. Identify any “extreme” points. (The book’s criterion for extreme is that absolute DFIT>1.)

Two observations qualify as large, a third is quite close.

E Explain what (in general) is measured by a DFFIT value.

For the ith observation, the overall change in that occurs when the ith observation is deleted from the data.

F. Do a dotplot of the hi values. Identify any “extreme” points. The Minitab criterion for a large leverage is 3p/n. Use this as the definition of “extreme.”

Six data points qualify. By the way, p = 3 for a two variable model.

A large hi indicates potential influence on the results. It doesn’t necessarily mean the observations is causing trouble.

G. Do a dotplot of the deleted residuals. Identify any “extreme” data points.

There are three obvious outliers.

H. Graph PSA_level versus CancerVol. Identify any unusual points.

There are three, perhaps four, unsual points.

I. Graph PSA_level versus Capsular. Identify any unusual points.

Again, there are three, perhaps four, unusual points.

J. What is the predicted value for observation 97? 123.49

K. Delete observation 97 by replacing the y-value with an asterisk (by replacing the value of y with an asterisk). Recompute the regression equation. PSA_level = 4.56 + 2.29 CancerVol + 0.59 Capsular

What is the predicted value for observation 97? 88.88

What is the difference between this predicted value and the value found in the previous part? Note: This is the “unstandardized” version of a DFIT for observation 97.

123.49-88.88 = 34.61

L. In addition to observation 97, delete observations 95 and 96. Re-run the regression, and then repeat part A based on this new regression. Describe differences between the two sets of results.

PSA_level = 5.96 + 1.90 CancerVol - 0.517 Capsular.

Coefficients are quite differennt, especially for Capsular which now is not significant.

MSE = 148, much smaller than before. The three extreme points created large error.


Stat 462 April 12 PARTIAL LAB SOLUTIONS

Go to www.stat.psu.edu/~rho/462data/. Link to the dataset mortality.txt. Copy and paste the dataset to Minitab. The variables are weekly mortality rates due to heart disease, weekly air temperature, and weekly pollution particulate levels in Los Angeles County.

A. Do a regression in which y = mortality, x1 = temp and x2 = partic (the pollution variable). Store the residuals. Are both variables needed in the equation? What is the evidence of your statement?

P-values for both variables are 0.000, evidence of statistical significance.

Predictor Coef SE Coef T P

Constant 110.545 3.119 35.45 0.000

temp -0.47824 0.03879 -12.33 0.000

partic 0.28827 0.02309 12.48 0.000

B. Create a column that is the “lag 1” of the residuals – that is residual at the previous time. To do this, use Stat>Time Series>Lag. Lag the residual column into a new column. Then, plot residuals versus lagged residuals. Describe the plot. In what way is the result evidence of a violation of the assumptions made about errors in a regression model?

Plot shows linear pattern. This indicates residuals are related to lagged residuals. This is a violation of assumption that residuals are independent of each other.

C. Use Stat>Basic Stats>Correlation to find the “1-st order” autocorrelation for the residuals. That is, find the correlation between the column of residuals and the column of lagged residuals. What is the correlation? Explain whether the correlation statistically significant.

There’s a significant first order autocorrelation.

Pearson correlation of RESI1 and lagres = 0.539

P-Value = 0.000

D. Lag each of the variables y, x1, and x2 into new columns. Refer to the previous part. Let r = value of the correlation found there. Create three new columns:

ynew = y – r*lag(y)

x1new=x1-r*lag(x1)

x2new= x2-r*lag(x2)

Do a regression between with ynew as the response variable and x1new and x2new as the predictors. Store the residuals. Then repeat parts B and C for this new set of residuals.

Note: The value of r = 0.539 (from previous part).

Predictor Coef SE Coef T P

Constant 41.236 1.360 30.33 0.000

x1new -0.15348 0.04330 -3.54 0.000

x2new 0.22391 0.02487 9.00 0.000

Residuals from this regression do not appear to have autocorrelation pattern.

Pearson correlation of RESI2 and lagres2 = -0.061

P-Value = 0.169

E. Refer to the regression output for the previous part. It can be proved that the “slope estimates” and their standard errors are the “correct estimates of the “slopes” in the untransformed regression (part A). How do the slopes and standard errors compare to the values found for part A?

Standard errors are larger. In the presence of autocorrelated residuals, ordinary least squares underestimates the standard error. The slopes differe somewhat, especially for temperature.

F. It can be shown that the “correct” estimate of the intercept for the untransformed model is (Intercept for transformed model/(1-r)). Find this intercept, and then write the “correct” estimated equation for the untransformed variables. Note see what it says in the previous part about the slopes.

Intercept = 41.236/(1-.439) = 89.45

Revised equation is Predicted Moratality = 89.45-0.15348 Temp + 0.22391 Partic.