Ch 3. – Regression Analysis – Computing in R

Example 1: Patient Satisfaction Survey (pgs. 79 – 83)

First read these data in R using the same approach introduced in the Chapter 2 R handout.
> head(HospSatis)

Age Severity Satisfaction

1 55 50 68

2 46 24 77

3 30 46 96

4 35 48 80

5 59 58 43

6 61 60 44
We can visualize the pairwise relationships via a scatterplot matrix using the pairs() command.

pairs(HospSatis)  Note: this only work if all columns in data frame are numeric!

We can see both age and severity of illness are negatively correlated with the response satisfaction, and are positively correlated with each other as would be expected.

We first examine the simple linear regression model for satisfaction using patient age as the sole predictor, i.e. we will assume the following holds for the mean satisfaction as a function of patient age: .

Thus we will fit the following model to the observed data from the patient survey:

where,

To fit this model in R we use the lm()command. The basic form to use when calling this function is lmfit= lm(y ~ x1 + x2 + … + xk) where y is the name of the response variable in the data frame and x1 + x2 + … + xk are the terms you want to include in your model. Applying this for the simple linear regression of patient satisfaction on the patient age is as follows.

satis.lm = lm(Satisfaction ~ Age)

attach(HospSatis)

satis.lm = lm(Satisfaction ~ Age)

summary(satis.lm)

Residuals:

Min 1Q Median 3Q Max

-21.270 -6.734 2.164 5.918 18.193

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 131.0957 6.8322 19.188 1.19e-15 ***

Age -1.2898 0.1292 -9.981 7.92e-10 ***

---

Signif.codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.375 on 23 degrees of freedom

Multiple R-squared: 0.8124, Adjusted R-squared: 0.8043

F-statistic: 99.63 on 1 and 23 DF, p-value: 7.919e-10

R-square = .8124 or 81.24% of the variation in the patient satisfaction responses can be
explained by the simple linear regression on patients age.

RMSE = 9.375 =

Testing Parameters and CI’s for Parameters:

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 131.0957 6.8322 19.188 1.19e-15 ***

Age -1.2898 0.1292 -9.981 7.92e-10 ***

---

Signif.codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate 95% CI’s for would be given by which you could easily do by hand if needed.

Examining residuals (, Leverage, and Influence

To examine the residuals and case diagnostics it is easiest to simply type plot(modelname). Here our simple linear regression of patient satisfaction on patient age is called satis.lm. This will produce a sequence of four plots which you view in turn, hitting Enter to view the next plot in the sequence.

> plot(satis.lm)

Plot 1Plot 2

Plot 3 Plot 4

The first plot is a plot of the residuals vs. the fitted values which can be used to look for lack of fit (inappropriate mean function) and constant error variance. The second plot is a normal quantile plot of the studentized residuals which allows us to assess the normality of the errors and check for outliers. The third plot is the square root of the absolute value of the studentized residuals () vs. the fitted values with a smooth curve fit added. If this plot and/or smooth shows an increasing trend then the assumption of constant error variance may be violated. The final plot is the studentized residuals vs. leverage . In addition to displaying the leverage on the horizontal axis, contours are drawn on the plot using Cook’s Distance . Here we can see that patient 9 has a relatively large Cook’s Distance but it is not above 1.0, so we will not worry too much about the influential effects of this case.

plot(Age,Satisfaction,xlab="Patient Age (yrs.)",ylab="Patient Satisfaction",main="Simple Linear Regression of Satisfaction on Age")
abline(satis.lm)


Confidence Intervals for the Mean of Y and Prediction of a New Y (visually only!)

> SLRcipi(satis.lm)

Durbin-Watson Statistic in R

The dwtest() function conducts the Durbin-Watson test for autocorrelation in the residuals and is found in the R library lmtest which needs to installed from one of the CRAN mirror sites. To do this in R select Install Packages(s)… from the Packages pull-down menu as shown below.

You will then be asked to select a CRAN Mirror site to download the package lmtest from. The choice is irrelevant really but I generally use Iowa State’s US (IA). Once you install a package from CRAN you need still need to load the new package using that option in Packages pull-down menu, i.e. Load package…

To use perform the Durbin-Watson test on residuals from a regression model you simply type:

> dwtest(satis.lm,alternative="two.sided")

Durbin-Watson test

data: satis.lm

DW = 1.3564, p-value = 0.07496

alternative hypothesis: true autocorrelation is not 0

The default is to conduct an upper-tail test which tests for significant positive lag 1 autocorrelation only. Here I have specified the two-sided alternative which tests for significant lag 1 autocorrelation of either type, positive or negative.

As time is not an issue for the patient satisifaction study, there really is no reason to test for autocorrelation.

Example 2: Soda Concentrate Sales (Durbin-Watson statistic/test example)

After reading these data into R and then fitting the simple linear regression model of Sales on Expenditures we can conduct the Durbin-Watson test for the residuals from this fit. Given that the observations are collected over time, the significance of this test suggests addressing the significant autocorrelation through our modeling is important.

> SodaConc = read.table(file.choose(),header=T,sep=",")

> names(SodaConc)

[1] "Year" "Expend" "Sales" notice I changed the column names in JMP 1st !!

> head(SodaConc)

Year Expend Sales

1 1 75 3083

2 2 78 3149

3 3 80 3218

4 4 82 3239

5 5 84 3295

6 6 88 3374

> detach(HospSatis)

> attach(SodaConc)

> plot(Expend,Sales,xlab="Advertising Expenditure",ylab="Soda Concentrate Sales")

> sales.lm = lm(Sales~Expend)

> abline(sales.lm)

> summary(sales.lm)

Call:

lm(formula = Sales ~ Expend)

Residuals:

Min 1Q Median 3Q Max

-32.330 -10.696 -1.558 8.053 40.032

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1608.5078 17.0223 94.49 <2e-16 ***

Expend 20.0910 0.1428 140.71 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 20.53 on 18 degrees of freedom

Multiple R-squared: 0.9991, Adjusted R-squared: 0.999

F-statistic: 1.98e+04 on 1 and 18 DF, p-value: < 2.2e-16

> plot(sales.lm)

The plot of the residuals vs. the fitted values for this regression suggests nonlinearity, i.e. we have indication that . Also non-normality of the residuals is evident.

The sequence of commands below will save the residuals from the simple linear regression of sales on advertising expenditure and then connect the points over time using a dashed (lty=2) blue line (col = blue).

> e = resid(sales.lm)

> plot(Year,e,main="Plot of Residuals vs. Time (Soda Concentrate Sales)")

> lines(Year,e,lty=2,col="blue")

> dwtest(sales.lm)

Durbin-Watson test

data: sales.lm

DW = 1.08, p-value = 0.006108

alternative hypothesis: true autocorrelation is greater than 0

Fitting a Multiple Regression Model
As an example we consider the for hospital satisfaction using both age of the patient and severity of illness as terms in our model, i.e. we fit the following model to our survey results,

where,

We fit this model below using R.

> attach(HospSatis)

> names(HospSatis)

[1] "Age" "Severity" "Satisfaction"

> satis.lm2 = lm(Satisfaction~Age+Severity)

> summary(satis.lm2)

Call:

lm(formula = Satisfaction ~ Age + Severity)

Residuals:

Min 1Q Median 3Q Max

-17.2800 -5.0316 0.9276 4.2911 10.4993

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 143.4720 5.9548 24.093 < 2e-16 ***

Age -1.0311 0.1156 -8.918 9.28e-09 ***

Severity -0.5560 0.1314 -4.231 0.000343 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.118 on 22 degrees of freedom

Multiple R-squared: 0.8966, Adjusted R-squared: 0.8872

F-statistic: 95.38 on 2 and 22 DF, p-value: 1.446e-11

We can see that both predictors, Age & Severity are significant (p < .001).
The residual plots and plots of leverage & Cook’s distance are shown on the following page and are obtained by plotting the model.

plot(satis.lm2)

No model inadequacies are evident from these plots.

The authors consider adding more terms based on the predictors Age and Severity in Example 3.3 on page 91 of the text. The terms they are add are: , giving us the model:

To fit this model in R we first need to form the squared terms for Age and Severity.

> Age2 = Age^2

> Severity2 = Severity^2

> satis.lm3 = lm(Satisfaction~Age+Severity+Age*Severity+Age2+Severity2)

> summary(satis.lm3)

Call:

lm(formula = Satisfaction ~ Age + Severity + Age * Severity +

Age2 + Severity2)

Residuals:

Min 1Q Median 3Q Max

-16.915 -3.642 2.015 4.000 9.677

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 127.527542 27.912923 4.569 0.00021 ***

Age -0.995233 0.702072 -1.418 0.17251

Severity 0.144126 0.922666 0.156 0.87752

Age2 -0.002830 0.008588 -0.330 0.74534

Severity2 -0.011368 0.013533 -0.840 0.41134

Age:Severity 0.006457 0.016546 0.390 0.70071

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.503 on 19 degrees of freedom

Multiple R-squared: 0.9008, Adjusted R-squared: 0.8747

F-statistic: 34.5 on 5 and 19 DF, p-value: 6.76e-09

None of the additional terms are significant. We can begin dropping these terms one at a time sequentially starting with as it has the largest p-value (p = .7453). Another approach is consider dropping all of these terms simultaneously using the “General F-test” as shown below.

The General F-test for comparing nested models.

where,


The relevant output for both models is shown below.

Compare these rival models in R we can use the anova()command to compare these models using the F-test described above.

> anova(satis.lm2,satis.lm3)

Analysis of Variance Table

Model 1: Satisfaction ~ Age + Severity

Model 2: Satisfaction ~ Age + Severity + Age * Severity + Age2 + Severity2

Res.Df RSS Df Sum of Sq F Pr(>F)

1 22 1114.5

2 19 1069.5 3 45.044 0.2667 0.8485

With a p-value = .8485 we clearly fail to reject the null hypothesis model and conclude the more complex model is unnecessary.

Variable Selection Methods - Stepwise Regression (see section 3.6 of text)

Backward Elimination – include potential term of interest initially and then remove variables one-at-time until no more terms can be removed.

Forward Selection – add the best predictor first, then add predictors successively until none of the potential predictors are significant.

In R, the easiest method to employ is backward elimination and thus that is the only one we will consider. The function step()in R will perform backward elimination on “full” model.

> detach(HospSatis)

> HospFull = read.table(file.choose(),header=T,sep=",")

> names(HospFull)

[1] "Age" "Severity" "Surgical.Medical" "Anxiety"

[5] "Satisfaction"

> attach(HospFull)

> satis.full = lm(Satisfaction~Age+Severity+Surgical.Medical+Anxiety)

> summary(satis.full)

Call:

lm(formula = Satisfaction ~ Age + Severity + Surgical.Medical +

Anxiety)

Residuals:

Min 1Q Median 3Q Max

-18.1322 -3.5662 0.5655 4.7215 12.1448

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 143.8672 6.0437 23.804 3.80e-16 ***

Age -1.1172 0.1383 -8.075 1.01e-07 ***

Severity -0.5862 0.1356 -4.324 0.000329 ***

Surgical.Medical 0.4149 3.0078 0.138 0.891672

Anxiety 1.3064 1.0841 1.205 0.242246

Residual standard error: 7.207 on 20 degrees of freedom

Multiple R-squared: 0.9036, Adjusted R-squared: 0.8843

F-statistic: 46.87 on 4 and 20 DF, p-value: 6.951e-10

Perform backward elimination on the full model and save the final model chosen to a model named satis.step.

> satis.step = step(satis.full)

Start: AIC=103.18

Satisfaction ~ Age + Severity + Surgical.Medical + Anxiety

Df Sum of Sq RSS AIC

- Surgical.Medical 1 1.0 1039.9 101.20

- Anxiety 1 75.4 1114.4 102.93

<none> 1038.9 103.18

- Severity 1 971.5 2010.4 117.68

- Age 1 3387.7 4426.6 137.41

Step: AIC=101.2

Satisfaction ~ Age + Severity + Anxiety

Df Sum of Sq RSS AIC

- Anxiety 1 74.6 1114.5 100.93

<none> 1039.9 101.20

- Severity 1 971.8 2011.8 115.70

- Age 1 3492.7 4532.6 136.00

Step: AIC=100.93

Satisfaction ~ Age + Severity

Df Sum of Sq RSS AIC

<none> 1114.5 100.93

- Severity 1 907.0 2021.6 113.82

- Age 1 4029.4 5143.9 137.17

> summary(satis.step)

Call:

lm(formula = Satisfaction ~ Age + Severity)

Residuals:

Min 1Q Median 3Q Max

-17.2800 -5.0316 0.9276 4.2911 10.4993

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 143.4720 5.9548 24.093 < 2e-16 ***

Age -1.0311 0.1156 -8.918 9.28e-09 ***

Severity -0.5560 0.1314 -4.231 0.000343 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.118 on 22 degrees of freedom

Multiple R-squared: 0.8966, Adjusted R-squared: 0.8872

F-statistic: 95.38 on 2 and 22 DF, p-value: 1.446e-11

Here we can see the based model with Age and Severity only as predictors is chosen via backward elimination using the AIC criterion.

Weighted Least Squares (WLS) – Example 3.9 (pg . 117)
In weighted least squares we give more weight to some observations in the data than others. The general concept is that we wish to give more weight to observations where the variation in the response is small and less weight to observations where the variability in the response is large. In the scatterplot below we can see that the variation in strength increases with weeks and strength. That is larger values of strength have more variation than smaller values.

> detach(HospFull)

> StrWeeks = read.table(file.choose(),header=T,sep=",")

> names(StrWeeks)

[1] "Weeks" "Strength"

> attach(StrWeeks)

> head(StrWeeks)

Weeks Strength

1 20 34

2 21 35

3 23 33

4 24 36

5 25 35

6 28 34

> plot(Weeks,Strength,main="Plot of Concrete Strength vs. Weeks")

> str.lm = lm(Strength~Weeks)

> abline(str.lm)

In WLS the parameter estimates ( are found by minimizing the weight least squares criteria:

The problem with weighted least squares is how do we determine appropriate weights? A procedure for doing this in the case of nonconstant variance as we have in this situation is outlined on pg. 115 of the text. We will implement the procedure outlined in the text in R.

> summary(str.lm)

Call:

lm(formula = Strength ~ Weeks)

Residuals:

Min 1Q Median 3Q Max

-14.3639 -4.4449 -0.0861 5.8671 11.8842

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 25.9360 5.1116 5.074 2.76e-05 ***

Weeks 0.3759 0.1221 3.078 0.00486 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.814 on 26 degrees of freedom

Multiple R-squared: 0.2671, Adjusted R-squared: 0.2389

F-statistic: 9.476 on 1 and 26 DF, p-value: 0.004863

First we need to save the residuals from an OLS (equal weight for all observations) regression model fit to these data. Then we either square these residuals or take the absolute value of these residuals and fit a regression model using either residual transformation as the response. In general, I believe the absolute value of the residuals from OLS fit works better.

> e = resid(str.lm)  save residuals from OLS fit

> abse = abs(e)  take the absolute value of the residuals

> abse.lm = lm(abse~Weeks)  perform the regression of |e| on Weeks (X)

> plot(Weeks,abse,ylab="|Residuals|",main="Absolute OLS Residuals vs. Weeks")

> abline(abse.lm)
Below is a scatterplot of the absolute residuals vs. week (X) with an OLS simple linear regression fit added.

We then save the fitted values from this regression and form weights equal to the reciprocal of the square of the fitted values, i.e.

> ehat = fitted(abse.lm)

> wi = 1/(ehat^2)

Finally we fit the weighted least squares (WLS) line using these weights.

> str.wls = lm(Strength~Weeks,weights=wi)

> summary(str.wls)

Call:

lm(formula = Strength ~ Weeks, weights = wi)

Residuals:

Min 1Q Median 3Q Max

-1.9695 -1.0450 -0.1231 1.1507 1.5785

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 27.54467 1.38051 19.953 < 2e-16 ***

Weeks 0.32383 0.06769 4.784 5.94e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.119 on 26 degrees of freedom

Multiple R-squared: 0.4682, Adjusted R-squared: 0.4477

F-statistic: 22.89 on 1 and 26 DF, p-value: 5.942e-05

The resulting WLS fit is shown on the below.

> plot(Weeks,Strength,main="WLS Fit to Concrete Strength Data")

> abline(str.wls)

Contrast these parameter estimates with those obtained from the OLS fit.
OLS

> str.lm$coefficients

(Intercept) Weeks

25.9359842 0.3759291
WLS

> str.wls$coefficients

(Intercept) Weeks

27.5446681 0.3238344

If the parameter estimates change markedly after the weighting process described above, we typically repeat the process using the residuals from first WLS fit to construct new weights and repeat until the parameter estimates do not change much from one weighted fit to the next. In this case the parameter estimates do not differ much, we do not need to consider repeating. The process of repeating WLS fits is called iteratively reweighted least squares (IRLS).

While we will not be regularly fitting WLS models, we will see later in the course many methods for modeling time series involve giving more weight to some observations and less weight to others. In time series modeling we typically give more weight to observations close in time to the value we are trying to predict or forecast. This is precisely idea of Discounted Least Squares covered in section 3.7.3 (pgs. 119-133). We will not cover Discounted Least Squares because neither JMP nor R have built-in methods for performing this type of analysis and the mathematics required to properly discuss it are beyond the prerequisites for this course. However, in Chapter 4 we will examine exponential smoothing which does use weights in the modeling process. In exponential smoothing all observations up to time t are used to make a prediction or forecast and we give successively smaller weights to observations as we look further back in time.

Detecting and Modeling with Autocorrelation

  • Durbin-Watson Test (see above as well)
  • Cochrane-Olcutt Method
  • Use of lagged variables in the model

Example 3.14 – Toothpaste Market Share and Price(pgs. 143-145)

In this example the toothpaste market share is modeled as function of price. The data is collected over time so there is the distinct possibility of autocorrelation. We first use R to fit the simple linear regression of market share (Y) on price (X).

> detach(StrWeeks)

> Toothpaste = read.table(file.choose(),header=T,sep=",")

> names(Toothpaste)

[1] "MarketShare" "Price"

> head(Toothpaste)

MarketShare Price

1 3.63 0.97

2 4.20 0.95

3 3.33 0.99

4 4.54 0.91

5 2.89 0.98

6 4.87 0.90

> attach(Toothpaste)

> tooth.lm = lm(MarketShare~Price)

> summary(tooth.lm)

Call:

lm(formula = MarketShare ~ Price)

Residuals:

Min 1Q Median 3Q Max

-0.73068 -0.29764 -0.00966 0.30224 0.81193

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 26.910 1.110 24.25 3.39e-15 ***

Price -24.290 1.298 -18.72 3.02e-13 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4287 on 18 degrees of freedom

Multiple R-squared: 0.9511, Adjusted R-squared: 0.9484

F-statistic: 350.3 on 1 and 18 DF, p-value: 3.019e-13

> plot(Price,MarketShare,xlab="Price of Toothpaste",ylab="Market Share")

> abline(tooth.lm)

> dwtest(tooth.lm)

Durbin-Watson test

data: tooth.lm

DW = 1.1358, p-value = 0.009813

alternative hypothesis: true autocorrelation is greater than 0

The Durban-Watson statistic d = 1.1358 whichis statistically significant (p = .0098). Thus we conclude a significant autocorrelation (exists in the residuals and we conclude the errors are not independent. This is a violation of the assumptions required for OLS regression. The ACF plot below shows the autocorrelation present in the residuals from the OLS fit.

> resid.acf = acf(resid(tooth.lm)) compute ACF for the residuals from OLS

> resid.acf

Autocorrelations of series ‘resid(tooth.lm)’, by lag

0 1 2 3 4 5 6 7 8 9 10

1.000 0.409 0.135 0.006 0.073 -0.208 -0.132 -0.040 -0.223 -0.354 -0.326

11 12 13

-0.144 -0.153 0.072

> plot(resid.acf)

Cochrane-Olcutt Method
The Cochrane-Olcutt method for dealing with this autocorrelation transforms both the response series for market share and the predictor series for price using the lag 1 autocorrelation in the residuals. We form the series and using the following formulae: