Chapter 6: Exercise 2

a (Lasso)

iii. Less flexible and better predictions because of less variance, more bias

b (Ridge regression)

Same as lasso. iii.

c (Non-linear methods)

ii. More flexible, less bias, more variance

Chapter 6: Exercise 3

a

(iv) Steadily decreases: As we increasesfrom0, allβ's increase from0to their least square estimate values. Training error for0βs is the maximum and it steadily decreases to the Ordinary Least Square RSS

b

(ii) Decrease initially, and then eventually start increasing in a U shape: Whens=0, allβs are0, the model is extremely simple and has a high test RSS. As we increases,betas assume non-zero values and model starts fitting well on test data and so test RSS decreases. Eventually, asbetas approach their full blown OLS values, they start overfitting to the training data, increasing test RSS.

c

(iii) Steadily increase: Whens=0, the model effectively predicts a constant and has almost no variance. As we increases, the models includes moreβs and their values start increasing. At this point, the values ofβs become highly dependent on training data, thus increasing the variance.

d

(iv) Steadily decrease: Whens=0, the model effectively predicts a constant and hence the prediction is far from actual value. Thus bias is high. Assincreases, moreβs become non-zero and thus the model continues to fit training data better. And thus, bias decreases.

e

(v) Remains constant: By definition, irreducible error is model independent and hence irrespective of the choice ofs, remains constant.

Chapter 6: Exercise 9

a

Load and split the College data

library(ISLR)

set.seed(11)

sum(is.na(College))

## [1] 0

train.size= dim(College)[1]/2

train = sample(1:dim(College)[1], train.size)

test =-train

College.train= College[train, ]

College.test= College[test, ]

b

NUmber of applications is the Apps variable.

lm.fit = lm(Apps~., data=College.train)

lm.pred= predict(lm.fit, College.test)

mean((College.test[, "Apps"]-lm.pred)^2)

## [1] 1538442

Test RSS is1538442

c

Pickλusing College.train and report error on College.test

library(glmnet)

## Warning: package 'glmnet' was built under R version 2.15.2

## Loading required package: Matrix

## Warning: package 'Matrix' was built under R version 2.15.3

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 2.15.3

## Loaded glmnet 1.9-3

train.mat =model.matrix(Apps~., data=College.train)

test.mat =model.matrix(Apps~., data=College.test)

grid =10^seq(4, -2, length=100)

mod.ridge=cv.glmnet(train.mat, College.train[, "Apps"], alpha=0, lambda=grid, thresh=1e-12)

lambda.best=mod.ridge$lambda.min

lambda.best

## [1] 18.74

ridge.pred= predict(mod.ridge, newx=test.mat, s=lambda.best)

mean((College.test[, "Apps"]-ridge.pred)^2)

## [1] 1608859

Test RSS is slightly higher that OLS,1608859.

d

Pickλusing College.train and report error on College.test

mod.lasso=cv.glmnet(train.mat, College.train[, "Apps"], alpha=1, lambda=grid, thresh=1e-12)

lambda.best=mod.lasso$lambda.min

lambda.best

## [1] 21.54

lasso.pred= predict(mod.lasso, newx=test.mat, s=lambda.best)

mean((College.test[, "Apps"]-lasso.pred)^2)

## [1] 1635280

Again, Test RSS is slightly higher that OLS,1635280.

The coefficients look like

mod.lasso=glmnet(model.matrix(Apps~., data=College), College[, "Apps"], alpha=1)

predict(mod.lasso, s=lambda.best, type="coefficients")

## 19 x 1 sparse Matrix of class "dgCMatrix"

## 1

## (Intercept) -6.038e+02

## (Intercept) .

## PrivateYes -4.235e+02

## Accept 1.455e+00

## Enroll -2.004e-01

## Top10perc 3.368e+01

## Top25perc -2.403e+00

## F.Undergrad .

## P.Undergrad 2.086e-02

## Outstate -5.782e-02

## Room.Board 1.246e-01

## Books .

## Personal 1.833e-05

## PhD -5.601e+00

## Terminal -3.314e+00

## S.F.Ratio 4.479e+00

## perc.alumni -9.797e-01

## Expend 6.968e-02

## Grad.Rate 5.160e+00

e

Use validation to fit pcr

library(pls)

##

## Attaching package: 'pls'

##

## The following object(s) are masked from 'package:stats':

##

## loadings

pcr.fit =pcr(Apps~., data=College.train, scale=T, validation="CV")

validationplot(pcr.fit, val.type="MSEP")

pcr.pred= predict(pcr.fit, College.test, ncomp=10)

mean((College.test[, "Apps"]-data.frame(pcr.pred))^2)

## [1] 3014496

Test RSS for PCR is about3014496.

f

Use validation to fit pls

pls.fit =plsr(Apps~., data=College.train, scale=T, validation="CV")

validationplot(pls.fit, val.type="MSEP")

pls.pred= predict(pls.fit, College.test, ncomp=10)

mean((College.test[, "Apps"]-data.frame(pls.pred))^2)

## [1] 1508987

Test RSS for PLS is about1508987.

g

Results for OLS, Lasso, Ridge are comparable. Lasso reduces theF.UndergradandBooksvariables to zero and shrinks coefficients of other variables. Here are the testR2for all models.

test.avg = mean(College.test[, "Apps"])

lm.test.r2 =1- mean((College.test[, "Apps"]-lm.pred)^2)/mean((College.test[, "Apps"]- test.avg)^2)

ridge.test.r2 =1- mean((College.test[, "Apps"]-ridge.pred)^2)/mean((College.test[, "Apps"]- test.avg)^2)

lasso.test.r2 =1- mean((College.test[, "Apps"]-lasso.pred)^2)/mean((College.test[, "Apps"]- test.avg)^2)

pcr.test.r2 =1- mean((College.test[, "Apps"]-data.frame(pcr.pred))^2)/mean((College.test[, "Apps"]- test.avg)^2)

pls.test.r2 =1- mean((College.test[, "Apps"]-data.frame(pls.pred))^2)/mean((College.test[, "Apps"]- test.avg)^2)

barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2), col="red", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), main="Test R-squared")

The plot shows that testR2for all models except PCR are around 0.9, with PLS having slightly higher testR2than others. PCR has a smaller testR2of less than 0.8. All models except PCR predict college applications with high accuracy.

Chapter 7: Exercise 5

a

We'd expectg2to have the smaller training RSS because it will be a higher order polynomial due to the order of the derivative penalty function.

b

We'd expectg1to have the smaller test RSS becauseg2could overfit with the extra degree of freedom.

c

Trick question.g1=g2whenλ=0.

Chapter 6: Exercise 9

Load the Boston dataset

set.seed(1)

library(MASS)

attach(Boston)

a

lm.fit = lm(nox~ poly(dis, 3), data = Boston)

summary(lm.fit)

##

## Call:

## lm(formula = nox ~ poly(dis, 3), data = Boston)

##

## Residuals:

## Min 1Q Median 3Q Max

## -0.12113 -0.04062 -0.00974 0.02338 0.19490

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 0.55470 0.00276 201.02 < 2e-16 ***

## poly(dis, 3)1 -2.00310 0.06207 -32.27 < 2e-16 ***

## poly(dis, 3)2 0.85633 0.06207 13.80 < 2e-16 ***

## poly(dis, 3)3 -0.31805 0.06207 -5.12 4.3e-07 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 0.0621 on 502 degrees of freedom

## Multiple R-squared: 0.715, Adjusted R-squared: 0.713

## F-statistic: 419 on 3 and 502 DF, p-value: <2e-16

dislim= range(dis)

dis.grid=seq(from =dislim[1], to =dislim[2], by =0.1)

lm.pred= predict(lm.fit, list(dis=dis.grid))

plot(nox~dis, data = Boston, col="darkgrey")

lines(dis.grid, lm.pred, col="red", lwd=2)

Summary shows that all polynomial terms are significant while predicting nox using dis. Plot shows a smooth curve fitting the data fairly well.

b

We plot polynomials of degrees 1 to 10 and save train RSS.

all.rss = rep(NA, 10)

for(iin1:10){

lm.fit = lm(nox~ poly(dis, i), data = Boston)

all.rss[i]= sum(lm.fit$residuals^2)

}

all.rss

## [1] 2.769 2.035 1.934 1.933 1.915 1.878 1.849 1.836 1.833 1.832

As expected, train RSS monotonically decreases with degree of polynomial.

c

We use a 10-fold cross validation to pick the best polynomial degree.

library(boot)

all.deltas= rep(NA, 10)

for(iin1:10){

glm.fit =glm(nox~ poly(dis, i), data = Boston)

all.deltas[i]= cv.glm(Boston, glm.fit, K =10)$delta[2]

}

plot(1:10, all.deltas, xlab="Degree", ylab="CV error", type ="l", pch=20,

lwd=2)

A 10-fold CV shows that the CV error reduces as we increase degree from 1 to 3, stay almost constant till degree 5, and the starts increasing for higher degrees. We pick 4 as the best polynomial degree.

d

We see that dis has limits of about 1 and 13 respectively. We split this range in roughly equal 4 intervals and establish knots at[4,7,11]. Note: bs function in R expects either df or knots argument. If both are specified, knots are ignored.

library(splines)

sp.fit = lm(nox~bs(dis, df=4, knots = c(4, 7, 11)), data = Boston)

summary(sp.fit)

##

## Call:

## lm(formula = nox ~ bs(dis, df = 4, knots = c(4, 7, 11)), data = Boston)

##

## Residuals:

## Min 1Q Median 3Q Max

## -0.1246 -0.0403 -0.0087 0.0247 0.1929

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 0.7393 0.0133 55.54 < 2e-16

## bs(dis, df = 4, knots = c(4, 7, 11))1 -0.0886 0.0250 -3.54 0.00044

## bs(dis, df = 4, knots = c(4, 7, 11))2 -0.3134 0.0168 -18.66 < 2e-16

## bs(dis, df = 4, knots = c(4, 7, 11))3 -0.2662 0.0315 -8.46 3.0e-16

## bs(dis, df = 4, knots = c(4, 7, 11))4 -0.3980 0.0465 -8.56 < 2e-16

## bs(dis, df = 4, knots = c(4, 7, 11))5 -0.2568 0.0900 -2.85 0.00451

## bs(dis, df = 4, knots = c(4, 7, 11))6 -0.3293 0.0633 -5.20 2.9e-07

##

## (Intercept) ***

## bs(dis, df = 4, knots = c(4, 7, 11))1 ***

## bs(dis, df = 4, knots = c(4, 7, 11))2 ***

## bs(dis, df = 4, knots = c(4, 7, 11))3 ***

## bs(dis, df = 4, knots = c(4, 7, 11))4 ***

## bs(dis, df = 4, knots = c(4, 7, 11))5 **

## bs(dis, df = 4, knots = c(4, 7, 11))6 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 0.0619 on 499 degrees of freedom

## Multiple R-squared: 0.718, Adjusted R-squared: 0.715

## F-statistic: 212 on 6 and 499 DF, p-value: <2e-16

sp.pred= predict(sp.fit, list(dis=dis.grid))

plot(nox~dis, data = Boston, col="darkgrey")

lines(dis.grid, sp.pred, col="red", lwd=2)

The summary shows that all terms in spline fit are significant. Plot shows that the spline fits data well except at the extreme values ofdis, (especiallydis>10).

e

We fit regression splines with dfs between 3 and 16.

all.cv = rep(NA, 16)

for(iin3:16){

lm.fit = lm(nox~bs(dis, df=i), data = Boston)

all.cv[i]= sum(lm.fit$residuals^2)

}

all.cv[-c(1, 2)]

## [1] 1.934 1.923 1.840 1.834 1.830 1.817 1.826 1.793 1.797 1.789 1.782

## [12] 1.782 1.783 1.784

Train RSS monotonically decreases till df=14 and then slightly increases for df=15 and df=16.

f

Finally, we use a 10-fold cross validation to find best df. We try all integer values of df between 3 and 16.

all.cv = rep(NA, 16)

for(iin3:16){

lm.fit =glm(nox~bs(dis, df=i), data = Boston)

all.cv[i]= cv.glm(Boston, lm.fit, K =10)$delta[2]

}

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned bases

plot(3:16, all.cv[-c(1, 2)], lwd=2, type ="l", xlab="df", ylab="CV error")

CV error is more jumpy in this case, but attains minimum at df=10. We pick10as the optimal degrees of freedom.

Chapter 7: Exercise 10

a

set.seed(1)

library(ISLR)

library(leaps)

attach(College)

train=sample(length(Outstate), length(Outstate)/2)

test=-train

College.train=College[train, ]

College.test=College[test, ]

reg.fit=regsubsets(Outstate~ ., data=College.train, nvmax=17, method="forward")

reg.summary=summary(reg.fit)

par(mfrow=c(1, 3))

plot(reg.summary$cp, xlab="Number of Variables", ylab="Cp", type="l")

min.cp=min(reg.summary$cp)

std.cp=sd(reg.summary$cp)

abline(h=min.cp+0.2*std.cp, col="red", lty=2)

abline(h=min.cp-0.2*std.cp, col="red", lty=2)

plot(reg.summary$bic, xlab="Number of Variables", ylab="BIC", type="l")

min.bic=min(reg.summary$bic)

std.bic=sd(reg.summary$bic)

abline(h=min.bic+0.2*std.bic, col="red", lty=2)

abline(h=min.bic-0.2*std.bic, col="red", lty=2)

plot(reg.summary$adjr2, xlab="Number of Variables", ylab="Adjusted R2",

type="l", ylim=c(0.4, 0.84))

max.adjr2=max(reg.summary$adjr2)

std.adjr2=sd(reg.summary$adjr2)

abline(h=max.adjr2+0.2*std.adjr2, col="red", lty=2)

abline(h=max.adjr2-0.2*std.adjr2, col="red", lty=2)

All cp, BIC and adjr2 scores show that size 6 is the minimum size for the subset for which the scores are withing 0.2 standard deviations of optimum. We pick 6 as the best subset size and find best 6 variables using entire data.

reg.fit=regsubsets(Outstate~ ., data=College, method="forward")

coefi=coef(reg.fit, id=6)

names(coefi)

## [1] "(Intercept)" "PrivateYes" "Room.Board" "PhD" "perc.alumni"

## [6] "Expend" "Grad.Rate"

b

library(gam)

## Loading required package: splines

## Loaded gam 1.09

gam.fit=gam(Outstate~Private+s(Room.Board, df=2)+s(PhD, df=2)+

s(perc.alumni, df=2)+s(Expend, df=5)+s(Grad.Rate, df=2), data=College.train)

par(mfrow=c(2, 3))

plot(gam.fit, se=T, col="blue")

c

gam.pred=predict(gam.fit, College.test)

gam.err=mean((College.test$Outstate-gam.pred)^2)

gam.err

## [1] 3745460

gam.tss=mean((College.test$Outstate-mean(College.test$Outstate))^2)

test.rss=1-gam.err/gam.tss

test.rss

## [1] 0.7697

We obtain a test R-squared of 0.77 using GAM with 6 predictors. This is a slight improvement over a test RSS of 0.74 obtained using OLS.

d

summary(gam.fit)

##

## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD,

## df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate,

## df = 2), data = College.train)

## Deviance Residuals:

## Min 1Q Median 3Q Max

## -4977.7 -1184.5 58.3 1220.0 7688.3

##

## (Dispersion Parameter for gaussian family taken to be 3300711)

##

## Null Deviance: 6.222e+09 on 387 degrees of freedom

## Residual Deviance: 1.231e+09 on 373 degrees of freedom

## AIC: 6942

##

## Number of Local Scoring Iterations: 2

##

## Anova for Parametric Effects

## Df Sum Sq Mean Sq F value Pr(>F)

## Private 1 1.78e+09 1.78e+09 539.1 < 2e-16 ***

## s(Room.Board, df = 2) 1 1.22e+09 1.22e+09 370.2 < 2e-16 ***

## s(PhD, df = 2) 1 3.82e+08 3.82e+08 115.9 < 2e-16 ***

## s(perc.alumni, df = 2) 1 3.28e+08 3.28e+08 99.5 < 2e-16 ***

## s(Expend, df = 5) 1 4.17e+08 4.17e+08 126.2 < 2e-16 ***

## s(Grad.Rate, df = 2) 1 5.53e+07 5.53e+07 16.8 5.2e-05 ***

## Residuals 373 1.23e+09 3.30e+06

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Anova for Nonparametric Effects

## NparDfNpar F Pr(F)

## (Intercept)

## Private

## s(Room.Board, df = 2) 1 3.56 0.060 .

## s(PhD, df = 2) 1 4.34 0.038 *

## s(perc.alumni, df = 2) 1 1.92 0.167

## s(Expend, df = 5) 4 16.86 1e-12 ***

## s(Grad.Rate, df = 2) 1 3.72 0.055 .

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Non-parametric Anova test shows a strong evidence of non-linear relationship between response and Expend, and a moderately strong non-linear relationship (using p value of 0.05) between response and Grad.Rate or PhD.