STAT 425 – Modern Methods of Data Analysis (46 pts.)
Assignment 7 – Regression Trees using CART/RPART
Problem 1 – BOSTON HOUSING DATA
The Boston Housing data set was the basis for a 1978 paper by Harrison and Rubinfeld, which discussed approaches for using housing market data to estimate the willingness to pay for clean air. The authors employed a hedonic price model, based on the premise that the price of the property is determined by structural attributes (such as size, age, condition) as well as neighborhood attributes (such as crime rate, accessibility, environmental factors). This type of approach is often used to quantify the effects of environmental factors that affect the price of a property.
Data were gathered for 506 census tracts in the Boston Standard Metropolitan Statistical Area (SMSA) in 1970, collected from a number of sources including the 1970 US Census and the Boston Metropolitan Area Planning Committee. The variables used to develop the Harrison Rubinfeld housing value equation are listed in the table below. (Boston.working)
Variables Used in the Harrison-Rubinfeld Housing Value Equation
variable / type / definition / sourceCMEDV / Dependent Variable / Median value of homes in thousands of dollars / 1970 U.S. Census
RM / Structural / Average number of rooms / 1970 U.S. Census
AGE / % of units built prior to 1940 / 1970 U.S. Census
B / Neighborhood / Black % of population / 1970 U.S. Census
LSTAT / % of population that is lower socioeconomic status / 1970 U.S. Census
CRIM / Crime rate / FBI (1970)
ZN / % of residential land zoned for lots > than 25,000 sq. ft. / Metro Area Planning Commission (1972)
INDUS / % of non-retail business acres (proxy for industry) / Mass. Dept. of Commerce & Development (1965)
TAX / Property tax rate / Mass. Taxpayers Foundation (1970)
PTRATIO / Pupil-Teacher ratio / Mass. Dept. of Ed (’71-‘72)
CHAS / Dummy variable indicating proximity to Charles River (1 = on river) / 1970 U.S. Census Tract maps
DIS / Accessibility / Weighted distances to major employment centers in area / Schnare dissertation (Unpublished, 1973)
RAD / Index of accessibility to radial highways / MIT Boston Project
NOX / Air Pollution / Nitrogen oxide concentrations (pphm) / TASSIM
Reference
Harrison, D., and Rubinfeld, D. L., “Hedonic Housing Prices and the Demand for Clean Air,” Journal of Environmental Economics and Management, 5 (1978), 81-102.
a) Use RPART to model the response, log(CMEDV). Show how you used cross-validation to find the final number of terms in your model. (5 pts.)
> Boston3 = data.frame(logCMEDV=log(Boston.working$CMEDV),Boston.working[,-1])
> names(Boston3)
[1] "logCMEDV" "CRIM" "ZN" "INDUS" "CHAS" "RM" "AGE" "DIS" "RAD"
[10] "TAX" "PTRATIO" "B" "LSTAT" "NOX"
b) Summarize and plot (using post()) the results from your final model in part (a). Find the R2 for your final model and estimate RMSEP using the residuals from the fit. Plot y vs. y and e vs. y. Discuss all. (8 pts.)
c) Use MCCV to estimate the RMSEP. The code for the rpart.cv function is in your notes. How does it compare to your RMSEP from part (b)? How does it compare to the MARS and OLS RMSEP’s from the last assignment? (3 pts.)
d) Use the group.tree command to store the terminal node information for each observation in these data. Do these terminal nodes cluster geographically in the Boston area? Gee whiz, coach how am I supposed to answer that? Discuss this plot! (3 pts.)
(1) Copy in the function below…
> plot.grps = function(x, y, grps, cex=1,gv = deparse(substitute(grps)),...){
groups <- as.factor(grps)
grps <- as.numeric(groups)
if(min(grps) == 0) {
grps <- grps + 1
}
plotchs <- c("+","o","*","x","O","X","a","b","c","d","e")
plot(x, y, type = "n", ...)
for(i in 1:length(x)) {
points(x[i],y[i],cex=cex,col = grps[i], pch = plotchs[grps[i]])
}
colors <- unique(grps)
pchs <- plotchs[unique(grps)[1]]
for(i in (unique(grps))[-1]) {
pchs <- paste(pchs, plotchs[i], sep = "")
}
print("Click plot to add upper-left corner of legend")
legend(locator(), legend = as.character(unique(groups)), col =
colors, pch = pchs)
invisible()
}
(2) Then run the commands below to construct the plot.
bos.grps = group.tree(mymodel)
> plot.grps(Boston2$lon,Boston2$lat,bos.grps)
Note: Boston2 is another version of the Boston Housing data that is in my .Rdata directory that contains the longitude and latitude of the 506 census tracts in the data.
e) Obtain a bagged estimate using the same cp (complexity parameter) you used for your final CART/RPART model. Compare the R2 and RMSEP of the bagged fit to that from CART. (3 pts.)
There is no residual extraction option for a bagged estimate. To get residuals use the following:
> resid = (Boston3$logCMEDV – predict(mybagmodel))
f) Use the bag.cv function from your notes to perform MCCV, how does the bagged model compare to unbagged model in terms RMSEP? (3 pts.)
PROBLEM 2 – Predicting KELLEY BLUE BOOK VALUE FOR CARS
Goal of this problem is to develop a model to predict the Kelley Blue Book value for a car using information about it. All of the cars in this data set are less than one year old and considered to be in excellent condition. The response is Price ($) and the potential predictors are described below:
· Mileage: number of miles the car has been driven
· Make: Buick, Cadillac, Chevy, Pontiac, Saab, Saturn (0 or 1 for each)
· Cylinder: number of cylinders in the engine
· Doors: number of doors
· Cruise: indicator variable representing whether the car has cruise control (1 = cruise)
· Sound: indicator variable representing whether the car has upgraded speakers (1 = upgraded)
· Leather: indicator variable representing whether the car has leather seats (1 = leather)
· convertible, coupe, hatchback, sedan, wagon – indicator variables for these type of cars. (1 = yes, 0 = no)
These data are contained in the package caret from CRAN.
> library(caret)
> data(cars)
> Cars = cars ß make a local copy of these data.
> names(Cars)
[1] "Price" "Mileage" "Cylinder" "Doors" "Cruise" "Sound" "Leather"
[8] "Buick" "Cadillac" "Chevy" "Pontiac" "Saab" "Saturn" "convertible"
[15] "coupe" "hatchback" "sedan" "wagon"
a) Use CART to fit a model to predict the Kelley Blue Book value of a car. Show how you used the results of cross-validation to arrive at your final model. (5 pts.)
b) Summarize and plot (using post()) the results from your final model in part (a). Find the R2 for your final model and estimate RMSEP using the residuals from the fit. Plot y vs. y and e vs. y. Discuss all. (8 pts.)
c) Use MCCV to estimate the RMSEP. The code for the rpart.cv function is in your notes. How does it compare to your RMSEP from part (b)? (2 pts.)
d) Obtain a bagged estimate using the same cp (complexity parameter) you used for your final CART/RPART model. Compare the R2 and RMSEP from the bagged fit to that from CART. (3 pts.)
e) Use the bag.cv function from your notes to perform MCCV, how does the
bagged model compare to unbagged model in terms RMSEP? (3 pts.)
1