DSCI 425 – Supervised Learning
Assignment 1 – Multiple Linear Regression (105 points)

Problem 1 – the boston housing data

The Boston Housing data set was the basis for a 1978 paper by Harrison and Rubinfeld, which discussed approaches for using housing market data to estimate the willingness to pay for clean air. The authors employed a hedonic price model, based on the premise that the price of the property is determined by structural attributes (such as size, age, condition) as well as neighborhood attributes (such as crime rate, accessibility, environmental factors). This type of approach is often used to quantify the effects of environmental factors that affect the price of a property. Data were gathered for 506 census tracts in the Boston Standard Metropolitan Statistical Area (SMSA) in 1970, collected from a number of sources including the 1970 US Census and the Boston Metropolitan Area Planning Committee. The variables used to develop the Harrison Rubinfeld housing value equation are listed in the table below.

Variables Used in the Harrison-Rubinfeld Housing Value Equation

variable / type / definition / source
CMEDV / Dependent Variable (Y) / Median value of homes in thousands of dollars / 1970 U.S. Census
RM / Structural / Average number of rooms / 1970 U.S. Census
AGE / % of units built prior to 1940 / 1970 U.S. Census
B / Neighborhood / % of population that is black / 1970 U.S. Census
LSTAT / % of population that is lower socioeconomic status / 1970 U.S. Census
CRIM / Crime rate measure / FBI (1970)
ZN / % of residential land zoned for lots > than 25,000 sq. ft. / Metro Area Planning Commission (1972)
INDUS / % of non-retail business acres (proxy for industry) / Mass. Dept. of Commerce & Development (1965)
TAX / Property tax rate / Mass. Taxpayers Foundation (1970)
PTRATIO / Pupil-Teacher ratio / Mass. Dept. of Ed (’71-‘72)
CHAS / Dummy variable indicating proximity to Charles River (1 = on river) / 1970 U.S. Census Tract maps
DIS / Accessibility / Weighted distances to major employment centers in area / Schnare dissertation (Unpublished, 1973)
RAD / Index of accessibility to radial highways / MIT Boston Project
NOX / Air Pollution / Nitrogen oxide concentrations (pphm) / TASSIM

Reference

Harrison, D., and Rubinfeld, D. L., “Hedonic Housing Prices and the Demand for Clean Air,” Journal of Environmental Economics and Management, 5 (1978), 81-102.

Develop a regression models for predicting CMEDV using the available predictors in the table above. Note that all variable are numeric with the exception of CHAS which is in indicator/dummy variable indicating whether or not the census tract is located along the Charles River in Boston. The file Boston.csv on my website can be read into R as shown in the handouts.

Boston = read.table(file.choose(),header=T,sep=”,”)

Boston$CHAS = as.factor(Boston$CHAS) because this 0/1 coded, you do not have to do this.

bos.lm = lm(CMEDV~.,data=Boston)
Your analysis should be thorough! Document the model development process by copying and pasting relevant R commands, output, and graphics into your write-up.

Grading rubric (50 points)

1)In this part of your analysis of these data you will fit a simple MLR model to these data without trying to address any model deficiencies etc.

a)Fit a base model and discuss any deficiencies (but don’t try to fix them). (5 pts.)

b)Stepwise reduction of base model and discussion of final model. (5 pts.)

c)Use cross-validation methods to estimate the prediction error of this model using split-sample, k-fold, and the .632 bootstrap approaches. (10 pts.)

2)In this part of your analysis of these data you will develop a MLR that addresses any deficiencies you identified in part (1). Things to consider would be adding higher order terms (polynomials terms) and power transformations. In end I would like you to compare the predictive performance of this model to the one you developed in part (1).

a)Model development, documentation, and discussion. (15 pts.)

b)Fitting final model, critiquing it, and discussing any deficiencies. (5 pts.)

c)Use cross-validation methods to estimate the prediction error of this model using split-sample, k-fold, and the .632 bootstrap approaches. All prediction measures should be for the response in the ORIGINAL scale, thus you will need to back-transform your predictions in the CV process. (10 pts.)

Problem 2 – listing Price of homes in the twin cities metro area

These data are contained in the TC Homes (train).csv file on the website. The variable descriptions are below. TC Homes (test).csvcontains homes I would like you to use your final model to predicting the list price force in the ORIGINAL scale. Whatever data torturing you do the training data will also need to be done to the test cases as well.

Variable / Info / Description
ListPrice / Response
(Y) / Current List Price ($)
BEDS / / # of Bedrooms
BATHS / / # of Bathrooms (can be fractional)
SQFT / / Square footage of home (ft.2)
LotSize / / Square footage of lot (ft.2) – missing for several
of the homes in these data.
YearBuilt / / Year the home was built, could be used to create
a new variable called Age = 2014 - YearBuilt
ParkingSpots / / # of Parking Spots (I assume off-street parking)
HasGarage / / Garage or No (Nominal)
DOM / / Days on the market, number of days the home
has been listed for sale.
BeenReduced / / Has the price been reduced from the original
listing price – Y or N. (Nominal)
SoldPrev / / Has the home been sold previously? Y or N (Nominal)
Latitude / / Latitude (degrees)
Longitude / / Longitude (degrees)
ShortSale / / Is more money owed on the home than what the asking price is? Y or N (Nominal)

Grading Rubric (55 points)

a)Fitting base model, critiquing it, and discussing any deficiencies. (5 pts.)

b)Model development, documentation, and discussion. (20 pts.)

Consideration of assumptions

Possible predictor transformations

Stepwise procedures

c)Fitting final model, critiquing it, interpreting it, and discussing any deficiencies.
(5 pts.)

d)Cross-validation results and discussion for predicting the response in the original scale. (10 pts.)

e)Give me your predicted list price for the test cases contained in the file TC Homes (test).csvusing your model. I will discuss how to do this this class.
(10 pts.)

1