BA 368: Business Analytics, Case 3.2

Use the Durango Big Data Set available on the course webpage to answer the following questions. Thanks to the class for putting this together.

1)Graph and printjust the square feet (as the x-variable) versus the zestimate (as the y-variable). Add the trend line and note the r2 value. Looking at the graph, there should be several pretty obviousoutliers – points that arevery different from the rest of the data.

2)Now let’s eliminate the outliers to make a more consistent data set. So, eliminate the data points with either square feet > 5000 or zestimate > $1.2M. I think there are 7 of them.

  1. Re-graph and print just square feet versus the zestimate. List the equation of the line and interpret the slope and y-intercept of the line. According to this line, what does a square foot of housing cost in Durango? What is thenew r2 value? Interpret it.

y = 181x +195998. Base price of a house is $196k and then $181 per square foot. r2 is 82% -- square footage determines 82% of house price.

  1. About how much is my house worth, at 1441 square feet? 181*1441 + 195998 = 456819 – this is too high since my house in not right in downtown.
  2. Try forcing the y-intercept to 0. (Don’t need to print this one). What does the slope say now about the cost of a square foot?About $260.
  3. What if you knew that a typical lot in Durango was about $100,000? Force the y-intercept to 100,000 and re-interpret the slope. About $219 per square foot.

3)Using just the line you found in 2a),

  1. Forecast the cost of each of the 92 homes.
  2. Calculate the absolute percentage error for each data point then compute the mean absolute percentage error (MAPE) and median absolute percentage error (Median APE – sounds like a terrible movie) for this method. 10% and 7%

4)Now, run multiple linear regression with the zestimate as the y-variable and all four other columns as the x-variables.

  1. Interpret both the r value and the r2 value.

r = .93, great fit. r2 = 87% -- the four variables combined explain 87% of house price.

  1. The “Significance F” is really the overall p-value. What does this tell us?

p-value is 0. The model is significant.

  1. List the multiple linear regression equation.

y = 187908 +149 square foot – 14770 bedrooms + 50527 bathrooms + 1.5 how old house is

  1. What are the individual p-values for the y-intercept and x-variables?

0%
0%
4%
0%
97%

But the 97% is due almost entirely to a typo – year 204 instead of 2004. If that’s fixed, amazingly all p-values become less than 5% and every factor is valid (but still have weird negative slope for bedrooms).

0%
0%
3%
0%
2%

5)Eliminate x-variable with p-value > 20% and keep the three x-variables with p-values < 20% and rerun the multiple linear regression.

  1. What are the individual p-values now?

0%
0%
4%
0%
  1. One of the factors should have a negative slope that doesn’t make sense. Eliminate that as well and rerun the regression with just the remaining x-variables.
  2. What are the p-values now?

All 0%!

  1. Interpret both the r value and the r2 value.

r = 0.927 – good fit. r2 = 86%, these two remaining variables explain 86% of price.

  1. The “Significance F” is really the overall p-value. What does this tell us?

It’s 0. This model is significant.

  1. List the multiple linear regression equation and interpret all parts of it.

y = 161650 (base price) + 142 per square foot + 46K per bathroom

  1. About how much is my house worth (sqft =1441, bed = 3, bath = 1.5, year = 1979)

$436k

  1. Use this equation to forecast all 92 home values and then calculate the mean absolute percentage error (MAPE) and median absolute percentage error (Median APE) for this method.

9% and 7%

6)Compare your results from part 2) and part 5). Part 2) is the simplest model using only one x-variable whereas part 5) has two x-variables. There is typically a trade-off between simplicity and accuracy. In this case, is it worth adding an extra variable – does it increase the accuracy of our forecasts enough to justify the increased complexity?

Could argue either way, but I would lean towards simplicity.