Statistic 302: Project 1

PREDICTING IS AS SIMPLE AS SIMPLE REGRESSION

Jimmy Huang

1Introduction

The Eastwestside Movers, an intracity moving company, has typically used a trained estimator to determine the number of labor hours needed for a move. This has proved useful in the past, but the company would like to be able to develop a more reliable estimate that would be more accurate in predicting the labor hours. In a preliminary effort to provide a more accurate means of estimation, the company has collected data for 36 moves (refer to provided data) in which the origin and destination were within the borough of Manhattan in New York City and the travel time was an insignificant portion of the hours worked. By having these collected data on hand, the company is now asking a statistician to do an analysis and develop a model so that the labor hours can be predicted based on the number of cubic feet to be moved from the apartment of origin.

2Preliminary Analysis

In terms of doing an analysis, a statistical tool, JMPIN, is being used to explore the data and find out the relationship between the number of cubic feet to be moved and the labor hours required.

With the 36 moves, the total labor required is 1,042.5 hours while the total space moved is 22,520 cubic feet. Therefore, it requires approximately 0.046292 hours on average to move one cubic feet.

Copyright © 2002 Jimmy HuangJune 13, 2002Page 1 of 5

Statistic 302: Project 1

After scatter plotting all the paired data between cubic feet and labor hours, Figure 1 on the left is the result. It seems that the labor hour is quite proportional to the cubic feet moved. A linear model can be used to fit the points.

Figure 1 Scatter plot of the collected 36 pairs of data

Copyright © 2002 Jimmy HuangJune 13, 2002Page 1 of 5

Statistic 302: Project 1

3Fitting Model

From the scatter diagram, it appears that there exists a linear association between the cubic feet moved and the labor hours required. Using JMPIN to conduct a simple regression fit of Hours by Feet (refer to Appendix A for details), we’ve obtained a linear model for the data as:

Hours = -2.36966 + 0.0500803 Feet

with the correlation coefficient r = 0.942998 and r2 = 0.889246. As r2 = 0.889246, it indicates that the fitted model can explain a large proportion of the total variation: approximately 88.9% of the variation in the labor hours is explained by the model.

Now, let’s conduct a hypothesis testing for zero slope (β1 = 0) to verify that a straight-line model in cubic feet is better than a model that does not include cubic feet at all. For the full hypothesis testing procedure, refer to Appendix D. Since we reject the null hypothesis of zero slope for the straight line, the choice of the linear model is convincingly reasonable. More over, the Analysis of Variance section in the JMPIN output (refer to Appendix B) further shows that the null hypothesis of zero slope should be rejected as the p-value < 0.001 < 0.05 (the significance level).

By plotting the residuals of Hours and graphing the histogram for the residuals(refer to Appendix B), we can see no violation to model assumptions.

4Application of the Fitted Model

By obtaining the relationship between the labor hours required and cubic feet to be moved with the linear model:
Hours = -2.36966 + 0.0500803 Feet

the company now can make more reliable predictions on the labor hours easily and accurately. For example, to estimate the labor hours needed to move X0 = 800 cubic feet, all the company needs to do is to do a simple calculation by substitute 800 into the model to get an estimated point as:

Predicted Hours = -2.36966 + 0.0500803 * 800 ≈ 37.69

Then by doing the following calculation, the company can get a 95% prediction interval (PI):

Predicted Hours + b1 * (X0 – Mean Feet) ± tn-2, 1-α/2 * SY|X * sqrt(1 + 1/n + (X0 – Meat Feet)2/((n – 1) * SX2))

≈37.69 + 0.0500803 * (800 - 625.555556) ± 2.030 * 5.031427 * sqrt(1 + 1/36 + (800 – 625.555556)2/((36 – 1) * 78726.654))

≈(36.02, 56.84)

Therefore, if the labor hour required to move 800 cubic feet is within 36 hours and 57 hours, then it is normal. If the labor hour is below 36 hours, then the move is more efficient than expected. If the labor hour is above 57 hours, then the company needs to investigate to verify what has been wrong with the move. There might have some other factors that affect the move as it has an extraordinary result.

Appendix A: Bivariate Fit of Hours By Feet

Figure 2 Mean and Regression Fit of Hours By Feet

Table 1: Fit Mean

Mean / 28.95833
Std Dev [RMSE] / 14.90104
Std Error / 2.483507
SSE / 7771.438

Table 2: Summary of Fit

RSquare / 0.889246
RSquare Adj / 0.885988
Root Mean Square Error / 5.031427
Mean of Response / 28.95833
Observations (or Sum Wgts) / 36

Table 3: Analysis of Variance (ANOVA Table)

Source / DF / Sum of Squares / Mean Square / F Ratio
Model / 1 / 6910.7189 / 6910.72 / 272.9864
Error / 34 / 860.7186 / 25.32 / Prob > F
C. Total / 35 / 7771.4375 / <.0001

Table 4: Parameter Estimates

Term / Estimate / Std Error / t Ratio / Prob>|t|
Intercept / -2.36966 / 2.073261 / -1.14 / 0.2610
Feet / 0.0500803 / 0.003031 / 16.52 / <.0001

Table 5: Data Summary

N Rows / Sum(Hours) / Sum(Feet) / Mean(Hours) / Mean(Feet) / Std Dev(Hours) / Std Dev(Feet)
36 / 1042.5 / 22520 / 28.9583333 / 625.555556 / 14.9010426 / 280.582704

Appendix B: Residual Plot of Hours

Figure 3 Residual Plot

Figure 4 Distributions Residuals Hours

Appendix C: Hypothesis Testing for Zero Slope: β1 = 0

Testing Procedure:

  1. Assumptions: The variable β1 has a normal distribution, from which a random sample has been selected.
  2. Hypotheses:H0: β1 = 0

HA: β1 ≠ 0

  1. Use 95% significant level: α = 0.05
  2. Test Statistic: T = (b1 – β1) / Sb1, where
  3. Sb1= SY|X / (SX * sqrt(n – 1))
  4. S2Y|X = (1 / (n – 2)) * ∑(Yi – Ŷi)2
  5. S2X = (1 / (n – 2)) * ∑(Xi – Xi)2
  6. Sample size n = 36
  7. Rejection regions: reject H0 if | T | ≥ tn-2, 1-α/2 = t34, 0.975 ≈ 2.030; do not reject H0 otherwise.
  8. Calculation of T:

From the JMPIN output in Appendix B, we get

b1 = 0.0500803

S2Y|X = 25.32=> SY|X ≈ 5.032

S2X = 78726.654=> SX ≈ 280.583

Sb1 = 5.032 / (280.583 * sqrt(36 – 1)) ≈ 0.00303

T = (0.0500803 – 0) / 0.00303 ≈ 16.528

  1. Since T ≈ 16.528 > t34, 0.975 ≈ 2.030, we reject H0 at significance level 0.05 and conclude that there is evidence that the cubic feet to be moved indeed provides significant information for predicting the labor hours needed, that is, a straight-line model in cubic feet is better than a model that does not include cubic feet at all.

Copyright © 2002 Jimmy HuangJune 13, 2002Page 1 of 5