Data Analysis Project
Predicting Home Depot’s Performance
Is Home Depot’s Revenue Predictable?
Data Analysis Project
Statistics & Data Analysis B01.1305.03
I. Introduction
The goal of this project is to explore the relationship and potential predictive power of a number of variables on the performance of the Home Depot Corporation (HD). Our assumption prior to starting this project is that Home Depot’s financial performance, specifically its revenues, should be correlated to and perhaps predicted by various external indicators. For example, we suspect that HD was subject to fluctuations in housing starts and/or housing sales. The notion was that if more houses were started, or if more houses were sold in a given year, than HD would have superior financial performance than it would in years where these indicators were lower.
II. Description of Data
We have collected data from several sources including:
1. United States Census bureau - www.census.gov
2. Bloomberg Terminal – for financial market data
3. Home Depot (and other companies) web site(s) – www.homedepot.com
We began our assessment by considering a range of indicators and a plan to explore the relationships between each indicator and HD revenues. Through several iterations of this analysis we are hoping to eliminate unnecessary and irrelevant indicators and end up with a simplified regression equation.
We began with the following variables:
1. Housing Starts - # of new houses that began construction during the year
2. Housing Sales - # of new houses for sale during the year
3. Housing Completions – # of new houses that completed construction during the year.
4. Mortgage Rates – 30 year mortgage rates
5. Home Improvement Industry Quarterly Returns
6. S&P 500 Quarterly Returns
7. Home Depot Quarterly Returns
For all of these variables we used quarterly data from 1990 through 3Q 1999. We were limited to this time period because we were unable to obtain quarterly revenue data for Home Depot prior to 1990. We are comfortable with this data set however, and believe that this 10-year period and 39 data points do provide a reliable representation of the relationships we are exploring.
We chose these particular variables because we expected that there should be some relationship to Home depot revenues. As a home improvement retail establishment, Home Depot should feel the effects of fluctuations in the housing market. An increase in housing sales, housing starts, and housing completions should lead to an increase in Home Depots financial performance.
Mortgage rates are also expected to have an impact on Home Depot. However, we are uncertain as to whether the effect of increased mortgage rates should be positive or negative for HD. On one hand, an increase in rates should mean lower starts and sales, and therefore lower revenues for HD. On the other hand, higher rates may mean that since people cannot afford to buy or build homes, they may choose to remodel and/or make improvements to their current homes. This would lead to increased revenues for Home Depot. The answer to this question should be found through our regression analysis.
Additionally, we decide to include quarterly returns for Home Depot, the home improvement industry, and the S&P 500 market index. We chose this data to explore whether there was a relationship between HD financial performance and the performance of the stock market. We expect that there is no relationship, although as we show in our regression there does appear to be a relationship.
III. Descriptive Statistics
We begin by “looking” at the data to see what the data looks like and if there is any information that jumps out as us that may be useful in our analysis. The first noticeable feature of the data is that the median HD Revenue is significantly lower than the mean HD Revenue. This indicates that the data is long right tailed – this will be further explored. Also of interest is the fact that the S&P and Industry Returns have positive means and medians, indicating positive returns on average. The large StDev for these two variables can be explained by the fat-tailed data that is expected for stock returns.
Variable N Mean Median TrMean StDev SE Mean
Starts 39 342.3 340.6 343.5 67.7 10.8
Sales 39 174.74 169.00 174.14 37.11 5.94
Completions 39 330.27 332.30 330.45 48.83 7.82
S&P Q Ret 39 3.57 3.89 3.61 6.71 1.07
IND Q Ret 39 4.86 3.70 4.80 11.36 1.82
HD Revenue 39 3146085 2453760 3014801 2163747 346477
Mortgage Rate 39 8.126 7.926 8.081 1.012 0.162
Variable Minimum Maximum Q1 Q3
Starts 185.4 454.5 294.6 399.3
Sales 100.00 257.00 152.00 204.00
Completions 233.90 429.70 293.40 361.90
S&P Q Ret -14.52 20.87 -0.25 7.28
IND Q Ret -29.74 41.27 -1.75 10.83
HD Revenue 641520 8139000 1298280 4921830
Mortgage Rate 6.763 10.337 7.243 8.710
As mentioned, Home Depot revenues exhibit a long right tail. Revenue is our target variable and therefore we have to consider the expected relationship between the target and the predictors.
In relation to the housing data, there is an additive/multiplicative relationship to HD revenue. A one-unit increase in sales, for example, should cause a percentage increase in revenue. Similarly, mortgage rates exhibit this relationship as well. Although mortgage rates also have a long right tail, we will not log this data since we are concerned with how a one unit increase in mortgage rates effect revenue, not how a percentage increase in mortgage rates effect revenue.
In contrast the stock return data for the S&P and the home improvement industry should have a proportional relationship to HD revenues, suggesting that this data should be logged in the regression analysis. We do not however log this data because return data is in a sense already logged. The return data is by definition proportional data – it relates one years return to the previous year return. Since our goal is to have a proportional relationship to revenue, the return data already provides that relationship.
We now take a look at scatter plots for each of the variables in relation to Logged HD Revenues in order to see if there appears to be any visible patterns or relationships between the variables and revenues. While each individual relationship does not provide information regarding the predictive power of all the variables together, these graphs do suggest which variables may have predictive power.
There are three types of relationships that are depicted in these scatter plots: positive, negative, and non-existent. ‘Starts’ and ‘Completions’ appear to have a positive relationship to the logged HD Revenue however these relationships are weak. ‘Sales’ has a relatively strong positive relationship to logged HD Revenue.
Mortgage rates are negatively related to revenues. This relationship is expected since mortgage rates should have an opposite relationship to the housing variables - higher rates should mean lower starts/sales/completions, and therefore lower revenues.
Finally, the quarterly return data does not appear to have any relationship to logged HD revenue. The graphs have no pattern and each one shows outliers. These outliers represent the fat tails of the return data as discussed above. As we will see through the regression analysis, the return data does provide predictive power for HD revenue, despite the fact that here there appears not to be any relationship. This result demonstrates the power of the regression – when all the predictors are considered together we see relationships that are otherwise not apparent.
Before we get to the regression we take a quick look at Home Depot revenues. The median and mean revenue are given on the boxplot (the median is near the line). The position of the median once again indicates that long right tail evident in this data. The revenues range from $641,520,000, through $8,139,000,000 – quite a large range to work with. One of the goals of our regression will be to reduce this range to a more manageable one. This will be further explored in the “Meaning of the Regression Section” below.
IV. Regression Analysis
We believe that housing starts, sales, and completion, mortgage rates, and Industry, S & P 500, and Home Depot returns will impact the revenues of Home Depot. In order to get a preliminary idea of how or model will look and to what extent the predictors will effect our target, we simply need to run a regression analysis.
Regression Equation #1
Log HD Revenues (MM) = 5.60 -0.000827 Starts + 0.00600 Sales
+ 0.00222 Completions - 0.0911 Mortgage Rates + 0.00114 HD Q Ret
+ 0.0146 S&P Q Ret - 0.00748 IND Q Ret
Predictor Coef StDev T P VIF
Constant 5.5983 0.3040 18.42 0.000
Starts -0.0008268 0.0005329 -1.55 0.131 5.1
Sales 0.0060032 0.0009002 6.67 0.000 4.4
Completi 0.0022216 0.0005280 4.21 0.000 2.6
Mortgage -0.09107 0.02402 -3.79 0.001 2.3
HD Q Ret 0.001139 0.002103 0.54 0.592 4.3
S&P Q Re 0.014559 0.003830 3.80 0.001 2.6
IND Q Re -0.007479 0.003515 -2.13 0.041 6.3
S = 0.09836 R-Sq = 92.9% R-Sq(adj) = 91.3%
Analysis of Variance
Source DF SS MS F P
Regression 7 3.90205 0.55744 57.62 0.000
Residual Error 31 0.29989 0.00967
Total 38 4.20194
Wow! What a wonderful model with which to start. Our preliminary model yields fairly good output. The existing model accounts for over 90% of the variability in our data as evidenced by the R-sq of 92.9%. VIF values all under 10 demonstrate that the effects of any collinearity are negligible. Furthermore, the F-test of 57.62 coupled with a p-value that goes to zero to at least three digits, leads us to believe that the entire model itself is statistically sound.
However a closer analysis of our models shows that that it can be simplified and hence “upgraded”. The variables “HD returns”, and “starts” have a very low t-statistic and a very high p-value, indicating that they are not statistically significant to our model. We found this to be a little puzzling at first, but in reality it makes lots of sense.
S&P 500 returns serve as market indicators, and therefore also act as indicators of economic strength and buying power. However, HD returns, and to a lesser extent industry returns, do not necessarily effect the economy or its buying power (unless everyone owned shares of Home Depots stock), so it delivers no predictive value to HD’s revenues.
Housing starts refer only to the beginning construction of new homes by builders who may not use a retailer like HD until its time to put the “finishing touches” on a started home. The majority of HD users (85%) are homeowners, and “do-it-yourselfers” who have no use for HD until they have actually closed on the sale of their dream home.
We ran a second regression with out the statistically insignificant predictors “HD Returns” and “starts” in an attempt to get a better, simpler model.
Regression Analysis #2
Log HD Revenues (MM) = 5.70 + 0.00495 Sales + 0.00173 Completions
- 0.0953 Mortgage Rates + 0.0147 S&P Q Ret - 0.00591 IND Q Ret
Predictor Coef StDev T P VIF
Constant 5.6989 0.2876 19.82 0.000
Sales 0.0049516 0.0006409 7.73 0.000 2.2
Completi 0.0017301 0.0003751 4.61 0.000 1.3
Mortgage -0.09533 0.02271 -4.20 0.000 2.0
S&P Q Re 0.014677 0.003808 3.85 0.001 2.5
IND Q Re -0.005907 0.002206 -2.68 0.011 2.4
S = 0.09963 R-Sq = 92.2% R-Sq(adj) = 91.0%
Analysis of Variance
Source DF SS MS F P
Regression 5 3.87440 0.77488 78.07 0.000
Residual Error 33 0.32753 0.00993
Total 38 4.20194
Upon initial review, our new cleaner regression looks great. The new model accounts for 92.2% (R-sq.) of the variability in our target value. The F-test is higher, now at 78.07, and the p-value is still zero to at least three digits, indicating statistical significance of our entire model. The VIF values all fall below 3! Hence there is no collinearity in our model that needs to be corrected. Finally, the high t-statistics and resulting low p-values for each predictor indicate that each value is statistically significant.
With an acceptable regression model in hand, we intensified our analysis to see if it in fact held up to the assumptions necessary for it to be valid. Analysis of the Residuals vs. Fitted Values for Regression #2 plot supports our assumption of homoscedasticty. The residuals plot appears as a cloud, with no apparent pattern of narrowing or widening, or any sub-groups. Furthermore the residuals are all within 2 standard deviations from the mean and are dispersed fairly evenly above and below zero in the standardized residual plot, indicating that the residuals average to zero.
It should be noted that there is one point of concern on the residual plot. It represents the data for the 2nd quarter of 1999, which may serve as an outlier and possible leverage point. We will return to that point later in our discussion.
Both the probability plot and the histogram with normal curve support our assumption of normal residual values.
Finally, the assumption that the residuals were not related to each other were supported in the graph of Residuals vs. Order. Although our data was arranged with respect to time, it displayed no cyclical or seasonal effects. Knowing the revenues generated in the 1st quarter of 1991 tells you nothing about the revenues in any quarter of any year.
Out of curiosity, we decided to test to see if the outlier and leverage point had a significant effect on or model. We ran a third regression without the data for the 2nd quarter of 1999 to see if it would produce statistically different results.
Regression Analysis #3 (outlier removed)
Log HD Revenues (MM) = 5.53 + 0.00539 Sales + 0.00181 Completions
- 0.0866 Mortgage Rates + 0.0162 S&P Q Ret - 0.00674 IND Q Ret
Predictor Coef StDev T P VIF
Constant 5.5284 0.2903 19.05 0.000
Sales 0.0053942 0.0006574 8.21 0.000 2.1
Completi 0.0018098 0.0003630 4.99 0.000 1.2
Mortgage -0.08663 0.02229 -3.89 0.000 2.1
S&P Q Re 0.016235 0.003748 4.33 0.000 2.6
IND Q Re -0.006738 0.002164 -3.11 0.004 2.5
S = 0.09576 R-Sq = 92.5% R-Sq(adj) = 91.3%
Analysis of Variance
Source DF SS MS F P
Regression 5 3.62463 0.72493 79.05 0.000
Residual Error 32 0.29346 0.00917
Total 37 3.91809
Our model with the outlier removed looks somewhat better. The R-sq has increased to 92.5% from 92.2%. The F-test has increased to 79.05, supporting the validity of our model. All predictors have high t-statistics and p-values of zero out to three decimal places with the exception of the predictor, “Industry Quarterly Returns”. And in addition, the VIF values are all still less than 3, indicating no need to correct for collinearity. Most importantly, this is more useful because our standard error has decreased, giving us better predictability.