1

Draft Eurostat Progress Report, June 2003

(prepared June 1, 2003)

Database Issues in Estimating Hedonic Computer Price Indexes:

Comparisons of Hedonic Model Specifications

Jack E. Triplett

Brookings Institution

An essential, but often neglected, aspect of database quality consists of the variables included in the database. Whatever the hedonic method chosen (the four major ones are described in Triplett, forthcoming), one needs an adequate and full set of the characteristics of the product in order to estimate an adequate hedonic price index. Otherwise, missing variables may produce errors in the hedonic coefficients (from the standard econometric considerations) and errors in the hedonic index even if the coefficients are estimated without bias.

Existing research on hedonic functions for computers suggests that missing variables are endemic, in the sense that many studies omit variables that have been found significant in other work. This is demonstrated in Table 1, which compares regression specifications for a large number of hedonic studies on computers.

The standard for completeness is the U.S. Bureau of Labor Statistics hedonic function, in the sense that it includes more variables than any hedonic function estimated by other researchers. Other computer hedonic functions omit variables that have been found to be important and statistically significant in BLS research. This report is a first step toward evaluating the effect on the price index of hedonic specifications that contain fewer variables than the state of the art commands.

BLS Database and Alternatives. The list of variables in the BLS hedonic function changes to an extent from quarter to quarter, partly because of changes in the computer market and partly because the BLS periodically sharpens its specification. Nevertheless, a typical BLS hedonic function will contain some 20 to 25 independent variables. Table 2 displays the BLS hedonic function for October, 2000: The function has four continuous variables (clock speed, memory size, hard drive capacity and video card capacity) and 20 dummy variables. Even this most complete specification may omit variables that are important to the performance of the computer. For example, Chwelos (2003), who has the next most complete hedonic specification, includes the size of the cache memory in his list of performance variables, and it is well established that the cache is an important determinant of computer performance.

When variables are omitted from a regression and they are correlated with the included ones, it is well known that the coefficients of the included variables will pick up part of the effect of the omitted variables on the price of the computer. The size of the impact depends on the correlation between the omitted variables and the included ones.

Omitted variable bias to hedonic coefficients is potentially a very important problem if estimated coefficients are to be used to make quality adjustments to prices when forced replacements occur in price index samples. If the coefficients are biased because of omitted variable bias, one cannot use them with confidence in the hedonic quality adjustment method. Even if the coefficients are unbiased (which will be true when correlations between included and omitted variables are low), the hedonic price index may still be biased by omission of variables.

Alternative data sets that are available for computing hedonic functions in Europe clearly differ in the number of variables that contain, and they differ from U.S. datasets. For example, the specification used by Moch and Triplett (2002) has no information on the monitor that is included in the transaction. The BLS specification finds, reasonably enough, that the size and resolution of the monitor matters a great deal to the price of the computer (see the four monitor coefficients in Table 2—generally, the “premium” monitor has better resolution). Similarly, the widely-cited work of Berndt and Rappaport (2001) contains nowhere near the number of variables in the BLS hedonic function. At the extreme, perhaps, the new German hedonic index for computers depends on a hedonic function that has essentially one variable (speed) plus a small number of dummy variables for brand. It is extraordinarily important to understand how the specification of variables in the hedonic function, and thus their availability in different databases, affects the accuracy of hedonic price indexes.

Testing A Reduced Specification on the BLS Dataset. To illustrate the effects of omitted variables on the coefficients of included variables, I used the BLS dataset (made available by BLS with special arrangements to assure confidentiality). I deleted variables from the BLS PC hedonic function to approximate the hedonic function specification used by Berndt and Rappaport (2001). The published information on variables in Berndt and Rappaport (2001) is not the latest version of their research, so I supplemented their published information with the additional unpublished information in Berndt and Rappaport (2002), and included the Celeron variable from the latter. This yields five characteristics variables. Additionally, Berndt and Rappaport report that company dummies were significant; BLS also reports significant company variables, but the two sets of company dummies are not necessarily the same (I do not have the names of these companies, they are confidential). I retained the BLS company dummies for purposes of this comparison.

The first column of Table 2 presents the full PC hedonic function used by BLS in October, 2000 (the BLS approach to hedonic indexes is described in Holdway, 2000). As noted above, it contains four continuous variables and 20 dummy variables. The second column presents the hedonic function variable specification used by Berndt and Rappaport (2001, 2002), except that I estimate the Berndt and Rappaport specification using the BLS data and the BLS linear functional form. The hedonic function in the second column has variables that are a subset of the hedonic function in the first column, but the data for the estimates in both columns are the same. Except where indicated, all the coefficients in this and the following tables are significant at standard levels, so standard errors and t-values are not reported, to simplify the tables.

This exercise does not assess the hedonic function of Berndt and Rappaport (2001, 2002), for their full specification includes a different functional form (semi-log) from the linear one used by BLS, and of course their data are different from the BLS data. It assesses, rather, what the BLS hedonic function—given the BLS linear specification and the BLS data—would have lost in accuracy, had the BLS employed in its hedonic function only the variables of Berndt and Rappaport (hereafter, B&R).

[In consequence, the right-hand column in Table 2 does not reproduce the regression results of B&R. We retained the BLS linear functional form for purposes of this comparison because, after carrying out Box-Cox tests, we could not reject superiority of the linear functional form over the semi-log and double-log forms, with BLS data. Indeed, the three were nearly equally good as representations of the relation in the data.Note that in some months BLS uses dummy variables for different microprocessor types and speeds, e.g., the regression for May, 2001, has one dummy variable for each of five different speeds for the P3-P4 chips. We treated speed, in MHz, as a continuous variable in these calculations, largely for simplicity and for comparability with B&R. None of the regression diagnostics or values of other coefficients were notably affected by this decision.]

As the columns of Table 2 indicate, the reduced set of variables results in substantial changes in coefficients of the included variables. Two examples indicate the analyses that can be performed to show the impacts of omitted variables on the accuracy of coefficients.

First, consider the DVD player variable, which is included in the full BLS specification, but is a missing variable in the reduced B&R specification. BLS estimates that the DVD player costs about $52.50. On the usual econometric analysis, omitting the DVD variable will affect the coefficients of included variables, in proportion to the correlation of the (excluded) DVD variable with the included variables. Considering, then, the table of simple correlation coefficients calculated for the BLS data for the same month (Table 3), the largest correlations in the table suggest the following impacts of omitting the DVD variable: Omission should raise the coefficient for MHz (R = .30), and for “company B” (R value suppressed), while lowering the coefficient for Celeron (R = -.29).

All the coefficients in the second column of Table 2 differ from those in the first column in the expected direction. For example, the estimated Celeron discount falls from $74 in the full (BLS) specification to $195 in the reduced (B&R) specification (compared with the base in this regression, which is the Pentium III). This change in the Celeron coefficient is caused by all the omitted variables, of course, not by the DVD variable alone. Also, omission of the DVD variable will also affect other coefficients, but by smaller amounts. Ignoring the complications and putting it most simply, omitted variables (including the DVD variable, but not only DVD) cause a substantial overstatement of more than $100 in the negative premium for the Celeron.

As a second example, consider the coefficient for the CDR/W variable. This variable is contained in both BLS and B&R specifications, but its coefficient is considerably larger in the reduced specification ($395), compared with the full specification ($213). If the cause of this close to $200 dollar overstatement in the reduced specification is omitted variable bias, what are the omitted variables that contribute the most to this change?

Again, we can inspect Table 3 for clues. The CDR/W variable is most strongly correlated with the omitted variables video capacity (R = .25), with “premium” video (R = .28), with 19-inch premium monitor (R = .33) and with premium speakers (R = .48). The latter three dummy variables have a combined estimated value of $472 in the complete (BLS) specification (see the coefficients in Table 2). Thus, the $172 increase in the estimated price for a CDR/W unit in the reduced (B&R) specification has in it some part of the $472 estimated value for the latter three variables (with the rest of the value distributed among other included variables). There is also an impact from the continuous variable video capacity, which could be evaluated (at mean values of all the variables, for example), but was not for this exercise. Additionally, the CDR/W variable picks up value from other excluded variables with smaller correlations in the full specification. As in the first example, a full accounting must consider all the variables and interactions, the examples here are only illustrative.

Sometimes investigators assess the validity of their estimated hedonic functions by the regression R2—if the R2 is high, their contention is, omitted variables cannot be a very great problem. For any use of the hedonic function where the coefficient values matter, relying on the R2 alone is problematic. The difference in R2 is not large between the full and reduced specification in Table 2 (.97 in the full, .89 in the reduced). Yet, the impact of omitted variables on the regression is substantial, as shown in the analysis above. Moreover, the values of the coefficients in the full (BLS) specification appear more reasonable than those from the reduced (B&R) specification.

One expects, therefore, that hedonic quality adjustments that might be made with the reduced specification will differ from those made with the full BLS specification. First, for the variables included in both specifications, the adjustments will differ because the coefficients estimate different prices for them. Second, adjustments will be made for different variables—some variables that are statistically significant in the full specification will receive no quality adjustments if the reduced specification is employed.

Note as well that the standard error of regression reported in Table 2 is twice as high in the reduced specification as in the full specification. It is consistently at least twice as high for all the months we examined (see below). If one were to estimate a hedonic price index with the hedonic imputation method (described in Chapter III of Triplett, forthcoming), imputations would have twice the error when performed on the reduced specification. I do not consider these imputations further in this report, but mention this point because use of hedonic functions for imputing missing prices, rather than for making a hedonic quality adjustment, has been discussed recently, sometimes under the impression that imputations would avoid imprecision in hedonic quality adjustments that arises from the regression specification. The reduced specification clearly implies imprecision in hedonic imputations.

A Test for A Different Month. In the BLS data, correlations among the variables within the sample are not always the same for every month. In October 2000, correlations among omitted variables (in the reduced specification) and included ones ranged up to about R = 0.50 for the highest value. December, 2001 provides a case where the between-variable correlations are low. Table 4 presents a comparison that is parallel to the one in Table 2.

In December, 2001 a number of companies were offering rebates. The BLS data distinguish when a price includes a rebate. I include the rebate variable in the reduced specification in Table 4 to make the numbers more comparable, but show results without it as well.

In Table four, coefficients in the reduced specification are closer to the full specification than they were in the results for Table 2. In particular, the three coefficients for the CDR/W variable are quite close in Table 4, ranging from $116 (BLS specification) to $123 (B&R specification, with rebate). Examining the R’s between the CDR/W variable and omitted variables from the reduced specification suggests why (Table 5): None of the relevant R values exceeds 0.15. The largest are for the omitted variables regular 19-inch monitor (R = 0.14) and onsite 3-year warranty (R = minus 0.13). Thus, omitting variables produces little impact on the estimated price for CDR/W in Table 4 because, consistent with econometric models, correlations between the included CDR/W variable and the omitted variables are low. Larger impacts were observed in Table 2 because the correlations with omitted variables were higher.

Similarly, omitting the DVD variable in December, 2001 has a relatively small impact on the coefficients of included variables in the reduced specification. No correlation of DVD with an included variable is as high as 0.15; the largest correlation is with HD (R equals 0.12).

However, the estimated price of speed is considerably higher in the reduced specification, by about 50% (.950, compared with .666). The altered estimate of the price of speed in the reduced specification likely comes from the contribution of omitted variables. Where does this increase come from? Examining the table of correlations (Table 5) suggests the following omitted variable sources: the video card (R = 0.32), monitor size (19-inch monitor has R = 0.20), the omission of software information (for Office XP, R = 0.26), and whether the machine is a business computer (R = 0.27).

It is also interesting that the estimated rebate is substantially smaller ($130) in the reduced specification than in the full specification ($319).

We also did these comparisons for other months for which we have the BLS data, including May, 2000 and April, 2001. These results are included, as Tables 6 and 7, without additional analyses, because the comparisons give results that are parallel to those already discussed. In particular, the standard error of the regression is in every case at least twice as large in the reduced specification.

The Effects on the Price Index. In the end, what is important is the price index, not the coefficients. BLS uses the hedonic quality adjustment method to construct computer price indexes. We cannot replicate the PPI because we do not have access to confidential data used for the PPI (we have only the BLS data that are used to estimate hedonic functions, and therefore for estimating the hedonic quality adjustments, not the PPI price data that are adjusted). Thus, we cannot evaluate price indexes estimated with the hedonic quality adjustment method that use the reduced specification.

For this reason, we evaluated the reduced specification with respect to the dummy variable method for estimating hedonic price indexes. For the dummy variable method, we need to estimate a single, pooled hedonic function on data from two adjacent periods. Generally, this is also difficult because variables in the BLS hedonic function change frequently, and the pooled method requires that the same list of variables be present in both periods that are pooled. By combining a few variables, documented in the notes to Tables 8 and 9, we could estimate time dummy coefficients covering the interval May to October, 2000.