Sample Selection for Hedonic Regression Models

When Unmeasured Quality Differences are Present[1]

Li Feng

FloridaStateUniversity

Stefan C. Norrbin*

FloridaStateUniversity

David W. Rasmussen

FloridaStateUniversity

and

Jeffrey S. Ueland

OhioUniversity

Abstract:

Hedonic regressions on housing data reveal instabilities that are likely to come from sub-groups of houses in the full sample. Past research has dealt with such sub-groups by ad hoc constraints that attempted to generate a homogenous data sample. In this paper we propose a statistical methodology, namely the Mahalanobis distance, to select houses that statistically belong to the same sub-group of houses. The results show that the method effectively generates coefficients estimates that are more consistent with a priori expectations.

Keywords: Mahalanobis; sample selection; Hedonic regression; unobserved quality
I.Introduction

Hedonic models of housing markets are used to predict prices and to assess the impact of public policies and neighborhood characteristics on dwelling value. These models have been used to evaluate the capitalization into house prices of spatial variations in public services (such as education, transit availability, and public safety), environmental hazards (such as noise and brownfields) and neighborhood attributes (such as race and socio-economic status). This paper provides evidence that standard hedonic models may not yield accurate estimates of the impact of these spatial attributes that effect house value and argues that better estimates of these effects might be generated by estimating hedonic regressions that use observations selected by a generalized distance function that yield more homogenous samples of sold dwellings.

When hedonic models are used to estimate the impact of spatial characteristics on house prices, the issue of sub-markets has received considerable attention. Dwellings in a sub-market of a metropolitan area are defined by Bourassa, et al. (1999) to be reasonably close substitutes for one another and imperfect substitutes for units in other sub-markets. Goodman and Thibodeau (2003) point out that market segmentation is typically an ad hoc enterprise; markets can be stratified by race, income, structure type, and neighborhood characteristics. But in practice, sub-markets often are spatially defined in terms of census tracts, zip codes, neighborhood quality (nested models), and aggregations of block group data constructed using geographic information systems. Goodman and Thibodeau conclude that “smaller is better” (p. 20), which is no doubt true if smaller implies a more homogeneous neighborhood.

Straszheim’s (1975, p. 28) observation that “variation in housing characteristics and prices by location is a fundamental characteristic of the urban housing market” suggests that sub-markets need not be contiguous. Holding socio-economic status and neighborhood characteristics constant in two non-adjacent census tract “sub-markets,” a high quality home in one census tract may be a close substitute for high quality homes in the other while lower priced homes in the same tract are highly imperfect substitutes. In fact, smaller may be better for defining market segmentation because it may group dwellings of similar vintage and construction, thereby controlling for some unmeasured quality differences. With our apologies to Gertrude Stein who asked if a rose is a rose, it is obvious that all square feet are not identical. Relatively high priced homes will have many unmeasured aspects of quality such as finer decorations and mouldings, bathroom and kitchen fixtures, floor coverings and landscaping so that attribute coefficients in a hedonic regression in this market might be quite different from a large sub-market of homes that has more homogeneous characteristics. Dwellings at the low end of the market may also be characterized by unmeasured quality differentials.

This paper examines different methods of selecting a sample of houses that is less likely to be affected by unmeasured quality differentials. We discuss various ad hoc methods used in prior research to select a relatively homogeneous sample of dwellings. The ad hoc nature of the restrictions usually used imply that each investigator may come to a different conclusion about the value of public policy and neighborhood attributes depending on how the sample is selected. Therefore it may be advantageous to use a statistical technique that can select a homogenous group of “normal” houses. We find that using a generalized distance function to select a sample that minimizes unmeasured quality variationsleads to a hedonic regression with plausible and stable parameters. This process has two advantages. Firstly, the technique is statistically based, thus the researchers’ priors are not a factor in filtering the data. Secondly, the technique allows for covariation among variables. The usual ad hoc methods are focused on setting constraints for a few specific variables. In contrast, the Mahalanobis distance function selects the data according to both variation and covariation of the explanatory variables in the system. Data from Duval County, FL (the Jacksonville MSA) are used in our statistical analysis as an example of a market where this statistical selection process leads to results that are generally more consistent with a priori expectations in terms of sign and magnitude.

The following section presents a hedonic regression for the DuvalCounty housing market. The results indicate that several parameters have suspect coefficients and some have signs that are inconsistent with theoretical priors. The third section presents a brief survey of the ad hoc methods that have been used to determine which observations are used in the analysis. We examine hedonic regressions using ad hoc methods that might adequately purge observations with unmeasured quality attributes. These results suggest that the high priced sub-market is composed of less homogeneous attributes and that the valuation of non-housing attributes is not constant across these sub-markets. In Section 4 we explore the efficacy of using a generalized distance function developed by Mahalanobis (1936) to identify samples of relatively homogeneous homes in Duval County that provide hedonic estimates that differ in important ways from models using the full sample. Conclusions are presented in Section 5.

II. Data and an Empirical Assessment of the Value of Housing Characteristics

Three types of data on individual units have been employed in housing market analysis. The American Housing Survey was an obvious first choice based on availability, but the fact that house value is reported by the occupant, the number of square feet in the dwelling is unavailable, and the lack of objective neighborhood information undermine its utility.[2] Multiple Listing Service (MLS) data are an obvious second choice because they should only contain arms-length transactions; it is irrational to pay a broker seven percent of the price to “search” for a predetermined buyer. This advantage is offset by the fact that these proprietary data are not publicly available and that they may account for a relatively small percentage of housing market transactions.[3] By virtue of their availability and comprehensiveness, property tax records have become the data of choice for the analysis of housing market studies.

Determining whether a sale recorded in tax assessor data is arms length is addressed in many of the studies we reviewed.[4] Only one (Clapp, et al. 1991) chose to accept the official tax assessor’s definition of an arm’s length transaction. Not to disparage the care with which Connecticut’s fine public servants make this determination, careful inspection of sales data in Florida makes it clear that the official determination that a sale is “qualified” for property appraisal purposes is not very reliable.[5] Official determination of whether a sale is arms-length is apparently viewed with skepticism by the housing research community since most studies that explicitly deal with this issue choose additional ad hoc methods of determining whether sales represent market transactions or are improbable extreme values. In addition, many studies eliminate observations at the extremes of the price distribution, perhaps because these might be construed to embody unmeasured quality characteristics.

Here we conduct empirical experiments using tax roll data for detached single family dwelling sales during 1995 in Jacksonville, FL (DuvalCounty). We identified 7,645 single-family detached dwellings that were “qualified” as arm’s length market transactions according to the CountyAssessor, and were likely to be market transactions.[6] A semi-log regression methodology is selected following the prior literature.[7] The empirical specification is given by:

LPriceh =  + Ii=1iXih + Jj=1jYjh + Kk=1kZkh + h,

where the dependent variable, LPriceh is log of the sales price for the hth house, i is the regression coefficient for the ith dwelling characteristic Xih, j is the regression coefficient for the jth neighborhood characteristic Yjh, and kis the regression coefficient for the kth location characteristic Zkh, and h is an error term that may be spatially correlated. A complete list of these variables and summary statistics is presented in Appendix A.

The dwelling characteristics are variables associated with the house construction or the lot that the house is built upon. The number of square feet (SQFT) measures the size of the house. A larger square foot size would increase the cost of building the house and would thus be expected to increase the selling price. In addition some factors pertaining to the construction are expected to increase the price, namely: number of bathrooms (BATHROOMS), number of bedrooms (BEDROOMS), central air (CENAIR) and heat (CENHEAT), and the existence of a fireplace (FIREPLACE). The log of the age of the house (LAGE) measures the depreciation of the dwelling and would thus be expected to lower the price. The number of acres, LOTSIZE, is expected to increase the price of a house with a larger lot. Additional features such as existence of a pool (POOL) and the presence of a garage (PARKING) should increase the price of a house.

Unique neighborhood characteristics for each house are defined using a geographic information system. Using block group data from the 2000 decennial census and employing the ARCGIS system, information on the economic and demographic characteristics of the area immediately surrounding each observation were collected. To gather neighborhood data that are unique to each observation in the sample, the latitude and longitude of each house in the sample is found using a geographic coding program. A radial distance of one-half mile is swept around each observation to generate the neighborhood characteristics that are unique to each dwelling.[8] Because the information is only available at the block group level, these data are retrieved via "proportional grabs." Under this approach the neighborhood includes all census block groups that are entirely within the circle as well as those that are partially included. In effect, it is assumed that household characteristics are distributed evenly throughout the block groups, so characteristics of block groups that are partially in the circle are also included in the estimation of neighborhood characteristics.[9]

Seven variables are created in this way: population density (POP DENSITY); percentage of households that are homeowners (OWNER%); average household income (AVERAGE INCOME); percent black (BLACK%); percent Hispanic (HISPANIC%); percent of head of households over age 50 (OVER50%) and percent white collar workers (WHITE COLLAR%). Population density is assumed to be an inferior good, while owner occupied single-family dwellings are presumably better maintained and multi-family units are presumed to generate negative externalities. Lower mobility among older households suggests a lower supply of houses for sale in neighborhoods with a relatively large population over age 50. Minority population, social class, and income are pertinent determinants of house value for the three reasons. First, because all dwellings are located within a single DuvalCounty school district, differences in school quality are likely to be the product of the socio-economic characteristics of the school’s catchment area. Assuming that house values, in this context, are determined by dwelling and lot characteristics, neighborhood attributes, and school quality, and that school quality is determined by socio-economic characteristics of the population it serves, ours is essentially a reduced form model.[10] Second, perceptions of public safety may be higher in affluent communities and many other aspects of community life are probably perceived to be superior in such neighborhoods. Among these advantages are superior parks and recreation facilities, better shopping opportunities, more aesthetic appeal, and other intangibles of neighborhood quality. Finally, encroachment of low-income households may trigger fears of diminished neighborhood quality and a corresponding decline in property values.[11]

Four location variables are included to augment these neighborhood characteristics: distance (measured in miles) to the center of the central city DIST_CBD, distance to the St. Johns River DIST_STJOHNS, and distance to the Atlantic Ocean, DIST_ATLANTIC. Sales prices should be negatively correlated with distance to such employment centers and amenities. Finally, the variable that captures water front location, WATERFRONT, should capture the positive value of having a house located immediately on a type of water, such as ocean, lake or river.[12]

Table 1 reports the results for a hedonic regression using the full sample of 7,645 dwellings. The dependent variable and age were logged, whereas the other variables are in levels. Column 1 in Table 1 reports the estimates for the OLS version, whereas the second column 2 shows the estimates corrected for spatial autocorrelation. The results indicate that such a correction did not lead to a major change in the results.[13] The results are generally satisfactory, with a coefficient of determination of almost .80. The table shows that most of the dwelling characteristics are significant and have the expected sign, except for central heat that has an insignificant coefficient. Probably this is due to the small number of houses that do not have central heat (see appendix A). The lot size variable has the expected sign, but is very small. An acre of land is only valued at $7,572 for the average house. The average house has a lot size of 0.31 acres, so these estimates suggest the cost of the lot accounts for only a small fraction of the sales price.

Neighborhood characteristics generally have the expected sign. Population density has the expected negative sign and prices are higher in neighborhoods that are more affluent, have fewer blacks, a higher percentage of older residents, and more white-collar workers. Sales prices are higher when there are more owner occupants in the neighborhood, but this effect diminishes as this proportion rises since the squared term is negative and significant. The effect on the house price of the Hispanic population variable is positive and significant at the 90% level. This finding is contrary to expectations.

Each of the location variables is statistically significant. However, contrary to standard urban theory, dwellings that are further from the central business district, ceteris paribus, have higher prices, and dwellings further away from St. John’s river have a higher price. Thus Table 1 indicates that most of the variables have reasonable estimates, but that puzzling magnitudes and signs exist on a few of the coefficients. Such odd estimates might be an indication of a heterogeneous sample that is characterized by unmeasured quality differences at the extremes of the price distribution. If such sub-groups exist then some filtering mechanism is needed to extract a homogenous group. We turn to this task in the next section.

III. Selecting a homogenous housing group

If housing markets are competitiveand dwellings differ only in the number of constant quality housing units they embody, there would be no reason to seek a homogeneous population of sold houses. Our housing data probably do an acceptable job of measuring the number of housing units in each dwelling; our problem is that unmeasured quality attributes are probably not randomly distributed over the population of sold houses. In this section we illustrate how this problem can bias hedonic regression coefficients, discuss previous attempts at constructing such a homogeneous group, and examine an alternative proposal for selecting a sample that minimizes unmeasured quality attributes.

To illustrate the problem of unmeasured quality differentials we generate Monte Carlo data using the following two different processes. Label the first group as the “normal” quality group because it is the more common type of house, and the second group as the “high quality” dwelling. The difference between the groups is the fact that the second group has an unmeasured quality variable that the researcher is unaware of. We generate the first group, the “normal” group according to the following equation:

Ph = Xh + Zh + h,

where Xh , Zh and h are random variables with a zero mean and unit variance. Ph is then constructed by adding the weighted values of the three characteristics Xh and Zh , Yh and the specific idiosyncratic shock of h. The second group, “the high quality,” group includes an unmeasured quality variable:

Ph = Xh + Zh + Yh + h,

where the Yh variable is a measure of the unobserved quality and is the effect of the variable on the sale price of the house. Assume that the Yh and the Zh variables are perfectly correlated.[14] Further assume that Xhand Zh are variables that have the same coefficients for both groups, namely=1.00 and=-3.00, and that the coefficient on Yh differs for the two experiments.