Using Cost-Distance Surface to Model Local Spatial Correlation in Housing Price Model

[18]

USING COST-DISTANCE SURFACE TO MODEL LOCAL SPATIAL CORRELATION IN HOUSING PRICE MODEL

JIELAI MA

Introduction

Hedonic price model is one of the frequent used models for describing the housing price relationship with other covariates. The term “hedonics” is derived from ancient Greek and it means “pleasure doctrine”. The hedonic hypothesis assumes that products are prized for their utility-bearing attributes. If we defined a product that its price P depends on Z and Z is a vector of different attributes, price P will be a function of Z. For the housing price, can be the housing characteristics and the location factors. The housing characteristics can be the year in which the house was built, how many baths in the house, or how large is the living rooms. The location factors include environmental variables, such as air quality(Beron, Murdoch and Thayer 2001), neighborhood information or other policy shocks. With a Taylor expansion, the hedonic price model can be expressed in linear functional form with the above factors.

In spatial analysis, spatial pattern can be defined as first order effect which is caused by external factors; we can model it as spatial trend surface and second order effect which is caused by spatial interactions of units themselves. If spatial clusters exist, it might be caused by three factors: local external factors, spatial interactions of the units or data aggregation process. For unique variant model, we can detect the trend surface by using kernel density model, K function model in spatial statistics, while kriging is a major modeling tool to analyzing the local cluster. In the econometrics context, there are couple methods in the literatures modeling these two effects.

Beron et al (2001) use the house’s longitude and latitude to create a quadratic trend surface as the addition to the local attributes. Using distance to CBD (HuaSun, YongTu and Shi-MingYu, 2005) is the alternative method. Both approaches are suitable for mono center city hypothesis. Using county dummy variable (Graves, Murdoch, Thayer, Waldman 1988) is another easy way to model the trend surface. If the major spatial trend surface has already being captured by some modeling techniques, local cluster information becomes important in the hedonic price modeling.

Additional environmental externalities are often included in the hedonic price model. Such as air quality by Ridker and Henning (1967), water quality by David (1968), landfill by McClelland, Schulze & Hurd (1990), transportation network by Ready, Berger, and Blomquist (1997). However, all the above variables cannot fully explain the local spatial cluster of housing price. From the data collection point of view, we will always have missing variables in our models. Models incorporated with spatial correlation have been widely adopted to solve this problem. On the other hand, the high variant of the micro dataset lead to the ineffectiveness of the spatial correlation modeling in some circumstance. This article tries to use the cost distance idea to modify the distance concept in spatial correlation and explore this idea on housing price model.

With recent development of GIS technology, sophisticated spatial data models become easily accessed using user friendly software, such as Arc-GIS, Mapinfo. The object oriented programming environment also creates some flexibility in the empirical analysis, especially the availability of Arc-Object, Geo-processing script, Model builder and the bridge of Esri and SAS create possibility to integrate spatial analysis into the comprehensive policy or business analysis. This article explores the issue of using Visual Basic to connect Arc-GIS and R statistics software and finish our model analysis. The cost distance model in Arc-GIS is well developed and can be easily accessed while the econometrics models in R are on the research frontier and being well maintained by sophisticated scholars. Connecting both software open the other door of analysis possibilities.

This paper will have following sections: 1. Cost distance vs. Euclidean distance. 2 Connecting Arc-GIS and R statistics software. 3. Using cost distance to model spatial correlation in housing price model.

1 Cost Distance vs. Euclidean Distance

"Everything is related to everything else, but near things are more related than distant things."(Waldo Tobler, 1970). That is the first law of geography. This fundamental concept of geography has been widely used in every corner of spatial analyses. For example in kriging model, we describe the variogram basing on distance relationship between existing points, with that we estimate the kriging surface. In spatial econometrics context, we use inverse distance to describe the spatial correlation between the location data objects. Social science scholars also use this concept to build up their models and depict the reality. Just like Falk and Abler (1980) illustrate the effort distance basing on the communication barriers. Wasserman and Faust (1994) utilize the social network distance to explain the relationship of difference social classes. Conley (2002) construct socio-economic distance to describe the different demographic distance. They are not using the true spatial distance between the objects. They describe the conceptualized relationship using distance perception. In spatial analysis, we also often explore similar conceptualized ideas to model the distance. For example Meyerson etc. (2001) analyze the traffic network using cost distance. However, in spatial econometrics literature, when we actually go into micro level spatial data, distance or connectivity relationship becomes the only choice to describe the data relationship and a piece of the conceptualized distance is still missing. Cost-distance modified the distance concept by applying flexible weights between might become one possibility to put this piece back to the literature.

Figure A: Cost Distance of Traffic Network

Cost-distance method use reasonable cost raster dataset adding weight into the distance between different points. For example, in Figure A, we have the traffic network cost raster. The cells that belong to highway have value of 1 and the cells that belong to local roads have value of 3. Passing two highway cells will only cost 2 (time unit), while passing two local road cells will cost 6 (time unit). It creates a possibility to find the shortest distance between two points on the graphs. If it can determine the conceptualized distance of different points, cost-distance can also explained the spatial relationship between points if we can find out the right cost raster. We assume using cost distance can much accurately describe the spatial correlation on some circumstance. The Figure A also shows how cost distance works for modeling correlation. If there is a merchandiser locating under the start in the following graph. There are two groups of people living on the locations under two different white arrows. We assume people only consider the driving time. The group live further away might have higher contribution to the sales of the merchandiser.

2 Connecting Arc-GIS and R statistics software

In order to embed cost distance concept in the spatial econometrics model, there are two technical challenges: 1. creating cost-distance matrix of points. 2. Incorporate cost distance matrix into spatial econometrics estimation model. There is no existing function in Arc-GIS 9.2 to calculate the cost distance between points. The cost distances from multiple points to point A can be calculated by creating one cost distance surface starting from the point A and extract the value. However, the distances from multiple points to point B need to be calculated by other cost distance surface of point B. The calculation has been implemented using Arc-Object. The procedure is computational intensive. However, with the built-in model of the Arc-GIS object, some of the existence models can be used when customized models are needed. The overall programming requirement has been reduced. We choose Visual Basic and Arc-Object to finish the above spatial model implementation. The model can also be easily integrated in the graphical user interface (GUI) under Visual Basic program language.

When we considering about the spatial econometrics estimation, we use spdep 0.4-4 package in R. R’s spdep package maintained by Roger Bivand is one of the major spatial econometrics software available on the market. The others includes GeoDa maintained by Luc Anselin, Matlab’s spatial econometrics maintained by James P. LeSage and commercialized spatial analysis package in Splus. R is open source software and it is free for download. User can easily modify the source code under their needs. There are more than 1,000 different packages available from nature science to social science data analysis. There are couple useful spatial analysis and econometrics package available, which make it very powerful than most of the other statistics software packages in the spatial research field. In order to connect Arc-Object and the R statistic software, we use Rcom server developed by Thomas Baier and Erich Neuwirth (2004) to implement the above idea. The follows is the diagram of the data processing diagram. The software also incorporates ordinary least square and most of spatial econometrics models with inverse Euclidean distance as weight matrix. It will be available for download at the end of 2007.

Figure B: Connecting R and Arc-GIS

3 Using Cost-Distance to Model Spatial Correlation in Housing Price Model

In order to capture the local cluster, there are several different approaches based on spatial econometrics and spatial statistics models. Using the spatial autocorrelation is one of the most frequently used methods. It can be implemented in spatial lag and spatial error model in spatial econometrics or kriging in spatial statistics.

The presence of local correlation in the residual is a strong indicator to model the missing local factors or local interactions. The spatial autocorrelation idea can also be implemented in housing price model, such as spatial lag or spatial error model (Beron, Hanson, Murdoch and Thayer, 2004). However, they reported that introducing localize spatial dependence did little to help reduce the variability. The reason might be that the inflexible spatial lag operator cannot fully capture the micro-level variation of the spatial autocorrelations (Bell and Bockstael 2000). Bell et al. use a higher order spatial lags model to attach such problem. The model has higher lags has better performance than the one lag model when they analyzed the housing market in northern Anne Arundel County, Maryland. However, because the complexity of the model, the traditional Maximum Likelihood Estimation becomes infeasible after the observation size is larger than the maximum matrix whose Eigen value cannot be computed. They use generalized-moments estimation (Kelejian and Prucha 1999) technique to solve the problem. However, the spatial correlation they describe is still basing on Euclidean distance. As we know the boundary of the different community will create high segregation problem in housing price model. The above modeling technique cannot solve such problem. That’s also why most of the empirical researches of spatial autocorrelation model focus on Macro Economics issue. The spatial correlation need to be adjusted according to the micro diversity. It is well know that the estimation result is sensitive to the specification of the weight matrix. We might still use higher order spatial lag model to reflect some of the variations. However, fundamental idea of modeling the local autocorrelation might not only base on the contiguity or Euclidean Distance. Just like in social network literature or macro economics literature, the link matrix in spatial econometrics is transferred into some meaningful distance weighted matrix. The distance can become the difference between two social classes or the trading amount between two countries. In this paper, we are trying to create a reasonable cost distance to substitute the traditional Euclidean distance in the spatial econometrics error component models. Because in the spatial error model, we are trying to model the spatial relationship of the unobserved effect, the most significant unobserved effect is the community distance. We define the community distance as following: each house belong to one local community or mixed communities, the houses inside the communities have less cost distance between each other; the houses belong to different communities have higher cost distance because unobserved boundaries. Using community distance to describe the unobserved variables spatial relationship is under the assumption that the regression model only use housing characteristics will create an unobserved communities effect problem. How to model the communities distance or the cost distance between different houses become the major challenge in our analysis.

Creating Cost Distance Spatial Correlation of Housing Price Model

For most of the house price analysis, the houses’ characteristics are available; the unobserved effects often result the local community definition. The idea originally comes from the homogeneous housing sub-market literature. Those literatures show that most of the housing market cannot be estimated in one market. They need to be divided into different sub-market to reflect the communities’ definition. For example, Thibodeau (2003) divided Dallas County into 11 estimated districts using its highway system. Another attempts include: Using Wald test to determine different parameters structure in different region (Chung, Vijverberg 2003). Case (2004) use estimated parameters and residuals of different census tracts to group different regions into sub-regions and conduct estimation. In our initial analysis, the residual of simple regression appear a strong pattern that connects with local community and their boundary. In order to model cost distance between un-captured effects of housing price regression model beyond the housing characteristics in different communities, we can use the demographic composition to define the community cluster and create the cost distance for calculating spatial correlation. However, using the census demographic information has its limitation, its boundary is fixed and its data point is not good enough to estimate the cost distance between each house. Although the local demographic composition is the strong indicator of the community classification, housing price itself is another strong sign of different community, which fully reflect most of the micro-diversity of local factors and a causal result of community classification. We decide to use previous housing price to define the cost distance spatial correlation as the spatial relationship between unobserved effects in spatial econometrics model. That is using previous housing price pattern to get the spatial weight matrix instead of using inverse Euclidean distance.

Comparing with spatial pattern of housing price in Dallas, the time trend of the price changes is relative small. For example, between 1998 1999 the total number of observations of the housing sales data is 17069. In 1998, there were 8644 sales occurred in Dallas county. The mean of the sales price is 130,182 and the standard deviation is $144,317. In 1999, there were 8425 transactions occurred and the mean of the sales price is $134,560 and the standard deviation is $161,339. The difference between the year 1998 and 1999 is not significant. However, we can find out the standard deviation of the price in each year is magnificent. In order to do the kriging interpolation, the larger number of observation will create better result. (We also notice some noise in the final slope surface, but the cost distances calculated are not affected by such noise as show in the example). We decide to use all the observations we have from 1980 to 1999 to create the kriging surface. In Arc-GIS, there are three kriging method available. We use the ordinary kriging model in Arc-GIS to create the local smooth estimated surface of the housing price before 2000. Before we create the price surface we use logarithm transformation to reduce the high price variation, which is also corresponding with the traditional hedonic housing price method. The housing price (or the log of housing price) exhibits strong global trend in Dallas County (Thibodeau, 2003). If the trend can be modeled explicitly and the remaining process is stochastic; the universal kriging might be a better choice (Bailey and Gatrell, 1995). However, the housing pattern cannot be well modeled by linear with linear drift or linear with quadratic drift models, which are available in Arc-GIS universal kriging. Instead of using universal kriging, we use ordinary kriging method in Arc-GIS to model the surface. The kriging method is under spherical model with 12 nearest neighbors as variable searching radius, which means that the interpolation is done locally while the variogram is estimated globally. In order to create smooth surface, our output cell size is 66 feet. We compare these two kriging methods, the ordinary kriging method perform much better than universal kriging method. The kriging surface can be seen in Figure D2.