Bayart, Bonnel, Morency: Survey Mode Integration and Data Fusion: Methods and challenges

8th InternationalConference on SurveyMethodsinTransport

Annecy, France, May 25-31, 2008

Resource paper for Workshop on Best Practices in Data Fusion

SURVEY MODE INTEGRATION AND DATA FUSION: METHODS AND CHALLENGES

Caroline Bayart, Patrick Bonnel

LET – ENTPE – CNRS – Université Lyon 2 – France

;

Catherine Morency

PhD Assistant professor, Department of civil, geological and mining engineering

MADITUC Group / CIRRELT, Ecole Polytechnique de Montréal, Canada

Development of new technologies (smart card, GPS, cell phones…) increases production of data sometimes in a continuous way. But for various reasons, including privacy concerns, these data generally not contain all needed information to match transportation needs especially for modelling purposes. It is also the case of most official data production such as travel surveys, time use surveys, housing surveys… which all contains precious information but none of them all the needed information. Data fusion methodologies allow to enrich the data in order to better meet data needs. In the same time data collection have to face increasing difficulties (cost, response rates…) in collecting data, representative of the whole population, through classical methods. Combining survey modes or methods are in some cases an opportunity to cope or at least to reduce this problem. Mixing these data impose to develop methods to identify modes or methods effects on collected data and to correct for them. This paper presents some of the methods available to treat data fusion and survey mode integration. Application are proposed on Lyon and Montreal areas.

1 INTRODUCTION

Data fusion and the use of multiple databases have appeared in the literature on transport surveys for some time (Goulias, 2000). Their use, however, considerably predates this, but without direct reference to data fusion methodologies. One familiar example is when a weight is applied to a sampling unit on the basis of a comparison with population data or when a synthetic population is generated from aggregate tables that define the reference world. Moreover, population synthesis, which has been discussed for some time in the scientific literature (specifically in the framework of microsimulation), raises similar issues to data fusion (Miller, 1996; Stopher and Greaves, 2003; Voas and Williamson, 2000; Beckman et al., 1996; Ton and Hensher, 2003…).

Integrating a number of data sources in the transport analysis and decision-making process is becoming increasingly complex, primarily because of the rapid change in the state, form and quantity of the available data. In the current context where information systems and technologies can be deployed rapidly and are practically becoming standard tools, the transport community is moving towards a state that Olsen (1999) has described as data chaos: a growing quantity of data, and increasing reliance “on the quality and comprehensiveness of disaggregate travel data to support modern analytical procedures” (Greaves, 2006), the development of very demanding transport models and increasing difficulties (cost, response rates) in collecting conventional transport data. Although data fusion or integration does not yet seem to be the standard approach for managing data chaos of this type, it represents a very useful methodological approach which should be examined in depth and, above all, discussed and trialled. A number of professionals are experiencing difficulties in effectively managing and exploiting the new sources of data at their disposal, in particular the data which are output on a continuous basis by automated systems. They are increasingly aware of the need to organize their data better, to conserve them and to use them in an integrated manner in order to make them more useful for their everyday operational management and planning tasks. Data fusion methods and methods for integrating data acquisition techniques must adapt to these new types of data.

Moreover, an increasing number of surveys are confronted by the problem of mixing different methodologies in the same survey. This occurs when in order to deal with a reduction in the response rate, an alternative mode is proposed to nonrespondents, or when a choice of mode is made available in order to reduce respondent burden (Yun and Trumbo, 2000; Morris and Adler, 2003). Making contact with specific groups in a survey which targets a wide audience can require ad-hoc protocols to avoid an inadequate response rate or biased responses (for example illiterate individuals in a self-administered survey). The issues which then arise are the comparability of the data and the methodologies which can be used to separate those differences in responses which are due to the methodology from real differences in the behaviour of each group.

The paper begins by setting out the reasons for increasing concerns about combining different survey modes and data fusion (section 2). Analysis will then focus on the combination of data obtained using different methodologies in the same survey (section 3). We shall then turn our attention to the combined use of data from different sources. The next section (section 4) begins with a presentation of the concepts and principles involved in data fusion, and continues with a description of the informational issues and the methods used. The authors then propose some points for discussion and suggest some possibilities for further research (section 5).

2 THE RATIONALE FOR DATA FUSION AND THE INTEGRATION OF SURVEY METHODS

The increasing difficulty of obtaining survey data that are representative of the target population and the increasing complexity of the data that are required to feed ever more sophisticated models mean that it is generally no longer possible to collect all the data during a single survey or with a single methodology. At the same time, new technology is potentially able to provide a large quantity of data that partially answers some of the issues facing us today. Processing and combining these different sources of data is becoming extremely important as a means of increasing our knowledge of behaviours and how they are changing, as well as improving models.

2.1 The representativeness of transport surveys

Response rates for conventional surveys are tending to fall (Atrostic and Burt, 1999). The increasing number of nonresponses observed in travel surveys is explained by a number of factors which are unlikely to disappear in the future. The increasing number of surveys conducted in recent years, in particular for commercial ends, has reduced the level of acceptance. Households are increasingly acquiring equipment that restricts the intrusion of “strangers” into their private lives (telephones, answering machines, entry phones…), which makes it more difficult to make contact with them and increases the cost of recruitment (Zmud, 2003). Weariness with surveys, associated with anxiety about revealing personal information, are tending to increase refusal rates. This propensity for nonresponses is tending to reduce confidence that survey results are representative of the studied population (Cobanoglu et al., 2001). A large number of techniques are available to attempt to limit nonresponses, for example giving prior warning and reducing respondent burden. In spite of the undeniable value of these techniques, nonresponse biases remain. It is primarily the desire to reduce these biases that explains the increased use of data weighting methods. They nevertheless always involve the assumption that nonrespondents with certain socioeconomic characteristics behave like respondents with the same characteristics. However, a considerable amount of research casts doubt on this hypothesis (Ampt, 1997;Richardon and Ampt, 1993; Richardson, 2000; Murakami, 2004). Combining survey modes therefore seems to be a possible solution insofar as the individuals who respond to one medium are not necessarily the same as those who respond to another (Bonnel and Le Nir, 1998; Bonnel, 2003; Bayart and Bonnel, 2008).

The reduction in the completeness of most official lists of residents is a major problem in many countries. This is obviously the case for telephone surveys because of the increasing number of individuals who only own cell phones and who are generally not listed in the telephone directories, or the growing number of individuals who do not wish their telephone number to appear in these directories. Once again, combining survey modes can provide a way of getting round the problem, as in the case of the German and Belgian national transport surveys in which the surveys are conducted by CATI for that part of the sample for which it is possible to obtain a telephone number and by post for the rest (Bonnel and Armoogum, 2005; Hubert and Toint, 2003).

2.2Increased awareness of the complexity of urban phenomena

Growing concerns about critical issues such as sustainable development, social equity and public health problems also encourage the integration of databases from a variety of fields. There is also an increasing amount of multidisciplinary research that examines the relationships between travel behaviours and urban planning, as well as their impacts on health (pedestrian safety, exposure to road traffic, pollutant emissions, air quality, active transport and obesity, etc. (Sallis et al., 2004)). Likewise, the appraisal of transport policies from the standpoint of sustainable development requires a consideration of economic, environmental and social dimensions. It is very rare for the existing databases to permit a combined analysis of all these dimensions. The production of ad-hoc data remains very costly and encounters major technical difficulties because of the cumbersome nature of the questionnaires generated by the investigation of multiple issues of this type. The fusion of existing data from various sources thus provides interesting possibilities for enriching the collected data and conducting cross analysis.

2.3 The increasing availability of observation systems

Technological developments, particularly in the areas of computer-aided operating systems and decision-making aid systems, are changing the way transport networks and the components of the system are managed and operated. A number of transport authorities have installed systems for the temporal and spatial monitoring of their network: advance vehicle location and passenger counting systems for transit vehicles, smart cards to access transit services, GPS on taxis or adapted vehicles, on-line reservation and follow-up systems for members of carsharing groups (Cassias and Kun, 2007). In addition to network-related data, collective data such as cell phone traces (tracking individuals, need for coupling with network data in order to deduce the transport mode), Bluetooth markers (tracking portable devices or vehicles) or GPS traces of private vehicles may become available, if confidentiality concerns do not prevent their use. This data has the potential to provide information about the use of transport networks by the fusion of spatio-temporal traces and geographic layers (matching position on the ground to the network). One of many possible examples is provided by the Belgian firm BE MOBILE (in collaboration with the Belgian Transport Ministry and a radio station) which provides information on the state of networks based on journey time data from a fleet of trucks fitted with GPS tracking devices. Another possibility is to analyze the behaviour of users when additional information is collected. This may be done, for example, during a survey conducted with PDAs fitted with GPS tracking systems where the traveller uses the PDA to validate and complete the information that is automatically collected by the GPS tracking system (trip purpose, mode used, etc.). See, for example, Kochan et al. (2006), Pendyala (1999), Stopher et al. (2007) or Stopher et al. (2005).

Other network operators also produce a large amount of data which can have transport applications (tracking the location of cell phones for example). This data is collected on a continuous basis, often unbeknown to travellers. The data can become very interesting for the transport community if effective data processing and analysis methods are developed. First and foremost, these tools frequently constitute a technical or administrative goal in their own right (limiting fraud by monitoring the rights of smartcard cardholders, validating the distances travelled in the case of GPS tracking systems installed in carsharing vehicles). However, the data produced are not generally very suitable for behavioural analysis or feeding models. Furthermore, it is quite unusual for the data to contain contextual information as a result of data confidentiality constraints. This limits their potential use without input from other source, which provides further confirmation of the value of data fusion methods.

3. THE MIXING OF SURVEY MODES OR METHODS

An increasing number of surveys are based on complex protocols which combine several modes or methodologies to increase the overall response rate, improve the rate of coverage of the target population or enhance the quality of the data that are produced (Couper, 2000; Gunn, 2002; Dillman et al., 2001). But proposing several data collection modes or methods carries a risk. The collection of information from different sources may provide results that lack comparability (Dillman and Christian, 2003). The danger when databases are merged is that a sample selection bias will be created that will compromise the accuracy of explanatory models of travel behaviours. This type of selection bias has received considerable coverage in the literature, from both theoretical and empirical standpoints (Winship and Mare, 1992), but as yet little attention has been paid to it with regard to transport surveys.

3.1 Sample selection bias

It has been known since the 1950s that estimating an equation on the basis of a subsample selected from the population may result in biases (Roy, 1951). However, the first econometric exploration of the consequences of such sample selection was Heckman’s work in 1974. The standard example is that of the estimation of salary based on an analysis of working women on their own, because the decision to work involves a trade-off in which the salary an individual can potentially earn plays a role. Since this, many papers have highlighted the importance of the selection bias in human and social science surveys (Maddala, 1986). Noteworthy examples are the model for migration in the USA analyzed by Nakosteen and Zimmer (1980), or that for female employment rates analyzed by Mroz (1987). The most frequent use of self-selection models is for the evaluation of processing or training.

In practice, the selection bias has two sources (Heckman, 1990). It results either from respondent self-selection or a selection decision by the study managers. When mixed survey modes are used, individuals choose to belong to one group or another or only respond if the proposed medium suits them. The responses are therefore not comparable, because the sample is no longer random and the presence of respondents is determined by external factors which may also affect the variable of interest in the studied model. It is highly likely that the socioeconomic characteristics and the travel behaviours of the individuals who respond using the Internet are different from those of the individuals who respond to a face-to-face interview (Resource system group, 2002; Lozar Manfreda and Vehovar, 2002).

The Laboratoire d’Economie des Transports has conducted an Internet survey of nonrespondents to the 2006 Lyon face-to-face household travel survey (Bayart and Bonnel, 2007), that is to say individuals who refused to allow an interviewer into their home or who could not be contacted during the first wave of interviews. The data from this survey highlight the problem of self-selection, with the nonrespondents to the standard face-to-face survey choosing whether or not to fill in the Internet questionnaire. We shall illustrate this by proposing an explanatory model for travel. More precisely, we shall analyze the average number of trips per person. A comparison between the results of the two surveys shows that the Internet respondents travelled less than the individuals who responded to the face-to-face survey.

Let us consider an equation that permits an analysis of the effect of survey mode on an individual’s average number of daily trips:

Yi = βiXi + αiIi + ui(1)

where Yi is the average number of trips made by individuals (dependent variable), Xi is a vector of explanatory variables and Ii is a dummy variable that states whether the individual responded by Internet. A question which arises is whether the coefficient αi measures the real impact on daily travel of the choice of responding on the Internet. The answer to this question is affirmative if individuals who decide to respond on the Internet would have reported the same number of trips if they had responded in the face-to-face situation. However, the variable I cannot be considered as exogenous in this model, as the contacted individuals chose whether or not to respond by Internet. Respondent self-selection must be corrected during least squares regression in order to obtain unbiased estimates of the coefficients. By using the two-stage estimation method developed by Heckman in 1979, our explanatory model for travel can be formalized as follows for each individual i:

Y1i = β1iX1i + u1i, for Internet respondents(2)

Y2i = β2iX2i + u2i, for face-to-face respondents(3)

where Yi is the average number of trips made by the individuals, X1i andX2i are two vectors of independent or explanatory variables for travel and u1i and u2i are two error terms that are assumed to be normal and which take account of the unobserved forces possibly influencing results. What we will do is estimate two models, one using the subsample of face-to-face respondents and one using the subsample of Internet respondents.