THEME [SST.2010.1.3-1.]
[Transport modelling for policy impact assessments]
Grant agreement for: Coordination and support action
Acronym: Transtools 3
Full title: „Research and development of the European Transport Network Model – Transtools Version 3
Proposal/Contract no.: MOVE/FP7/266182/TRANSTOOLS 3
Start date: 1st March 2011
Duration: 36 months
Milestone 79 - “Sampling alternatives for choice model estimation in practice”
Document number: TT3_WP8_COM_MS79_Sampling alternatives for choice model estimation_0b
Workpackage: WP8
Deliverable nature:Report
Dissemination level: N/A
Lead beneficiary:ITS Leeds, beneficiary 2, Andrew Daly.
Due data of deliverable: March, 2012
Date of preparation of deliverable: 30.12.2011
Date of last change: 17.03.2012
Date of approval by Commission: N/A
Abstract:
The note discusses practical approaches for sampling alternatives in the estimation of GEV models, which include the tree-nested models which will be used for the passenger and freight demand modelling. It covers both procedures for actual sampling of subsets of alternatives and the process of estimation based on those subsets, pointing out the deficiencies in the academic literature, which is still being developed. It concludes that, given the moderate numbers of alternatives to be used in this study, the efficiency of modern software and hardware and the difficulties and limitations of the methods, that sampling of alternatives should not be used for model estimation in this work.
Keywords:
logit models, sampling alternatives, model estimation
Author(s):
Daly, Andrew
Disclaimer:
The contents of this report reflect the views of the author(s) and do not necessarily reflect the official views or policy of the European Union. The European Union is not liable for any use that may be made of the information contained in the report.
The report is not an official deliverable under the TT3 project and has not been reviewed or approved by the Commission. The report is a working document of the Consortium.
Sampling alternatives for choice model estimation in practice.
Report version0c
2012
By
Andrew Daly
Copyright: / Reproduction of this publication in whole or in part must include the customary bibliographic citation, including author attribution, report title, etc.
Published by: / Department of Transport, Bygningstorvet 116 Vest, DK-2800 Kgs. Lyngby, Denmark
Request report from: /
Content
1.Introduction
2.McFadden (1978): Sampling alternatives for MNL
3.Sampling in GEV models
4.Conclusion and Recommendation
Sampling alternatives for choice model estimation in practice
Summary
The note begins by pointing out that sampling alternatives introduces a number of new issues involving the specification of sampling procedures and the calculation of errors. With moderate numbers of alternatives it is possible to make estimations efficiently without sampling.
For multinomial logit, sampling procedures due to McFadden (1978) are known to be unbiased when used in estimation, though the increased error introduced by these procedures is unknown. Simple sampling procedures are consistent with the McFadden requirements, though the sampling criteria would have to be specified for each study.
For more general models, as will be needed for this study, the best sampling procedures are unknown, though work by Guevara and Ben-Akiva (2010) is promising.
Because of the theoretical and practical complications, and the possibility of efficient work without sampling, it is recommended that no sampling be undertaken for model estimation in this study. For model application, the methods are even less well developed and we similarly recommend that no sampling be undertaken.
1.Introduction
1.1Context of the milestone
For the TransTools3 (TT3) project it is necessary to estimate choice models with substantial numbers of spatial alternatives. Such models can make heavy demands on computer resources, particularly run time, but also potentially the storage requirement. An option to reduce these demands is to look for some way of reducing the number of alternatives actually used in model estimation by sampling. The intention of the present note is to set out the practical procedures available for sampling and recommend a procedure for this project.
For the purposes of the current note, we shall assume that the observations can be treated as independent of each other. Since TT3 is working with RP observations, this assumption seems reasonable.
1.2Preliminary remark
Sampling of alternatives should only be undertaken when it is clearly necessary. Alternative sampling always leads to increased error in parameter estimations which is not completely straightforward to calculate. Below, we begin to analyse the nature of the error specifically introduced by alternative sampling in a limited range of cases. This additional error is usually ignored in practice, but this is incorrect.
Moreover, in many cases the use of sampling imposes a duty on the analyst to make calculations of correction terms, sometimes simple, sometimes complicated, that need checking. This additional burden is an important practical reason to avoid sampling unless it is necessary.
Finally, in complex models the guarantee of consistency of estimation is not always present. The main aim of this note is to discuss the practical application of methods that achieve consistency in a wider range of cases.
Nerella and Bhat (2004) give some indications of the magnitude of the error, both for MNL and for more complicated models. For MNL, they give some guidelines on minimum sample size to achieve stability, but that is with a simple sampling strategy in simulated data and so not likely to be transferable to real data and more sophisticated sampling. In particular, the efficiency of sampling can vary substantially between contexts and sampling procedures. For mixed logit models, the results are less accurate, but in this case the estimates are not necessarily consistent and would not usually be used in practice. The current situation appears to be that sampling alternatives in MNL causes noise, but we would not be able to state in advance what that noise would be in a specific situation; sampling alternatives in mixed logit causes more noise and bias may also be present, but we are not able to say how much of either.
With efficient software it is possible to estimate MNL and tree-nested models with quite large numbers of alternatives. RAND Europe routinely uses ALOGIT to estimate models with tens of thousands of alternatives, dozens of parameters and tens of thousands of observations[1]. Of course runs of this magnitude are time-consuming but they remain entirely feasible for practical analysis work. ALOGIT is currently not able to estimate cross-nested models or models based on repeated observations for individuals (panel data); but these features are not necessary for the current TT3 work.
In each study, therefore, a decision needs to be taken about whether or not sampling is to be undertaken. When sampling is undertaken, we should investigate how much error this is likely to introduce. Procedures that may introduce bias should not be used without careful testing.
1.3Reading guidance
In this note, we discuss first the sampling procedures appropriate for MNL models and some of the basic properties of sampling relating to destination choice. We then turn to sampling for GEV models.
1
2.McFadden (1978): Sampling alternatives for MNL
In his 1978 paper, McFadden set out the Positive Conditioning (PC) property under which consistent estimates of a MNL can be obtained with alternative sampling. Specifically, he showed that asymptotically consistent estimates of model parameters can be obtained if we maximise a modified likelihood function, with a contribution for each individual of
(1)
where is the systematic part of utility for alternative;
is the chosen alternative;
is the sampled set of alternatives for this individual, which is a subset of the set of all available alternatives ; and
is the probability of sampling , if is chosen;
we must have (which is the PC property).
Clearly, if is the same for all , then it will cancel out in the equation above. This is the Uniform Conditioning property. The simplicity of Uniform Conditioning is attractive, but in many practical cases some alternatives are much more important than others, in the sense of being much more likely to be chosen, so that the general PC approach with unequal values is more efficient. Some intuition on how ‘important’ alternatives should be identified is given below.
Note that it is essential that the chosen alternative is included in .
An important point in practice is that McFadden’s PC theorem requires the assumption that “the choice model is multinomial logit”. In practice, this will often not be the case and the consequence is then that estimation using the amended likelihood function may not give the same result as when using the full model.[2] However, in this case, neither the base MNL nor the sampled version is correct and sampling does not necessarily make matters worse.
2.1Practical sampling strategies for PC
A simple practical approach for PC is to use independent sampling, where each unchosen alternative is sampled with probability . Alternatively, we cansample a fixed number of times from , with replacement, giving each alternative a probability of being sampled at each draw and then delete the duplicate alternatives. In each case Ben-Akiva and Lerman (1985, equations 9.22 and 9.23) these strategies yield
(2)
with independent of . Examining the likelihood equation (1), we see that cancels out and we are left with
(3)
With independent sampling the expected set size is but there is quite likely to be considerable variation around this number. When sampling with replacement the probabilities must sum to 1, of course, but the advantage claimed for this method is that the size of the set varies less than with independent sampling; the expected set size is more complicated to determine. In each case we can adjust the sample rate to obtain a suitable balance between sampling error and computational cost.
A further approach, which has been used in some practical studies, is to sample without replacement. This gives no variation in the sampled set size, but the calculation of is complicated and for this reason the approach is not generally recommended.
Ben-Akiva and Lerman imply that replacement sampling is likely to be preferable to independent sampling, because of the anticipated reduced variation of the set size. Variation of the set size is, however, not obviously directly connected with the accuracy of estimation; some aspects of these issuescan be tested quite readily by simulation, as illustrated in Figure 1. Details of the simulation are given in the Appendix.
Figure 1: Results of simulated sampling of choice sets
The upper lines in the graph show the relative cost of the three approaches, specifically giving the ratio of selected destinations to the weight of those destinations, for a selected reasonable sampling rate (about 16% of the destinations). We see that the differences between the procedures are very small and that all the procedures become more efficient as the parameter becomes more strongly negative (equivalently, the study area size increases relative to expected trip lengths). The lower lines show the coefficient of variation in the sample sizes. We see that for small study areas replacement sampling gives less variation, as expected, than independent sampling, but that the difference decreases substantially as the study area size increases and for the largest area size independent sampling gives a less variable set size.
It appears that independent sampling is a simple and efficient approach. It is also quicker to apply than replacement sampling, though neither of these approaches is time-consuming.
It is also relevant to note that PC sampling is more important in the context of destination choice, as needed for these projects, than for other contexts such as residential choice, where the link to travel accessibility from a specific origin is less strong. The strength of the link also affects the balance of choice between independent and replacement sampling, i.e. results from residential choice studies cannot necessarily be transferred straightforwardly to destination choice studies.[3]
In practical studies, a stratified sampling approach has also been used. This approach would sample a fixed number of alternatives from each of several strata, adding the chosen alternative if it is not otherwise sampled. The efficiency of such an approach has not been tested in the current work.
2.2Sampling error in PC sampling
A simple approach to assessing sampling error is to calculate the ‘fraction of choice’ that is captured by :
(4)
This approach is taken by Miller et al. (2007) in the context of sampling alternatives for application and is implicitly the measure used to indicate efficiency in the graph above. It is intuitively clear that as the fraction increases then the approximation will generally improve. After all, when the fraction reaches 1 there is no approximation. But it is a rough measure.
Suppose we want to estimate by , where is the set of that have been sampled. If is the sampling probability, the variance of is given by
(5)
It is quite easy to see that this is minimised for a given , which is related to calculation cost by the efficiency of the sampling procedure, when (see also Hammersley and Handscomb, 1964). This result gives a strong indication that the intuitive attribution of sampling probability as approximately proportional to ‘importance’, measured by , is reasonable[4]. But the formula for variance is not a simple function of .
The calculation above applies for independent sampling. Examining the variation in in the simulation runs of the previous section, we see that the results for replacement sampling are similar to those for independent sampling, with lower variation for small area sizes and larger variation for larger areas. We might expect that a function like (5) could be developed, though it would not necessarily be simple.
The error in the likelihood-contribution calculation (3), to which contributes the sampling error, can be estimated as
(6)
In estimation of these models, we maximise the likelihood over all the observations, i.e. summing each individual’s contribution (3) to obtain the overall likelihood. Then the first-order conditions for optimality are, for each parameter and assuming that the utility functions are linear in parameters and observed data :
(7)
where indexes the individuals.
Examining this result, we see that the sampling error occurs in the denominator, which has variance given by (5). The resulting estimation error is difficult to calculate, as it depends on the observed data and the way in which varies over the observations.
Since the model is not properly specified and (as noted by Guevara and Ben-Akiva, 2010) we are not working with the true likelihood function, the usual error calculations do not apply. Guevara and Ben-Akiva show that he estimation error is given by the sandwich matrix and this should therefore be applied for models estimated using sampling.
3.Sampling in GEV models
The key references for procedures in GEV models (introduced by McFadden, 1978) are Lee and Waddell (2010), Guevara and Ben-Akiva (2010, also Guevara, 2010) and Bierlaire et al. (2008). Note that Frejinger et al. (2009) do not deal with models beyond MNL except through the ‘path size’ correction and that the PC correction is therefore sufficient for their work. Among these, we give priority to the work of Guevara and Ben-Akiva, but first we note briefly the other work that has been done.
3.1Other literature
Koppelman and Garrow (2005) discuss the (important) issue of the choice-based sampling of observations and do not discuss sampling alternatives at all. One is at a loss to know why this paper is cited so frequently in the literature discussing sampling of alternatives.
Bierlaire et al. (2008) is also aimed chiefly at the issue of sampling observations, which is not directly related to the present problem. However, “for the sake of completeness”, they give some attention to sampling alternatives, deriving results that foreshadow somewhat the work of Guevara. However, Guevara’s work is more directly focussed on our topic of interest and we shall base our discussion on those publications.
Lee and Waddell (2010) claim to provide the first consistent estimator for tree logit with sampling of alternatives. The formula (their equation 5) is simple, the logsum used in the higher (unsampled) level is[5]
(8)
where is the sampling rate“which only applies to the sampled non-chosen alternatives”, so they apply a rate of 1 to the chosen alternative.The estimate of the logsum is therefore a function of the chosen alternative.When , i.e. the model is MNL, this is different from McFadden’s PC, so that it appears that the Lee and Waddell procedure is incorrect. Simple simulations confirm that a bias is introduced.
3.2Guevara and Ben-Akiva work
Guevara and Ben-Akiva (2010, abbreviated as GBA)[6], which relates closely to a chapter of Guevara’s thesis (2010), gives the theorem that consistent estimation can be achieved by a correction of the logit utility function
(9)
where is the probability of selecting the reduced choice set , given that is the chosen alternative; we note that this is reassuringly the standard McFadden PC correction;
is the derivative with respect to its th argument of the GEV[7] generating function ; here we note that it is calculated for the restricted choice set .
In an MNL model, for all the alternatives, so that this term disappears from the function and we return to the standard McFadden MNL formulation. However, in more general GEV, such as nested logit, this term does not disappear. Ben-Akiva and Lerman (1985) show that equation (9) can be used (without sampling, i.e. without the term) to represent any GEV model, so that the GBA theorem using (9) represents an intuitive extension of both McFadden sampling and the Ben-Akiva/Lerman finding.