A well-known logistic model for ranking data – implementation as a generalized linear model
Key words: Discrete choice model, Thurstonian model, generlized linear model.
H.P.Piepho , 1 Sep 2008
Many types of experiments and surveys in agriculture yield ranked data, where a set of choices (treatments etc.) is ranked by the same subject. For example, a household may be asked to rank a set of four alternative objectives for a specific income-generating activity. Or a set of six varieties tested in a complete block design with four blocks is ranked separately in each block regarding disease severity.
A simple model for the observed ranks postulates an underlying utility model. Let be the set of Ai alternatives available for unit i and a linear predictor for unit I and alternative a. The so-called utilities take the form
,
where is a measurement error associated with utilities . If has density
,
Let be the alternative given rank k and the ranking of unit i. Under the Gumbel model for measurement error, the probability distribution of observed rankings has a closed form (Luce, 1959):
(1)
The model is also denoted as “exploded logit” since the ranking probability is written as a product of first choice probabilities for successive remaining alternatives (Skrondal and Rabe-Hesketh, 2003), i.e., rankings can be assumed to be obtained successively such that the best choice is selected first, then the second best among the remaining choices, etc. At the k-th successive selection step the contribution to (1) takes the form of a multinomial probability with sample size one and number of categories determined by the remaining choices.
To illustrate, consider the case of Ai = 3, and assume that the observed ranking is . At the first selection step for the best choice, the multinomial likelihood for the first choice is
,(2)
where
,
, and
.
This is a multinomial likelihood with three categories and sample size equal to one. Now conditioning on the first choice , there are two remaining second choices. The conditional binomial likelihood for the second choice is
,(3)
where
and
.
Again, the sample size for this binomial is one. The resulting likelihood for the i-th sample is
(4).
in accordance with eq. (1). It emerges that the likelihood can be written as a product of successive conditional multinomial likelihoods.
The fact that for each selection step we have a (conditional) multinomial likelihood with sample size equal to onecan be exploited in fitting the model as a generalized linear model (McCullagh and Nelder, 1989). For this purpose, the data need to be exploded such that for each selection step a conditional multinomial likelihood can be fitted as described before. To illustrate, we continue the example . The exploded pseudo-data yis as follows:
sample rankchoicey
1110
1121
1130
1210
1231
Now for each rank, except the last, a (conditional) multinomial model needs to be fitted. Specifically, for the first rank a trinomial model is fitted and for the second a conditional binomial model is fitted. For each rank this is achieved by defining a record for each category for the multinomial in question.
Finally, we may exploit that, up to a constant not depending on parameters, the kernel of a multinomial likelihood is equivalent that of a Poisson model (McCullagh und Nelder, 1989: 209). Thus, we may analyse by a log-linear model that takes pseudo-data y to be Poisson random variables. In order to make this a multinomial for each rank and sample, a common intercept must be fitted to each sample-by-rank combination. Thus, the linear predictor is
sample.rank + choice(5)
Any block or treatment structure is reflected by giving a structured model for sample and choice, respectively, giving great flexibility to account for complex designs and research questions.It is important to realize, however, that only within sample effects can be estimated based on the model.
The method does not allow for tied ranks. Though a modification is possible (Skrondal and Rabe-Hesketh, 2003), details are not discussed here.
Example (kindly provided by Augustine Takyi, Institute 480, 1 September 2008)
Respondents were 120 household in six communities in a project regionin Southern Ethiopiareceiving support by two NGO.
The respondents were asked to rank their reasons for keeping goats according to the following predetermined reasons;
GKSLA – goat keeping, sale of live animals
GKHCMt – goat keeping, home consumption of meat
GKHCMk – goat keeping, home consumption of milk
GKCER – goat keeping, ceremonies
GKSLA / GKHCMt / GKHCMk / GKCER2 / 4 / 1 / 3
2 / 4 / 1 / 3
2 / 3 / 1 / 4
2 / 3 / 1 / 4
… 116 more records
%let A=4; /*number of choices*/
data a;
array _choice _choice1-_choice&a;
input GKSLAGKHCMtGKHCMkGKCER;
sample = _N_;
_choice1=GKSLA; label _choice1='GKSLA ';
_choice2=GKHCMt; label _choice2='GKHCMt';
_choice3=GKHCMk; label _choice3='GKHCMk';
_choice4=GKCER; label _choice4='GKCER ';
do i=1to &a;
rank=i;
do j=1to &a;
choice=label(_choice[j]);
if _choice[j]=rank thendo; y=1; output; end;
if _choice[j]>rank thendo; y=0; output; end;
end;
end;
datalines;
2413
2413
2314
2314
2314
2314
2314
2413
2314
2413
2413
1324
1423
1423
2314
1324
1324
1324
2314
2413
2314
1423
1423
3214
2314
2413
3412
2314
2314
1324
2314
2413
3214
3214
4312
3214
2413
4312
2413
2314
3214
3214
1423
3214
3214
3214
2413
2314
2413
4321
2314
3214
2413
2413
2413
1324
2314
2413
2314
2413
2314
2413
1423
2314
2314
1423
2413
2314
2314
2413
1324
2314
2413
2413
1324
1423
2413
1423
2413
1423
2314
2413
2413
2413
2413
1324
2413
2314
1423
2413
2314
1423
1423
2314
1324
2413
2413
2314
1423
1324
1423
1324
1324
2413
2314
2314
2413
1423
2413
2413
1234
2413
1324
2413
2314
2314
2413
1324
2314
2413
;
procgenmoddata=a;
odsoutput lsmeans=lsmeans lsmeandiffs=diffs;
class choice sample rank;
model y=sample*rank choice/link=log dist=poisson type1;
lsmeans choice/pdiff;
%mult(trt=choice, p=probchisq);
run;
The %mult macro can be downloaded at
Output:
LR Statistics For Type 1 Analysis
Chi-
Source Deviance DF Square Pr > ChiSq
Intercept 879.6391
sample*rank 762.7329 479 116.91 1.0000
choice 421.3834 3 341.35 <.0001
Least Squares Means
Standard Chi-
Effect choice Estimate Error DF Square Pr > ChiSq
choice GKCER -2.0480 0.1618 1 160.29 <.0001
choice GKHCMk 1.7123 0.1819 1 88.63 <.0001
choice GKHCMt -1.9923 0.1615 1 152.09 <.0001
choice GKSLA 0.5722 0.1542 1 13.77 0.0002
Differences of Least Squares Means
Standard Chi-
Effect choice _choice Estimate Error DF Square Pr > ChiSq
choice GKCER GKHCMk -3.7604 0.3027 1 154.33 <.0001
choice GKCER GKHCMt -0.0557 0.1804 1 0.10 0.7575
choice GKCER GKSLA -2.6202 0.2701 1 94.11 <.0001
choice GKHCMk GKHCMt 3.7047 0.3023 1 150.13 <.0001
choice GKHCMk GKSLA 1.1401 0.2044 1 31.13 <.0001
choice GKHCMt GKSLA -2.5645 0.2700 1 90.24 <.0001
TRT BY LEVEL BY2 LEVEL2 BY3 LEVEL3 LABEL LSMEAN G
choice ...... GKCER -2.048012 . a .
GKHCMk 1.7123459 . . c
GKHCMt -1.992305 . a .
GKSLA 0.5722071 b . .
The best utility was found for GKHCMk, which differed significantly from GKSLA, which in turn differed significantly from the remaining choices GKCER and GKHCMt.
For comparison, the Friedman-test was computed for the same data. The chi-squared value was 890.89, which is rather larger than that of the multinomial logit analysis (341.35).
Choice Mean rank
GKCER 3.4583
GKHCMk 1.2833
GKHCMt 3.3833
GKSLA 1.8750
Mean ranks agree with ranking by estimated utilities.
Extended example
The 120 households were stratified according to 6 community (kebele), which were receiving support from two different NGO.To illustrate the flexibility of the method, consider the effect of NGO and its interaction with CHOICE. The linear model is
sample.rank + choice + ngo + choice.ngo(6)
This is fitted as follows:
procgenmoddata=a;
odsoutput lsmeans=lsmeans lsmeandiffs=diffs;
class choice sample rank ngo;
model y=sample*rank choice|ngo/link=log dist=poisson type1;
run;
Output:
LR Statistics For Type 1 Analysis
Chi-
Source Deviance DF Square Pr > ChiSq
Intercept 879.6391
sample*rank 762.7329 479 116.91 1.0000
choice 421.3834 3 341.35 <.0001
NGO 421.3834 0 0.00 .
choice*NGO 394.5309 3 26.85 <.0001
The significant interaction indicates that there are significant differences in which households rank alternative choices depending on the supporting NGO. The main effect for NGO provides no test here, because ranking is within households, and thus within NGOs, so the effect is not estimable: it is a between-sample effect. In fact, we could as well drop the NGO main effect without altering the tests for other effects.
To further test whether there was residual variation between communities (kebele) within NGO, the following model was used (see Piepho et al., 2003 for syntax):
sample.rank + choice + (ngo/kebele) + choice.(ngo/kebele)(7a)
Dropping effects for NGO and NGO.kebele, which are between-sample effects and so are not estimable, the model resolves as
sample.rank + choice + choice.ngo + choice.ngo.kebele(7b)
Implementation in MIXED:
procgenmoddata=a;
class choice sample rank ngo kebele;
model y=sample*rank choice choice*ngo choice*kebele*ngo/link=log dist=poisson type1;
run;
Output:
LR Statistics For Type 1 Analysis
Chi-
Source Deviance DF Square Pr > ChiSq
Intercept 879.6391
sample*rank 762.7329 479 116.91 1.0000
choice 421.3834 3 341.35 <.0001
choice*NGO 394.5309 3 26.85 <.0001
choice*NGO*kebele 367.5717 12 26.96 0.0078
There is significant residual variation among communities within NGOs.
References
Luce RD1959 Individual choice behavior. Wiley, New York.
McCullagh P, Nelder JA 1989 Generalized linear models. Chapman & Hall, London.
Piepho HP, Büchse A, Emrich K 2003 A hitchhiker's guide to the mixed model analysis of randomized experiments. Journal of Agronomy and Crop Science189, 310-322.
Skrondal A, Rabe-Hesketh S 2003 Multi-level logistic regression for polytomous data and rankings. Psychometrika 68, 267-287
1
A simpler implementation using the PHREG procedure
13 April 2013
As pointed out by Allison and Christakis (1994), the exploded logit model can also be implemented using procedures for proportional hazards (PH) regression, exploiting identities in the likelihoods between the PH model and the exploded logit model. The authors used the PHREG procedure of SAS to illustrate their procedure. The approach is also considered in Allison (1999).
When developing the approach using GENMOD in 2008, our intention was to find an easy way to compute all pairwise comparisons between levels of a factor or effect. At the time, the PHREG procedure did not have an LSMEANS or a CLASS statement, so PHREG was not a convenient option at the time. These statements have now been added to the procedure. Also, the means can be output using the ODS facility, so letter displays for all pairwise comparisons can computed using the %MULT macro (Piepho, 2012).
Two technicalities must be observed when using the PHREG procedure:
(1) For PHREG to compute lsmeans, the param=glm option on the CLASS statement must be used. This invokes “GLM parameterization”.
(2) The means option must be added in the LSMEANS statement.
Example (kindly provided by Augustine Takyi, Institute 480, 1 September 2008)
Using the example provided by Augustine Takyi, we get exactly the same results with PHREG as with the GENMOD implementation. The only difference is that the means are shifted, because the mean for the last factor level is set to zero. The means themselves have no direct interpretation because these refer to a latent scale; it is only the differences between means that have a direct interpretation as log odds under the exploded logit model. And these differences are identical under both implementations.
The PHREG Procedure
choice Least Squares Means
Standard
choice Estimate Error z Value Pr > |z|
GKCER -2.6201 0.2701 -9.70 <.0001
GKHCMk 1.1401 0.2044 5.58 <.0001
GKHCMt -2.5644 0.2699 -9.50 <.0001
GKSLA 0 0 . .
Differences of choice Least Squares Means
Standard
choice _choice Estimate Error z Value Pr > |z|
GKCER GKHCMk -3.7602 0.3027 -12.42 <.0001
GKCER GKHCMt -0.05571 0.1804 -0.31 0.7575
GKCER GKSLA -2.6201 0.2701 -9.70 <.0001
GKHCMk GKHCMt 3.7045 0.3023 12.25 <.0001
GKHCMk GKSLA 1.1401 0.2044 5.58 <.0001
GKHCMt GKSLA -2.5644 0.2699 -9.50 <.0001
trt label lsmean letters
choice GKCER -2.620099 a
GKHCMk 1.1401345 c
GKHCMt -2.564392 a
GKSLA 0 b
References
Allison, P.D.(1999): Logistic regression using the SAS System. Theory and application. SAS Institute Inc., Cary.
Allison P.D., Christakis, N.A.(1994): Logit models for sets of ranked items. Sociological Methodology24, 199-228.
Piepho, H.P. (2012): A SAS macro for generating letter displays of pairwise mean comparisons. Communications in Biometry and Crop Science7, 4-13.
SAS code
data b;
input GKSLAGKHCMtGKHCMkGKCER;
sample = _N_;
y=GKSLA; choice='GKSLA ';output;
y=GKHCMt; choice='GKHCMt';output;
y=GKHCMk; choice='GKHCMk';output;
y=GKCER; choice='GKCER ';output;
datalines;
2413
2413
2314
2314
2314
2314
2314
2413
2314
2413
2413
1324
1423
1423
2314
1324
1324
1324
2314
2413
2314
1423
1423
3214
2314
2413
3412
2314
2314
1324
2314
2413
3214
3214
4312
3214
2413
4312
2413
2314
3214
3214
1423
3214
3214
3214
2413
2314
2413
4321
2314
3214
2413
2413
2413
1324
2314
2413
2314
2413
2314
2413
1423
2314
2314
1423
2413
2314
2314
2413
1324
2314
2413
2413
1324
1423
2413
1423
2413
1423
2314
2413
2413
2413
2413
1324
2413
2314
1423
2413
2314
1423
1423
2314
1324
2413
2413
2314
1423
1324
1423
1324
1324
2413
2314
2314
2413
1423
2413
2413
1234
2413
1324
2413
2314
2314
2413
1324
2314
2413
;
/*Added on 13 APR 2013 - PHREG now has a CLASS and LSMEANS statement.
This code, which follows Allison’s SAS for logistic regression book
and my GENMOD code produce exactly the same results.*/
procphregdata=b;
odsoutput lsmeans=lsmeans diffs=diffs;
class sample choice/param=glm;
/*The param=glm option is needed for glm parameterization.
Without this, lsmeans are not computed*/
model y=choice;
strata sample;
lsmeans choice/pdiff means;
/*The means option must be added, otherwise the lsmeans
will not be computed, but only the pairwise differences*/
%mult(trt=choice, p=probz, descending=0);
run;
1