A well-known logistic model for ranking data – implementation as a generalized linear model

Key words: Discrete choice model, Thurstonian model, generlized linear model.

H.P.Piepho , 1 Sep 2008

Many types of experiments and surveys in agriculture yield ranked data, where a set of choices (treatments etc.) is ranked by the same subject. For example, a household may be asked to rank a set of four alternative objectives for a specific income-generating activity. Or a set of six varieties tested in a complete block design with four blocks is ranked separately in each block regarding disease severity.

A simple model for the observed ranks postulates an underlying utility model. Let be the set of Ai alternatives available for unit i and a linear predictor for unit I and alternative a. The so-called utilities take the form

,

where is a measurement error associated with utilities . If has density

,

Let be the alternative given rank k and the ranking of unit i. Under the Gumbel model for measurement error, the probability distribution of observed rankings has a closed form (Luce, 1959):

(1)

The model is also denoted as “exploded logit” since the ranking probability is written as a product of first choice probabilities for successive remaining alternatives (Skrondal and Rabe-Hesketh, 2003), i.e., rankings can be assumed to be obtained successively such that the best choice is selected first, then the second best among the remaining choices, etc. At the k-th successive selection step the contribution to (1) takes the form of a multinomial probability with sample size one and number of categories determined by the remaining choices.

To illustrate, consider the case of Ai = 3, and assume that the observed ranking is . At the first selection step for the best choice, the multinomial likelihood for the first choice is

,(2)

where

,

, and

.

This is a multinomial likelihood with three categories and sample size equal to one. Now conditioning on the first choice , there are two remaining second choices. The conditional binomial likelihood for the second choice is

,(3)

where

and

.

Again, the sample size for this binomial is one. The resulting likelihood for the i-th sample is

(4).

in accordance with eq. (1). It emerges that the likelihood can be written as a product of successive conditional multinomial likelihoods.

The fact that for each selection step we have a (conditional) multinomial likelihood with sample size equal to onecan be exploited in fitting the model as a generalized linear model (McCullagh and Nelder, 1989). For this purpose, the data need to be exploded such that for each selection step a conditional multinomial likelihood can be fitted as described before. To illustrate, we continue the example . The exploded pseudo-data yis as follows:

sample rankchoicey

1110

1121

1130

1210

1231

Now for each rank, except the last, a (conditional) multinomial model needs to be fitted. Specifically, for the first rank a trinomial model is fitted and for the second a conditional binomial model is fitted. For each rank this is achieved by defining a record for each category for the multinomial in question.

Finally, we may exploit that, up to a constant not depending on parameters, the kernel of a multinomial likelihood is equivalent that of a Poisson model (McCullagh und Nelder, 1989: 209). Thus, we may analyse by a log-linear model that takes pseudo-data y to be Poisson random variables. In order to make this a multinomial for each rank and sample, a common intercept must be fitted to each sample-by-rank combination. Thus, the linear predictor is

sample.rank + choice(5)

Any block or treatment structure is reflected by giving a structured model for sample and choice, respectively, giving great flexibility to account for complex designs and research questions.It is important to realize, however, that only within sample effects can be estimated based on the model.

The method does not allow for tied ranks. Though a modification is possible (Skrondal and Rabe-Hesketh, 2003), details are not discussed here.

Example (kindly provided by Augustine Takyi, Institute 480, 1 September 2008)

Respondents were 120 household in six communities in a project regionin Southern Ethiopiareceiving support by two NGO.

The respondents were asked to rank their reasons for keeping goats according to the following predetermined reasons;

GKSLA – goat keeping, sale of live animals

GKHCMt – goat keeping, home consumption of meat

GKHCMk – goat keeping, home consumption of milk

GKCER – goat keeping, ceremonies

GKSLA / GKHCMt / GKHCMk / GKCER
2 / 4 / 1 / 3
2 / 4 / 1 / 3
2 / 3 / 1 / 4
2 / 3 / 1 / 4

… 116 more records

%let A=4; /*number of choices*/

data a;

array _choice _choice1-_choice&a;

input GKSLAGKHCMtGKHCMkGKCER;

sample = _N_;

_choice1=GKSLA; label _choice1='GKSLA ';

_choice2=GKHCMt; label _choice2='GKHCMt';

_choice3=GKHCMk; label _choice3='GKHCMk';

_choice4=GKCER; label _choice4='GKCER ';

do i=1to &a;

rank=i;

do j=1to &a;

choice=label(_choice[j]);

if _choice[j]=rank thendo; y=1; output; end;

if _choice[j]>rank thendo; y=0; output; end;

end;

end;

datalines;

2413

2413

2314

2314

2314

2314

2314

2413

2314

2413

2413

1324

1423

1423

2314

1324

1324

1324

2314

2413

2314

1423

1423

3214

2314

2413

3412

2314

2314

1324

2314

2413

3214

3214

4312

3214

2413

4312

2413

2314

3214

3214

1423

3214

3214

3214

2413

2314

2413

4321

2314

3214

2413

2413

2413

1324

2314

2413

2314

2413

2314

2413

1423

2314

2314

1423

2413

2314

2314

2413

1324

2314

2413

2413

1324

1423

2413

1423

2413

1423

2314

2413

2413

2413

2413

1324

2413

2314

1423

2413

2314

1423

1423

2314

1324

2413

2413

2314

1423

1324

1423

1324

1324

2413

2314

2314

2413

1423

2413

2413

1234

2413

1324

2413

2314

2314

2413

1324

2314

2413

;

procgenmoddata=a;

odsoutput lsmeans=lsmeans lsmeandiffs=diffs;

class choice sample rank;

model y=sample*rank choice/link=log dist=poisson type1;

lsmeans choice/pdiff;

%mult(trt=choice, p=probchisq);

run;

The %mult macro can be downloaded at

Output:

LR Statistics For Type 1 Analysis

Chi-

Source Deviance DF Square Pr > ChiSq

Intercept 879.6391

sample*rank 762.7329 479 116.91 1.0000

choice 421.3834 3 341.35 <.0001

Least Squares Means

Standard Chi-

Effect choice Estimate Error DF Square Pr > ChiSq

choice GKCER -2.0480 0.1618 1 160.29 <.0001

choice GKHCMk 1.7123 0.1819 1 88.63 <.0001

choice GKHCMt -1.9923 0.1615 1 152.09 <.0001

choice GKSLA 0.5722 0.1542 1 13.77 0.0002

Differences of Least Squares Means

Standard Chi-

Effect choice _choice Estimate Error DF Square Pr > ChiSq

choice GKCER GKHCMk -3.7604 0.3027 1 154.33 <.0001

choice GKCER GKHCMt -0.0557 0.1804 1 0.10 0.7575

choice GKCER GKSLA -2.6202 0.2701 1 94.11 <.0001

choice GKHCMk GKHCMt 3.7047 0.3023 1 150.13 <.0001

choice GKHCMk GKSLA 1.1401 0.2044 1 31.13 <.0001

choice GKHCMt GKSLA -2.5645 0.2700 1 90.24 <.0001

TRT BY LEVEL BY2 LEVEL2 BY3 LEVEL3 LABEL LSMEAN G

choice ...... GKCER -2.048012 . a .

GKHCMk 1.7123459 . . c

GKHCMt -1.992305 . a .

GKSLA 0.5722071 b . .

The best utility was found for GKHCMk, which differed significantly from GKSLA, which in turn differed significantly from the remaining choices GKCER and GKHCMt.

For comparison, the Friedman-test was computed for the same data. The chi-squared value was 890.89, which is rather larger than that of the multinomial logit analysis (341.35).

Choice Mean rank

GKCER 3.4583

GKHCMk 1.2833

GKHCMt 3.3833

GKSLA 1.8750

Mean ranks agree with ranking by estimated utilities.

Extended example

The 120 households were stratified according to 6 community (kebele), which were receiving support from two different NGO.To illustrate the flexibility of the method, consider the effect of NGO and its interaction with CHOICE. The linear model is

sample.rank + choice + ngo + choice.ngo(6)

This is fitted as follows:

procgenmoddata=a;

odsoutput lsmeans=lsmeans lsmeandiffs=diffs;

class choice sample rank ngo;

model y=sample*rank choice|ngo/link=log dist=poisson type1;

run;

Output:

LR Statistics For Type 1 Analysis

Chi-

Source Deviance DF Square Pr > ChiSq

Intercept 879.6391

sample*rank 762.7329 479 116.91 1.0000

choice 421.3834 3 341.35 <.0001

NGO 421.3834 0 0.00 .

choice*NGO 394.5309 3 26.85 <.0001

The significant interaction indicates that there are significant differences in which households rank alternative choices depending on the supporting NGO. The main effect for NGO provides no test here, because ranking is within households, and thus within NGOs, so the effect is not estimable: it is a between-sample effect. In fact, we could as well drop the NGO main effect without altering the tests for other effects.

To further test whether there was residual variation between communities (kebele) within NGO, the following model was used (see Piepho et al., 2003 for syntax):

sample.rank + choice + (ngo/kebele) + choice.(ngo/kebele)(7a)

Dropping effects for NGO and NGO.kebele, which are between-sample effects and so are not estimable, the model resolves as

sample.rank + choice + choice.ngo + choice.ngo.kebele(7b)

Implementation in MIXED:

procgenmoddata=a;

class choice sample rank ngo kebele;

model y=sample*rank choice choice*ngo choice*kebele*ngo/link=log dist=poisson type1;

run;

Output:

LR Statistics For Type 1 Analysis

Chi-

Source Deviance DF Square Pr > ChiSq

Intercept 879.6391

sample*rank 762.7329 479 116.91 1.0000

choice 421.3834 3 341.35 <.0001

choice*NGO 394.5309 3 26.85 <.0001

choice*NGO*kebele 367.5717 12 26.96 0.0078

There is significant residual variation among communities within NGOs.

References

Luce RD1959 Individual choice behavior. Wiley, New York.

McCullagh P, Nelder JA 1989 Generalized linear models. Chapman & Hall, London.

Piepho HP, Büchse A, Emrich K 2003 A hitchhiker's guide to the mixed model analysis of randomized experiments. Journal of Agronomy and Crop Science189, 310-322.

Skrondal A, Rabe-Hesketh S 2003 Multi-level logistic regression for polytomous data and rankings. Psychometrika 68, 267-287

1

A simpler implementation using the PHREG procedure

13 April 2013

As pointed out by Allison and Christakis (1994), the exploded logit model can also be implemented using procedures for proportional hazards (PH) regression, exploiting identities in the likelihoods between the PH model and the exploded logit model. The authors used the PHREG procedure of SAS to illustrate their procedure. The approach is also considered in Allison (1999).

When developing the approach using GENMOD in 2008, our intention was to find an easy way to compute all pairwise comparisons between levels of a factor or effect. At the time, the PHREG procedure did not have an LSMEANS or a CLASS statement, so PHREG was not a convenient option at the time. These statements have now been added to the procedure. Also, the means can be output using the ODS facility, so letter displays for all pairwise comparisons can computed using the %MULT macro (Piepho, 2012).

Two technicalities must be observed when using the PHREG procedure:

(1) For PHREG to compute lsmeans, the param=glm option on the CLASS statement must be used. This invokes “GLM parameterization”.

(2) The means option must be added in the LSMEANS statement.

Example (kindly provided by Augustine Takyi, Institute 480, 1 September 2008)

Using the example provided by Augustine Takyi, we get exactly the same results with PHREG as with the GENMOD implementation. The only difference is that the means are shifted, because the mean for the last factor level is set to zero. The means themselves have no direct interpretation because these refer to a latent scale; it is only the differences between means that have a direct interpretation as log odds under the exploded logit model. And these differences are identical under both implementations.

The PHREG Procedure

choice Least Squares Means

Standard

choice Estimate Error z Value Pr > |z|

GKCER -2.6201 0.2701 -9.70 <.0001

GKHCMk 1.1401 0.2044 5.58 <.0001

GKHCMt -2.5644 0.2699 -9.50 <.0001

GKSLA 0 0 . .

Differences of choice Least Squares Means

Standard

choice _choice Estimate Error z Value Pr > |z|

GKCER GKHCMk -3.7602 0.3027 -12.42 <.0001

GKCER GKHCMt -0.05571 0.1804 -0.31 0.7575

GKCER GKSLA -2.6201 0.2701 -9.70 <.0001

GKHCMk GKHCMt 3.7045 0.3023 12.25 <.0001

GKHCMk GKSLA 1.1401 0.2044 5.58 <.0001

GKHCMt GKSLA -2.5644 0.2699 -9.50 <.0001

trt label lsmean letters

choice GKCER -2.620099 a

GKHCMk 1.1401345 c

GKHCMt -2.564392 a

GKSLA 0 b

References

Allison, P.D.(1999): Logistic regression using the SAS System. Theory and application. SAS Institute Inc., Cary.

Allison P.D., Christakis, N.A.(1994): Logit models for sets of ranked items. Sociological Methodology24, 199-228.

Piepho, H.P. (2012): A SAS macro for generating letter displays of pairwise mean comparisons. Communications in Biometry and Crop Science7, 4-13.

SAS code

data b;

input GKSLAGKHCMtGKHCMkGKCER;

sample = _N_;

y=GKSLA; choice='GKSLA ';output;

y=GKHCMt; choice='GKHCMt';output;

y=GKHCMk; choice='GKHCMk';output;

y=GKCER; choice='GKCER ';output;

datalines;

2413

2413

2314

2314

2314

2314

2314

2413

2314

2413

2413

1324

1423

1423

2314

1324

1324

1324

2314

2413

2314

1423

1423

3214

2314

2413

3412

2314

2314

1324

2314

2413

3214

3214

4312

3214

2413

4312

2413

2314

3214

3214

1423

3214

3214

3214

2413

2314

2413

4321

2314

3214

2413

2413

2413

1324

2314

2413

2314

2413

2314

2413

1423

2314

2314

1423

2413

2314

2314

2413

1324

2314

2413

2413

1324

1423

2413

1423

2413

1423

2314

2413

2413

2413

2413

1324

2413

2314

1423

2413

2314

1423

1423

2314

1324

2413

2413

2314

1423

1324

1423

1324

1324

2413

2314

2314

2413

1423

2413

2413

1234

2413

1324

2413

2314

2314

2413

1324

2314

2413

;

/*Added on 13 APR 2013 - PHREG now has a CLASS and LSMEANS statement.

This code, which follows Allison’s SAS for logistic regression book

and my GENMOD code produce exactly the same results.*/

procphregdata=b;

odsoutput lsmeans=lsmeans diffs=diffs;

class sample choice/param=glm;

/*The param=glm option is needed for glm parameterization.

Without this, lsmeans are not computed*/

model y=choice;

strata sample;

lsmeans choice/pdiff means;

/*The means option must be added, otherwise the lsmeans

will not be computed, but only the pairwise differences*/

%mult(trt=choice, p=probz, descending=0);

run;

1