Comparing distributions when observations are weighted

Let Xi and Yi for i =1, 2,...,k be independent Poisson distributed variables with means αiλi and βiωi respectively. The quantities λi and ωi are known sampling intensities, while the αi and βi are unknown. Conditioning on Σ Xi = nx and Σ Yi = ny the vectors the Xi and Yi then have multinomial distributions with probability of class i being pi = αiλi / Σ αjλj for the Xi and qi = βiωi / Σ βjωj for the Yi.

We now want to test the hypothesis that the distribution of Xi and Yi is the same provided that the sampling intensities were the same. Formally, this leads to the null-hypothesis

H0 : αi = cβi for i =1, 2 ..., k for some value of c.

The log-likelihood under the null-hypothesis is then

ln L = Σ[Xi ln(αiλi) − αiλi + Yi ln(αiωic) − αiωic]

giving that the maximum likelihood estimators α* and c* must obey the equations

(Xi + Yi)/αi* = λi + ωic* for i =1, 2 ..., k

(Σ Yi)/c* = Σ αi*ωi.

Solving these equations numerically then gives the corresponding estimates pi* and qi* under the null hypothesis. Then, any appropriate test-statistic Z can be simulated under the null hypothesis by simulating the multinomial variates Xi and Yi using the estimates parameters pi* and qi* and keeping Σ Xi and Σ Yi equal to the observed values.

If the observed value of Xi + Yi = 0 for some classes these classes should be ignored. The test statistic may for example be the standard kji-square statistic. However, if the observations are small, one should simulate conditioning on Xi + Yi > 0 for i =1, 2 ...,k in order to have a well defined test statistic in each simulation. The p-value of the test is now the fraction of simulated test statistics that are larger than the observed one.

If the simulations were done using the correct values of pi and qi this test would be exact. However, when these parameters are estimated from data, the distribution obtained from the simulations is likely to be a good approximation, provided that the total number of counts is not too small.