Statistics 512 Notes 20:

Multinomial Distribution:

Consider a random trial which can result in one, and only one, of k outcomes or categories . Let denote the probabilities of outcomes ; the probability of is . Let

denote the outcome of the ith trial. Let be indicator variables for whether the ith trial resulted in the 1,...,kth outcome respectively, e.g., . Let , denote the number of trials whose outcome is .

We have

Note that

where .

The likelihood is

The partial derivatives are:

,...,

It is easily seen that satisfies these equations.

See (6.4.19) and (6.4.20) in book for information matrix.

Goodness of fit tests for multinomial experiments:

For a multinomial experiment, we often want to consider a model with fewer parameters than , e.g.,

(*)

where and

To test if the model is appropriate, we can do a “goodness of fit” test which tests

vs.

We can do this test using a likelihood ratio test where the number of extra parameters in the full parameter space is so that under .

Example 2: Linkage in genetics

Corn can be starchy (S) or sugary (s) and can have a green base leaf (G) or a white base leaf (g). The traits starchy and green base leaf are dominant traits. Suppose the alleles for these two factors occur on separate chromosomes and are hence independent. Then each parent with alleles SsGg produces with equal likelihood gametes of the form (S,G), (S,g), (s,G) and (s,g). If two such hybrid parents are crossed, the phenotypes of the offspring will occur in the proportions suggested by the able below. That is, the probability of an offspring of type (S,G) is 9/16; type (SG) is 3/16; type (S,g) 3/16; type (s,g) 1/16.

Alleles of first parent
Alleles of second parent / SG / Sg / sG / sg
SG / (S,G) / (S,G) / (S,G) / (S,G)
Sg / (S,G) / (S,g) / (S,G) / (s,G)
sG / (S,G) / (S,G) / (s,G) / (s,G)
Sg / (S,G) / (S,g) / (s,G) / (s,g)

The table below shows the results of a set of 3839 SsGg x SsGg crossings (Carver, 1927, Genetics, “A Genetic Study of Certain Chlorophyll Deficiencies in Maize.”)

Phenotype / Number in sample
Starchy green / 1997
Starchy white / 906
Sugary green / 904
Sugary white / 32

Does the genetic model with 9:3:3:1 ratios fit the data?

Let denote the phenotype of the ith crossing.

Model: are iid multinomial.

Likelihood ratio test:

Under , [there are three extra free parameters in ].

Reject if .

Thus we reject .

Model for linkage:

Maximum likelihood estimate of for corn data = 0.0357, see handout.

Test

Under , [there are two extra free parameters in ].

Linkage model is not rejected.

Sufficiency

Let denote the a random sample of size n from a distribution that has pdf or pmf . The concept of sufficiency arises as an attempt to answer the following question: Is there a statistic, a function which contains all the information in the sample about . If so, a reduction of the original data to this statistic without loss of information is possible. For example, consider a sequence of independent Bernoulli trials with unknown probability of success . We may have the intuitive feeling that the total number of successes contains all the information about that there is in the sample, that the order in which the successes occurred, for example, does not give any additional information. The following definition formalizes this idea:

Definition: Let denote the a random sample of size n from a distribution that has pdf or pmf , . A statistic is said to be sufficientfor if the conditional distribution of given does not depend on for any value of .

Example 1: Let be a sequence of independent Bernoulli random variables with . We will verify that is sufficient for .

Consider

For , the conditional probability is 0 and does not depend on .

For ,

The conditional distribution thus does not involve at all and thus is sufficient for .

Example 2:

Let be iid Uniform(). Consider the statistic .

We have shown before (see Notes 1) that

For , we have

which does not depend on .

For , .

Thus, the conditional distribution does not depend on and is a sufficient statistic

Factorization Theorem: Let denote the a random sample of size n from a distribution that has pdf or pmf , . A statistic is sufficient for if and only if we can find two nonnegative functions, and such that

where does not depend upon .