Discrimination and Classification Procedures

AGR206Ch16DISCRIM.doc Revised: 6/1/04

Chapter 16. Discriminant Analysis.

16:1 What is Discriminant Analysis?

Discriminant analysis (both for discrimination and classification) is a statistical technique to organize and optimize:

¨ the description of differences among objects that belong to different groups or classes, and

¨ the assignment of objects of unknown class to existing classes.

For example, we may want to determine what characteristics of the inflorescence best discriminate between two very similar species of grasses, and we may want to create a rule that can be used by others to classify individual plants in the future.

Thus, there are two related activities or concepts in discrimination and classification:

1. Descriptive discrimination focuses on finding a few dimensions that combine the originally measured variables and that separate the classes or collections as much as possible.

2. Optimal assignment of new objects, whose real group membership is not known, into one of the existing groups or classes.

Discriminant analysis is a method for classifying observations (objects or subjects) into one of two or more mutually exclusive groups; for determining the degree of dissimilarity of observations and groups; and for determining the specific contribution of each independent variable to this dissimilarity.

16:1.1 Elements of DA:

¨ One categorical dependent variable (groups or classes); for example, Bromus hordeaceus vs. Bromus madritensis. When we have groups that represent factorial combinations of variables, these have to be “flattened” and considered as a set of groups. For example, if we are trying to identify the species and origin of seeds from 2 species (brma and brho)that may have come from two environments (valley or mountain) we have to create a nominal variable that takes 4 values, one for each possible combination of origin and environment.

¨ A set of continuous independent variables that are measured on each individual; for example, length, width, area and perimeter of the seed outline.

¨ A set with as many probability density functions (pdf) as there are groups. Each pdf describes the probability of obtaining an object, subject or element from a group that has a particular set of values for the independent variables. For example, the pdf for B. hordeaceus (brho) would tell you the probability of finding a brho seed of any given combination of length, width, area and perimeter. The pdf for B. madritensis (brma) would tell you the probability of finding a brma seed with those same characteristics. Typically, it is assumed that all the pdf’s are the multivariate normal distributions.

The equation for the multivariate normal distribution is:

where is the vector of random variables, p is the number of variables or rows in , is the variance covariance matrix of , and is the centroid of the distribution. If we were considering only two characteristics, say width and length, the two pdf’s for the two grasses might look like this (after standardizing width and length, simulated data):

Note that for any combination of length and width there is a positive probability that it be brma, as well as brho. In some areas, the probabilities are clearly different, but in others they are similar. Cutting away the front and left sides of the picture allows us to see better how the two pdf’s interact.

16:1.2 How does DA compare with other methods?

16:1.2.1 with PCA:

1. DA has X and Y variables, whereas in PCA there is only one set of variables.

2. DA has predetermined groups.

3. Both use the concept of creating new variables that are linear combinations of the original ones.

16:1.2.2 with Cluster Analysis:

1. DA has predetermined groups, and it is used to optimally assign objects of unknown membership to the groups.

2. Cluster analysis is used to generate classifications or taxonomies.

3. In DA, groups are mutually exclusive and exhaustive. All possible groups must be considered, and each object or subject belongs to a single group. This is not the case for all versions of cluster analysis.

16:1.2.3 With MANOVA

1. DA and MANOVA are very similar, and are based on several common theoretical aspects. In fact, DA is accessible through the MANOVA Fit Model personality.

2. Both have categorical X's and continuous Y's (particularly in the discrimination phase of DA).

3. Both use exactly the same canonical variates, separation of SS&CP into between and within groups, etc.

4. The boundary between MANOVA and descriptive DA is not clear-cut in terms of the statistical calculations. The calculations are almost the same.

5. The difference between MANOVA and classification is a clear one in terms of objectives and calculations. Whereas in MANOVA the main question is whether there are significant differences among groups, in DA the main goal is to develop and use discriminant functions to optimally classify objects into the groups.

16:2 Why and When to use Discriminant Analysis?

DA is useful in the following types of situations:

Incomplete knowledge of future situations. For example, a population can be classified as being at risk of extinction on the basis of characteristics that were typical of populations that went extinct in the past. A student applying to go to college may have to be classified as likely to succeed or likely to fail based on the characteristics of students who did succeed or fail in the past.

The group can be identified, but identification requires destroying the subject or plot. For example, the strength of a rope or a camalot can be measured by stressing it until it breaks. Of course after it breaks we know its strength but cannot use the information on that particular piece, because it no longer exists. The exact species of a seed can be determined by DNA analysis, but after the analysis is done, there is no more seed left to do anything with the information!

Unavailable or expensive information. For example, the remains of a human are found and the sex has to be determined. The type of land cover has to be determined for each square km of a large region. Although it would be possible to go to each spot and look at the land cover directly, it would be too expensive. Satellite images can be used and land cover inferred from the spectral characteristics of the reflected radiation.

When the goal is classification of objects whose classes are unknown, the analysis proceeds as follows:

1. Obtain a random sample of objects from each class (these are objects whose membership is known). This is known as the "training" or "learning" sample.

2. Measure a series of continuous characteristics on all objects of the training sample and identify any characteristics that are redundant or that really do not help in the discrimination among groups (this can be done by using MANOVA with stepdown analysis, see textbook by Tabachnik and Fidell). This step is not crucial, but can save time, money and increase power of discrimination.

3. Submit the training sample to a DA and obtain a set of discriminant functions. These functions are used implicitly by SAS and JMP, so you do not need to see or know them. The information on these functions is stored in a SAS dataset that is created with an OUTSTAT=file1 option in the PROC DISCRIM statement. In JMP, the discrimination functions can be saved to table columns.

4. In JMP, add a row containing the values of all predictors for an object to the data table. In SAS, create a new SAS dataset (file2) with the characteristics of objects of unknown membership to be classified and submit to another PROC DISCRIM where DATA=file1 and TESTDATA=file2.

The same procedure allows a true validation of the classification functions by using a file2 that contains objects of known membership to be classified using only the information on the Y variables and the classification functions developed with an independent dataset.

Because the pdf’s of different groups overlap, some classification errors will usually be made, even if the true parameters that describe the pdf's for each group are known.

Figure 16-1. A linear classification rule to determine if people own riding mowers based on their income and lot size. Regardless of the position of the boundary line used for classifying individuals, some individuals will be classified incorrectly.

16:3 Concepts involved in discrimination and classification.

A good classification system should have the following characteristics:

1. Use all information available.

2. Make few classification errors.

3. Minimize the negative consequences of making classification errors.

Aside from the statistical details, a classification problem has the following elements:

1. Groups or populations.

2. PDF's for each group or population in the X space.

3. Classification rules.

4. Relative sizes of each group.

5. Costs of misclassification.

16:3.1 Basic idea

Assign the unit with unknown membership to the group that has the maximum likelihood of being the source of the observed vector Xu.

Example: 2 urns in random positions. One contains 9 white and 1 black marbles (A) the other contains 1 white and 9 black (B). Blindfolded, you extract one marble from one urn. Where did it come from? The wisest decision rule would be:

black ® B

white ® A

However, even knowing all population parameters we will make mistakes.

Outcome / Prob / Classific. / Error? / error rate
A and whiteïA / 9/20 / A / No / 1/10
A and blackïA / 1/20 / B / Yes
B and whiteïB / 1/20 / A / Yes
B and blackïB / 9/20 / B / No

The basic classification idea minimizes error rate or cost of errors. The only difference between this example and discriminant analysis is the complexity. The essential theoretical basis is the same.

Rule: Assign an individual u to group g if:

P(gïXu) > P(g’ïXu) for all g’ ¹ g (for all groups other than g)

If we are considering a single continuous variable for the classification, and we have two groups, the decision rule can be depicted with the following Figure. Note that nothing is assumed or said about the specific distribution of the observed variable in each group.

Figure 16-2. Classification rule and error rates for two groups when there is a single dimension or variable used for the classification. X is the characteristic measured to classify objects. Population on the left is 1 and the one on the right is 2. P(j|k) is the probability of classifying an object as j given that it is k.

16:3.2 Prior Probabilities

Suppose that in the previous example we take 1 urn of type A and 2000 urns of type B. Marbles can come only from 2 groups as before: A or B. Further, suppose that you randomly select a marble from a random urn and it is white. Do you say it came from an urn type A or B? In the previous situation it was clear (almost) that it came from A. As the number of B urns increases, the probability that the white marble came from B also increases.

Consider the probability of the event “white marble from B”, call it P(whiteÇB).

P(whiteÇB) = P(white) P(Bïwhite) = P(B) P(whiteïB)

In general, assume that instead of color you measure a vector Xu on the extracted marble and use g to designate groups.

P(XuÇg) = P(Xu) P(gïXu) = P(g) P(Xuïg)

We are interested in calculating P(gïXu) for all g’s, so we can assign Xu (the marble) to the group g with the max P(gïXu). P(g) are called prior probabilities, or priors and reflect the probability of getting a unit at random from any g, before we know anything about the unit. (P(g) = pg)

16:3.3 Costs of making errors

The cost of incorrectly classifying an individual from group 1 into group 2 may be quite different from the cost of incorrectly putting an individual from group 2 into group 1. A typical example is that of a trial for a serious crime. The truth is not known (perhaps not even to the person on trial). What is the consequence of releasing a guilty subject? What is the consequence of convicting an innocent person? The relative consequences should affect the way in which one weighs the evidence. This is taken into account in discriminant analysis by the decision rule. Note that the following decision rule and figure depict a situation in which we are measuring 2 characteristics of each object, so the whole plane is divided into two regions:

These rules indicate that we should classify the object a in population 1 if the ratio of probabilities (“heights” of the pdf’s) f1(X)/f2(X) is greater than the ratio of the costs of misclassification times the ratio of priors. C(j|k) is the cost of classifying an object from population k into j; pk is the prior probability for population k.

Figure 16-3. Example of decision rule for classification of two populations based on two characteristics. The line partitions the plane of all possible pairs of values (x1, x2) (the "universe" of events W) into two mutually exclusive and comprehensive sets, R1 and R2. This figure shows an unusual shape of the boundary between the two groups, but it is a possible one.

16:4 Model and Assumptions.

16:4.1 Model

The model is essentially the same as for MANOVA, except that in DA the categorical variable is always a one-way analysis. Factorial combinations must be “flattened” and viewed as a single set of different groups or treatments.

16:4.2 Assumptions and other issues.

16:4.2.1 Equality of sample size across cells.

Inequality of cell sizes is usually not a problem because DA is one-way.