7.4 Discriminant Rules (More Than Two Populations)

7.88

7.4 Discriminant Rules (More Than Two Populations)

Almost all of the discrimination rules given in Section 7.1 can be extended to having more than two populations (there is no longer a single linear discriminant function rule).

Example: Wheat kernels (wheat.sas, wheat.dat)

A researcher at Kansas State University wanted to classify wheat kernels into “healthy”, “sprout”, or “scab” types. The researcher developed an automated method to take the following measurements: kernel density, hardness index, size index, weight index, and moisture content. The purpose of the discriminant analysis was to see if the wheat kernels could accurately be classified into the three types using the five measurements.

Partial listing of the data set (s=single, k=kernel):

lot / class / kernel / type1 / skden1 / skden2 / skden3 / skhard / sksize / skwt / skmst
5911 / hrw / 1 / Healthy / 1.35 / 1.35 / 1.36 / 60.33 / 2.30 / 24.65 / 12.02
5911 / hrw / 2 / Healthy / 1.29 / 1.29 / 1.29 / 56.09 / 2.73 / 33.30 / 12.17
5911 / hrw / 3 / Healthy / 1.24 / 1.23 / 1.23 / 43.99 / 2.51 / 31.76 / 11.88
5911 / hrw / 4 / Healthy / 1.34 / 1.34 / 1.34 / 53.82 / 2.27 / 32.71 / 12.11
5911 / hrw / 5 / Healthy / 1.26 / 1.26 / 1.26 / 44.39 / 2.35 / 26.07 / 12.06

srw4 / srw / 276 / Scab / 1.03 / 1.03 / 1.03 / -9.57 / 2.06 / 23.82 / 12.65

§ Lot is the batch of the wheat

§ Class is hrw=hard red winter wheat and srw=soft red winter wheat; a new variable called hrw is created to be a binary numerical variable denoting the different classes

§ Kernel is an identification number

§ type1 is the kernel type – Healthy, Sprout, Scab

§ skden1, skden2, and skden3 are the density of the kernel (the measurement was repeated three times); a new variable called skden is created to be the average of the three measurements

§ skhard is the kernel’s hardness

§ sksize is the kernel’s size

§ skwt is the kernel’s weight

§ skmst is the kernel’s moisture content

Initial summary measures and PCA:

First 50 kernels – notice kernel #31

The PRINCOMP Procedure

Observations 276

Variables 6

Simple Statistics

skden skhard sksize

Mean 1.191571450 25.91269460 2.200355072

StD 0.140570141 27.91452693 0.495121266

Simple Statistics

skwt skmst hrw

Mean 27.43524638 11.18883322 0.5217391304

StD 7.97612238 2.03025511 0.5004345937

Correlation Matrix

skden skhard sksize skwt skmst hrw

skden 1.0000 0.0989 0.2073 0.2764 0.0089 -.0151

skhard 0.0989 1.0000 -.0980 -.3395 -.1883 0.3909

sksize 0.2073 -.0980 1.0000 0.7566 0.0046 0.0248

skwt 0.2764 -.3395 0.7566 1.0000 0.1442 -.1410

skmst 0.0089 -.1883 0.0046 0.1442 1.0000 -.6816

hrw -.0151 0.3909 0.0248 -.1410 -.6816 1.0000

Eigenvalues of the Correlation Matrix

Eigenvalue Difference Proportion Cumulative

1 2.14089924 0.45206317 0.3568 0.3568

2 1.68883607 0.68799736 0.2815 0.6383

3 1.00083871 0.30098689 0.1668 0.8051

4 0.69985182 0.41680304 0.1166 0.9217

5 0.28304878 0.09652340 0.0472 0.9689

6 0.18652538 0.0311 1.0000

Eigenvectors

Prin1 Prin2 Prin3 Prin4 Prin5 Prin6

skden 0.182219 0.315600 0.734515 -.557323 0.042130 0.123681

skhard -.394409 0.195882 0.547652 0.630087 -.198030 -.264431

sksize 0.419028 0.501916 -.115373 0.384230 -.232840 0.597778

skwt 0.542541 0.382239 -.126822 0.079601 0.156216 -.716044

skmst 0.394311 -.461683 0.319975 0.372033 0.601871 0.168304

hrw -.431070 0.500849 -.169666 0.001902 0.719821 0.128047

3-4 principal components appear to be the true dimension of the data. Note the possible interpretations of the principal components.

Kernel #31 is the outlier on the plot

Notice there is some separations between the different kernel classes. The division is actually for the hard and soft read winter wheat (hrw variable). It may be of interest to try a separate analysis for the variables??? I will not do that here and just use hrw as a variable for discriminating between the wheat kernels types.

What should be done about kernel #31? Talk to the researcher to make sure this kernel’s data values are correct.

Discriminant Analysis:

title2 'Discriminant analysis on the wheat data set - priors proportional';

proc discrim data=set1 method=normal crossvalidate

out=list_set outcross=cross_set;

class type1;

var skden skhard sksize skwt skmst hrw ;

priors proportional;

run;

title2 'Missclassifications from crossvalidation';

proc print data=cross_set;

where type1 ne _into_;

var type1 _into_ healthy sprout scab;

run;

Chris Bilder, STAT 873

Discriminant analysis on the wheat data set - priors proportional

The DISCRIM Procedure

Observations 276 DF Total 275

Variables 6 DF Within Classes 273

Classes 3 DF Between Classes 2

Class Level Information

Variable Prior

type1 Name Frequency Weight Proportion Probability

Healthy Healthy 96 96.0000 0.347826 0.347826

Scab Scab 84 84.0000 0.304348 0.304348

Sprout Sprout 96 96.0000 0.347826 0.347826

Pooled Covariance Matrix Information

Natural Log of the

Covariance Determinant of the

Matrix Rank Covariance Matrix

6 2.83517

Pairwise Generalized Squared Distances Between Groups

2 _ _ -1 _ _

D (i|j) = (X - X )' COV (X - X ) - 2 ln PRIOR

i j i j j

Generalized Squared Distance to type1

From

type1 Healthy Scab Sprout

Healthy 2.11211 7.28413 2.75637

Scab 7.01707 2.37917 5.44346

Sprout 2.75637 5.71052 2.11211

Linear Discriminant Function

_ -1 _ -1 _

Constant = -.5 X' COV X + ln PRIOR Coefficient = COV X

j j j Vector j

Linear Discriminant Function for type1

Variable Healthy Scab Sprout

Constant -108.89245 -88.62721 -101.75140

skden 92.60313 79.65724 86.94798

skhard -0.06784 -0.07288 -0.08135

sksize 10.47804 10.41204 10.98232

skwt 0.04965 -0.19559 0.01680

skmst 5.68324 5.79647 5.67197

hrw 18.26673 18.63187 18.46743

Classification Summary for Calibration Data: WORK.SET1

Resubstitution Summary using Linear Discriminant Function

Generalized Squared Distance Function

2 _ -1 _

D (X) = (X-X )' COV (X-X ) - 2 ln PRIOR

j j j j

Posterior Probability of Membership in Each type1

2 2

Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))

j k k

Number of Observations and Percent Classified into type1

From

type1 Healthy Scab Sprout Total

Healthy 74 7 15 96

77.08 7.29 15.63 100.00

Scab 10 64 10 84

11.90 76.19 11.90 100.00

Sprout 23 19 54 96

23.96 19.79 56.25 100.00

Total 107 90 79 276

38.77 32.61 28.62 100.00

Priors 0.34783 0.30435 0.34783

Error Count Estimates for type1

Healthy Scab Sprout Total

Rate 0.2292 0.2381 0.4375 0.3043

Priors 0.3478 0.3043 0.3478

Classification Summary for Calibration Data: WORK.SET1

Cross-validation Summary using Linear Discriminant Function

Generalized Squared Distance Function

2 _ -1 _

D (X) = (X-X )' COV (X-X ) - 2 ln PRIOR

j (X)j (X) (X)j j

Posterior Probability of Membership in Each type1

2 2

Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))

j k k

Number of Observations and Percent Classified into type1

From

type1 Healthy Scab Sprout Total

Healthy 67 7 22 96

69.79 7.29 22.92 100.00

Scab 11 62 11 84

13.10 73.81 13.10 100.00

Sprout 25 19 52 96

26.04 19.79 54.17 100.00

Total 103 88 85 276

37.32 31.88 30.80 100.00

Priors 0.34783 0.30435 0.34783

Error Count Estimates for type1

Healthy Scab Sprout Total

Rate 0.3021 0.2619 0.4583 0.3442

Priors 0.3478 0.3043 0.3478

Chris Bilder, STAT 873

Missclassifications from crossvalidation

Obs type1 _INTO_ Healthy Sprout Scab

8 Healthy Sprout 0.45678 0.48732 0.05591

11 Healthy Scab 0.29606 0.25829 0.44564

14 Sprout Healthy 0.65448 0.33797 0.00755

15 Sprout Scab 0.11832 0.25361 0.62806

16 Sprout Healthy 0.54047 0.42140 0.03813



271 Scab Healthy 0.37123 0.28543 0.34334

273 Scab Healthy 0.47232 0.34115 0.18653

title2 'Discriminant analysis on the wheat data set';

proc discrim data=set1 method=normal crossvalidate

outcross=cross_set;

class type1;

var skden skhard sksize skwt skmst hrw ;

priors equal;

run;

title2 'Missclassifications from crossvalidation';

proc print data=cross_set;

where type1 ne _into_;

var type1 _into_ healthy sprout scab;

run;

The DISCRIM Procedure

Observations 276 DF Total 275

Variables 6 DF Within Classes 273

Classes 3 DF Between Classes 2

Class Level Information

Variable Prior

type1 Name Frequency Weight Proportion Probability

Healthy Healthy 96 96.0000 0.347826 0.333333

Scab Scab 84 84.0000 0.304348 0.333333

Sprout Sprout 96 96.0000 0.347826 0.333333

Pooled Covariance Matrix Information

Natural Log of the

Covariance Determinant of the

Matrix Rank Covariance Matrix

6 2.83517

Pairwise Generalized Squared Distances Between Groups

2 _ _ -1 _ _

D (i|j) = (X - X )' COV (X - X )

i j i j

Generalized Squared Distance to type1

From

type1 Healthy Scab Sprout

Healthy 0 4.90497 0.64426

Scab 4.90497 0 3.33136

Sprout 0.64426 3.33136 0

Linear Discriminant Function

_ -1 _ -1 _

Constant = -.5 X' COV X Coefficient Vector = COV X

j j j

Linear Discriminant Function for type1

Variable Healthy Scab Sprout

Constant -107.83640 -87.43762 -100.69535

skden 92.60313 79.65724 86.94798

skhard -0.06784 -0.07288 -0.08135

sksize 10.47804 10.41204 10.98232

skwt 0.04965 -0.19559 0.01680

skmst 5.68324 5.79647 5.67197

hrw 18.26673 18.63187 18.46743

Classification Summary for Calibration Data: WORK.SET1

Resubstitution Summary using Linear Discriminant Function

Generalized Squared Distance Function

2 _ -1 _

D (X) = (X-X )' COV (X-X )

j j j

Posterior Probability of Membership in Each type1

2 2

Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))

j k k

Number of Observations and Percent Classified into type1

From

type1 Healthy Scab Sprout Total

Healthy 74 7 15 96

77.08 7.29 15.63 100.00

Scab 10 65 9 84

11.90 77.38 10.71 100.00

Sprout 23 20 53 96

23.96 20.83 55.21 100.00

Total 107 92 77 276

38.77 33.33 27.90 100.00

Priors 0.33333 0.33333 0.33333

Error Count Estimates for type1

Healthy Scab Sprout Total

Rate 0.2292 0.2262 0.4479 0.3011

Priors 0.3333 0.3333 0.3333

Classification Summary for Calibration Data: WORK.SET1

Cross-validation Summary using Linear Discriminant Function

Generalized Squared Distance Function

2 _ -1 _

D (X) = (X-X )' COV (X-X )

j (X)j (X) (X)j

Posterior Probability of Membership in Each type1

2 2

Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))

j k k

Number of Observations and Percent Classified into type1

From

type1 Healthy Scab Sprout Total

Healthy 66 9 21 96

68.75 9.38 21.88 100.00

Scab 10 65 9 84

11.90 77.38 10.71 100.00

Sprout 25 20 51 96

26.04 20.83 53.13 100.00

Total 101 94 81 276

36.59 34.06 29.35 100.00

Priors 0.33333 0.33333 0.33333

Error Count Estimates for type1

Healthy Scab Sprout Total

Rate 0.3125 0.2262 0.4688 0.3358

Priors 0.3333 0.3333 0.3333

Obs type1 _INTO_ Healthy Sprout Scab

8 Healthy Sprout 0.45316 0.48345 0.06339

11 Healthy Scab 0.27834 0.24283 0.47882

14 Sprout Healthy 0.65378 0.33760 0.00862

15 Sprout Scab 0.10858 0.23273 0.65869

16 Sprout Healthy 0.53754 0.41912 0.04334

17 Sprout Scab 0.15010 0.12690 0.72299

18 Sprout Healthy 0.56431 0.35650 0.07919

19 Sprout Scab 0.18301 0.16055 0.65644



268 Scab Healthy 0.52590 0.13806 0.33604

273 Scab Healthy 0.46007 0.33229 0.20764

Notes:

§ The classification error rates are a little better using the PRIORS=PROPORTIONAL option. Note that the proportion in each wheat class is approximately the same.

§ Using the linear discriminant rules is a little more complicated when the number of populations is more than 2. For this example, 2 different linear discriminant functions are needed. See wheat.sas for the PROC IML code used to classify the wheat kernels.

§ Also contained in wheat.sas is the PROC IML code needed to show how the Mahalanobis distance and the posterior probability are found. Examine this on your own.

§ No cost of classifications are used here

§ The covariance matrices for healthy, sprout, and scab were found to be unequal (p-value<0.0001) using the POOL=TEST option in PROC DISCRIM. When the quadratic discriminant rule is used, the classification error rates are a little less. The actual code and output used to find these rates are excluded from the notes.

§ Examine the 3D plot of the principal components for justification of why some classification error rates are larger than others.

§ Summary of classification errors

Classification Error Rates
Actual / Healthy / Scab / Sprout / Overall Error
S1=S2=S3 priors=prop. / Resubstitution / 22.92% / 23.81% / 43.75% / 30.43%
Crossvalidation / 30.21% / 26.19% / 45.83% / 34.42%
Different Si priors=equal / Resubstitution / 26.04% / 20.24% / 33.33% / 26.54%
Crossvalidation / 31.25% / 22.62% / 41.67% / 31.85%
S1=S2=S3 priors=equal / Resubstitution / 22.92% / 22.62% / 44.79% / 30.11%
Crossvalidation / 31.25% / 22.62% / 46.88% / 33.58%
Different Si
prior=prop. / Resubstitution / 25.00% / 21.43% / 31.25% / 26.09%
Crossvalidation / 31.25% / 23.81% / 40.63% / 32.25%

7.5 Variable Selection Procedures

In order to find the most parsimonious model that best estimates the dependent variable in regression analysis, variable selection procedures are used to narrow down the number of independent variables. Similar variable selection procedures caan be used for discriminant analysis. This helps to eliminate variables that do not help to discriminant between the different populations.

ANCOVA REVIEW (STAT 801)

One-way ANOVA model: Yij = m + ai + eij

where eij~ind. N(0,s2)

ai is the effect of treatment i

m is the grand mean

Yij is the response of the jth object to treatment i

Example: Wheat kernels

Let Yij be the hardness of the jth kernel from the ith classification.

Y11 = hardness of kernel 1 from healthy class

a1 = healthy effect, a2 = sprout effect, a3 = scab effect

Note that if a1 = a2 = a3, there are no mean differences among the kernel types. In this case, would hardness be a good discriminator between the kernel types?

One-way ANCOVA model: Yij = m + ai + bixij + eij

bi = slope coefficient

xij = covariate

Example: Wheat kernels

xij = variable that has an effect on hardness

Note that if a1 = a2 = a3, there are no mean differences among the kernel types when xij is accounted for. In this case, would hardness be a good discriminator between the kernel types?

Forward selection

Find the variable that is the best discriminator among all the variables. This variable produces the largest F statistic value in a one-way ANOVA model.

Example: Wheat kernels (sorry about the notation)

skdenij = m + ai + eij

skhardij = m + ai + eij