7.88
7.4 Discriminant Rules (More Than Two Populations)
Almost all of the discrimination rules given in Section 7.1 can be extended to having more than two populations (there is no longer a single linear discriminant function rule).
Example: Wheat kernels (wheat.sas, wheat.dat)
A researcher at Kansas State University wanted to classify wheat kernels into “healthy”, “sprout”, or “scab” types. The researcher developed an automated method to take the following measurements: kernel density, hardness index, size index, weight index, and moisture content. The purpose of the discriminant analysis was to see if the wheat kernels could accurately be classified into the three types using the five measurements.
Partial listing of the data set (s=single, k=kernel):
lot / class / kernel / type1 / skden1 / skden2 / skden3 / skhard / sksize / skwt / skmst5911 / hrw / 1 / Healthy / 1.35 / 1.35 / 1.36 / 60.33 / 2.30 / 24.65 / 12.02
5911 / hrw / 2 / Healthy / 1.29 / 1.29 / 1.29 / 56.09 / 2.73 / 33.30 / 12.17
5911 / hrw / 3 / Healthy / 1.24 / 1.23 / 1.23 / 43.99 / 2.51 / 31.76 / 11.88
5911 / hrw / 4 / Healthy / 1.34 / 1.34 / 1.34 / 53.82 / 2.27 / 32.71 / 12.11
5911 / hrw / 5 / Healthy / 1.26 / 1.26 / 1.26 / 44.39 / 2.35 / 26.07 / 12.06
srw4 / srw / 276 / Scab / 1.03 / 1.03 / 1.03 / -9.57 / 2.06 / 23.82 / 12.65
§ Lot is the batch of the wheat
§ Class is hrw=hard red winter wheat and srw=soft red winter wheat; a new variable called hrw is created to be a binary numerical variable denoting the different classes
§ Kernel is an identification number
§ type1 is the kernel type – Healthy, Sprout, Scab
§ skden1, skden2, and skden3 are the density of the kernel (the measurement was repeated three times); a new variable called skden is created to be the average of the three measurements
§ skhard is the kernel’s hardness
§ sksize is the kernel’s size
§ skwt is the kernel’s weight
§ skmst is the kernel’s moisture content
Initial summary measures and PCA:
First 50 kernels – notice kernel #31
The PRINCOMP Procedure
Observations 276
Variables 6
Simple Statistics
skden skhard sksize
Mean 1.191571450 25.91269460 2.200355072
StD 0.140570141 27.91452693 0.495121266
Simple Statistics
skwt skmst hrw
Mean 27.43524638 11.18883322 0.5217391304
StD 7.97612238 2.03025511 0.5004345937
Correlation Matrix
skden skhard sksize skwt skmst hrw
skden 1.0000 0.0989 0.2073 0.2764 0.0089 -.0151
skhard 0.0989 1.0000 -.0980 -.3395 -.1883 0.3909
sksize 0.2073 -.0980 1.0000 0.7566 0.0046 0.0248
skwt 0.2764 -.3395 0.7566 1.0000 0.1442 -.1410
skmst 0.0089 -.1883 0.0046 0.1442 1.0000 -.6816
hrw -.0151 0.3909 0.0248 -.1410 -.6816 1.0000
Eigenvalues of the Correlation Matrix
Eigenvalue Difference Proportion Cumulative
1 2.14089924 0.45206317 0.3568 0.3568
2 1.68883607 0.68799736 0.2815 0.6383
3 1.00083871 0.30098689 0.1668 0.8051
4 0.69985182 0.41680304 0.1166 0.9217
5 0.28304878 0.09652340 0.0472 0.9689
6 0.18652538 0.0311 1.0000
Eigenvectors
Prin1 Prin2 Prin3 Prin4 Prin5 Prin6
skden 0.182219 0.315600 0.734515 -.557323 0.042130 0.123681
skhard -.394409 0.195882 0.547652 0.630087 -.198030 -.264431
sksize 0.419028 0.501916 -.115373 0.384230 -.232840 0.597778
skwt 0.542541 0.382239 -.126822 0.079601 0.156216 -.716044
skmst 0.394311 -.461683 0.319975 0.372033 0.601871 0.168304
hrw -.431070 0.500849 -.169666 0.001902 0.719821 0.128047
3-4 principal components appear to be the true dimension of the data. Note the possible interpretations of the principal components.
Kernel #31 is the outlier on the plot
Notice there is some separations between the different kernel classes. The division is actually for the hard and soft read winter wheat (hrw variable). It may be of interest to try a separate analysis for the variables??? I will not do that here and just use hrw as a variable for discriminating between the wheat kernels types.
What should be done about kernel #31? Talk to the researcher to make sure this kernel’s data values are correct.
Discriminant Analysis:
title2 'Discriminant analysis on the wheat data set - priors proportional';
proc discrim data=set1 method=normal crossvalidate
out=list_set outcross=cross_set;
class type1;
var skden skhard sksize skwt skmst hrw ;
priors proportional;
run;
title2 'Missclassifications from crossvalidation';
proc print data=cross_set;
where type1 ne _into_;
var type1 _into_ healthy sprout scab;
run;
Chris Bilder, STAT 873
Discriminant analysis on the wheat data set - priors proportional
The DISCRIM Procedure
Observations 276 DF Total 275
Variables 6 DF Within Classes 273
Classes 3 DF Between Classes 2
Class Level Information
Variable Prior
type1 Name Frequency Weight Proportion Probability
Healthy Healthy 96 96.0000 0.347826 0.347826
Scab Scab 84 84.0000 0.304348 0.304348
Sprout Sprout 96 96.0000 0.347826 0.347826
Pooled Covariance Matrix Information
Natural Log of the
Covariance Determinant of the
Matrix Rank Covariance Matrix
6 2.83517
Pairwise Generalized Squared Distances Between Groups
2 _ _ -1 _ _
D (i|j) = (X - X )' COV (X - X ) - 2 ln PRIOR
i j i j j
Generalized Squared Distance to type1
From
type1 Healthy Scab Sprout
Healthy 2.11211 7.28413 2.75637
Scab 7.01707 2.37917 5.44346
Sprout 2.75637 5.71052 2.11211
Linear Discriminant Function
_ -1 _ -1 _
Constant = -.5 X' COV X + ln PRIOR Coefficient = COV X
j j j Vector j
Linear Discriminant Function for type1
Variable Healthy Scab Sprout
Constant -108.89245 -88.62721 -101.75140
skden 92.60313 79.65724 86.94798
skhard -0.06784 -0.07288 -0.08135
sksize 10.47804 10.41204 10.98232
skwt 0.04965 -0.19559 0.01680
skmst 5.68324 5.79647 5.67197
hrw 18.26673 18.63187 18.46743
Classification Summary for Calibration Data: WORK.SET1
Resubstitution Summary using Linear Discriminant Function
Generalized Squared Distance Function
2 _ -1 _
D (X) = (X-X )' COV (X-X ) - 2 ln PRIOR
j j j j
Posterior Probability of Membership in Each type1
2 2
Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
j k k
Number of Observations and Percent Classified into type1
From
type1 Healthy Scab Sprout Total
Healthy 74 7 15 96
77.08 7.29 15.63 100.00
Scab 10 64 10 84
11.90 76.19 11.90 100.00
Sprout 23 19 54 96
23.96 19.79 56.25 100.00
Total 107 90 79 276
38.77 32.61 28.62 100.00
Priors 0.34783 0.30435 0.34783
Error Count Estimates for type1
Healthy Scab Sprout Total
Rate 0.2292 0.2381 0.4375 0.3043
Priors 0.3478 0.3043 0.3478
Classification Summary for Calibration Data: WORK.SET1
Cross-validation Summary using Linear Discriminant Function
Generalized Squared Distance Function
2 _ -1 _
D (X) = (X-X )' COV (X-X ) - 2 ln PRIOR
j (X)j (X) (X)j j
Posterior Probability of Membership in Each type1
2 2
Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
j k k
Number of Observations and Percent Classified into type1
From
type1 Healthy Scab Sprout Total
Healthy 67 7 22 96
69.79 7.29 22.92 100.00
Scab 11 62 11 84
13.10 73.81 13.10 100.00
Sprout 25 19 52 96
26.04 19.79 54.17 100.00
Total 103 88 85 276
37.32 31.88 30.80 100.00
Priors 0.34783 0.30435 0.34783
Error Count Estimates for type1
Healthy Scab Sprout Total
Rate 0.3021 0.2619 0.4583 0.3442
Priors 0.3478 0.3043 0.3478
Chris Bilder, STAT 873
Missclassifications from crossvalidation
Obs type1 _INTO_ Healthy Sprout Scab
8 Healthy Sprout 0.45678 0.48732 0.05591
11 Healthy Scab 0.29606 0.25829 0.44564
14 Sprout Healthy 0.65448 0.33797 0.00755
15 Sprout Scab 0.11832 0.25361 0.62806
16 Sprout Healthy 0.54047 0.42140 0.03813
271 Scab Healthy 0.37123 0.28543 0.34334
273 Scab Healthy 0.47232 0.34115 0.18653
title2 'Discriminant analysis on the wheat data set';
proc discrim data=set1 method=normal crossvalidate
outcross=cross_set;
class type1;
var skden skhard sksize skwt skmst hrw ;
priors equal;
run;
title2 'Missclassifications from crossvalidation';
proc print data=cross_set;
where type1 ne _into_;
var type1 _into_ healthy sprout scab;
run;
The DISCRIM Procedure
Observations 276 DF Total 275
Variables 6 DF Within Classes 273
Classes 3 DF Between Classes 2
Class Level Information
Variable Prior
type1 Name Frequency Weight Proportion Probability
Healthy Healthy 96 96.0000 0.347826 0.333333
Scab Scab 84 84.0000 0.304348 0.333333
Sprout Sprout 96 96.0000 0.347826 0.333333
Pooled Covariance Matrix Information
Natural Log of the
Covariance Determinant of the
Matrix Rank Covariance Matrix
6 2.83517
Pairwise Generalized Squared Distances Between Groups
2 _ _ -1 _ _
D (i|j) = (X - X )' COV (X - X )
i j i j
Generalized Squared Distance to type1
From
type1 Healthy Scab Sprout
Healthy 0 4.90497 0.64426
Scab 4.90497 0 3.33136
Sprout 0.64426 3.33136 0
Linear Discriminant Function
_ -1 _ -1 _
Constant = -.5 X' COV X Coefficient Vector = COV X
j j j
Linear Discriminant Function for type1
Variable Healthy Scab Sprout
Constant -107.83640 -87.43762 -100.69535
skden 92.60313 79.65724 86.94798
skhard -0.06784 -0.07288 -0.08135
sksize 10.47804 10.41204 10.98232
skwt 0.04965 -0.19559 0.01680
skmst 5.68324 5.79647 5.67197
hrw 18.26673 18.63187 18.46743
Classification Summary for Calibration Data: WORK.SET1
Resubstitution Summary using Linear Discriminant Function
Generalized Squared Distance Function
2 _ -1 _
D (X) = (X-X )' COV (X-X )
j j j
Posterior Probability of Membership in Each type1
2 2
Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
j k k
Number of Observations and Percent Classified into type1
From
type1 Healthy Scab Sprout Total
Healthy 74 7 15 96
77.08 7.29 15.63 100.00
Scab 10 65 9 84
11.90 77.38 10.71 100.00
Sprout 23 20 53 96
23.96 20.83 55.21 100.00
Total 107 92 77 276
38.77 33.33 27.90 100.00
Priors 0.33333 0.33333 0.33333
Error Count Estimates for type1
Healthy Scab Sprout Total
Rate 0.2292 0.2262 0.4479 0.3011
Priors 0.3333 0.3333 0.3333
Classification Summary for Calibration Data: WORK.SET1
Cross-validation Summary using Linear Discriminant Function
Generalized Squared Distance Function
2 _ -1 _
D (X) = (X-X )' COV (X-X )
j (X)j (X) (X)j
Posterior Probability of Membership in Each type1
2 2
Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
j k k
Number of Observations and Percent Classified into type1
From
type1 Healthy Scab Sprout Total
Healthy 66 9 21 96
68.75 9.38 21.88 100.00
Scab 10 65 9 84
11.90 77.38 10.71 100.00
Sprout 25 20 51 96
26.04 20.83 53.13 100.00
Total 101 94 81 276
36.59 34.06 29.35 100.00
Priors 0.33333 0.33333 0.33333
Error Count Estimates for type1
Healthy Scab Sprout Total
Rate 0.3125 0.2262 0.4688 0.3358
Priors 0.3333 0.3333 0.3333
Obs type1 _INTO_ Healthy Sprout Scab
8 Healthy Sprout 0.45316 0.48345 0.06339
11 Healthy Scab 0.27834 0.24283 0.47882
14 Sprout Healthy 0.65378 0.33760 0.00862
15 Sprout Scab 0.10858 0.23273 0.65869
16 Sprout Healthy 0.53754 0.41912 0.04334
17 Sprout Scab 0.15010 0.12690 0.72299
18 Sprout Healthy 0.56431 0.35650 0.07919
19 Sprout Scab 0.18301 0.16055 0.65644
268 Scab Healthy 0.52590 0.13806 0.33604
273 Scab Healthy 0.46007 0.33229 0.20764
Notes:
§ The classification error rates are a little better using the PRIORS=PROPORTIONAL option. Note that the proportion in each wheat class is approximately the same.
§ Using the linear discriminant rules is a little more complicated when the number of populations is more than 2. For this example, 2 different linear discriminant functions are needed. See wheat.sas for the PROC IML code used to classify the wheat kernels.
§ Also contained in wheat.sas is the PROC IML code needed to show how the Mahalanobis distance and the posterior probability are found. Examine this on your own.
§ No cost of classifications are used here
§ The covariance matrices for healthy, sprout, and scab were found to be unequal (p-value<0.0001) using the POOL=TEST option in PROC DISCRIM. When the quadratic discriminant rule is used, the classification error rates are a little less. The actual code and output used to find these rates are excluded from the notes.
§ Examine the 3D plot of the principal components for justification of why some classification error rates are larger than others.
§ Summary of classification errors
Classification Error RatesActual / Healthy / Scab / Sprout / Overall Error
S1=S2=S3 priors=prop. / Resubstitution / 22.92% / 23.81% / 43.75% / 30.43%
Crossvalidation / 30.21% / 26.19% / 45.83% / 34.42%
Different Si priors=equal / Resubstitution / 26.04% / 20.24% / 33.33% / 26.54%
Crossvalidation / 31.25% / 22.62% / 41.67% / 31.85%
S1=S2=S3 priors=equal / Resubstitution / 22.92% / 22.62% / 44.79% / 30.11%
Crossvalidation / 31.25% / 22.62% / 46.88% / 33.58%
Different Si
prior=prop. / Resubstitution / 25.00% / 21.43% / 31.25% / 26.09%
Crossvalidation / 31.25% / 23.81% / 40.63% / 32.25%
7.5 Variable Selection Procedures
In order to find the most parsimonious model that best estimates the dependent variable in regression analysis, variable selection procedures are used to narrow down the number of independent variables. Similar variable selection procedures caan be used for discriminant analysis. This helps to eliminate variables that do not help to discriminant between the different populations.
ANCOVA REVIEW (STAT 801)
One-way ANOVA model: Yij = m + ai + eij
where eij~ind. N(0,s2)
ai is the effect of treatment i
m is the grand mean
Yij is the response of the jth object to treatment i
Example: Wheat kernels
Let Yij be the hardness of the jth kernel from the ith classification.
Y11 = hardness of kernel 1 from healthy class
a1 = healthy effect, a2 = sprout effect, a3 = scab effect
Note that if a1 = a2 = a3, there are no mean differences among the kernel types. In this case, would hardness be a good discriminator between the kernel types?
One-way ANCOVA model: Yij = m + ai + bixij + eij
bi = slope coefficient
xij = covariate
Example: Wheat kernels
xij = variable that has an effect on hardness
Note that if a1 = a2 = a3, there are no mean differences among the kernel types when xij is accounted for. In this case, would hardness be a good discriminator between the kernel types?
Forward selection
- Find the variable that is the best discriminator among all the variables. This variable produces the largest F statistic value in a one-way ANOVA model.
Example: Wheat kernels (sorry about the notation)
skdenij = m + ai + eij
skhardij = m + ai + eij