SPR Fall 2003
Exam 2
Exam 2 Solution.
Twenty data sets were generated for this exam. Nine were used by students, but results on all are in this document. Each data set contains a set of 100 three dimensional vectors (3 features per sample). There were three classes, and each category generated normally distributed vectors with different means and covariances. The probabilities of occurrence of each class varied from set to set.
This document consists of three parts. The first part is a description of the methods and results, the second describes the programs, and the third contains a table of results.
Methods and Results.
We will discuss the results from the second dataset. The results for all datasets are tabulated in the third section.
Preliminary analysis
The frequencies of occurrence of the three classes are
probabilities 0.320 0.260 0.420
The means of the three classes are:
means ( -1.084 -1.431 0.082)
( -0.377 -0.391 -0.587)
( -1.486 -0.532 -1.204)
The covariances matrices, estimated from the data, are:
0.340 0.211 0.108 | 0.624 -0.270 -0.140 | 0.277 -0.025 0.024 |
0.211 1.541 -0.096 | -0.270 0.647 0.252 | -0.025 0.977 -0.058 |
0.108 -0.096 0.613 | -0.140 0.252 1.126 | 0.024 -0.058 0.366 |
We plot the three projections of the data below. From the y-z plot (z abscissa, y ordinate) we note that the green circles must be the first class, since this is the only category that has a positive z mean. Class 3 has a lower x mean than class 2, so that class 2 are blue pluses and class 3 are red x’s.
There is substantial overlap between the classes. The variances of the classes are different, and the variances of the components are also different. Class 1 has the largest variance in the y component, and one can see that there is a greater dispersion on the plot, though it is not striking.
We first attempt the Fisher classifier, which estimates a common variance for all classes, and uses three discriminant functions. The equations for this are in section 4.3.3. We obtain a resubstitution error frequency of 0.25 and cross-validation error frequency of 0.27.
In view of the fact that the variances are different, we attempt a quadratic classifier. This is described in section 2.2.1. We find the resubstitution error frequency is 0.22 and the cross-validation error frequency is 0.27. These error probabilities are not significantly different from those for the linear classifier. This suggests that the differences between the variances of the classes is not big enough to matter. This is surprising, since the variance on the third coordinate of class 2 is three times as big as for class three.
As a point of information, we also compute the error frequency using with a quadratic classifier but the parameters (class probabilities, means, covariance matrices) are those used to generate the data. This could not be done by the student. Incidentally, the classifier with these parameters is the optimal classifier. The error frequency obtained is 0.25, a number consistent with the other results.
We attempt two methods for finding the confidence interval for resubstitution error probability. The first is to assume that the resubstitution error is produced from Bernoulli trials and to assume that the central limit theorem holds. Under these assumptions the confidence interval is
p ± 1.96 s
where p is the resubstitution error frequency and s = sqrt(p(1-p)/n). This gives us a confidence interval [0.139, 0.280]. The second method is to generate 200 bootstrap samples of data, train a classifier for each sample and find the error frequency. Each sample has 100 feature vectors, just as in the original data. This leads to 200 values of error frequency. We sort these, and find the probability of the 5th lowest and the 5th from the top: this is the bootstrap estimate of the 95% confidence interval, which goes from cumulative probability 2.5% = 5/200 to 97.5% = 195/200. This interval is [0.140, 0.301], which is quite close to the Bernoulli trial model estimate.
Most students in the class estimated the bootstrap interval by taking the bootstrap error frequencies, computing their mean and standard deviation, and then finding the interval based on a normal assumption. This is not as powerful as the computation from the cumulative distribution of the error frequencies, which is a non-parametric method and works even if the distribution is not normal.
Programs
The programs are in prog.zip, the data sets in data.zip
test_script computes the assorted error probabilities. Functions resub_crossval computes both the resubstitution and crossvalidation probabilities for the Fisher linear classifier, and requb_crossvalq does this for the quadratic classifier. As a last step, the dataset generation parameters are obtained from variables pp, mm, and vv. These parameters are fed into function make_quad_discrim which computes arrays suitable for the discriminant functions. Those arrays and the data are fed to function quadclassify which classifies the data. The number of errors is found by comparing these results with dat(4,:), the row in the data matrix that contains the ‘true’ class numbers. The results of these calculations are stored in array errs and also written to a text file called classresults.txt.
boot_script computes resubstitution error frequency confidence intervals. The class probabilities, means, and covariances are estimated in function quadest. quadclassify is used to classify the data. Variables cl and ch are the low and high ends of the confidence interval computed on Gaussian assumption. Subsequently we go through a for loop (200 iteration) that does a random selection of data vectors (features plus category), performs the estimation/classification, and stores the error probabilities in array perr1. After the for loop the array is sorted and the bootstrap confidence interval values [clb, clh] are read from the appropriate locations.
Function resub_crossval uses function Fisher to compute the linear discriminant function parameters. The classification and error computation are done with MATLAB statements. First the resubstitution error is found from the full data set. Next a loop generates the leave-one-out datasets for training and classifies the left-out vector (hold).
Function resub_crossvalq works in a similar way. Training is done in function quadest, and classification in function quadclassify. Function quadest pulls out the data for each category from the full dataset, estimates the class probability by counting the number of elements, then computes the mean and covariance. The output values of the function are mat, a 3 index array that contains the three inverse covariance matrices, vecs which contains three vectors that are used for the linear portion of the quadratic classifier, and cons which has the three constants for the three discriminant function. Observe that the constants include log(p), probabilities of the class occurrence, as required by the Bayes theory. quadclassify just plugs in feature vectors into the discriminant functions.
File Fisher computes parameters for the Fisher linear discriminant functions. Data are split into categories, probabilities, means, and covariances are computed. The same loop also accumulates the within-class covariance from the class covariances. Subsequently the between-class covariance is found from the means. Function eig solves the generalized eigenvalue problem. The rank of the between-class covariance matrix (dcov) is two, so that only two rows of the eigenvector matrix are kept.
Script param_script computes and prints the class parameters, and calls on plotdat to do the plotting.
Tables of Results
Classifier Results
Dataset / Bayes Classifier / Fisher resubst / Fisher crossval / Quadratic resub / Quadratic crossval1 / 0.270 / 0.250 / 0.290 / 0.260 / 0.310
2 / 0.250 / 0.250 / 0.270 / 0.220 / 0.270
3 / 0.170 / 0.430 / 0.420 / 0.190 / 0.230
4 / 0.080 / 0.500 / 0.530 / 0.050 / 0.070
5 / 0.150 / 0.480 / 0.490 / 0.150 / 0.200
6 / 0.100 / 0.120 / 0.150 / 0.070 / 0.180
7 / 0.070 / 0.060 / 0.070 / 0.040 / 0.070
8 / 0.230 / 0.510 / 0.510 / 0.200 / 0.260
9 / 0.320 / 0.590 / 0.570 / 0.290 / 0.370
10 / 0.190 / 0.430 / 0.430 / 0.190 / 0.230
11 / 0.030 / 0.250 / 0.240 / 0.020 / 0.030
12 / 0.060 / 0.080 / 0.100 / 0.060 / 0.090
13 / 0.170 / 0.380 / 0.380 / 0.180 / 0.200
14 / 0.100 / 0.490 / 0.520 / 0.110 / 0.150
15 / 0.270 / 0.590 / 0.520 / 0.260 / 0.270
16 / 0.140 / 0.420 / 0.420 / 0.110 / 0.130
17 / 0.160 / 0.530 / 0.570 / 0.180 / 0.240
18 / 0.080 / 0.400 / 0.410 / 0.060 / 0.090
19 / 0.210 / 0.370 / 0.380 / 0.160 / 0.200
20 / 0.210 / 0.440 / 0.430 / 0.200 / 0.240
Confidence Intervals
Dataset / Quad resub / Low conf / High conf / Boot low conf / Boot high conf1 / 0.260 / 0.174 / 0.310 / 0.160 / 0.346
2 / 0.220 / 0.139 / 0.280 / 0.140 / 0.301
3 / 0.190 / 0.113 / 0.260 / 0.080 / 0.267
4 / 0.050 / 0.007 / 0.130 / 0.010 / 0.093
5 / 0.150 / 0.080 / 0.200 / 0.060 / 0.220
6 / 0.070 / 0.020 / 0.150 / 0.030 / 0.120
7 / 0.040 / 0.002 / 0.100 / 0.010 / 0.078
8 / 0.200 / 0.122 / 0.250 / 0.080 / 0.278
9 / 0.290 / 0.201 / 0.370 / 0.170 / 0.379
10 / 0.190 / 0.113 / 0.250 / 0.080 / 0.267
11 / 0.020 / -0.007 / 0.040 / 0.000 / 0.047
12 / 0.060 / 0.013 / 0.130 / 0.020 / 0.107
13 / 0.180 / 0.105 / 0.210 / 0.070 / 0.255
14 / 0.110 / 0.049 / 0.150 / 0.030 / 0.171
15 / 0.260 / 0.174 / 0.340 / 0.160 / 0.346
16 / 0.110 / 0.049 / 0.160 / 0.040 / 0.171
17 / 0.180 / 0.105 / 0.230 / 0.070 / 0.255
18 / 0.060 / 0.013 / 0.120 / 0.010 / 0.107
19 / 0.160 / 0.088 / 0.240 / 0.070 / 0.232
20 / 0.200 / 0.122 / 0.270 / 0.110 / 0.278
Data set 1
probabilities 0.330 0.330 0.340
means ( 0.022 0.193 1.129)
( -0.482 -0.252 -0.481)
( 0.323 -1.145 1.110)
covariances
0.499 -0.065 0.138 | 0.397 0.089 0.127 | 0.754 0.151 0.192 |
-0.065 0.376 -0.147 | 0.089 0.533 0.087 | 0.151 0.426 0.070 |
0.138 -0.147 0.935 | 0.127 0.087 1.079 | 0.192 0.070 0.505 |
Data set 2
probabilities 0.320 0.260 0.420
means ( -1.084 -1.431 0.082)
( -0.377 -0.391 -0.587)
( -1.486 -0.532 -1.204)
covariances
0.340 0.211 0.108 | 0.624 -0.270 -0.140 | 0.277 -0.025 0.024 |
0.211 1.541 -0.096 | -0.270 0.647 0.252 | -0.025 0.977 -0.058 |
0.108 -0.096 0.613 | -0.140 0.252 1.126 | 0.024 -0.058 0.366 |
Data set 3
probabilities 0.390 0.290 0.320
means ( 1.290 -0.262 0.131)
( -0.099 -0.939 -0.555)
( 1.571 -1.031 -1.608)
covariances
0.556 0.070 0.054 | 0.374 0.012 0.054 | 1.133 0.028 0.158 |
0.070 0.268 0.061 | 0.012 0.365 0.039 | 0.028 0.986 0.033 |
0.054 0.061 0.532 | 0.054 0.039 0.630 | 0.158 0.033 0.702 |
Data set 4
probabilities 0.360 0.290 0.350
means ( 0.305 1.018 -0.788)
( 0.691 1.214 0.940)
( -0.509 -1.509 1.319)
covariances
0.339 0.013 -0.069 | 0.665 0.018 0.105 | 0.679 -0.118 0.117 |
0.013 0.268 -0.015 | 0.018 0.611 0.052 | -0.118 0.401 0.039 |
-0.069 -0.015 0.918 | 0.105 0.052 0.412 | 0.117 0.039 1.421 |
Data set 5
probabilities 0.350 0.360 0.290
means ( 0.207 0.341 -1.053)
( -0.594 -1.479 0.674)
( 0.023 1.348 -0.864)
covariances
0.296 0.122 0.061 | 0.273 0.017 0.078 | 0.633 -0.001 -0.057 |
0.122 0.493 -0.019 | 0.017 0.565 -0.016 | -0.001 0.774 0.089 |
0.061 -0.019 0.450 | 0.078 -0.016 0.302 | -0.057 0.089 0.632 |
Data set 6
probabilities 0.360 0.310 0.330
means ( -0.585 0.540 0.938)
( 0.547 0.854 -0.994)
( 0.546 -0.992 1.525)
covariances
0.458 0.069 0.077 | 0.537 0.104 0.106 | 0.298 -0.083 0.043 |
0.069 1.221 -0.077 | 0.104 0.764 0.062 | -0.083 0.404 -0.025 |
0.077 -0.077 0.644 | 0.106 0.062 0.648 | 0.043 -0.025 0.656 |
Data set 7
probabilities 0.330 0.460 0.210
means ( -0.934 -0.943 -1.518)
( 0.346 1.475 -0.683)
( -0.126 -0.585 0.610)
covariances
0.759 0.036 0.204 | 0.775 0.052 -0.116 | 0.679 -0.071 0.036 |
0.036 0.263 -0.090 | 0.052 0.730 -0.003 | -0.071 0.554 -0.109 |
0.204 -0.090 0.568 | -0.116 -0.003 0.673 | 0.036 -0.109 0.492 |
Data set 8
probabilities 0.260 0.450 0.290
means ( 0.960 0.695 -0.394)
( 1.144 -0.419 -0.896)
( -0.358 0.652 1.171)
covariances
0.916 -0.113 -0.003 | 1.007 -0.112 -0.099 | 1.005 0.007 0.322 |
-0.113 0.578 0.137 | -0.112 1.226 0.009 | 0.007 0.288 0.150 |
-0.003 0.137 0.449 | -0.099 0.009 0.357 | 0.322 0.150 0.749 |
Data set 9
probabilities 0.420 0.300 0.280
means ( 0.241 0.414 -0.883)
( -0.491 -0.358 -0.112)
( -0.254 -0.527 0.619)
covariances
0.926 -0.045 0.002 | 0.761 0.072 -0.158 | 0.810 0.077 -0.086 |
-0.045 0.458 0.135 | 0.072 0.280 -0.146 | 0.077 0.933 0.091 |
0.002 0.135 0.910 | -0.158 -0.146 0.673 | -0.086 0.091 0.523 |
Data set 10
probabilities 0.280 0.280 0.440
means ( -0.782 0.210 -1.466)
( 0.580 -1.039 -0.995)
( 0.134 1.022 -1.591)
covariances
0.541 0.025 0.132 | 0.410 -0.017 0.098 | 0.704 -0.059 0.099 |
0.025 0.610 -0.175 | -0.017 0.434 0.133 | -0.059 0.346 -0.031 |
0.132 -0.175 0.965 | 0.098 0.133 1.026 | 0.099 -0.031 1.191 |
Data set 11
probabilities 0.530 0.240 0.230
means ( -0.880 -0.687 1.432)
( 0.933 0.769 -0.200)
( -1.703 -1.317 -1.182)
covariances
0.265 -0.060 -0.029 | 0.270 -0.091 -0.090 | 0.411 -0.010 -0.003 |
-0.060 0.236 0.094 | -0.091 0.473 0.043 | -0.010 0.350 -0.244 |
-0.029 0.094 0.588 | -0.090 0.043 0.738 | -0.003 -0.244 0.907 |
Data set 12
probabilities 0.390 0.320 0.290
means ( -0.263 -0.290 0.933)
( -0.311 0.772 -1.364)
( 1.235 1.073 0.318)
covariances
0.269 0.003 0.016 | 0.577 -0.251 0.016 | 1.279 -0.160 -0.027 |
0.003 0.602 -0.072 | -0.251 0.843 -0.019 | -0.160 0.995 0.031 |
0.016 -0.072 0.339 | 0.016 -0.019 0.379 | -0.027 0.031 0.525 |
Data set 13
probabilities 0.300 0.430 0.270
means ( -0.501 -1.362 -1.263)
( 0.513 -0.750 0.552)
( 1.142 -1.265 -0.450)
covariances
0.213 -0.039 -0.083 | 0.250 0.011 0.045 | 0.982 0.307 -0.399 |
-0.039 0.479 -0.005 | 0.011 0.505 -0.022 | 0.307 0.357 -0.008 |
-0.083 -0.005 0.613 | 0.045 -0.022 0.550 | -0.399 -0.008 0.847 |
Data set 14
probabilities 0.480 0.290 0.230
means ( -0.027 -1.164 1.481)
( 0.848 -0.047 -0.960)
( 1.690 0.267 -0.103)
covariances
0.772 -0.045 0.173 | 0.396 -0.009 0.108 | 0.548 -0.270 -0.147 |
-0.045 0.430 -0.110 | -0.009 0.997 0.063 | -0.270 1.243 0.086 |
0.173 -0.110 0.441 | 0.108 0.063 0.257 | -0.147 0.086 0.431 |
Data set 15
probabilities 0.320 0.420 0.260
means ( 0.087 0.416 -0.079)
( -0.676 -1.220 0.224)
( -0.787 0.226 0.136)
covariances
0.829 0.016 -0.146 | 0.852 0.382 -0.029 | 0.242 0.026 0.092 |
0.016 0.795 -0.034 | 0.382 1.687 0.018 | 0.026 0.320 -0.088 |
-0.146 -0.034 0.306 | -0.029 0.018 0.258 | 0.092 -0.088 0.609 |
Data set 16
probabilities 0.540 0.220 0.240
means ( 0.632 -0.888 -1.460)
( 0.170 0.915 0.909)
( -0.215 0.753 -1.001)
covariances
0.891 0.184 0.164 | 0.702 -0.102 0.163 | 0.439 0.073 -0.042 |
0.184 0.881 0.096 | -0.102 0.923 0.323 | 0.073 1.013 -0.385 |
0.164 0.096 0.931 | 0.163 0.323 0.887 | -0.042 -0.385 0.460 |
Data set 17
probabilities 0.330 0.390 0.280
means ( -0.679 -0.788 0.861)
( -1.240 1.005 -1.504)
( -0.461 0.316 0.735)
covariances
0.803 -0.047 -0.098 | 0.323 0.094 -0.011 | 1.006 0.061 0.312 |
-0.047 0.894 0.042 | 0.094 0.916 0.103 | 0.061 0.356 0.097 |
-0.098 0.042 0.474 | -0.011 0.103 0.770 | 0.312 0.097 0.809 |
Data set 18
probabilities 0.500 0.250 0.250
means ( -1.346 1.416 1.021)
( 1.086 -1.208 -1.547)
( 0.882 0.511 -0.274)
covariances
0.713 0.040 0.014 | 0.486 0.044 0.169 | 1.001 -0.065 0.171 |
0.040 0.466 -0.090 | 0.044 0.511 0.129 | -0.065 0.465 -0.162 |
0.014 -0.090 1.210 | 0.169 0.129 0.970 | 0.171 -0.162 0.771 |
Data set 19
probabilities 0.220 0.240 0.540
means ( 0.245 -1.201 1.192)
( -1.561 -1.268 -0.399)
( 0.552 -1.040 -0.684)
covariances
0.796 0.021 0.245 | 0.393 0.095 0.049 | 0.411 -0.013 0.049 |
0.021 0.427 -0.043 | 0.095 0.338 0.058 | -0.013 0.449 -0.030 |
0.245 -0.043 1.932 | 0.049 0.058 0.741 | 0.049 -0.030 0.748 |
Data set 20
probabilities 0.310 0.180 0.510
means ( 0.151 0.414 0.392)
( -0.241 -1.056 1.470)
( 0.056 1.216 -1.139)
covariances
0.392 -0.309 -0.036 | 0.603 -0.067 -0.157 | 0.731 -0.009 -0.072 |
-0.309 1.034 -0.090 | -0.067 0.479 -0.157 | -0.009 0.611 -0.055 |
-0.036 -0.090 0.988 | -0.157 -0.157 0.739 | -0.072 -0.055 0.525 |
Page 1