SPR Fall 2003

Exam 2

Exam 2 Solution.

Twenty data sets were generated for this exam. Nine were used by students, but results on all are in this document. Each data set contains a set of 100 three dimensional vectors (3 features per sample). There were three classes, and each category generated normally distributed vectors with different means and covariances. The probabilities of occurrence of each class varied from set to set.

This document consists of three parts. The first part is a description of the methods and results, the second describes the programs, and the third contains a table of results.

Methods and Results.

We will discuss the results from the second dataset. The results for all datasets are tabulated in the third section.

Preliminary analysis

The frequencies of occurrence of the three classes are

probabilities 0.320 0.260 0.420

The means of the three classes are:

means ( -1.084 -1.431 0.082)

( -0.377 -0.391 -0.587)

( -1.486 -0.532 -1.204)

The covariances matrices, estimated from the data, are:

0.340 0.211 0.108 | 0.624 -0.270 -0.140 | 0.277 -0.025 0.024 |

0.211 1.541 -0.096 | -0.270 0.647 0.252 | -0.025 0.977 -0.058 |

0.108 -0.096 0.613 | -0.140 0.252 1.126 | 0.024 -0.058 0.366 |

We plot the three projections of the data below. From the y-z plot (z abscissa, y ordinate) we note that the green circles must be the first class, since this is the only category that has a positive z mean. Class 3 has a lower x mean than class 2, so that class 2 are blue pluses and class 3 are red x’s.

There is substantial overlap between the classes. The variances of the classes are different, and the variances of the components are also different. Class 1 has the largest variance in the y component, and one can see that there is a greater dispersion on the plot, though it is not striking.

We first attempt the Fisher classifier, which estimates a common variance for all classes, and uses three discriminant functions. The equations for this are in section 4.3.3. We obtain a resubstitution error frequency of 0.25 and cross-validation error frequency of 0.27.

In view of the fact that the variances are different, we attempt a quadratic classifier. This is described in section 2.2.1. We find the resubstitution error frequency is 0.22 and the cross-validation error frequency is 0.27. These error probabilities are not significantly different from those for the linear classifier. This suggests that the differences between the variances of the classes is not big enough to matter. This is surprising, since the variance on the third coordinate of class 2 is three times as big as for class three.

As a point of information, we also compute the error frequency using with a quadratic classifier but the parameters (class probabilities, means, covariance matrices) are those used to generate the data. This could not be done by the student. Incidentally, the classifier with these parameters is the optimal classifier. The error frequency obtained is 0.25, a number consistent with the other results.

We attempt two methods for finding the confidence interval for resubstitution error probability. The first is to assume that the resubstitution error is produced from Bernoulli trials and to assume that the central limit theorem holds. Under these assumptions the confidence interval is

p ± 1.96 s

where p is the resubstitution error frequency and s = sqrt(p(1-p)/n). This gives us a confidence interval [0.139, 0.280]. The second method is to generate 200 bootstrap samples of data, train a classifier for each sample and find the error frequency. Each sample has 100 feature vectors, just as in the original data. This leads to 200 values of error frequency. We sort these, and find the probability of the 5th lowest and the 5th from the top: this is the bootstrap estimate of the 95% confidence interval, which goes from cumulative probability 2.5% = 5/200 to 97.5% = 195/200. This interval is [0.140, 0.301], which is quite close to the Bernoulli trial model estimate.

Most students in the class estimated the bootstrap interval by taking the bootstrap error frequencies, computing their mean and standard deviation, and then finding the interval based on a normal assumption. This is not as powerful as the computation from the cumulative distribution of the error frequencies, which is a non-parametric method and works even if the distribution is not normal.

Programs

The programs are in prog.zip, the data sets in data.zip

test_script computes the assorted error probabilities. Functions resub_crossval computes both the resubstitution and crossvalidation probabilities for the Fisher linear classifier, and requb_crossvalq does this for the quadratic classifier. As a last step, the dataset generation parameters are obtained from variables pp, mm, and vv. These parameters are fed into function make_quad_discrim which computes arrays suitable for the discriminant functions. Those arrays and the data are fed to function quadclassify which classifies the data. The number of errors is found by comparing these results with dat(4,:), the row in the data matrix that contains the ‘true’ class numbers. The results of these calculations are stored in array errs and also written to a text file called classresults.txt.

boot_script computes resubstitution error frequency confidence intervals. The class probabilities, means, and covariances are estimated in function quadest. quadclassify is used to classify the data. Variables cl and ch are the low and high ends of the confidence interval computed on Gaussian assumption. Subsequently we go through a for loop (200 iteration) that does a random selection of data vectors (features plus category), performs the estimation/classification, and stores the error probabilities in array perr1. After the for loop the array is sorted and the bootstrap confidence interval values [clb, clh] are read from the appropriate locations.

Function resub_crossval uses function Fisher to compute the linear discriminant function parameters. The classification and error computation are done with MATLAB statements. First the resubstitution error is found from the full data set. Next a loop generates the leave-one-out datasets for training and classifies the left-out vector (hold).

Function resub_crossvalq works in a similar way. Training is done in function quadest, and classification in function quadclassify. Function quadest pulls out the data for each category from the full dataset, estimates the class probability by counting the number of elements, then computes the mean and covariance. The output values of the function are mat, a 3 index array that contains the three inverse covariance matrices, vecs which contains three vectors that are used for the linear portion of the quadratic classifier, and cons which has the three constants for the three discriminant function. Observe that the constants include log(p), probabilities of the class occurrence, as required by the Bayes theory. quadclassify just plugs in feature vectors into the discriminant functions.

File Fisher computes parameters for the Fisher linear discriminant functions. Data are split into categories, probabilities, means, and covariances are computed. The same loop also accumulates the within-class covariance from the class covariances. Subsequently the between-class covariance is found from the means. Function eig solves the generalized eigenvalue problem. The rank of the between-class covariance matrix (dcov) is two, so that only two rows of the eigenvector matrix are kept.

Script param_script computes and prints the class parameters, and calls on plotdat to do the plotting.

Tables of Results

Classifier Results

Dataset / Bayes Classifier / Fisher resubst / Fisher crossval / Quadratic resub / Quadratic crossval
1 / 0.270 / 0.250 / 0.290 / 0.260 / 0.310
2 / 0.250 / 0.250 / 0.270 / 0.220 / 0.270
3 / 0.170 / 0.430 / 0.420 / 0.190 / 0.230
4 / 0.080 / 0.500 / 0.530 / 0.050 / 0.070
5 / 0.150 / 0.480 / 0.490 / 0.150 / 0.200
6 / 0.100 / 0.120 / 0.150 / 0.070 / 0.180
7 / 0.070 / 0.060 / 0.070 / 0.040 / 0.070
8 / 0.230 / 0.510 / 0.510 / 0.200 / 0.260
9 / 0.320 / 0.590 / 0.570 / 0.290 / 0.370
10 / 0.190 / 0.430 / 0.430 / 0.190 / 0.230
11 / 0.030 / 0.250 / 0.240 / 0.020 / 0.030
12 / 0.060 / 0.080 / 0.100 / 0.060 / 0.090
13 / 0.170 / 0.380 / 0.380 / 0.180 / 0.200
14 / 0.100 / 0.490 / 0.520 / 0.110 / 0.150
15 / 0.270 / 0.590 / 0.520 / 0.260 / 0.270
16 / 0.140 / 0.420 / 0.420 / 0.110 / 0.130
17 / 0.160 / 0.530 / 0.570 / 0.180 / 0.240
18 / 0.080 / 0.400 / 0.410 / 0.060 / 0.090
19 / 0.210 / 0.370 / 0.380 / 0.160 / 0.200
20 / 0.210 / 0.440 / 0.430 / 0.200 / 0.240

Confidence Intervals

Dataset / Quad resub / Low conf / High conf / Boot low conf / Boot high conf
1 / 0.260 / 0.174 / 0.310 / 0.160 / 0.346
2 / 0.220 / 0.139 / 0.280 / 0.140 / 0.301
3 / 0.190 / 0.113 / 0.260 / 0.080 / 0.267
4 / 0.050 / 0.007 / 0.130 / 0.010 / 0.093
5 / 0.150 / 0.080 / 0.200 / 0.060 / 0.220
6 / 0.070 / 0.020 / 0.150 / 0.030 / 0.120
7 / 0.040 / 0.002 / 0.100 / 0.010 / 0.078
8 / 0.200 / 0.122 / 0.250 / 0.080 / 0.278
9 / 0.290 / 0.201 / 0.370 / 0.170 / 0.379
10 / 0.190 / 0.113 / 0.250 / 0.080 / 0.267
11 / 0.020 / -0.007 / 0.040 / 0.000 / 0.047
12 / 0.060 / 0.013 / 0.130 / 0.020 / 0.107
13 / 0.180 / 0.105 / 0.210 / 0.070 / 0.255
14 / 0.110 / 0.049 / 0.150 / 0.030 / 0.171
15 / 0.260 / 0.174 / 0.340 / 0.160 / 0.346
16 / 0.110 / 0.049 / 0.160 / 0.040 / 0.171
17 / 0.180 / 0.105 / 0.230 / 0.070 / 0.255
18 / 0.060 / 0.013 / 0.120 / 0.010 / 0.107
19 / 0.160 / 0.088 / 0.240 / 0.070 / 0.232
20 / 0.200 / 0.122 / 0.270 / 0.110 / 0.278

Data set 1

probabilities 0.330 0.330 0.340

means ( 0.022 0.193 1.129)

( -0.482 -0.252 -0.481)

( 0.323 -1.145 1.110)

covariances

0.499 -0.065 0.138 | 0.397 0.089 0.127 | 0.754 0.151 0.192 |

-0.065 0.376 -0.147 | 0.089 0.533 0.087 | 0.151 0.426 0.070 |

0.138 -0.147 0.935 | 0.127 0.087 1.079 | 0.192 0.070 0.505 |

Data set 2

probabilities 0.320 0.260 0.420

means ( -1.084 -1.431 0.082)

( -0.377 -0.391 -0.587)

( -1.486 -0.532 -1.204)

covariances

0.340 0.211 0.108 | 0.624 -0.270 -0.140 | 0.277 -0.025 0.024 |

0.211 1.541 -0.096 | -0.270 0.647 0.252 | -0.025 0.977 -0.058 |

0.108 -0.096 0.613 | -0.140 0.252 1.126 | 0.024 -0.058 0.366 |

Data set 3

probabilities 0.390 0.290 0.320

means ( 1.290 -0.262 0.131)

( -0.099 -0.939 -0.555)

( 1.571 -1.031 -1.608)

covariances

0.556 0.070 0.054 | 0.374 0.012 0.054 | 1.133 0.028 0.158 |

0.070 0.268 0.061 | 0.012 0.365 0.039 | 0.028 0.986 0.033 |

0.054 0.061 0.532 | 0.054 0.039 0.630 | 0.158 0.033 0.702 |

Data set 4

probabilities 0.360 0.290 0.350

means ( 0.305 1.018 -0.788)

( 0.691 1.214 0.940)

( -0.509 -1.509 1.319)

covariances

0.339 0.013 -0.069 | 0.665 0.018 0.105 | 0.679 -0.118 0.117 |

0.013 0.268 -0.015 | 0.018 0.611 0.052 | -0.118 0.401 0.039 |

-0.069 -0.015 0.918 | 0.105 0.052 0.412 | 0.117 0.039 1.421 |

Data set 5

probabilities 0.350 0.360 0.290

means ( 0.207 0.341 -1.053)

( -0.594 -1.479 0.674)

( 0.023 1.348 -0.864)

covariances

0.296 0.122 0.061 | 0.273 0.017 0.078 | 0.633 -0.001 -0.057 |

0.122 0.493 -0.019 | 0.017 0.565 -0.016 | -0.001 0.774 0.089 |

0.061 -0.019 0.450 | 0.078 -0.016 0.302 | -0.057 0.089 0.632 |

Data set 6

probabilities 0.360 0.310 0.330

means ( -0.585 0.540 0.938)

( 0.547 0.854 -0.994)

( 0.546 -0.992 1.525)

covariances

0.458 0.069 0.077 | 0.537 0.104 0.106 | 0.298 -0.083 0.043 |

0.069 1.221 -0.077 | 0.104 0.764 0.062 | -0.083 0.404 -0.025 |

0.077 -0.077 0.644 | 0.106 0.062 0.648 | 0.043 -0.025 0.656 |

Data set 7

probabilities 0.330 0.460 0.210

means ( -0.934 -0.943 -1.518)

( 0.346 1.475 -0.683)

( -0.126 -0.585 0.610)

covariances

0.759 0.036 0.204 | 0.775 0.052 -0.116 | 0.679 -0.071 0.036 |

0.036 0.263 -0.090 | 0.052 0.730 -0.003 | -0.071 0.554 -0.109 |

0.204 -0.090 0.568 | -0.116 -0.003 0.673 | 0.036 -0.109 0.492 |

Data set 8

probabilities 0.260 0.450 0.290

means ( 0.960 0.695 -0.394)

( 1.144 -0.419 -0.896)

( -0.358 0.652 1.171)

covariances

0.916 -0.113 -0.003 | 1.007 -0.112 -0.099 | 1.005 0.007 0.322 |

-0.113 0.578 0.137 | -0.112 1.226 0.009 | 0.007 0.288 0.150 |

-0.003 0.137 0.449 | -0.099 0.009 0.357 | 0.322 0.150 0.749 |

Data set 9

probabilities 0.420 0.300 0.280

means ( 0.241 0.414 -0.883)

( -0.491 -0.358 -0.112)

( -0.254 -0.527 0.619)

covariances

0.926 -0.045 0.002 | 0.761 0.072 -0.158 | 0.810 0.077 -0.086 |

-0.045 0.458 0.135 | 0.072 0.280 -0.146 | 0.077 0.933 0.091 |

0.002 0.135 0.910 | -0.158 -0.146 0.673 | -0.086 0.091 0.523 |

Data set 10

probabilities 0.280 0.280 0.440

means ( -0.782 0.210 -1.466)

( 0.580 -1.039 -0.995)

( 0.134 1.022 -1.591)

covariances

0.541 0.025 0.132 | 0.410 -0.017 0.098 | 0.704 -0.059 0.099 |

0.025 0.610 -0.175 | -0.017 0.434 0.133 | -0.059 0.346 -0.031 |

0.132 -0.175 0.965 | 0.098 0.133 1.026 | 0.099 -0.031 1.191 |

Data set 11

probabilities 0.530 0.240 0.230

means ( -0.880 -0.687 1.432)

( 0.933 0.769 -0.200)

( -1.703 -1.317 -1.182)

covariances

0.265 -0.060 -0.029 | 0.270 -0.091 -0.090 | 0.411 -0.010 -0.003 |

-0.060 0.236 0.094 | -0.091 0.473 0.043 | -0.010 0.350 -0.244 |

-0.029 0.094 0.588 | -0.090 0.043 0.738 | -0.003 -0.244 0.907 |

Data set 12

probabilities 0.390 0.320 0.290

means ( -0.263 -0.290 0.933)

( -0.311 0.772 -1.364)

( 1.235 1.073 0.318)

covariances

0.269 0.003 0.016 | 0.577 -0.251 0.016 | 1.279 -0.160 -0.027 |

0.003 0.602 -0.072 | -0.251 0.843 -0.019 | -0.160 0.995 0.031 |

0.016 -0.072 0.339 | 0.016 -0.019 0.379 | -0.027 0.031 0.525 |

Data set 13

probabilities 0.300 0.430 0.270

means ( -0.501 -1.362 -1.263)

( 0.513 -0.750 0.552)

( 1.142 -1.265 -0.450)

covariances

0.213 -0.039 -0.083 | 0.250 0.011 0.045 | 0.982 0.307 -0.399 |

-0.039 0.479 -0.005 | 0.011 0.505 -0.022 | 0.307 0.357 -0.008 |

-0.083 -0.005 0.613 | 0.045 -0.022 0.550 | -0.399 -0.008 0.847 |

Data set 14

probabilities 0.480 0.290 0.230

means ( -0.027 -1.164 1.481)

( 0.848 -0.047 -0.960)

( 1.690 0.267 -0.103)

covariances

0.772 -0.045 0.173 | 0.396 -0.009 0.108 | 0.548 -0.270 -0.147 |

-0.045 0.430 -0.110 | -0.009 0.997 0.063 | -0.270 1.243 0.086 |

0.173 -0.110 0.441 | 0.108 0.063 0.257 | -0.147 0.086 0.431 |

Data set 15

probabilities 0.320 0.420 0.260

means ( 0.087 0.416 -0.079)

( -0.676 -1.220 0.224)

( -0.787 0.226 0.136)

covariances

0.829 0.016 -0.146 | 0.852 0.382 -0.029 | 0.242 0.026 0.092 |

0.016 0.795 -0.034 | 0.382 1.687 0.018 | 0.026 0.320 -0.088 |

-0.146 -0.034 0.306 | -0.029 0.018 0.258 | 0.092 -0.088 0.609 |

Data set 16

probabilities 0.540 0.220 0.240

means ( 0.632 -0.888 -1.460)

( 0.170 0.915 0.909)

( -0.215 0.753 -1.001)

covariances

0.891 0.184 0.164 | 0.702 -0.102 0.163 | 0.439 0.073 -0.042 |

0.184 0.881 0.096 | -0.102 0.923 0.323 | 0.073 1.013 -0.385 |

0.164 0.096 0.931 | 0.163 0.323 0.887 | -0.042 -0.385 0.460 |

Data set 17

probabilities 0.330 0.390 0.280

means ( -0.679 -0.788 0.861)

( -1.240 1.005 -1.504)

( -0.461 0.316 0.735)

covariances

0.803 -0.047 -0.098 | 0.323 0.094 -0.011 | 1.006 0.061 0.312 |

-0.047 0.894 0.042 | 0.094 0.916 0.103 | 0.061 0.356 0.097 |

-0.098 0.042 0.474 | -0.011 0.103 0.770 | 0.312 0.097 0.809 |

Data set 18

probabilities 0.500 0.250 0.250

means ( -1.346 1.416 1.021)

( 1.086 -1.208 -1.547)

( 0.882 0.511 -0.274)

covariances

0.713 0.040 0.014 | 0.486 0.044 0.169 | 1.001 -0.065 0.171 |

0.040 0.466 -0.090 | 0.044 0.511 0.129 | -0.065 0.465 -0.162 |

0.014 -0.090 1.210 | 0.169 0.129 0.970 | 0.171 -0.162 0.771 |

Data set 19

probabilities 0.220 0.240 0.540

means ( 0.245 -1.201 1.192)

( -1.561 -1.268 -0.399)

( 0.552 -1.040 -0.684)

covariances

0.796 0.021 0.245 | 0.393 0.095 0.049 | 0.411 -0.013 0.049 |

0.021 0.427 -0.043 | 0.095 0.338 0.058 | -0.013 0.449 -0.030 |

0.245 -0.043 1.932 | 0.049 0.058 0.741 | 0.049 -0.030 0.748 |

Data set 20

probabilities 0.310 0.180 0.510

means ( 0.151 0.414 0.392)

( -0.241 -1.056 1.470)

( 0.056 1.216 -1.139)

covariances

0.392 -0.309 -0.036 | 0.603 -0.067 -0.157 | 0.731 -0.009 -0.072 |

-0.309 1.034 -0.090 | -0.067 0.479 -0.157 | -0.009 0.611 -0.055 |

-0.036 -0.090 0.988 | -0.157 -0.157 0.739 | -0.072 -0.055 0.525 |

Page 1