Last Name: Leske

First Name: Cavin

UW ID: 902 110 0962

Email:

ECE 539-1

Final Project Report

Dec. 21, 2001


Table of Contents

Introduction………………………………………….. 3

Work Performed…………………………………….. 5

Results………………………………………………... 7

Discussion……………………………………………. 9

Works Cited………………………………………….. 10

Appendix - Code……………………………………... 11

bp.m……………………………………………... 11

knndemo.m……………………………………… 15


Introduction

With human population continuously on the rise, it is becoming increasingly important to manage natural resources effectively. This includes maintaining a well-balanced and preserved ecosystem, especially in forests. When poor decisions are made, the consequences can be disastrous. For example, logging operations in Canada have devastated large areas and caused the numbers of birds and animals to dwindle. Having accurate information about the forests can prevent this and other similar incidents from occurring. The problem is that this sort of detailed information generally does not exist. However, it might be possible to obtain the necessary data through the use of predictive models.

In 1998, Dr. Jock Blackard from Colorado State University investigated this subject in his Ph.D. dissertation entitled “Comparison of Neural Networks and Discriminant Analysis in Predicting Forest Cover Types.” The area he selected for study was the Roosevelt National Forest in northern Colorado. He chose this area because it is relatively remote and undisturbed by human activities. His data set contained over 581,000 feature vectors, each representing a 30-meter by 30-meter plot of land. Each feature vector consisted of 12 independent cartographic variables as the feature dimensions, and 7 major forest cover types as the classes. Cartographic variables are simply the physical properties of the land, and the forest cover types are the predominant species of tree that grows there. The cartographic variables were derived from data originally collected by the US Geological Survey and the US Forest Service. The forest cover types were determined from the US Forest Service Region 2 Resource Information System data. Descriptions of the attributes presented in the data file are listed below.

Name / Data Type / Measurement / Description
Elevation / Quantitative / Meters / Elevation in meters
Aspect / Quantitative / Azimuth / Aspect in degrees azimuth
Slope / Quantitative / Degrees / Slope in degrees
Horizontal Distance to Hydrology / Quantitative / Meters / Horizontal distance to nearest surface water features
Vertical Distance to Hydrology / Quantitative / Meters / Vertical distance to nearest surface water features
Horizontal Distance to Roadways / Quantitative / Meters / Horizontal distance to nearest roadway
Hillshade – 9 am / Quantitative / 0 to 255 index / Hillshade index at 9 am, summer solstice
Hillshade – noon / Quantitative / 0 to 255 index / Hillshade index at noon, summer solstice
Hillshade – 3 pm / Quantitative / 0 to 255 index / Hillshade index at 3 pm, summer solstice
Horizontal Distance to Fire Points / Quantitative / Meters / Horizontal distance to nearest wildfire ignition points
Wilderness Area (4 binary columns) / Qualitative / 0 (absence) or 1 (presence) / Wilderness area designation
Soil Type (40 binary columns) / Qualitative / 0 (absence) or 1 (presence) / Soil type designation
Cover Type / Integer / 1 to 7 (class) / Forest cover type designation

The forest cover types consisted of 7 classes, which represented spruce/fir, lodgepole pine, ponderosa pine, cottonwood/willow, aspen, Douglas fir, and krummholz. It should be noted that the wilderness area and soil type designations are qualitative as opposed to quantitative. There were 4 wilderness areas 40 soil types included in the data file, and each had its own binary column. So for any particular feature vector, only 1 column for either of these dimensions could be a binary 1, while the remaining columns would be a binary 0. This resulted in 54 columns representing 12 cartographic variables in the data file. This was a very important consideration when using the data as the input to a neural network, and will be discussed in detail in the following pages.

The conclusion Dr. Blackard made in his dissertation was that a neural network model more accurately predicted the type of forest cover than the traditional discriminant model. Specifically, he used a back-propagation network that yielded a classification rate of 70%. The traditional discriminant analysis method only demonstrated 58% accuracy. He made his data set and findings publicly available on a “Machine-Learning-Databases” FTP site at the University of California – Ivine. However, he did not define the methods he used in configuring the neural network.

While Dr. Blackard was comparing two different predictive models, I focused on just the neural network aspect for my project. I used the same data set that he used in his dissertation. I investigated the effectiveness of two different feedforward networks. Namely, I used the back-propagation algorithm and the k-nearest neighbor classifier. My original goal was to match or go beyond the best classification rate reported by Dr. Blackard. Using the back-propagation algorithm, I hoped to examine Dr. Blackard’s approach to the problem and possibly improve on it. To contrast with the back-propagation algorithm, I wanted to examine the k-nearest neighbor classifier to see if it was faster, simpler, or more accurate.

Work Performed

Before the data could be applied to a neural network, it first had to be separated into a training set and a testing set and then preprocessed. I chose to break the data set up in a similar manner as Dr. Blackard. I used the same training set as he used, which consisted of the first 11,340 entries. This number is not arbitrary, as it may seem. It was chosen because the class distribution is the same for all 7 classes. In other words, each class has 1,620 instances in the training set. This way the neural network would be trained evenly. The testing set I used was much smaller than Dr. Blackard’s, however. Instead of using the last 570,000 entries, I only used the 10,000 entries after the training set as my testing set. The reason for this is because if I had used all 570,000 entries, the computing time needed to preprocess the testing set would be several days. By reducing the testing set to 10,000 feature vectors, there is enough information to adequately represent the entire data set while keeping the computing time spent on preprocessing at a reasonable length.

To preprocess the data I had to separately consider the back-propagation algorithm and the k-nearest neighbor classifier. The applications I used to implement these pattern classifiers were modifications of MATLAB m-files written by Dr. Yu Hen Hu. The code can be found in the Appendix. I will begin by describing what was involved with using the k-nearest neighbor classifier. This pattern classification technique was the easiest to use, since it accepts a wide range of data. Scaling is not necessary because the algorithm simply defines regions that encompass the areas of concentration for each dimension. These regions correspond to the class label and whichever region a particular feature vector is closest to, that is the class associated with the vector. There was only one modification I had to make to preprocess the training and testing files after they were loaded into MATLAB. Since Professor Hu’s knn.m m-file, like most pattern classification utilities, requires that the outputs use 1 of N encoding, the target vector had to be changed from one vector to seven vectors. This is because the forest cover type is originally defined as an integer between 1 and 7, thus having only one column in the data file. To deal with this, I added a segment of code in knndemo.m, the program that sets up the training and testing files before calling knn.m. This simply replaced the one integer column with seven binary columns containing a value of 1 in the appropriate position. After this was done, the k-nearest neighbor classifier could be run properly.

The back-propagation algorithm required significantly more preprocessing than the k-nearest neighbor classifier. To begin with, some code had to be added to bpconfig.m that converted the training and testing files outputs to 1 of N encoding. This was exactly what was described above, and was necessary because the m-file bptest.m assumes the outputs use 1 of N encoding. Next, I had to consider the training file. As a rule, the training file should maximize information content. In other words, each training vector applied to the back-propagation algorithm should be chosen on the condition that its information content is the largest possible for the given task (Haykin, 1999). By using the first 11,340 records from the data file, this criterion was satisfied. Since each forest cover type was represented equally, each vector applied contributed the same amount to the overall network and no one class was trained more than any other. Lastly, the inputs had to be normalized. This was simply a matter of computing the mean of each feature dimension and subtracting it from every vector. This would make all the inputs zero mean. The code to accomplish these objectives was added to bpconfig.m.

Once these modifications were made to the m-files, the next step was to perform extensive trails to determine the most accurate classifier. The k-nearest neighbor classifier did not require any configuration. However, for every number of nearest neighbors used, the computing time was quite lengthy. For example, it took over 30 minutes just to compute the classification matrix using one nearest neighbor. Similar times were required for other numbers. Since I repeated every trial 5 times to obtain a reasonable average, the time spent for each number of nearest neighbors was almost 3 hours. Because of this, I decided to only use up to 10 nearest neighbors.

The back-propagation algorithm required much more experimentation, since there are essentially endless configurations. For all the trials that I performed, I scaled the inputs between the range of -5 and +5 and the outputs between the range of 0.2 and 0.8. Although I tried other activation functions, I found that the best results were obtained with hyperbolic tangent for the hidden layers and sigmoidal for the output layer. I also found that using a separate fixed tuning set did not improve the classification rate, so I used the entire training set to estimate the training error. I chose to stop training after 100 iterations if no improvement was observed, and to train for no more than 10,000 epochs. I used 10 epochs between checking for convergence and an epoch size of 64. By making these same choices for every trial, the parameters that were remaining were the number of levels, number of neurons in each hidden layer, and the learning rate and momentum constants. I varied these parameters in a manner that would cover the widest range of configurations possible.

Results

The classification rates of the k-nearest neighbor classifier are listed in the table below for 1 up to 10 nearest neighbors. Each classification rate was averaged over five trials.

# of nearest neighbors / Average classification rate
1 / 72.51%
2 / 70.11%
3 / 70.18%
4 / 68.61%
5 / 66.71%
6 / 67.62%
7 / 67.89%
8 / 67.67%
9 / 66.83%
10 / 66.62%

As the above table illustrates, the classification rates tended to decrease as more nearest neighbors were used. Hence, I concluded that more than 10 nearest neighbors would not give better results. The best classification rate was obtained with one nearest neighbor, which was averaged to 72.51%.

For the many trials that I performed with the back-propagation algorithm, I discovered that a learning rate of 0.4 or greater prevented any learning from happening. So all the results included here will only have learning rates less than 0.4. Further, the algorithm never converged before 10,000 epochs, which is what I used for the maximum number of epochs to run. I did not expect convergence, however, since I did not anticipate training or testing classification rates much higher than 70%. In fact, the training process was usually terminated after about 7,000 epochs. This is because by this time the training error leveled off and was not significantly decreasing. I decided that if no improvement was observed after 100 iterations, further training would not appreciably affect the final results. For each configuration I have included below, I averaged the classification rates over 10 trials. The reason why I used 10 trials as opposed to the 5 that I used for the k-nearest neighbor classifier is because the results were much less consistent with the back-propagation algorithm, and hence required greater averaging. The training classification rate was consistently around 70%, however the testing classification rate varied by as much as 15% for the same configuration. The general method that I used to test different configurations was to first find the best combination of learning rate and momentum constants, then adjust the number of layers and number of neurons in each hidden layer. I only performed averaging on a configuration if it gave a descent classification rate on the first trial. The following table lists some of the various configurations that I used and the corresponding average training and testing classification rates. It would be impractical to list every configuration that I tried, so I have only shown the ones that gave reasonable results.

Learning Rate h / Momentum Constant m / # of Layers / # of Neurons in each hidden layer / Training Classification Rate / Testing Classification Rate
0.1 / 0 / 4 / 5 / 62.33% / 56.49%
0.2 / 0 / 4 / 5 / 60.21% / 51.47%
0.01 / 0.8 / 4 / 5 / 63.71% / 53.60%
0.01 / 0.6 / 4 / 5 / 64.05% / 49.82%
0.01 / 0.4 / 4 / 5 / 61.40% / 47.77%
0.01 / 0.2 / 4 / 5 / 66.93% / 50.52%
0.04 / 0.2 / 4 / 5 / 66.01% / 64.82%
0.1 / 0.2 / 4 / 5 / 68.89% / 57.51%
0.1 / 0.1 / 4 / 5 / 64.78% / 53.87%
0.1 / 0.05 / 4 / 5 / 66.64% / 59.05%
0.15 / 0.05 / 4 / 5 / 68.22% / 61.93%
0.2 / 0.05 / 4 / 5 / 67.66% / 56.08%
0.15 / 0.05 / 4 / 10 / 71.95% / 56.94%
0.15 / 0.05 / 4 / 20 / 69.01% / 53.86%
0.15 / 0.05 / 5 / 10 / 67.31% / 59.29%
0.15 / 0.05 / 5 / 20 / 70.33% / 64.83%
0.15 / 0.05 / 5 / 40 / 68.20% / 55.94%


The best average classification rate I observed was 64.83% for the configuration with