A Machine Learning Approach to Object Recognition in the Context of Visual Road Scene Analysis

A Machine Learning Approach to Object Recognition in the Context of Visual Road Scene Analysis from a Moving Vehicle

Jivko Sinapov

Dept. of Computer Science

IowaStateUniversity

I. Introduction

This project address an object recognition task related to visual road scene analysis in the context of a moving vehicle. The goal is to implement and evaluate a robust object recognition technique for detection of various objects of interest. Object recognition tasks such as detecting other cars and traffic signs are very important when designing driving assistive systems or autonomous driving agents. In this project we implement and evaluate an object detection scheme utilizing a cascade of Haar feature classifiers, as well as a boosting technique utilizing SVM.

II. Background and Motivation

The success of several of the challengers in last year’s DARPA Grand Challenge shows that computer vision can be used effectively in solving many of the problems associated with autonomous driving. Many problems remain unsolved, however. For example, the computers that processed the data in the autonomous vehicles participating in the challenge are far superior to the average PC and it is unlikely that people at large would be able to outfit theirs cars with such systems. In the coming years we are likely to see various driving assistive technologies appear on the market and there is currently a large overlap between the set of problems associated with autonomous driving and that of problems associated in the area of driving assistive technology. A large fraction of accidents occurs because the driver is not paying attention to the road and cars in front of their vehicle. For example, a driver not paying attention can easily veer off course and enter an undesired lane, or fail to stop at a traffic light. As such, real time traffic light, and vehicle detection are very appropriate problems to tackle since any driving assistive or autonomous driving system would have to be able to perform those tasks.

The primary goal for this project is to provide appropriate solutions for these problems which can work in real time on a regular PC. In particular, we’ll take a look at the problems of detecting traffic lights and other cars in the field of view. It is conceivable that in the near future cars would come equipped with systems which monitor the road, as well as the driver in order to determine if he or she is not paying attention to the road. In such cases, the system must be able to detect situations which demand the driver’s immediate attention – for example, if the car is approaching a red light at high speed or if the car in front is suddenly slowing down. In order for such a system to work, it will need to be able to accurately detect the objects of interest and a machine learning approach is likely to provide such a solution.

III. Object Recognition Using a Haar Cascade Classifier

In the task of object recognition, we implement an approach which classifies objects based on an extended set of Haar features. This approached was originally proposed by Viola and Jones [1] and extended by Lienhart [2].

The detection scheme uses the values of Haar-like features in an image in order to classify an object as a positive or negative instance. A subset of simple features used in this model is shown in Figure 1.

Figure 1: Some simple examples of Haar-based features

Each of these features consists of a geometric representation of two regions – black and white. The value of each feature at a given position in the image is the difference between the sums of the pixels within the two regions. Haar features can take arbitrarily complex shapes and the size of the full set available in this model is in the order of tens of thousands. In order to compute the value of each feature at a given location of the image, the image is represented in an integral form: the value at position (x, y) in the integral image will contain the sum of pixels that are above y and to the left of x. The general formula for the integral image representation is the following:

The integral image representation is chosen for several reasons. First, it allows for efficient computation of a given Haar feature at a given position of the test image. In addition, it allows for robust object detection regardless of global lightning conditions, since the Haar features take into account only the differences of sums of pixels, which are invariant in terms of the global intensity of the image. Last but not least, the integral image representation allows for the object detection algorithm to scan for objects at different scales very efficiently since scaling the integral image can be done much faster than scaling the RGB image [1]. This is a very desirable property since real-time usability is a major goal for this object recognition system.

The classifier is built in stages – at each stage, an AdaBoost-like approach is applied to selecting one or more Haar-features, as well as determining appropriate thresholds which can be applied to reject a large number of negative training instances. Important input parameters for the training procedure are the minimum hit ratio and maximum false alarm rate – the search for optimal feature and threshold selection will continue until those two requirements are met, at which point the remaining training examples will be passed on to the next stage. For example, if those parameters are set to 0.995 and 0.5 respectively, at each stage, feature selection and threshold optimization will be applied until the resulting stage is capable of classifying 99.5% of the positive instances as positive and does not classify more than 50% of the negative images as positive. For more precise details regarding feature selection and training, consult Viola and Jones [1].

In the extended model of the classifier implemented in the OpenCV C++ library, each stage of the classifier can make use of more than one feature in order to meet the requirements set by the input parameters, in which case each stage can be viewed as a decision tree, rather than a decision stump [3]. It is also important to note that at each stage, the classifier uses a different set of negative training images which are sampled from a given database of images that do not contain the specified object. After training the desired number of stages, the result is a cascade of tree-like classifiers, as show in Figure 2.

The structure of the resulting classifier is essentially that of a degenerate decision tree or a decision list. Each added stage to the classifier tends to reduce the false positive rate, but also reduces the detection rate [3]. As such, it is essential to train the classifier with the appropriate number of stages for the given task.

Once a classifier is trained, detection is done by sliding a window across an input image and passing the cropped sub-imagethrough the classifier. In order for classification to be size-invariant, the same procedure is also performed on the input integral image at different scales. Given this scheme, the output of classification is a series of sub-windows of the test image which contain the desired object. In the following two sections we outline how this model was applied to the problems of traffic light and car detection.

IV. Traffic Light Detection

The problem of traffic light detection is important in the area of driving assistive technology. A system which is tasked with preventing accidents when a driver is not paying attention must always know whether there is a traffic light in the scene and what its state is.

To solve this task, a Haar cascade classifier for traffic lights was trained. The dataset used in these experiments consists of real time video taken from a camcorder positioned inside a passenger car. The camera resolution is low (320 by 240) and so is the image quality, thus adding another challenge to this problem.

The classifier was trained with 5 stages on 120 positive examples, and 120 negative examples. The minimum hit ratio at each stage is set to 0.95 and the maximum false alarm rate is set to 30%.

To improve results and decrease computation time, the area of the image being searched through is restricted to the portion where a traffic light could actually occur – there is no point at looking for that object on the actual road, for example. Once a traffic light is detected in the input stream, the image is analyzed to determine its state. Ideally, we would want to identify the area of the traffic light which contains the actual color signal. In our case, however, the resolution was low enough such that the number of pixels that actually correspond to the light in the traffic light is usually about 5 or 6 which makes it quite difficult to analyze. Nevertheless, a simple scheme for determining the color of the light is implemented which works the following way:

Input: cropped image of detected traffic light

Steps:1. G = Sum the green components of the cropped image

2. R = Sum of the red components

3. If (G/R) < t1, then output RED

4. If (G/R) > t2, then output GREEN

5. Else, output YELLOW

The thresholds t1 and t2 were automatically optimized based on a training set of traffic light images. In practice this scheme worked well in determining the color of a given light, although if we had better resolution, it is conceivable that a much better and more robust algorithm could be devised.

The classifier was evaluated on about 20 minutes of continuous input stream recorded while driving in Ames, IA.The detection scheme works quite comfortably in real time due to the small size of the trained classifier and small area of the image that is being processed. In all the occasions on which a traffic light was passed, the classifier is able to detect it and almost always outputs the correct color, so long it is green or red. The low-quality of the video input makes it difficult to recognize yellow since the pixels of the light signal actually assume white color in such cases. The good detection performance is likely due to the fact the traffic light shape is very distinct and there are almost no other objects present in the portion of the image that is being searched. One obvious drawback is that only traffic lights of this particular shape can be detected – while most traffic lights in Ames follow this standard, the same might not be true for other cities. Figure 4 shows some example results. At the end of this paper there is a discussion about some available online demos of this system thatdemonstrate how it works in practice.

Fig.4. Example results from running the traffic light detection procedure

V. Car Detection

In this particular problem, we are interested in detecting vehicles in front of the observer. A series of Haar cascade classifiers are trained and evaluated on two different datasets.

The first dataset, as in the previous problem, consists of low-quality video taken while driving in Ames and the surrounding areas. The low quality, however, makes it difficult to detect objects further in the distance and as such a second dataset of good quality images was used in order to evaluate the detection scheme in more detail.

5.1. Car Detection in low-quality and low-resolution video stream

As in the previous section, the experiments are performed on a dataset comprising of a recorded video from a camera installed in a passenger car while driving in Ames. A classifier with 10 stages is trained on 200 sample images of cars taken from half the amount of video available. The training parameters minimum hit ratio and maximum false alarm rate are set to 0.995 and 0.3 respectively. The resulting classifier is tested on about 20 minutes of video recorded while driving on the freeway.

Once again, since the position of the road relative to the observer is known in this context, we are able to restrict the image area in which a car is hypothesized to be. Restricting the region of interest allows for greater speed of computation and for elimination of false positives which could not possibly be actual cars due to their location.

Once the region of interest is identified, it is scanned by a widnowat different scales, and any sub-windows which are marked as positive by the Haar cascade classifier are deemed to be detected cars.

Restricting the search area helps eliminate almost all false positives. A passing car was always detected as such, although once it gets far ahead enough, the detection scheme fails due to the small size of the object and low image quality. Even though large semi-trucks were not part of the training set, they generally tended to be recognized as cars by the classifier, if close enough. The demos available online can give an accurate illustration of how well this detection and classification scheme works. Some sample screenshots are included in Appendix I. Overall, with a large data set and good quality video stream, such system could be fairly robust although it will never be absolutely perfect and hence an autonomous driving agent would need a much smarter framework in order to detect vehicles on the road.In the next section, we evaluate this object recognition and detection scheme much more precisely with mid- to good- quality input data.

5.2. Car Detection in mid- to good-quality images

The dataset used in the following experiments consist of 526 images taken from inside the driver seat of a vehicle, each of which contains at least one car in front of the observer. The images are not sequential frames from a video feed. Sample images from this dataset are shown in Figure 5. The dataset was split into 2/3 training and 1/3 test sets. Overall, 300 sample images of cars were extracted which were used for training each classifier.

Knowing that detection rate can decrease as the number of stages in a classifier increase, our task is to determine the optimal number of stages for this given problem. The training parameters minimum hit ratio and maximum false alarm rate are set to 0.995 and 0.3 respectively for all trained classifiers. Following, classifier with number of stages ranging from 5 to 10 are trained and evaluated on the test set.

Evaluation is performed by running the detection scheme on the test set and taking note of the type of results that are outputted at each frame. Each output result falls within one of three categories: positive, negative, or partial. Positive results are those that contain a car in a well-defined box. Negative results are such outputs that do not contain any major distinguishable portion of a car. Partial results contain everything in between – if the result contains a major portion of the car, or if it contains a car, but also lots of other stuff, then it is labeled as partial. Figure 6 shows examples of each type of outputs.

(a) (b)(c)

Figure 6: Examples of a positive (a), negative (b) and a partial (c) detected object.

Each trained classifier was tasked with detecting cars in the test set and the resulting outputs were saved and manually labeled as positive, negative or partial. Figure 7 shows the results of each run. As we can see from the chart, the 7-stage classifier detects the highest number of cars in the test set, while the 10-stage classifier detects the lowest false alarm rate, as expected.

Figure 7: Summary of classifiers’ performance.

The results of these experimentsillustrate the tradeoff between the hit rate and the false alarm rate of each classifier. Ideally, we want to detect as many actual appearances of the target object in the input stream without reporting too many false positives. In both, driving assistive technology and autonomous driving applications, a false positive error is not nearly as bad as a complete miss of an actual object of interest.

Following, we explore an approach to boost the classifier in order to minimize the false alarm rate while maintaining a good hit ratio. One such approach would be to reinsert samples of false positive outputs into the training set and further train the Haar cascade classifier. Retraining the classifier, however, is a highly time-consuming process when compared to other machine learning techniques. A 10-stage Haar cascade classifier, for example, can take up to one hour train on an average PC, even when faced with only a small dataset of 300 positive and 300 negative samples. If a real-time system is being told by the user that some of its findings are false positives, it would not have the luxury of time to adapt to those results.

Our approach to improving performance in real time utilizes an SVM which is trained on labeled detected outputs resulting from running the Haar cascade classifier detection scheme. This technique proves time-efficient and it improves performance. A good question at this point is why not use SVM from the very beginning? An SVM approach would likely yield better results than Haar cascade classification. However, we note that it is difficult to efficiently search through an RGB image at different scales for potential candidates. If the SVM makes use of global and local features instead of the raw pixel values, then there would be even extra computational overhead (in addition to scaling the image) when sliding a window and looking for a match. Real-time usability is a requirement for any driving assistive technology or autonomous driving system. We also have to note that object detection and recognition is only a small portion of such a system and as such, we need an efficient algorithm which saves computational resources for other tasks such as object tracking and decision making.