Stereo Vision for Mobile Robotics

Marti Gaëtan, Micro-Engineering Laboratory, Swiss Federal Institute of Technology of Lausanne.

- 5 -

Abstract

The Virtual Reality and Advanced Interfaces (VRAI) group[1] is currently investigating stereo vision for mobile robots. This paper provides an overview of both computational and biological approaches to stereo vision, stereo image processing and robot navigation.

1 Introduction

Two eyes or cameras looking at the same scene from different perspectives provide a mean for determining three-dimensional shape and position. Scientific investigation of this effect (called variously stereo vision, stereopsis or single vision) has a rich history in psychology, biology and more recently, in the computational model of perception. Stereo is an important method for machine perception because it leads to direct depth measurements. Additionally, unlike monocular techniques, stereo does not infer depth from weak or unverifiable photometric and statistical assumptions, nor does it require specific detailed objects models. Once stereo images have been brought into point-to-point correspondence, recovering depth by triangulation is straightforward.

Current range imagers have achieved real-time or near real-time performance on images of modest size. For example, stereo algorithms on standard hardware are capable of returning dense 128 x 128 range images at 10 Hz, while scanning laser range-finders can operate at 2 Hz on 256 x 256 images. To take advantage of these devices, researchers have proposed numerous methods for extracting 3-D information from range images. These methods operate either in 3-D Cartesian space (volumetric representations) or in a 2.5-D range image space (contour map method). Contour map methods are particularly attractive for computation bound applications such as mobile robots.

We begin with a discussion of the geometric basis and a computational model for stereo vision. Next, we briefly describe biological aspects of depth perception and a contour map method for depth processing. Finally, we present an obstacle avoidance technique for mobile robots using real-time stereo vision.

2 Stereo vision

The geometric basis key problem in stereo vision is to find corresponding points in stereo images. Corresponding points are the projections of a single 3D point in the different image spaces. The difference in the position of corresponding points in their respective images is called disparity (see Figure 1). Disparity is a function of both the position of the 3D scene point and of the position, orientation, and physical characteristics of the stereo devices (e.g. cameras).

Figure 1 (a) A system with two cameras is shown. The focal points are Fl and Fr, the image planes are Il and Ir. A point P in the 3D scene is projected onto Pl in the left image and onto Pr in the right image, (b) cyclopean view: the disparity D is the difference in the position of the projection of the point P onto the two stereo image planes.

In addition to providing the function that maps pair of corresponding images points onto scene points, a camera model can be used to constraint the search for corresponding image point to one dimension. Any point in the 3D world space together with the centers of projection of two cameras systems, defines an epipolar plane. The intersection of such a plane with an image plane is called an epipolar line (see Figure 2). Every point of a given epipolar line must correspond to a single point on the corresponding epipolar line. The search for a match of a point in the first image may therefore be reduced to a one-dimensional neighborhood in the second image plane (as opposed to a 2D neighborhood).

Figure 2 Epipolar lines and epipolar planes.

When the stereo cameras are oriented such that there is a known horizontal displacement between them, disparity can only occur in the horizontal direction and the stereo images are said to be in correspondence. When a stereo pair is in correspondence, the epipolar lines are coincident with the horizontal scan lines of the digitized pictures[2].

Ideally, one would like to find the correspondence of every individual pixel in both images of a stereo pair. However, it is obvious that the information content in the intensity value of a single pixel is too low for unambiguous matching. In practice, continuous areas of image intensity are the basic units that are matched. This approach (called area matching) usually involves some form of cross-correlation to establish correspondences.

2.1  Matching

The main problem in matching is to find an effective definition of what we call a valid correlation.
Correlation scores are computed by comparing a fixed window in the first image against a shifting window in the second. The second window is moved in the second image by integer increments along the corresponding epipolar line and a correlation score curve is generated for integer disparity values. The measured disparity can then be taken to be the one that provides the largest peak.

To quantify the similarity between two correlation windows, we must choose among many different criteria that produce reliable results in a minimum computation time. We denote by I1(x,y) and I2(x,y) the intensity value at pixel (x,y). The correlation window has dimensions
. Therefore, the indexes which appear in the formula below vary between -n and +n for the i-index and between -m and +m for the j-index:

It is important to know if a match is reliable or not. The form of the correlation curve (for example C1) can be used to decide if the probability of the match to be an error is high or not. Indeed, errors occur when a wrong peak slightly higher than the right one is chosen. Thus, if in the correlation curve we find several peaks with approximately the same height, the risk of choosing the wrong one increases, especially if the image is noisy. However, a confidence coefficient s , proportional to the difference of height between the most important peaks may be defined. Other important information may also be extracted from the correlation curve as, for instance, bland areas.

3 Human depth perception

For human beings, correlation (as described in the previous section) is only a local mechanism of stereoscopic vision [Bruce and Green 1990]. However, imagine the following experiment:

Figure 3 Stereoscopic fusion false target problem.

A stereoscopic system displays the set of points D1, D2, D3 for the right eye (see Figure 3) and G1, G2, G3 for the left one. An observer should be able to see any interlace of points (grey points in the light grey area) but, instead, they all succeed to the dark ones. This experiment shows that a global mechanism, based on criteria other than local correlation, is used. Among these ones, the following are taken in account:

·  A principle of correlation based on the contours of the image (see Figure 4a);

·  A mechanism of cognitive interpretation which has, in some cases, more priority than the local mechanism of stereo vision [Maar 1982];

·  A mechanism of pictorial clues of depth (relative size, relative height, perspective, shade, “fog” effect and interposition (see Figure 4b));

·  A principle of dynamic clues, such as motion parallax;

·  Other mechanisms such as correlation of frequency filtered images [Poggio and Poggio 1984];

Figure 4 (a) “Illusory” contours defining a square giving the impression that this shape is placed in front of 4 circles, (b) interposition principle (cognitive interpretation).

This list is not exhaustive but presents the most significant criteria belonging to the global system of the human stereoscopic vision[3].

In summary, the human stereo system uses a number of interesting methods, which work together to recover depth. On one hand, this system is very powerful because even using one eye, it is possible to perceive depth. On the other hand, it is also very subjective because a trompe-l’œil can fool our perceptive system.

In the next section, we will focus on an example of computational model for range (disparity) image parsing.

4 Contour map method

For many applications, working in 3D Euclidean space turns out to be unnecessary and difficult to manage. To reduce the amount of data which has to be processed, we introduce a method of quantifying volumes that allows us to manipulate range images (see Figure 5a) directly, without having to first transform to 3-D space. The method is similar to the use of contour maps to represent elevations; hence, we call it the contour method [Chauvin, Marti and Konolige 1997]. A contour represents the elevation at a particular height (see Figure 5b); all terrain between one contour line and the next is at an elevation between that represented by the contour lines. Contour creation can be visualized as a set of planes parallel to the ground at specified heights, intersecting the terrain. These cutting planes induce a quantization of the 3-D space based on elevation. Our approach uses this basic idea, and consists of the following steps:

  1. Constructing a set of volumes in 3-D space using a set of cutting surfaces (not necessarily planar);
  1. Projecting the cutting surfaces back to the range image to induce a quantization of the range data (see Figures 5c and 5d);
  1. Using the quantized range image to construct terrain models or other abstractions;

In cases where the desired segmentation is relative to the sensor viewpoint, the first two steps can be achieved off-line, leading to significant computational savings, especially when the cutting surfaces are complicated. In addition, step 3 can often be performed in the range image space, which is much more efficient than working in the volumetric space. Also, in contrast to the grid-based approaches, the cutting surfaces need not be regular, and can be sized to take in account the precision and error characteristics of the range data.

(a) /
(b)
(c) / (d)
Figure 5 (a) Range image where light (green) pixels represent points closer than dark ones, (b) Elevation map composed of the superposing of 8 contours (a single contour is represented with a special pattern), (c) and (d) projection of the cutting surfaces back in the range image for the special contour of image b.

5 Mobile robot obstacle avoidance

The contour method is well suited for use with the vector field histogram (VFH) algorithm for mobile robot obstacle avoidance [Borenstein and Koren 1991]. Originally developed with sonar sensors, the method used three steps:

  1. A regular 2-D histogram grid in plan view, holding the results of sonar sensor readings around the robot. The value of each grid point represents the number of sonar readings that indicated an object within the point (see Figure 7a);
  1. A polar histogram is computed from the histogram grid, with k regular angular sectors instead of a rectilinear grid. The value hk of each sector in the polar histogram represents the obstacle density in that direction (see Figure 8);
  1. Steering and velocity values are extracted from the polar histogram;

Figure 6 Representation of an obstacle (a) in the Cartesian space, (b) in the contour image space.

Figure 7 Two obstacles (a) in the polar grid, (b) in the contour image grid.

In range images, each column of the image represents a polar sector whose angular width is determined by the camera parameters (see Figure 7b). We let the k sectors correspond to the columns of the range image. Thus, we can construct the polar histogram directly from the contour representation, without having to convert to Cartesian space.

The key step is calculating the histogram value hk for each sector. Roughly speaking, this value represents the probability of finding an obstacle close to the robot in the direction of sector k. The simplest idea is to use a single cutting surface at elevation over the ground plane sufficient to constitute an obstacle for the robot. Any points in the resulting contour (see Figure 6b) are obstacles, and we can use the number of such points in a column and their distance to determine a histogram value.

Figure 8 Polar histogram corresponding the obstacles of figure 7.

The details of the weighting scheme we use are not critical; we expect almost any reasonable method that combines distance and number of points will work reasonably well. In our implementation we used a stereo system with disparity as the range metric, and let each contour cell m contributes to its sector value. This measure compensates for the fact that the disparity increases hyperbolically as an object gets closer.

(a) (b)

(c)

Figure 9 Experimental results. (a) image of the scene, (b) corresponding disparity image, (c) from left to right: contour, object detection and polar histogram (the vertical line corresponds to the direction followed by the robot and the horizontal line, its speed).

The VFH method was implemented using a small stereo system[4] for range images and a PC for processing the VFH algorithm. The stereo system returned images at a 5 Hz rate, and the VFH processing took less than 10 ms per image to format the polar histogram and extract the desired direction and speed of travel. Data were then sent to a robot navigation program[5] in order to steer a Koala[6] or Pioneer[7] robot.

Figure 9 shows a calculation of the polar histogram from a typical range image and a single-contour segmentation. In this case, the sensor covers about a 70 degree angle, and each sector is about 0.5 degrees. Image (a) is an intensity image of the scene, and (b) shows the disparity map computed by a stereo system. Brighter green values are higher disparities, hence closer to the camera. The final set of images (c) shows the contour (left side), a segmentation of some obstacles (middle), and finally the polar histogram. The middle of the histogram is straight along the camera optical axis, and the vertical line indicates the direction of travel that the VFH algorithm has found. From the picture, this is the direction through the open door.