Section 2.1.2 Realtime Feature Tracking and Segmentation

upper body, hands, objects (LG)

1 page - Other Relevant Research

List of virtual reality research groups can be found at :

Media Lab , Gesture and Narrative Language

work on life-like avatars, not very relevant: avatars are autonomous

maybe we don't want avatar, we want virtual presenece (same person

at remote location).

NTI : From observation of their webpage, seems like a very similar idea.

(In fact, I recognize alot of the drawings...)

The website doesn't provide any detailed information, so I can't

really say too much about it. At the level that it is presented,

I would say it's exactly the same thing, except that they also claim

that they want to define future standards in order to

"not be hindered in the future by inadequate standards or improper

infrastructure decisions that could have been avoided"

(from

ATR

two relevant projects, one on recognition and generation of humans, the other recog and gen of scenes

however, no detailed information there...

Scattered, individual 'pigeon-holed' efforts which study sub-problems

ie, what can be done so far:

Rehg&Kanade hand pose estimation, (LG)

J. Rehg and T. Kanade. ‘Model-based tracking of self-occluding articulated objects’. In Proc. of Fifth Intl. Conf. on Computer Vision, pages 612-617, Boston, MA, June 1995. IEEE Computer Society Press.

Rehg has developed a system able to estimate hand-pose. At each time step, the previous time step's estimated pose is used as a starting point to perform a gradient descent search for the current pose. The gradient descent operation minimizes the sum of squared differences between the current image and a comparision image generated from the current pose estimate. The generated image is created by warping pre-acquired templates of the finger segments according to the pose, and compositing the entire hand image with several such segments, taking into account inter-digital occlusions. Under the conditions of 1) high contrast images with a plain dark background, 2) high resolution image of the hand without scale, orientation, and postion changes (solely internal pose changes of the fingers), 3) slow hand movement, the system is able to track the change in hand pose robustly.

Pfinder (MIT Medialab) (LG)

C. Wren, A. Azarbayejani, T.Darrell, A. Pentland

‘Pfinder: Real-Time Tracking of the Human Body’

published in IEEE Transactions on Pattern Analysis and Machine

Intelligence July 1997, vol 19, no 7, pp. 780-785

This system is able to track a single user in real-time (10Hz) on an SGI Indy computer. The technique used is to create second-order statistical models of the background and the user, and at each time step to perform a MAP (maximum a posteriori) decision pixel-wise to deterimine the background/figure segementation. The pixels belonging to the user are further MAP classified as belonging to one of several blobs which represent parts of the body with different colors (head, hands, feet, torso, and legs). Based on the classification, new statistics for the background and user body parts are recomputed at each frame, in order to adapt to changes in lighting conditions. Hands and head are found and tracked reliably due to the inclusion of prior knowledge of human skin color (which is largely invariant to amound of skin pigmentation). Robustness is further increased by including some a priori knowledge of the outline of the human body, and using the detection of certain poses to label certain body part blobs as the head or hands. This system is relatively computationally inexpensive, and can provide a reliable body segmentation as well as localization of the hands and head. However, it does not provide 3D information (although, for small bolb regions, the same algorithm from a different camera viewpoint could be used to triangulate 3D positions), and also does not provide the pose of the body (pose of the face, arms, hands).

Caltech Arm Tracker (LG)

Luis Goncalves, Enrico Di Bernardo, Pietro Perona

‘Monocular Tracking of the Human Arm in 3D’, in the proceedings of the IEEE

Fifth International Conference on Computer Vision, June 1995,pp 764-770

This system is able to determine the 3D pose of a human arm in real-time (30Hz) from a monocular view, by using a recursive estimator and a 3D model of the arm. At each frame, an error vector quantifying the distance between the expected arm outline and the real arm's outline in a background subtracted image is computed. This error vector is then used by the recursive estimator to update the expected pose of the arm. The system can resolve the position of the arm with the resolution of 1 cm, based on the assumptions of 1) a fairly accurate model of the arm (consisting of truncated cones) 2) no loose clothing, 3) Limited self-occulusion of the arm by the body (since only a foreground/background segmentation is used to compute the error vector, if the arm is in front of the torso, it's boundary information will be lost), 4) the shoulder position is assumed to be known in 3D.

JY's 3DPhotography (JYB) (this goes in section 2.1.1 - realtime 3D recon)

Commercial systems for tracking (optical, magnetic), and 3D reconstruction

(laser scanners) are not viable since they are invasive techniques and

require a structured/controlled environment.

(NOTE HOWEVER : I recall seeing a picture where the users have 3D gogles/

glasses on? If they have to wear that special equipment, you can use the

glasses to provide some head position/pose information)

---

2 pages -

Description of modules available for sensing

(INTRO)

There are several known methods of extracting various types of information from image streams useful for the purpose of tracking. These can be incorporated and further developed to be part of a system capable of tracking and segmenting the upper-body and hands, as well as objects brought into the scene. Below is a brief description of the methods we wish to experiment with, evaluate, and possibly integrate into our system.

(These all belong in section 2.1.1 - Real-time 3D Reconstruction)

?image-based rendering? (JYB)

?video mosaicing and image-based panoramic reconstruction? (JYB)

3D stereo reconstruction (JYB)

Weak Structured Lighting, structured lighting (JYB)

structure from motion (JYB)

shape from shading ... (JYB)

(?any techniques at all available?)

Skin color blob trackers (LG)

Human skin has a distinctive hue which makes it relatively easy to detect.

- largely invariant to race,

- color histogram, image color statistics, and continuity can be used to robustly segment areas of skin color in an image (Forsyth, Berkeley)

- (Pentland,MIT Media Lab) can detect location of face and hands in image, and also provide a rough estimate of hand pose by fitting 3D elliptical blob to segmented skin patch, using skin region information from two cameras.

Strength: fast - has been demonstrated in real-time

fairly robust to lighting conditions and skin complexion

Weakness: provides only medium resolution estimates of postion and pose.

Specification: ??how accurate is Pentand's pose/position of ellipsoids??

Blob trackers (LG)

(Bregler, Berkeley) Recursive techniques for segmenting regions based on optical flow and/or other features. Differs from active contours in that region rather than boundary information is used to keep track of the object. Also, no explicit representation of the shape is kept, other than the image centroid and moments of inertia of the object.

Strength: can track distinct regions without any a priori model of what is to be tracked.

Weakness: (optical flow) measurements are computationally expensive. Due to the lack of a model, errors can accumulate and the actual physical region being tracked may shift over time.

Active contours (LG)

(Blake, Oxford, UK)

A method of tracking a region in an image, allowing for 2D position, orientation, scale, and shape changes. Given the location of the object's/body part's contour in the previous frame, measurements to detect the location of the contour in the current frame are made, and the optimum expected contour is computed.

Strength: fast, recursive estimation, applicable to real-time implementation

Weakness: may not be particularly robust requires a clearly visible contour does not provide 3D information

Kalman filtering (LG)

(Kalman, 1960s)

This classical control-theoretic technique of efficiently estimating the state of a dynamical system given a time sequence of measurements of some outputs of the system is of widespread use in computer vision, especially for the purpose of object tracking.

Strength: effective, recursive technique can provide accurate measurements in real-time.

Weakness: requires an accurate dynamical model of the system being observed in order to provide accurate results. Also requires the acquistion of reliable and robust measurements of the system's state. This is typically not an issue in mechanical feedback control applications, but is the main cause for failure of the method in computer vision applications.

Condensation filter (LG)

(Blake, Oxford, UK)

The condensation filter is an alternate form of performing recursive estimation. It is based on the Kalman filter, but has greater robustness to scene clutter. This robustness is achieved by avoiding the parameterized guassian distribution representation of the estimated state that a Kalman filter uses, and instead representing the distribution implicitly as an ensemble of possible states. This ensemble is updated through time by evolving each sample according to a dynamical model of the system, then sampling this new underlying distribution , and assigning the new samples a likelihood based on how well they agree with the data in the observed image. Blake et al have demonstrated robust tracking of a hand on a cluttered desktop, and of a leaf on a bush shaking vigorously in the wind.

Strength: can track with great robstness with respect to scene clutter, especially if a good dynamical model of the system being observed is available.

Weakness: Large number of sample states may be required for effective tracking, requiring many more image measurements to be made than with a kalman filter. There is a tradeoff between number of sampled states and the achievable robustness.

Motion (optical flow) (LG)

Estimation of optical flow is a well known problem in computer vision, and there are numerous techniques developed and being investigated. The basic idea is to build up a dense motion field over the entire image given two consecutive images. Corner features provide complete 2D optical flow information, and edge boundaries provide estimates perpendicular to the edge. Using continuity assumptions and regularization terms, it is possible to compute a dense optical flow field. This kind of information could be useful for image segmentation, or for tracking applications.

Strength: can provide dense dataset to be used as input for both segmentation or tracking.

Weakness: computationally expensive, can run real-time only with dedicated hardware. Works best with textured surfaces.

Elastic Graphs (EYE)

Feature Tracking (EYE)

?Dynamical Motion models (LG)

(no solid results yet)

Background subtraction (LG)

A procedure for performing figure-ground segmentation based on background color statistics. If knowledge of the background pixel-wise color statistics is known, and the user's body and clothes colors are sufficiently distinct from those of the background, a pixel-wise segmentation can be calculated. Adaptive background statistic acquistion can increase robustness to lighting changes.

Strength: simple, fast

Weakness: provides only figure-ground segmentation, cannot segement different body parts or objects

Articulated model-based trackers (LG)

(Perona, Caltech)

Recursive estimation of position and pose of an articulated body can be computed using a 3D model of the body and by making measurements on the image at locations specified by the current estimate of pose. Such a system tracking the human arm in realtime from a monocular view has been built, and has a resolution of 1 cm.

Strength: efficient recursive estimation. Implementable in real-time. Provides 3D estimates of position and pose directly.

Weakness: dependent on an accurate 3D model of the body, current system cannot handle loose clothing. May need to integrate different types of image measurements to increase robustness.

2D realtime video (for segmentation) (?LG/EYE?)

- not really a vision module, but just the raw data.

subpixel image segmentation (?JYB?)

(?how?)

Texture (segmentation) (LG)

(Perona, Caltech) Efficient multiscale methods of peforming texture segmentation based on the collective response of a set of convolution kernels have been developed. Region growing of image areas with similar responses lead to the detection of texture boundaries and labelling of separate texture regions. Strength: provides detailed 2D segmentation of image.

Weakness: computationally expensive, requires body/objects to be textured

Edge detection (LG)

The classical computer vision operation of edge detection can provide features to be used by higher level systems. An explicit search for edges over the entire image is likely not to be very useful; typically, edge detection is done implicitly in the measurement process of a tracking or segmentation algorithm.

registration/alignment techniques (LG)

(Viola, MIT among many others) Several techniques for determining position and pose of an object. Viola's technique obtains registration by maximizing mutual information (information-theoretic approach). Very robust to lighting conditions, but computationally expensive.

------

2 pages - chart of possible methods for hand, object, and upper body tracking

Upper body tracking:

Integration of various low level image measurement modules with techniques for boostraping, registration, and 3D tracking should be able to deliver a reliable and robust system for estimating the pose and position of the upper torso and arms. Several independent approaches in research laboratories throughout the world have been able to solve subproblems of this task, and it seems reasonable to expect over the next few years that further research coupled with integration and development on computational substrates of ever increasing power should yield reliable systems working in real-time.

Below is a block diagram outlining a possible integration scheme for 3D body position and pose estimation.

(Insert Upper Body Tracking block diagram here)

Hand Tracking:

Estimating the position and pose of a human hand is a particularly difficult task for several reasons:

- it typically occupies a very small region of a video image, so that the resolution with which it is observed is quite low.

- it has a complicated geometrical and mechanical structure with upwards of 17 degrees of freedom, and it's appearance in the image can vary drastically depending on orientation and internal pose.

- it can change pose quite rapidly as well as undergoe very quick translations within the scene.

For these reasons, there are no known methods capable of tracking the hand in such an unrestricted environment as the one of our intended application domain. Research to date on hand tracking and hand gesture recognition is done under the simplifying assumptions that the hand occupies a large portion of the image, that hand poses are stationary or very slowly moving, and a frontal or side view of the hand is assumed.

The chart below outlines possible approaches to tackle this difficult problem.

(Insert Hand Tracking block diagram here)

As a fallback, the estimation process can be circumvented by direct transmission of a small area of interest in the video where the hand is known to lie (via color blob tracking, or other low resolution tracking and detection schemes).

Object Tracking

In our intended application domain, hand-held object tracking is also an unsolved problem. The following chart outlines possible approaches to be investigated and developed. The key steps required are :

- detection and identification based on a pre-established catalog of known objects

- tracking and registration, taking into account occlusion generated by the hand holding the object.

As with the hand pose estimation scheme, the fallback scheme consists of direct transmission of the small area of interest in the video containing the hand and object.

(Insert Object Tracking block diagram here)

------

Roadmap of development.

stages of increasing difficulty

1) upper body and arms tracking

2) hand tracking / pose estimation

3) hand-held object tracking

main stages of development

1) feasibility studies and prototyping of subsystems in off-line mode

2) integration of subsystems , off-line and real-time on highpowered computational substrate (ONYX machine)

3) integration of subsystems on custom hardware.

------

1/4 page bios.

Luis Goncalves is currently a PhD student at the California Institue of Technology, expecting to graduate December 1998. His thesis work deals with the problems of analysis and synthesis of human motion from visual input. In conjunction with Enrico Di Bernardo (postdoc at Caltech) and Pietro Perona (professor at Caltech), he has developed a system which is able to track the movement of a human arm in real-time (30Hz) and in 3D using only the input from a single camera. Currently he is developing methods of learning dynamical and probabilistic models of human motion given large datasets of visual input. These models will be useful both to make body tracking systems more robust, and also to automatically synthesize realistic human motion.