Enterface10 Project Proposal Hand Puppet

Enterface10 Project Proposal Hand Puppet

eNTERFACE’10 Project Proposal
Vision based Hand Puppet

Leaders: Lale Akarun, Rainer Stiefelhagen, Hazım Kemal Ekenel
Participants: Cem Keskin, İsmail Arı, Lukas Rybok, Furkan Kıraç, Yunus Emre Kara, Neşe Alyüz


In this project, a virtual 3D puppet will be animated by visually tracking the bare hand of a performer. The performer will manipulate the digital puppet via predefined hand movements that are akin to the ones used in traditional hand puppetry. Hand puppets are traditionally worn like a glove on the hand and controlled by moving fingers, which are directly translated into movements of the limbs and body parts of the puppet. In this digital version of the same act, the performer will not use actual puppets, gloves or markers. Instead, the bare hand of the performer will be tracked by one or more cameras in real-time, and the estimated hand posture will be mapped in an intermediary step to the animation parameters of the digital puppet. Also, upon performing some predefined hand gestures, i.e. some combinations of changes in hand posture and motion, the system will initiate complex sequences of animations that will enrich the performance. The puppet will be immediately animated, allowing the performer to have visual feedback. Moreover, using a separate camera, the facial expressions of the performer will be tracked, recognized and mapped onto the puppet as well.

Project Objectives

The main objective of this project is to design and implement a real-time vision-based hand tracking system that does not rely on other sensors or markers. To establish a working system, a large database will be created during the workshop. To visualize the results and make the emphasis on the real-time efficiency of the system, a digital hand puppetry system will be realized.

Tracking all the joints and angles of the hand correctly is a hard problem, since hands are articulated objects with very high degrees of freedom. Moreover, digital puppetry implies that the process should be real-time, and very robust (as the actual aim is entertainment), further complicating the problem. Therefore, most approaches in literature include some kind of markers (colored bands, gloves) or sensors (data gloves, accelerometers).

Some recent works suggest that it is possible to estimate the hand parameters with a high degree of precision in real-time. These works mostly rely on data driven modeling of the hand parameters and therefore, their success is mainly based on the dataset. Inspired from these approaches, we will collect (or synthetically generate) a large database of precise hand postures as a part of the project.

Such a database will prove to be useful not only for this project, but for many other applications that rely on hand gesture tracking and recognition, like sign language and HCI.

In order to ensure that the end result is eye pleasing, noise will be filtered using traditional methods such as Kalman filter. Moreover, it is likely that the frame rate of the tracking and estimation process will cause stuttering of the animation or jumps, if it is not fast enough (which is the likelier scenario, given the compexity of the problem). Therefore, a control hierarchy will be defined. For instance, at the coarsest level, the centroid and rotation of the hand, i.e. global parameters will be estimated. These parameters can be directly mapped to the puppet to guarantee a fluent animation with a higher response rate. Subsequently, the rest of the parameters will be estimated with a lower priority in parallel and mapped onto the puppet. The virtual puppet can be designed to make the most of this approach, allowing the same control hierarchy to be directly mapped onto its limbs.

Mapping the hand parameters to the puppet allows the performer to control the limbs of the puppet separately. Yet this mode of operation does not simplify the task of creating interesting and eye pleasing animations. As an extra challenge, we will track communicative hand gestures, which will trigger certain sequences of animation, which will allow high-level manipulation of the puppet.

Finally, we will track the face of the performer (or a secondary performer) using a separate camera and recognize certain facial expressions, which will then be mapped onto the face of the puppet directly. This will further enhance the quality of the performance and enable the performer to incorporate expressions in the act.

So, the objectives of this project are:

  1. Collection of a hand posture database and its ground truth creation
  2. Simple, fast tracking of the hand and face from multiple cameras
  3. Estimation of the approximate 3D parameters of the hand
  4. Tracking of the facial points
  5. Development of a digital hand puppet that can be controlled by hand movements, hand gestures and facial expressions.

Background Information

There are several phases associated with this project. We will list these phases and provide possible approaches for each step.

The first phase consists of the selection of hand posture and movement classes to be included in the database, collecting this data and generating the ground truth for the data. Ground truth generation is especially hard for an object as articulated as the human hand. Given the time constraint of the workshop, we can decide to collect this data by synthetically generating it using photorealistic hand models, which directly solves the ground truth generation problem. The main concern with this approach is, how well the synthetic models will actually represent the actual hand, as the end product will have to interpret actual hand images.

The second phase is the most complex part of the project. As we specifically want to omit markers and special devices, and still operate in real-time, we will consider methods that are very fast and robust and that only rely on vision, possibly using multiple cameras. [1] is an extensive survey on vision-based full dof hand motion estimation. According to this survey, the fastest approach that is also accurate is the work of Shimada et al. in [2]. This paper proposes a system estimating arbitrary 3D human hand postures in real-time. Moreover, it can accept not only predetermined hand signs but also arbitrary postures and it works in a monocular camera environment. Hence, this method is based on 2D appearance of the hands. The system extracts appearance based features from the hand image for each frame and looks for the corresponding 3D hand shape in a database that consists of several thousand 2D projections of 3D hand models with different parameters.

A variant of this approach that makes use of multiple cameras in order to estimate the volumetric distribution of the hand first, has been proposed by Chik et al. in [3]. Yet, reportedly, this approach does not work in real time, as it can handle each frame in about two seconds.

We will follow the same approach as Shimada et al., using the database generated in the first phase to match the 2D parameters of the hand to the 3D ground truth. To extend this work, we will make use of multiple cameras and particle filters. Using the result of Shimada’s work as an initial guess, we will further enhance the results using particle filters.

Particle Filters (PF) have recently been incorporated in model based tracking scenarios successfully. Bandouch et al [4, 5] used PF methodology for tracking 3D human pose configurations in near real-time using multiple cameras. Given an initial 3D object configuration, the same method is suitable for fine tuning the 3D model parameters given its high convergence rate. We will use this method to enhance and filter the results. Also, using several cameras will help resolve the ambiguities, which cause most of the erroneous results in Shimada’s work.

Since we have multiple cameras and a distributed system, we will also pursue the next logical step and actually try to estimate the 3D point cloud corresponding to the hand in real time. For this reason, the cameras will need to be calibrated. Calibration is a well researched problem, and most vision libraries have special functions for this task. Once the cameras are calibrated, the remaining task is to find matching feature point pairs from separate views and reconstruct the 3D coordinates.

The third phase consists of detection of the hand, tracking its position in 3D, as well as 3D gesture recognition. Since we want our system to work in an unrestricted scenario, the tracker needs to be robust to the presence of other skin-colored objects. In order to achieve this we are going to make use of a particle filter framework with the extracted 3D point clouds of skin colored objects as part of the observation model. Similar approaches [10, 11] show promising results, but cannot deal with occlusions, i.e. when the tracked hand occludes another skin-colored object, the track may be lost to the occluded object.

Occlusion maps have been introduced by Lanz et al. [12] to cope with this problem in the context of multi person tracking. However, since their approach is computationally too expensive to be employed in our system, we will simplify it to fit to our scenario.

For initialisation purposes and as an additional cue for the hand tracker, we will build a hand detector using the database from the first phase. This can be done in a similar fashion to the approach of Ong et al. [8] by training a boosted classifier detecting the hand in different poses. Recently Froba et al. [9] presented a both fast and accurate object detection algorithm that is part of publicly availailable libraries and therefore can be used to build such a hand detector. Since the detector not only outputs the 2D position of the hand but also a rough estimation of the hand posture, the results may also be used to reduce the search space of the hand posture estimation component.

The hand features to be extracted will be chosen according to the exact method used in phase two. For hand gesture recognition, we will use a changepoint model that is currently under development, which, in preliminary tests performed, much faster than HMMs, with a greater recognition accuracy.

The face of the performer will also be detected in each frame, and the facial expression will be tracked. Public libraries already include a robust face detector [6] which will be used for face initialization in this project. The same approach is applied for detecting different components of face such as eyes and mouth. Since the positioning and mouth movements will be animated using hands, we are mainly interested in tracking upper face (especially eyes and eyebrows) for controlling emotional mood of the puppet. Therefore, a few facial expressions will be selected to map to the puppet's expression such as being neutral, happy, sad, surprised and angry. We will track facial features using an enhanced Active Shape Model [7] that works in multi-view and multi-resolution which we already developed. The model will be trained for mainly the upper faces of the people in the database we collect and real-time tracking of the expression will be performed.

The rest of the phases are considerably easier, as they are either common practices in most gesture recognition applications, or simple design choices regarding the puppet.

Technical Description:

We have explained the phases of the project above. These can be summarized as:

  1. Database design, data collection and ground truth generation.
  2. Training of the hand posture models and implementation of the estimation methods.
  3. Implementation of robust, vision-based hand and face tracking methods.
  4. Filtering of the parameters and estimation of missing data.
  5. Design of a digital puppet with sufficient degrees of freedom that inherently makes use of some control hierarchy.
  6. Determining a map between the hand and expression parameters and the control parameters of the puppet.
  7. Animation of the puppet in real-time.

We have organized these tasks into the following workpackages:

WP1: Data collection and ground-truth creation

WP2: Hand posture modeling and training

WP3: Vision based hand tracking and gesture recognition

WP4: Vision based facial expression recognition

WP5: Creating and animating puppets

WP6: Management

Resources needed:

WP1: Synthetic database will be created before the workshop starts. For the actual data collection phase, multiple simple cameras and a simple background will be needed. The cameras can be standard 640x480, 30 fps webcams, but with preferably high image quality. Ground truth labeling is a time consuming process, and therefore a dedicated staff will be needed for this task.

WP2: Shimada’s work will be thoroughly examined and also implemented before the start of the workshop, if at all possible. All implementations should attempt to make use of GPUs. A dedicated PC with a good GPU is necessary, as well as the cameras used for the data collection task.

WP3 & WP4: For vision based algorithms a separate multicore PC will be used. The cameras will be directly connected to this PC. For communication between the PCs, a sensor network framework such as Smartflow will be used. As face and hand tracking are independent processes, the multicore architecture will be utilized accordingly.

WP5: Staff with graphics and animation background is needed for this work package. The final animation should also be run on a separate server, which collects all the animation parameters separately. This PC should also have a decent graphics card.

Work plan and implementation schedule

Pre-workshop / Week 1 / Week 2 / Week 3 / Week 4
WP1: Data / Synthetic database (DB) creation / Database design for the actual DB / Ground truth (GT) labeling / GT / GT
WP2: Modeling / Implementation of Shimada’s work on the synthetic database. / Extending the algorithm using particle filters. / Extending the algorithm using 3D point cloud. / Enhancements and tests. / Enhancements and tests on the actual DB.
WP3: Tracking and gesture recognition / Implementation of the changepoint model for gesture recognition. / System setup. Hand and face detection. Camera calibration. / Feature extraction for hand and faces.
3D point cloud generation. / Filtering. / Enhancements and test.
WP4: Animation / Selection of a control hierarchy and a suitable puppet. / Creation and visualization of the puppet model. / Mapping hand parameters to the puppet. Tests with synthetic data. / Animation with GT. / Animation with tracked points.
WP5: Management / Coordination meeting (CM) / CM, Midterm presentation. / CM, Report writing. / CM, Report writing, final presentation.

Benefits of the Research


D1: Synthetic hand database

D2: Hand database from multiple cameras with partial Ground truth

D3: Hand tracker

D4: Hand gesture recognizer

D5: Facial point tracker and gesture recognizer

D6: Puppet model and vision based animation


[1] Erol, A., Bebis, G.N., Nicolescu, M., Boyle, R.D., Twombly, X.: “A review on vision-based full DOF hand motion estimation”. In: Vision for Human-Computer Interaction”, vol. III, p. 75 (2005)

[2] N. Shimada, K. Kimura, and Y. Shirai. “Real-time 3d hand posture estimation based on 2d appearance retrieval using monocular camera”. In ICCV Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, pages 23–30, Vancouver, BC, Canada, 2001. IEEE.

[3] Desmond Chik, Jochen Trumpf, andNicol N. Schraudolph. “3D Hand Tracking in a Stochastic Approximation Setting”. In2nd Workshop on Human Motion: Understanding, Modeling, Capture and Animation, 11th IEEE Intl. Conf. Computer Vision (ICCV), pp. 136–151,Springer Verlag, Berlin, Rio de Janeiro, Brazil, 2007.

[4] Bandouch J, Beetz M. “Tracking Humans Interacting with the Environment Using Efficient Hierarchical Sampling and Layered Observation Models.” IEEE Int. Workshop on Human-Computer Interaction (HCI). 2009.

[5] Bandouch J, Engstler F, Beetz M. “Evaluation of hierarchical sampling strategies in 3d human pose estimation”. Proceedings of the 19th British Machine Vision Conference (BMVC). 2008.

[6] P. Viola and M. J. Jones, "Robust Real-Time Face Detection", International Journal of Computer Vision (IJCV), 2004.

[7] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, “Active shape models-their training and application,” Computer Vision Image Understanding, vol. 61, no. 1, pp. 38–59, 1995.

[8] Ong, EJ., Bowden, R. "A boosted classifier tree for hand shape detection". Processdings of the IEEE Intl. Conf. on Automatic Face and Gesture Recognition, 2004.

[9] Froba, B., Ernst, A. "Face detection with the modified census transform". Processdings of the IEEE Intl. Conf. on Automatic Face and Gesture Recognition, 2004.

[10] Azad, P. "Visual Perception for Manipulation and Imitation in Humanoid Robots", pages 164-169, 2009.

[11] Nickel, K., Stiefelhagen, R. "Visual Recognition of Pointing Gestures for Human-robot Interaction". Image and Vision Computing, Vol. 25, Issue 12, pages 1875-1884, 2007

[12] Lanz, O. "Approximate Bayesian Multibody Tracking". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 28, Issue 9, pages 1436-1449, 2006

Profile of team:

Dr. Rainer Stiefelhagen is a Professor at the Universität Karlsruhe (TH), where he is directing the research field on "Computer Vision for Human-Computer Interaction". He is also head of the research field "Perceptual User Interfaces" at the Fraunhofer Institut for Information and Data Processing (IITB) in Karlsruhe. His research focuses on the development of novel techniques for the visual and audio-visual perception of humans and their activities, in order to facilitate perceptive multimodal interfaces, humanoid robots and smart environments. In 2007, Dr. Stiefelhagen was awarded one of the currently five German Attract projects in the area of Computer Science funded by the Fraunhofer Gesellschaft. His work has been published in more than one hundred publications in journals and conferences. He has been a founder and Co-Chair of the CLEAR 2006 and 2007 workshops (Classification of Events, Activities and Relationships) and has been Program Committee member and co-organizer in many other conferences. Dr. Stiefelhagen received his Doctoral Degree in Engineering Sciences in 2002 from the Universität Karlsruhe (TH).

Dr. Lale Akarun is a professor of Computer Engineering in Bogazici University. Her research interests are face recognition and HCI. She has been a member of the FP6 projects Biosecure and SIMILAR, national projects on 3D Face Recognition and Sign Language Recognition . She currectly has a joint project with Karlsruhe University on use of gestures in emergency management environments, and with University of Saint Petersburg on Info Kiosk for the Handicapped. She has actively participated in eNTERFACE, leading projects in eNTERFACE06 and eNTERFACE07, and organizing eNTERFACE07.