Gesture Recognition in a Classroom Environment
By Michael Wallick
Submitted as partial requirement for
CS 766 (Computer Vision)
Fall 2002
1. Introduction
Gestures are a natural and intuitive way of that people use for non-verbal communication. Gestures can also be used to interact with a computer system, without having to learn how to operate different devices. Gesture recognition in a surveillance system can provide a computer with information about what people are doing in a giving scene.
In an automated editing system the gestures of the actors can be useful in helping to drive the focus of the camera. This becomes especially important in a classroom environment. The gestures of the lecturer (in essence the only actor) help to indicate what is important in the scene, such as specific parts of the board, during the lecture. While these gestures are important in giving clues about what is important, the gesture itself can also obscure the important information (such as pointing). A good automated editing system needs to be aware and correct for this.
In this paper, I will present a gesture recognition algorithm that can be used in a classroom environment for the purposes of lecture understanding, with an emphasis on automated video editing of the classroom lecture. In particular, the method will be for recognizing “pointing and reaching” gestures. Because both of these gestures can be confused with writing and in order to avoid situations where the lecturer is blocking important information with his or her gestures, I will introduce the concept of “board regions.” These regions are a partitioning of the board into semantically linked groups of writing on the board. With the knowledge of the attributes of these regions (such as the times a region is first drawn, stops being drawn or is erased) the gesture of pointing and reaching can be disambiguated from writing, and the system can get a better idea of what part of the board is being pointed at.
This paper is structured in the following way. The next section is a brief survey of existing gesture recognition techniques, including template based approaches (the approach used in this paper). Following that is a discussion of Virtual Videography, the automated video editing project that this work is being done for. After the Virtual Videography section is a description of the “board regions.” This will include a formal definition of the region as well as a computer vision technique for finding these regions. A description of the gesture recognition algorithm that was implemented follows the region section. Finally, I will conclude with a discussion of how the gestures and the regions can be used together to get a better understanding of the classroom lecture and improve the automatic editing results.
2. Gesture Recognition
Gesture recognition has been a hot topic in computer science in recent years. There are many different techniques for achieving this goal, including template matching, neural networks, and statistical approaches. Either of these approaches can special tracking equipment or tracking based on computer vision [Watson93]. Special equipment will generally give better results and are easier to implement than computer vision techniques, however they will also tend to be more expensive, intrusive and restrictive than computer vision approaches. In this section, I will briefly outline the different techniques.
The first means of gesture recognition is with a template based approach. The raw input data is fed to the computer and then compared against a set of known gestures and classified as the gesture that it best fits. This approach has many advantages including that it is simple to implement and maintain such a system. The disadvantages to this approach is that it can be thrown off by noise, either in the templates or the data that is being classified, and it can be very dependant on background as well as the clothing worn by the actor whose gestures are being recognized [Watson93]. For this project I have constructed a template based gesture recognition system, and have addressed several of these problems in my implement.
A method for gesture recognition that is more advanced than template based matching is statistical approaches. With a statistical approach, models of the gestures are employed to help classify unknown gestures. These models can include a Hidden Markov Model (HMM) or Dynamic Time Warping (DTW). The known gestures are stored in the models and pattern recognition algorithms are used to identify the gestures. The difficulties with this scheme is that they can hard to implement and are slow to look up gestures [Martin99].
Similar to statistical methods of gesture recognition is using a neural network. The network is trained on different gestures and can then use pattern recognition techniques, which neural networks are known for. As with other statistical approaches, the neural network is difficult to implement. Additionally it requires many training examples, sometimes in the thousands, for each gesture [Fels93].
3. Virtual Videography
The main motivating factor for this project is Virtual Videography [Gleicher00, Gleicher02]. In Virtual Videography, a camera is placed in an unobtrusive location in a classroom and aimed at the board; the resulting footage is boring and difficult to watch. It is our goal to have a system that automatically edits this footage producing a useful video that is easy to watch and learn from.
For both the region finding (described in the next section) and the gesture recognition, the data set consisted of video taken from CS 559 (Introduction to Computer Graphics) in the Fall of 2001, taught by Michael Gleicher. Two cameras were placed close to each other in the back of the classroom and pointed towards the board. The professor is the only obstruction between the camera and the chalkboard.
In the following sections, I will formally describe a region, and then give an algorithm for extracting the regions from a videotape of a classroom lecture. After that, I present a method for using the same video to analyze the gestures that are being made by the professor.
4. Regions and Region Finding
In their work, Onishi et. al [Onishi00] introduce a method for segmenting blackboards based on an idea called written rectangles. The concept of regions can be viewed as an extension of the written rectangles idea, as it differs from and expands on their work in several ways. First is a more formalized definition for region. Second, the region concept applies over a more general domain of images; theirs is only applied and tested on a blackboard. Finally, a different algorithm is used for extracting regions from marker and chalkboards, and it can easily be extended to other domains.
Intuitively we can think of a region as being the partitioning of the surface over time and space, where each partition represents a single thought or idea that was written down. More formally we define a region to be the collection of strokes, such that all strokes in the region are related to each other.
Notice that the region definition requires a semantic understanding of what is written. Current technology does not provide a means of automatically extracting this information. Instead, make the observation that regardless of the overall structure of writing, ideas will still be grouped together. Using this observation the regions can be approximated by grouping together stokes that occur close together in either time or space.
A specific “lifespan” is applied to the region model. Each region is “born,” “grows,” “matures,” and eventually “dies.” A region will be in each of the states, in the above order, for some time during its life. When a stroke appears close to an existing region, the state of the region combined with the distance the stroke is from the region determines how the stroke is processed. The following is an in-depth explanation of each of these states:
Birth: A region is born when the first stroke belonging to a new region is drawn. In our model we assume that the writing surface either starts clean or originally contains irrelevant writing left over from the previous use. Therefore, all regions have a birth and cannot exist prior to the start of the event that is making use of the surface.
Growth: Birth is immediately followed by growth and is characterized by writing being added to the region for up to some fixed amount of time. A region is still considered to grow if a part of it is erased. However, the spatial extent of regions will generally not shrink. Either a small amount is erased to make a correction, or it is erased entirely.
Maturity: A region is mature once it stops growing, or another region is born. This means that only one region can be growing at once.
Death: A region dies as either a result of being erased; merging with another region or the event that is using the surface has ended.
If it seems that a region that has entered maturity is again growing, it is said that a new region is born, and merge the old region with this new one.
5. Algorithm for Region Finding
The region finding algorithm consists of three steps. The first step is segmentation and refilling. The person or people (who will be referred to as the presenter) are removed from the video and replaced with the area of the board that is being obstructed. The second step is to find where and when the board was changed, either by writing or erasing. Finally, this information is combined to form regions.
5.1 Segmentation and Refilling
Throughout the video, the presenter will be blocking some part of the board. It is likely that he or she will often block important information, i.e. what is being written. Therefore, the first step is to remove the presenter from the video and use in-painting to replace the part of the board that is obstructed. To identify the presenter and board, color classification is used to segment the video. I choose to use color segmentation for several reasons: the board is generally a unique color which makes finding non-board pixels easy; segmentation based on color will work when there is little motion or on single images, such as those acquired with a high-resolution digital camera; color classification will work with multiple occlusions (i.e. other participants or more than one presenter). I choose to use color classification for the reasons listed above, however any segmentation technique can be used instead.
In order to perform color segmentation, I extended the algorithm described in [Jones98]. The algorithm uses a three-dimensional array indexed by RGB values. The array is “trained” with pixels that are known to be the color of the object in question. In performing color classification, each pixel is checked against the trained array. If the location of the array at the pixels RGB value is greater than some threshold value, then that pixel is the same color as the object in question. In this particular method, each dimension of the array has less “bins” than the number of possible colors, so similar colors get grouped together. For our implementation, we use 32 bins per dimension. Training is done once per video, using the video itself. This compensates for lighting and camera configuration changes that occur throughout the data set.
The first two extensions have to do with the color arrays. First I allow for multiple arrays, in order to represent several objects. Additionally, each array has an associated confidence function:
Ci(R,G,B) = [0,1] (1)
Given an R, G, B vector, a color array i will return a confidence between 0 and 1. More specifically, the function is implemented as:
Ci(R,G,B) = iR,G,B/itotal (2)
In other words, the confidence function is the percentage of the number of training pixels that correspond to the R,G,B value in question. A pixel in question belongs to the object whose array returns the highest confidence value. Our current implementation uses two arrays, one for the board and one for the presenter.
The final extension to this method is in training the arrays. In [Jones98], the array is trained manually. Since I assume that the location of the board in the video is known, the extension allows automatic training of the color arrays. As the presenter is likely to move around during the lecture, pixels that change drastically from one training frame to the next are considered to be presenter, and those that remain constant are board. While this method does not work with 100% accuracy, the error tends to remain low enough that it does not cause misclassification in the end.
The following is the algorithm for training the color segmentation program:
1. Select a set of frames to be used for the training, i.e. every x frame from the video
2. Enumerate these frames from 1 to n, maintaining their order
3. For each pixel in the board region of frame 1, increment the board array at the appropriate location
4. For i = 2 to n do :
a. for each pixel in the board region of frame i, compare it to the pixel at the same location in frame i-1
b. if the two pixels are close in value, then increment the board array, otherwise increment the presenter array
After the training is complete, we segment the video and refill the pixels that have been removed. The following is our segmentation and refilling algorithm:
Let n be the number of frames in the video
For i = n-1 down to 1 do:
1. Perform color segmentation on frame $i$, placing all presenter pixels into a mask temp
2. Perform a series of morphological erode and dilate operations on temp to remove noise in the classification and include chalk or marker that was not classified as board
3. Replace each pixel that is marked in temp (i.e. not board) with the pixel in frame i+1