Integrating cyberinfrastructure into existing e-Social Science research
Svenja Adolphs, 1 Bennett Bertenthal,2 Steve Boker,3 Ronald Carter,1 Chris Greenhalgh,4 Mark Hereld,5,6 Sarah Kenny,5 Gina Levow,7 Michael E. Papka,5,6,7 Tony Pridmore4
1 Centre for Research in Applied Linguistics, School of English, Nottingham University, UK
2 Department of Psychology, Indiana University, USA
3 Department of Psychology, University of Virginia, USA
4 School of Computer Studies and IT, Nottingham University, UK
5 Computation Institute, Argonne National Laboratory and The University of Chicago, USA
6 Mathematics and Computer Science Division, Argonne National Laboratory, USA
7 Department of Computer Science, The University of Chicago, USA
Email address of corresponding author:
Abstract. This study has been facilitated by an NSF/ESRC exchange programme between researchers at the University of Chicago and the University of Nottingham. At the University of Nottingham the National Centre for e-Social Science node seeks to explore, understand and demonstrate the salience of new forms of digital record as they emerge from and for e-Social Science. The Nottingham Multimodal Corpus (NMMC) is a corpus of multimodal data that marries established coding schemes with visual mark-up systems to foster a richer understanding of the embodied nature of language use and its manifold relations to the production of distinctive social contexts. At the University of Chicago, the NSF-funded Social Informatics Data Grid (SIDGrid)[1] is being built to enable researchers to collect real-time multimodal behaviour at multiple time scales. Multimedia data (voice, video, images, text, numerical) is stored in a distributed data warehouse that employs Web and Grid services to support data storage, access, exploration, annotation, integration, analysis, and mining of individual and combined data sets. With particular reference to the analysis of and markup of hand gestures in spoken discourse, this paper explores some basic steps in integrating cyberinfrastructure into existing e-Social Science research as an interdisciplinary team with perspectives from linguistics, psychology, data and computing systems, and machine analysis of multi-modal data.
Introduction
Information technology advances have already made it possible to develop multi-million word databases or ‘corpora’ of spoken conversation as well as software tools to analyse this data quantitatively (see, for example, the 5 million word CANCODE[2] corpus developed at Nottingham). However, social interactions are multi-modal in nature, combining both verbal and non-verbal components and units (Kress and van Leeuwen 2001). Non-verbal, multi-modal behaviour (e.g. body language, gestures, eye-contact etc) plays an integral part in determining the meaning and function of spoken language (Baldry and Thibault 2004).
We report here on how we begin to exploit the emerging e-Science infrastructure to extend early research in the field[3,4][4] and develop an integrated resource for interdisciplinary research into natural language use that responds to the challenges of multi-modal analysis. Selected sequences from twenty-five hours (approximately 125,000 words) of video data taken from naturally occurring dyadic tutorial supervision sessions are analysed using standard corpus linguistic search tools and key clusters of lexico-grammatical linguistic features are isolated. These key linguistic data clusters are then subjected to mark-up of relevant sections of the data by means of visual tracking based on computer vision technologies and to further analysis which, manually and in semi-automated fashion, marries the verbal and the visual components of the video data.
Previous spoken corpus research has highlighted discourse markers as key structuring devices of particular significance (Carter and McCarthy 2006). These units provide a particular site for analysis of the way in which key components of body language, such as, in the case of the Nottingham research, head nods and hand gestures, are utilised by speakers to provide visual support for discourse management. Ongoing research, reported at the 2006 e-Social Science conference, has focused on head nod movements as backchannels in conversation .[5]
The collaborative task that we set ourselves is to integrate SIDGrid cyberinfrastructure with the Handtalk gesture-in-communication project to enable us to explore the associated scientific, technical and social issues. The discussion is organized as follows – description of the problem and approach, the viewing tool, integration with SIDGrid, special issues encountered, and concluding remarks.
From Headtalk to Handtalk
The problems associated with identifying and classifying head-nods are multiplied when considering hand gestures, even when gesture is only considered in the form of hand and arm movement. A head-nod can be said to exist due to movement along one axis. Hands and arms can move much more freely and both in tandem with or independently of one another. Hands also perform several other practical functions during talk, including scratching, adjusting clothing and hair, writing, reaching for and moving objects, etc. Unlike head-nods, however, there already exists a substantial body of research into how hand gestures support and supplement spoken utterances. Much of this existing work on gesture (McNeill 1992) has attempted to represent the full scope of this complexity by employing teams of transcribers who manually code large amounts of video data and then cross-check each other’s work. Researchers start out by identifying everything they consider to be a gesture by the speaker they are observing and then adding these to the orthographic transcript, using square brackets around the words where a gesture takes place and putting the words where the stroke of the gesture occurs in bold type. The gesture is then located in space as illustrated here:
Figure 1: Division of the gesture space for transcription purposes (from McNeill 1992: 378).
From this point the gestures can be coded according to classification systems for gesture type, form and meaning.
For this project it was decided to take a more ‘bottom-up’ approach and to work with a very simple set of gestures to which additional layers of complexity could be added at a later stage. To simplify the types and forms of the gestures for the study we decided to focus on the movement of arms only and not to look at hand shape. We also wanted to adapt the existing gesture region model shown in Figure 1. The preliminary examinations of the supervision data we have collected suggested that, if we were to simplify this model, then dividing the area in front of the speaker where the gestures are ‘played out’, along vertical axes would give some interesting results as illustrated in Figure 2.
Figure 2 also illustrates the way in which key linguistic data clusters are then subjected to mark-up of relevant sections of the data by means of video analysis based on computer vision technologies. An interactive program allows users to apply a visual tracking algorithm to selected targets in an input video clip. In the current prototype the user indicates, with a mouse, the position of the head and hands in the first frame of the video. The system then automatically locates the torso and identifies the four regions of interest shown in Figure 2. An augmented version of the Kernel Annealed Mean-Shift (KAMS) algorithm of Naeem et al. (2007) tracks the position of the head and hands through the remainder of the video. The blue circles in Figure 2 denote the tracker’s estimates of the head and hands’ positions. Although no attempt has been made in NMMC to automate the clustering and recognition systems, the independent success of each strongly suggests that the automatic recognition of head and hand gestures is feasible in the type of image data considered here.
Figure 2: Computer image tracking applied to video.
Tracking is complicated when speakers bring their hands together, or to their face. Khan et al.’s (2004) interaction filter was therefore incorporated into the standard KAMS algorithm to prevent the trackers losing their targets when this happens. Torso and zone position is updated as a function of head location and a text file produced which summarises the movement of the hands into and out of the four zones. Though the method could be applied to a wide variety of object representations, it has long been known that human skin colour clusters very tightly in some colour spaces, making the colour histogram ideal for tracking the face (and hands) of the speaker. The current system therefore represents each target object (hand or face) as a normalized 3D histogram of colour values; see Naeem et al. (2007) for details.
The reason for choosing the vertical axes was based on the fact that during supervisions the speakers spend a good deal of time comparing and contrasting different ideas. These ideas often appear to exist in metaphorical compartments in front of the speaker. When these ideas are compared or contrasted the speaker will often move one or both hands along a horizontal axis to support the verbal element of the communication. Dividing the horizontal plane with vertical lines allows us to track this movement and to link it to the talk. A 4-point coding scheme was constructed.
1) Left hand moves to the left
2) Left hand moves to the right
3) Right hand moves to the left
4) Right hand moves to the right
This initial coding scheme focuses purely on the movement of the arms. Additional or more complex schemes will allow us to make further remarks about the way in which the gesture supports and/or supplements the verbal part of the speaker’s message (initial analysis of correlations with various types of linguistic discourse markers is already under way) and to marry analysis with work undertaken with colleagues specializing in computational linguistics and psychology at the University of Chicago and the University of Virginia who have particular expertise in prosodic analysis, the relationship between pitch, intonation and gesture (Levow, 2005) and have undertaken preliminary research on the analysis of the role gender in gesture communication.[6]
Applications and representation of data
The final concern of the corpus development is with how the multiple streams of coded data are physically re-presented in a re-usable operational interface format. With a multi-media corpus it is difficult to exhibit all features of the talk simultaneously. If all characteristics of specific instances where a word, phrase or coded gestures (in the video) occur in talk, are displayed, the corpus would have to involve multiple windows of data including concordance viewers, text viewers, audio and video windows. This may make the corpus impractical to use, and would mean that the corpus may be slow and sometimes prone to fail if the computer system is unable to deal with storing and replaying such high volumes of video data. It is further difficult to ‘read’ any large quantity of multiple tracks of such data simultaneously, as current corpora allow with text. The notion of selecting, for example, a section of text or a search token, and retrieving the exact point in the data at which it occurs is, itself, not straightforward. Again, the fact that gestures are not discrete units in the same way as words and utterances means that it is difficult to align the different modes exactly according to the time at which different actions or words occur.
In order to represent the data in a way that allows for the different data streams to be analyzed alongside one another, the multi-modal data is viewed through the Digital Replay System (DRS) interface (French et al. 2006). The software also enables the researcher to code and search the corpus data.
The Digital Replay System allows video data to be imported and a digital record to be created that ties sequences of video to a transcribed text log, accompanied, where appropriate, by samples of data that are also subjected to visual tracking (indicated in blue circles in Figure 2). The text log is linked by time to the video from which the transcript is derived so that the text log plays alongside the video. Further annotations can be added to the log to show where gestures – head nods in the above example – occur and these annotations are also tied to the video. An index of annotations is produced and each can be used to go to that part of the log and video at which they occur. The annotation mechanism provides an initial means of marking up multi-modal data and of maintaining the coherence between spoken language and accompanying gestural elements. Note that in Figure 3 the second concordance line of the search term yeah has been selected (shown on the right side of the DRS interface). The corresponding video clip where this utterance is spoken is shown in the video clip (to the left of the interface) and can be played on the audio track (positioned at the bottom of the screen). Using this concordance viewer, the analyst can search across a large database of multi-modal data utilising specific types, phrases, patterns of language, or gesture codes as ‘search terms’. Once presented as a concordance view the analyst may jump directly to the temporal location of each occurrence within the associated video or audio clip, as well as to the full transcript.
Figure 3: A screenshot of the concordance tool in use within the DRS software interface.
SIDGRID development
The Social Informatics Data (SID) Grid (Bertenthal et al. 2007) is designed to enable researchers to collect multimodal behavior, and then to store and analyze different data types (e.g. voice, video, images, text, numerical) in a distributed multimedia data warehouse that employs web and grid services to support data storage, access, exploration, annotation, integration, analysis, and mining of individual and combined data sets.
The diagram in Figure 4 illustrates how SIDGrid connects data, analysis, and researchers. Previously collected corpora and data archives in raw or partially analyzed forms are supported, as well as existing applications such as ELAN and DRS with the addition of suitable (and usually straightforward) interface code. An essential component of the SIDGrid is transparent data integration services so that this distributed data can be used simply and effectively by any infrastructure components and services. SID Grid query, exploration and analysis services are based upon web and grid services.
Integrating access to SIDGrid services into DRS is taking a course similar to the path taken with ELAN[7], an open source annotation and viewing tool already in use by practitioners around the world. We have begun work on the main phases of the integration effort.