Abe Kazemzadeh, Samuel Kim, and Yoonji Kim
Final Project for CSCI 534: Affective Computing
5/2/07
Profs. Gratch and Marsella
The Emotion Mirror:
Recognizing Emotion in Speech and Displaying the Emotion in an Avatar
Abstract
This paper describes the Emotion Mirror, a demonstration that recognizes emotion in a user's utterance and displays the emotion through the facial expressions of an animated avatar that repeats what the user has said. The demonstration is discussed in terms of system architecture and component design. In addition to the technical details of the system we also look at general engineering and theoretical issues that pertain to the demo system. Finally, describe our experience in creating the system to highlight what worked and what did not.
Introduction
Our project aimed to develop an integrated demonstration of several affective computing technologies. The theme of an emotional mirror served to tie the separate technologies into a coherent interaction. The emotion in the user's spoken utterance is "reflected" back as facial emotion of an animated avatar that speaks back what the user said. Since human emotion often operates at a subconscious level, it is hoped that the Emotion Mirror may eventually be used help people reflect upon their emotions. The intuition behind such an application comes from two sources. First, it has been postulated that there is a point in a child's development when they realize their own reflection in a mirror. Second, many people find it funny or unusual to hear their recorded voice when played back.
The emotion mirror demonstration consists of a speech and acoustic emotion recognizer, a text-based emotion recognizer that operates on the speech recognition output, and a facial animation system that controls the face and lip synchronization. The figure on the following page shows a diagram of the overall system architecture. One goal of our design was to reuse as much existing software and methodology as possible to ensure that our project was feasible in a semester time frame and focused our efforts on the emotion aspects of the design. The details of each component are discussed in the following sections and cited in the reference section. After that we look at engineering and theoretical issues that pertain to this demonstration.
Figure 1. The input to the system is the user's spoken utterance, which is assumed to be of an unconstrained vocabulary. The lexical and emotional content are recognized and the wave file is saved. The lexical output of this stage (text) is classified into emotional categories based on word distribution statistics and lexical resources and a final decision can be made considering the acoustic emotion recognition results (this is not implemented yet and the ideal way to combine these components is left to future work). The resulting emotional decision is sent with the text and wave file to the face which adjusts the facial expression and synchronizes the lips with the wave file. The lip gestures are generated by using the phone sequence from the TTS module. All of the components are separate processes on the same machine connected by TCP, but the face, face control, and TTS module are connected by an API provided by the CSLU toolkit that hides the TCP implementation details. For more information about the separate components, see the following sections and the reference for proper citation.
Speech and Acoustic Emotion Recognition
In the proposed real-time emotion detection system, we extract and fuse emotional information encoded at different timescales especially supra- and intra- frame level, and lexical level. The rationale behind this approach is that emotion is encoded at different levels of speech, such as supra-frame, intra-frame, and lexical level, and each timescale feature is complementary.
There have been many studies trying to classify emotion states model with each level of feature. In this work, we first extract emotion encoded in supra- and intra- frame level using acoustic information and later combine with lexical information. MFCC and statistics of prosody information (pitch and energy) are utilized for extracting emotion in intra- and supra- frame level, respectively. The results from speech recognizer are used for extracting emotion in lexical level.
Extracting emotion from acoustic information based on two separated modalities needs to be combined in order to have a unique decision. Here we used a modified version of weighted likelihood sum, and details are described in [Kim, 2007]. For real-time application, we utilize several libraries such as IT++ and TORCH [IT++][TORCH]. Prior to extracting textual emotion, speech recognition is required to make a word sequence. SONIC speech recognizer based on Hidden Markov Model (HMM) is utilized for this application [SONIC]. It is developed for continuous speech recognition at University of Colorado, Boulder.
Textual Emotion Recognition
The benefit of analyzing the emotion in the text of the speech recognizer output is that it is possible to use aspects of an utterances meaning. The approached used here use a shallow, lexical approach to extracting meaning that can help extract emotional information. Other deeper ways to analyze the meaning of an utterance could provide a richer description of the utterance's emotion, but there is a trade-off between depth of analysis and breadth of coverage. Since we wanted to allow unconstrained user input, we opted for a shallow analysis with broad coverage.
The first method we tried was using Cynthia Whissell's Dictionary of Affect in Language (DAL) (Whissell 1986, Whissell 1989). This dictionary provides a table of 9000 words (the references quote 4000 words, but the dictionary that was provided to us by Whissell contains almost 9000 entries) with values for three dimensions: valence, activation, and imagery (also, in the references we cite there is no mention of the imagery dimension), which are normalized to a range of 1 to 3. The dictionary was compiled by ratings provided by experimental subjects and evaluated in repeat experiments for validity. Some attested uses are evaluating the emotional tone of texts, authorship attribution, and discourse analysis. One limitation noted in (Whissell 1989) is that the activation dimension has lower reliability. This is interesting because the acoustic emotion recognition shows an opposite tendency where the valence is less reliable. Therefore, these two modalities have a potential to complement each other. However, we did not end up using the features from the DAL in our demonstration. The main inputs to the face's emotion controls were categorical emotions. In future work, it may be possible to control the face on a word level, instead of a sentence level. With this increase in resolution, it may be possible to use the DAL features to control emotions at the word level.
The second method we tried was using a document retrieval technique known as TFIDF (term frequency * inverse document frequency) (Salton and Buckley 1988). This approach uses the intuition that words (terms) that both occur frequently in a document but also infrequently across documents are relevant terms for queries. The frequency of a term in a document, term frequency, is captured by the tf component. The infrequency of a terms occurrence across documents, inverse document frequency, is captured by the idf component. Our use of this approach extended the metaphor of document query to emotion categorization by considering the utterance's lexical content to be the query and an ordered list of relevant emotions to be the documents returned. To do this, we created lists of hypothetical emotional utterances in documents for each of the 7 emotions accepted by the face (angry, disgusted, fearful, happy, neutral, sad, surprised). This approach gave us initial data, but is still very sparse and does not provide a rigorous base for the categorization. With more data, it is reasonable to assume that performance will improv by virtue of more coverage as well as making n-gram features possible. Currently only unigrams are used. One way we circumvented lack of data was by using the stems of words as features for the TFIDF analysis, rather than the word tokens. The stemming was performed using the WordNet resource and is implemented by a perl library available on CPAN (Pedersen and Banerjee 2007). This reduced the amount of words to maintain statistics for, but slowed down the performance, especially in the word statistics learning phase. Another method, the Porter stemming algorithm offered a better speed of performance, but used a more naive method. Since the slowness of the WordNet stemming was less noticeable in the querying phase, we chose it over the Porter stemming algorithm. However, more empirical evidence is needed to advocate one over the other.
Another tradeoff in the TFIDF approach is the representation of emotions as documents. In our demonstration the documents were lists of approximately 100 sentences. An alternative would be to have each sentence be considered a document. The tradeoff can be seen in the two terms of the TFIDF equation. When using the first method, one emotion per document, the estimation of idf becomes very discrete since it can only take one of eight values (0/7, 1/7, 2/7, ... 7/7). However, when using the second method, one document per sentence, the tf term becomes very discrete because since the number of words per sentence is relatively small, we get the same problem of a small denominator in the tf term. One possible way to get around this would be to use multiple neutral documents. This would take advantage of the fact that there is more neutral data available, so the estimation of idf would improve while still keeping the document size large enough to estimate tf.
One way we want to improve the textual emotion recognition is by getting more data for use in the emotional documents and for empirical evaluation of this approach. With more data in the emotional documents, higher order n-grams could be used. Also with empirical evaluation we could make meaningful assertions about the strengths and weaknesses of this methodology. One possible way to get more data is by using the demonstration to gather user data and learn from this input. Another opportunity for furthering this research is to work towards better integration with other modalities. Our approach was geared towards the CSLU face. With other modalities we might have considered other emotional representations, perhaps more along the lines of the DAL. Also more data would help here by allowing us to determine optimal ways to combine the textual and acoustic emotional recognition components. The ultimate improvement to textual emotion recognition would be true understanding of semantics, context, agent reasoning, combined with a dynamic emotion model, so beyond our initial approach there are ample opportunities for further work.
Facial Animation of Spoken Emotions
For the facial expression, there is a small number(six) of basic emotions noted as discrete emotion theory.(Ekman; Izard; Tomkins) The six emotions are angry, disgusted, fearful, happy, sad and surprised. Neutral is added to CSLU for representing default state. This is only for default expression. Each basic emotion is genetically determined, universal, and discrete. In this project, we followed the same as this.
We get the emotion from the speech recognizer at first and decide final emotion as the input of our program through text analysis. The inputs consist of text of sentence, emotion, and audio file as a text file type. We read it and apply the animation. Thedifference from the initial system is that emotion is represented at real time for every speech and also the lip synchronization is performed automatically, without the user explicitly entering the text.
Engineering andTheoretical Issues
The overarching theoretical issue that subsumes the emotion mirror demonstration the representation and modeling of human emotion. Closely related to this is the engineering issue of measuring, categorizing, and synthesizing emotions. It may be possible to have an ad hoc way of measuring, categorizing, and synthesizing emotions, but without a theoretical motivation, these applications will not have sufficient generality. Without taking into account theoretical considerations, applications will not be able to generalize across different input and output modalities, interpersonal idiosyncrasies, situational context, and cross-cultural differences. Conversely, without reliable and precise ways of measuring, categorizing, and synthesizing emotions, theoretical issues cannot be empirically tested. One of the important themes of affective computing is the complementary nature of theoretical and engineering issues.
The way that our demonstration tackled the theoretical issue of representing emotions was by using different emotional representations for different modalities in a principled way. Research has shown [reference] that emotions expressed in speech contain reliable information about the activation dimension of emotions, but are less reliable predictors of valence. Therefore, in previous studies, neutral and sad emotions are reliably differentiated from happy and angry, but within these groups the acoustic measurements provide less discrimination. The lexical and semantic aspects of language contain information that is useful in capturing fine detail about emotions from different connotations of similar words and through descriptions of emotion-causing situations. However, the richness of language can cause ambiguities and blur distinctions between emotional categories. Facial expression of emotion has the benefit of providing clear emotional categories in ways that a human subject can reliably identify. However, automatic recognition of facial emotions can be confounded by different camera angles, individual physiognomy, and concurrent speech.
By combining acoustic and lexical information, we made it possible to get both activation measures and categorization as well. Moreover, a technical problem in speech recognition is that emotional speech differs from neutral speech and can cause recognition errors. Using both acoustic and lexical modalities could circumvent this problem if unreliable recognition could be detected, perhaps by a language model score. Also, the fact that we used the speakers’ utterance for the animated character allowed this error to be minimized to lip synch errors. The TFIDF approach provided a mapping between the expressiveness of emotion in an utterance's meaning and the categorical input to the facial animation.
Another theoretical issue of this demonstration is the psychological idea of a mirror to provide introspection and self-awareness. Since emotions are often sub-conscious [reference], such an application may have a therapeutic use in helping people "get in touch with" their emotions. Hypothetically, this could increase emotional expressiveness in shy people and decrease it in obnoxious or hysterical people. Whether it could actually be such a psychological panacea is a question that need more work. Related to this is the theoretical notion of philosophy of mind and other-agent reasoning. This deals with the question of how people understand the actions and motives of others. It is an open question whether people understand other agents' actions though putting themselves in the other agents’ shoes, or by some type of causal reasoning about psychological states. Experiments using variations on the emotional mirror could provide insight into this issue.
Some possible applications of the technology behind this demonstration include automatic animation, actor training, call center training, therapy for introspection or interpersonal skills, indexing multimedia content, and human-centered computing. For automatic animation, this technology would both lip synch and emotion synch and could save animators time. Corrections and extra artistic effects could modify or add to the automatically generated emotions. Actors could use this demonstration to practice their lines. This would allow them to see the emotions that they are actually conveying and it could make practicing more fun. Similarly, for patients that have trouble conveying emotion. Perhaps they convey too little emotion, in the case of shy people, or to much emotion, in the case of hysterics or hot tempered people. Shy people might benefit from attempting to act emotional around a computer, with which they might be less intimidated than with a human friend or psychiatrist. People who over-display emotion may benefit from seeing their emotion being displayed back to them, making them realize how they appear to others. (Tate and Zabinski 2003) report on the clinical prospect of using computers in general for psychological treatment, and there is of course the famous ELIZA program that is realized today in the Emacs Psychiatrist (Weizenbaum 1966).
Finally, as with many affective computing applications there can be ethical concerns in this demonstration. It has been noted that users may be more abusive of virtual characters than real people. Furthermore, to enable the demonstration to recognize anger it may be necessary to have the system recognize swear words and racial epithets. The laws of unintended consequences may be realized in many unpleasant ways.