- Title page:
Name of the project………..: Affect-responsive interactive photo-frame
Principal investigators…….: Dr. Albert Ali Salah, Dr. İlkay Ulusoy, Hamdi Dibeklioğlu
Candidate participants…….: ---Hamdi Dibeklioğlu, Orsan Aytekin
Date………………………………..: 07.12.2009
Abstract………………………….:
The project aims to develop an interactive photo-frame system in which you the user can upload a series of videos of anyone a single person (or a child).you wish as a model. Then, The system will be composed of three parts. The first part will analyze the uploaded videos and prepare segments for interactive play. The second part will use multi-modal input (sound analysis, facial expression, etc.) user’s facial expressions and to generate a user state. The third part will synthesize continuous video streams from the prepared segments in accordance with the modeled state of the response with related gestures of the loaded model subject to the user.
2. Project objectives (max 1 page):
3.
The main goal of the project is to research human behavior understanding, segmantation of emotive video sequences as well as robust facial expression recognition. For robust expression recognition, and segmentation of emotive videos, this work enables research on different classifiers and features. Additionally, generic image processing methods will be studied for modelling emotive video segments with smooth transitions. As an extension of the project prosody in videos will be used to support mood recognition. Thus, speech and sound signal patterns will be studied.
- Background information (max 1 page): :a brief review of the related literature, so as to let potential participants prepare themselves for the workshop;
This project attempts to create a responsive photograph, using address the limitations of human behavior understanding modeling and automatic segmantation and synthesis of video sequences. The core algorithms which can be used as to elicit computer responses in this kind of systems as well as naive simple emotion expression recognition systems.
Such a responsive system was proposed in [Schröder], A similar work on called an Audiovisual Sensitive Artificial Listeners has been proposed in [Schröder]. It is a system in which virtual characters react with to real users. Facial images and voice information in the videos are used to extract features. Extracted features, which are then submitted to analysers and interpreters which that understand the user’s state and determine the response of the virtual character. Generally, Hidden Markov Models (HMMs) are used in sequential recognition and synthesis problems. In [Trung], a dialogue model is proposed which that is able recognize the user’s emotional state, as well as and related decide on related d acts. a A Partially Observable Markov Decision Process approach is used with observed user’s emotional states and actions.
There are several works on mood recognition, in the literature, but most of themse are offline approaches. The bottleneck in our system is real-time interaction, so we need a real time mood recognizer therefore we will start with simple and easy-to-recognize signals, and move to recognition of more complex stimuli. The eMotion system developed in our lab can recognize six basic emotional expressions in realtime Kaliouby et al. [Kaliouby] proposed MindReader API which models head and facial movements over time by Dynamic Bayesian Networks, and infer the person's affective-cognitive state in real time. Additionally Sebe et al. [Sebe], [Valenti] proposed a realtime emotion recognition software, which exploits uses a Bezier volume-based tracker for the face and a Naive naive Bayesian classifier for expression classification. In a similar vein, Kaliouby et al. [Kaliouby] proposed a MindReader API which models head and facial movements over time by Dynamic Bayesian Networks, to infer a person's affective-cognitive states in real time. For additional information on facial expression recognition, see [Fasel],[Pantic] and [Tian].
Figure 1. A sample finite state machine diagram for synthesis of video. The system has segments labeled with automatically identified facial actions (happy, angry, neutral, sleeping) hand-picked segments (response to `bravo!`), and non-labeled segments (facial action clusters).
Participants are referred for further reading [Fasel],[Pantic] and [Tian] for more detail in facial expression recognition.
- Detailed technical description (max 3 pages):
o Technical description
o Resources needed: facility, equipment, software, staff etc.
o Project management
a. Technical description: The project will focus on the research and
development of the following modules-subsystems:
• Wizard of old Oz system experiment (Data Collection): In the first part of the system, we will assume a manual segmentation of the input videos. They will be labeled with indicators such as affect and activity level. Also, the analysis part will be encapsulated in a wizard of Oz approach; A a researcher will be responsible to determine the acts of the model. The acquired videos multimodal input will be submitted evaluated on a separate terminal by a researcher, who thento pc which is under control of the researcher. She/he will analyze the user’s face and sound and determines the act next segment to be played by the systemof the model. These observations and acts will be gathered to learn a proper set of reactions of a human with respect to these expected behavior patterns.
• Facial Feature Extraction Expression Module: This module analyzes the facial image sequences and extracts features, classifying them into expression categories or facial action units. This module will be used for both offline and online analysis. For offline analysis, learned segments will be stored The output of this module feeds a module that models the emotion specific patterns.
with their average facial motion vectors. Basic expressions (anger, happiness, neutral, disgust, puffy cheek, sleep) will have special detectors. We will seek them in the input videos, and fill their place with most similar segments. Other segments will depend on the input size. The selection of a segment during synthesis can occur with a small random probability, as well as by observing a facial expression event that resembles the segment mean motion unit most.
• Voice Feature Extraction Module: This module selects several vocal cues that can be used to trigger response from the system, and analyzes the sound and extracts features. The output of this module feeds to emotion modelling module. Voice features supports modeling of the emotion specific patterns.implements robust recognition algorithms to identify the presence of these cues. We will experiment with unsupervised sound event categorization for automatic transition selection, like in the facial expression case.
• Emotion Patterns Modeling Module: Models the emotion specific patterns and is used in emotion recognition and video segmentation tasks.
• Pattern Recognition Module: User’s behaviors are analyzed and the current state is recognized.
• Video Segmantation Segmentation Module: It is the core of the system, involving multimodal fusion of face and voice patterns of subject’s behavior. In this module, collected videos of the model character are analyzed offline, divided into short segments and stored with indicators of affective content and activity levels. Segmentation errors here are not of great importance, as the synthesis module will use all footage material.
• Video Synthesis Module: After recognition of the current state of user, this module synthesizes and plays the responsive segments video in a loop online. Although interactive, we do not expect anything resembling true communication from a this system, and the model that drives the synthesis will be a simple, sparsely connected finite state machine, each state representing one short segment played in a loop (See Figure). The fall-back for this part of the system is producing static frames that are displayed for short periods, instead of synthesizing full video.
b. Resources needed: facility, equipment, software, staff etc.
Facility: Enough space for the set-up - Space for enrolment.
Equipment: Webcams, and PCs.
Staff:
-One or more researchers experienced in video processing and synthesis
-Two or more researchers for speech and/or sound based interfaces
-One researcher for usability engineering
c. Project management: The project leader will be responsible for this task.
6. Work plan and implementation schedule (max 1 page): a tentative timetable detailing the work to be done during the workshop
WP1 (First Week): Wizard of Oz study, data collection, data annotation, system requirements elicitation, platform selection
WP2 (Second Week): Development of program logic and FSM for the synthesis. Experiments with multiple FSMs. Parallel work on analysis from each modality. Data collection for real-time analysis
WP3 (Third Week): Incremental system integration. Prototype with manual segmentation. Segmentation by analysis.
WP4 (Fourth Week): Integration and demonstrators. Code optimization. Reporting.
7. Benefits of the research (max 1 page): expected outcomes of the project. Please describe what the tangible results are to be achieved at the end of the Workshop
We insist that all the software components used for the project, and all the software built during the project should be free for use, and available as such to all participants (after the workshop too).:
8.
An affect-responsive interactive photo-frame prototype will be developed. Present digital frame systems work with a collection of photographs that are cycled repeatedly. We would like to develop a system that is more responsive, requiring some interaction to engage the viewer. A typical use-case is The developed system will provide virtual interaction with your a collection of baby videos sent over to a relative that lives far. relatives who lives far to you (for example, grandparents can interact with their new born grandchildren by just downloading their videos) or A second use-case is a professionally produced set of segments for a famous personpeople you like. The system is conceived of as modularAn automatic segmantation system for behavioral videos will be studied and implemented which can be used in Human Computer Intreaction and behavior understanding applications.
, and fall-back strategies are planned for each module that seems to fail functioning properly.
- Profile of team leaders:
Leader (with a 1-page-max CV)
Dr. İlkay Ulusoy:
Ilkay Ulusoyreceived her B.S. degree from the Electrical and Electronics Engineering (EEE) Department of Middle East Technical University (METU), Ankara, Turkey in 1994 and M.Sc. degree from the Biomedical Engineering Department of the Ohio State University, Columbus, Ohio, USA in 1996. She completed her Ph.D. EEE, METU in 2003. During and after her Ph.D. she did research at the Computer Science Department of York University, York, UK and Microsoft Research Cambridge, UK. Currently, she is Asst. Prof. at the EEE Department of METU. She is a member of the IEEE. She is interested in computer vision and pattern recognition.
Dr. Albert Ali Salah:
Albert Ali Salah received his PhD at the Perceptual Intelligence Laboratory of Boğaziçi University, where he was part of the team working on machine learning, face recognition and human-computer interaction. He joined the Signals and Images research group at Centrum voor Wiskunde en Informatica (CWI) in Amsterdam in 2007 as a BRICKS/BSIK scholar, and the Intelligent Systems Laboratory Amsterdam (ISLA-ISIS) of the University of Amsterdam in 2009. With his work on facial feature localization, he received the inaugural EBF European Biometrics Research Award in 2006.
Staff proposed by the leader (with 1-page-max CVs)Hamdi Dibeklioğlu:
Hamdi Dibeklioğlu received his B.Sc. degree from Computer Engineering Department of Yeditepe University, in June 2006, and his M.Sc. degree from Computer Engineering Department of Boğaziçi University, in July 2008. He was part of the Perceptual Intelligence Laboratory of Boğaziçi University at his M.Sc. education, where he worked on 3d facial feature locatization and part-based 3d face recognition. He is currently a research assistant and a Ph.D. student at the Intelligent Systems Laboratory Amsterdam (ISLA-ISIS) of University of Amsterdam. He works with Professor Theo Gevers and Dr. Ali Salah on Human Behavior Analysis.
o
You may propose some members of your future team. If possible, try to avoid having too many people from your group: part of the benefit of eNTERFACE is to let people meet and share experiences from different places, and possibly in different languages.
o Other researchers needed (describing the required expertise for each, max 1 page):
o -One or more researchers experienced in video processing and synthesis
o -One or more researchers for speech and/or sound based interfaces
- References
This is very important for future members of your team; it will help them to prepare themselves for collaborating with you.
[Schröder] M. Schröder, E. Bevacqua, F. Eyben, H. Gunes, D. Heylen, M. Maat, S. C. Pammi, M. Pantic, C. Pelachaud, B. Schuller, E. Sevin, M. Valstar, M. Wöllmer, “A Demonstration of Audiovisual Sensitive Artificial Listeners”, Proc. International Conference on Affective Computing & Intelligent Interaction, Amsterdam, Netherlands, IEEE, 2009.
[Trung] T. Bui, J. Zwiers, M. Poel, A. Nijholt, “Toward affective dialogue modeling using partially observable markov decision processes”, In D. Reichardt and P. Levi and J.-J. Ch. Meyer (Ed.), 1st Workshop on Emotion and Computing - Current Research and Future Impact, 29th Annual German Conference on Artificial Intelligence (pp. 47 --50), 2006.
[Kaliouby] R. Kaliouby, P. Robinson, Real-Time Vision for HCI, chapter “Real-time Inference of Complex Mental States from Facial Expressions and Head Gestures”, pages 181–200, 2005.
[Sebe] N. Sebe, M. S. Lew,I. Cohen,Y. Sun, T. Gevers, T. S. Huang , “Authentic Facial Expression Analysis” In Automatic Face and Gesture Recognition, 2004.
[Valenti] R. Valenti, N. Sebe, T. Gevers, “"Facial Expression Recognition: A Fully Integrated Approach,”" Image Analysis and Processing Workshops, 2007. ICIAPW 2007. 14th International Conference on , vol., no., pp.125-130, 10-13 Sept. 2007
[Fasel] B. Fasel and J. Luettin, “Automatic facial expression analysis: a survey”, Pattern Recognition, Volume 36, Issue 1, January 2003, Pages 259-275.
[Pantic] M. Pantic and L. J. M. Rothkrantz, “Automatic analysis of facial expressions: The state of the art”, IEEE Trans. Pattern Anal. Mach. Intell., 22(12):1424-1445, December 2000.
[Tian] Y. Tian, T. Kanade, J. F. Cohn, Handbook of Face Recognition, chapter, “Facial Expression Analysis”, pages 247-275, December 06, 2005.
11. Other information (not mandatory)
Please insert here any other information you consider useful for the proposal evaluation.