Constantinos Boulis

RESEARCH STATEMENT

My research interests lie in the intersection of three fields; automatic speech recognition (ASR), natural language processing (NLP) and data mining. All these three fields come together into Spoken Language Understanding (SLU) systems, i.e. systems that go beyond the surface recognition and interpretation of spoken language and therefore have the potential to revolutionize human-computer and human-human interactions. The promise of SLU is to deliver natural and efficient human-computer spoken interaction and to augment human-human spoken communication. After the mid 90’s the core technologies of ASR and NLP have reached a certain level of maturity that allowed for basic commercial and research SLU systems to be developed. Examples include the DARPA Communicator project, an attempt to build a dialogue system with the goal of travel reservation over the phone, and AT&T’s “How May I Help You?”© project, a call-routing application where customers describe naturally the questions/problems they are facing and are automatically routed to the relevant department. These two examples help demonstrate that a wide variety of applications with different research requirements are encompassed under the general term of SLU.

Although the success of SLU systems depends on the success of the underlying, core technologies the research community has started to realize that SLU offers unique research opportunities. Speech-to-concept is not a mere concatenation of speech-to-text and text-to-concept. Even if the underlying technologies have reached perfection we would still be losing an important part of information by ignoring the prosodic aspects of the speech signal, i.e. the way we speak rather than what we say. Moreover, bringing the core technologies to perfection may not be actually needed in many cases. For example, in call-routing applications a word error rate of the ASR component of 30% degrades minimally the routing classification performance.

RESEARCH ACCOMPLISHMENTS

Over the past 8 years of my undergraduate and graduate research I have focused in various problems in ASR, NLP and data mining. In my Diploma and Master’s thesis I have explored various approaches for ASR speaker adaptation with small amounts of data. In my first year of PhD, I worked on a project to speech-enable mobile devices by designing source and channel coders specifically tailored for ASR. In my last 4 years of PhD research I focused on topic learning in text and conversational speech, an area that gave me the opportunity to work on data mining problems, such as combining multiple clustering partitions, and NLP, such as classifying conversations to topics. There were two main pillars on which I developed my PhD dissertation. The first one is general algorithms for topic learning and the second is specific issues that can improve topic extraction in speech.

Algorithms for topic learning

Most of the past work on topic learning has implicitly made two major assumptions; that a certain amount of training data are always available and that the most suitable representation of text is the bag-of-words, i.e. only the count and not the sequence of words are important. In my research, I have challenged both of these assumptions. I explored unsupervised approaches to text learning where no training examples are provided. I formulated and implemented several novel clustering combination approaches that were shown to offer performance improvements over a variety of text corpora. In addition, the same methodology of combining clustering partitions was utilized to find the most prevalent clusters in the data rather than outputting clusters that may not map to real entities. My work on clustering combination was published on the highly selective conference “Principles and Practice of Knowledge Discovery in Databases” with acceptance rate 18% and total number of submissions 581, and was honored with the Best Student Paper Award. In addition, I have designed a way to incorporate selected word pairs for supervised topic learning. In the past, alternative representations to the bag-of-words had mixed results. Using a new feature selection criterion, a small number of word pairs are chosen to augment the traditional bag-of-words representation. The new representation obtained better performance over a variety of different corpora and learning methods.

Modeling factors pertinent to topic learning in speech

Topic learning in speech is not the same as topic learning in text. Speech is a richer medium than text but is also a more challenging one. Topic learning in natural human-human conversations introduces a number of new issues, for example, the role of prominence, dealing with disfluencies, gender or speaker idiosyncrasies and the effect of word errors from the ASR component. In my research, I have shown that integrating automatically detected prominence can help design better feature subsets. Words with low average prominence, i.e. the average value of prominence for all occurrences of a word, are likely to be irrelevant for topics and can be filtered out. Combining prosodic and lexical measures of word saliency produced the best results for topic classification and clustering. Another phenomenon present in spoken language is disfluencies. Disfluencies obscure the counts of words and can be modeled as a noisy channel. My experiments have shown that removing disfluencies does not improve topic classification performance when the bag-of-words representation is used, but when more complex representations are employed, such as word pairs, removing disfluencies is important. Gender or speaker idiosyncrasies are another phenomenon distinct in spoken language. Although, I have found that training gender-dependent topic models or stop-word lists does not help topic classification, I was able to show that there are lexical differences between genders and that the gender of one conversation side influences the language of the other conversation side. These findings have implications for a number of natural language processing tasks, such as language modeling and detecting dialog acts. Lastly, I have looked into the issue of dealing with word errors introduced by the ASR component. There has been considerable work in this area before but my approach was to decompose the two spaces, the semantic and the acoustic and build different feature selection and clustering techniques for each space.

FUTURE DIRECTIONS

SLU is in its early stages and there are a number of different promising avenues to be explored. My research on SLU will follow three main directions.

Reducing the cost of deploying SLU systems

Collecting, annotating and analyzing training data are laborious, time-intensive and expensive steps that currently amount to the majority of the costs associated with deploying an SLU system. Moreover, new data have to be collected for new tasks. Therefore, methods that do not significantly compromise performance with less annotated data are extremely attractive. Also, reusing data from different domains or designing systems that are portable across different operating conditions and tasks are equally important. In addition, since it is usually easy to obtain large amounts of unannotated data, methodologies to utilize the unannotated data or even train an SLU system without any annotated data are crucial. Research on semi-supervised learning for SLU has recently emerged and initial results show the promise of such attempts.

SLU on human-human communication

The main focus of SLU has been the human-computer interaction. On the other hand, human-human communication has not received the same amount of attention despite the fact that it is occurring naturally and ubiquitously in environments such as business meetings and customer call-centers. The objective of SLU on human-human communication is to augment rather than replace the interaction, therefore it is very different from SLU in human-computer interaction. SLU can be used to extract the topics from business meetings, through topic segmentation, clustering and characterization, which can in turn be used to retrieve relevant portions of meetings or produce summaries of meetings. In customer call-centers, SLU can be used to assess the degree to which the customer was satisfied with their interaction. Detecting frustration or satisfaction can be an important component of such systems. Another interesting application of SLU on human-human communication is the 311 line. The 311 line is a phone line for city residents to make their non-urgent requests/comments for city services. People place a phone call and converse with a city employee about the information they want to relay. A recent February 7th article on the TIME magazine described the 311 line as “a way of harnessing the collective needs of an entire city” and cited numerous cases where the 311 line has helped uncover knowledge previously unattained by other means. The objective of such a system is to mine the vast amounts of speech data daily collected (in New York city only there are 41,000 calls to 311 every day) for useful and actionable knowledge. The 311 line is a prime target for SLU to augment human-human communication. While a conversation is happening, the system can suggest new topics or questions for the city employee to ask, that are deemed to be relevant based on similar conversations.

PROPOSED COLLABORATIONS

IBM Research is a great place to pursue the future of SLU systems. There are already a number of labs in IBM that are pursuing direct or peripheral research to SLU. Under the Natural Language Processing research area there are two groups that I would be especially fortunate to work with; Human Language Technologies and Text Analysis and Language Engineering groups both in IBM T.J Watson Research Center. In the Human Language Technologies group, Dr. Roberto Pieraccini is a leading authority in SLU research and I would be privileged to have the opportunity to collaborate with him. Also, Dr. Yuqing Gao’s leading edge work on speech-to-speech translation shares many research issues with my PhD work. In the Text Analysis and Language Engineering group, Dr. Salim Roukos is an established researcher with experience in all aspects of natural language understanding.