Conversational browser for accessing VoiceXML-based IVR services via multi-modal interactions on mobile devices

JIEUN PARK, JIEUN KIM, JUNSUK PARK, DONGWON HAN

Computer & Software Technology Lab.

Electronics and Telecommunications Research Institute

161 Kajong-Dong, Yusong-Gu, Taejon, 305-350

KOREA

Abstract: - Users can access VoiceXML-based IVR(Interactive Voice Response) systems using their mobile devices such as smart phones in anytime and at any place. Even though their mobile devices have small screens, they have to interact the services using voice-only modality. As a result of this uni-modality, the services have some fundamental problems as follows. 1) Users can not know what service items can be selected before TTS(Text-to-Speech) engine reads them. 2) Users always have to pay attentions not to forget items they can select and an item they will select. 3) Users cannot confirm if their speech input is valid or not at once. So, users always have to wait new questions from the server in order to confirm. Because of this inconvenience and cumbersomeness, users generally prefer to connect with a human operator directly.

In this paper, we propose a new conversational browser that makes a user access the existing VoiceXML-based IVR(Interactive Voice Response) services via multi-modal interactions on a small screen. The conversational browser fetches voice-only web pages from web servers, converts the web pages to the multi-modal web pages by using a multi-modal markup language, and interprets the converted web pages.

Key-Words: - VoiceXML, Multi-modal interaction, XHTML+Voice, WWW, Internet, Interactive voice response services

1 Introduction

VoiceXML allows Web developers to use their exiting Java, XML, and Web development skills to design and implement IVR (Interactive Voice Response) services. They no longer have to learn the proprietary IVR programming language[1]. Many companies have changed their proprietary IVR systems to be written in VoiceXML. In general, the voice-only web services have simpler flows and have more specific domains than the existing visual-web services because they have to depend on voice-only modality. Users can access the voice-only web services using their mobile devices such as smart phones in anytime and at any place. Even though their mobile devices have small screens, they have to interact the services using voice-only modality. As a result of this uni-modality, the services have some fundamental problems as follows. 1) Users can not know what service items can be selected before TTS(Text-to-Speech) engine reads them. 2) Users always have to pay attentions not to forget items they can select and an item they will select. 3) Users cannot confirm if their speech input is valid or not at once. So, users always have to wait new questions from the server and reply to the server in order to confirm. Because of this inconvenience and cumbersomeness, users generally prefer to connect with an human operator directly. In this case, the original purpose of IVR services that is automation may be not achieved.

In this paper, we propose a new conversational browser that makes user access the existing VoiceXML-based IVR services via multi-modal interactions on a small screen. Users can look at and listen to items they can select, and confirm the result of speech input through the displayed text instead of beginning a new dialogue with a server for confirmation. The organization of this paper is as follows. In section II, we describe the concept of multi-modal access to VoiceXML-based IVR services. In section III, we describe the conversational browser architecture and overall execution flows. In section IV and V, we describe related works and conclusions.

2 Multi-modal access to VoiceXML-based IVR services

2.1 VoiceXML-based IVR services

Fig.1 illustrates an architectural model of VoiceXML-based IVR systems[1]. A conversation between a user and a system begins when a telephone call is initiated. Once a call is connected over the phone network, the VoiceXML Infrastructure acts as a “browser” and begins making a series of HTTP requests to a traditional Web server for VoiceXML, audio, grammar, and ECMAScript documents. The web server responds with these simple documents over HTTP.

Fig. 1 The architecture model of VoiceXML-based IVR services

Once retrieved, the actual VoiceXML “interpreter” within the VoiceXML Infrastructure executes the IVR applications and engages in a conversation with the end user. All software and resources necessary to “execute” a particular IVR service – such as voice recognition, computer-generated text-to-speech, ECMAScript execution etc – are embedded within the VoiceXML infrastructure. The following is a simple IVR service example[5] for ordering pizzas via a conversation between a human and a computer.

Computer: How many pizzas would you like?

Human: one

Computer: What size of pizza would you like? Say one of small, medium, or large

Human: medium

Computer: Would you like extra cheese? Say one of yes or no.

Human: yes.

Computer: What vegetable toppings would you like? Say one of Olives, Mushrooms, Onions, or Peppers?

Human: Um…help.

Computer: What vegetable toppings would you like? Say one of Olives, Mushrooms, Onions, or peppers?

Human: Mushrooms

Computer: What meat toppings would you like? Say one of Bacon, Chicken, Ham, Meatball, Sausage, or

Pepperoni

Human: Help.

Computer: Say one of Bacon, Chicken, Ham, Meatball, Sausage, or Pepperoni

Human: Sausage

Computer: Thank you for your orderings.

Example 1. An IVR Service example

In above example, users always have to answer to questions in a pre-defined order by a service provider and remember what items they can select. This voice-only interaction style is very inefficient because already many users have accessed these services using their mobile phone with a small screen. The small screen can support users to look at and listen to the information about what items he can select. Also it can support to select any item in any order if the selected item has no dependency with the others.

In the following section, we will describe how to access an Example 1 service via multi-modal user interactions.

2.2 The access to IVR services via multi-modal user interactions

Fig.2 shows the same service as Fig.1 via multi-modal user interactions. If the user clicks the textbox below the label “Quantity”, he can listen to “How many pizza….” like as the first dialogue in Example 1. At this time, the user can reply via voice or text input. If the user say “one”, the textbox will show a “1” character. If the user clicks the label “Size”, he can listen to “What size of Pizza …” like as the second dialogue in Example 1. At this time, the user can select one of radio buttons or say one of “small”, “medium”, or “large”. If the user say “small”, the radio button “small 12” will be selected. After replying all these questions, user clicks the button “Submit Pizza Order” for sending data to the servers.

By supplementing visual modality in voice modality, users can know in advance what items can be selected and select their own favorite modality according to circumstances. Also users don’t need to answer additional questions for validating the speech input because users can know the recognition results through a displayed text at once.

Fig. 2 A multi-modal web page

By supplementing visual modality in voice modality, users can know in advance what items can be selected and select their own favorite modality according to circumstances. Also users don’t need to answer additional questions for validating the speech input because users can know the recognition results through a displayed text at once.

Previous, there are many researches in the field of multi-modal browser[6, 7, 8, 9]. The researches focus on adding other modality (mainly voice) in the existing visual browser. But we think the effects of adding voice modality in visual-only web applications are not more powerful than adding visual modality in voice-only web applications – VoiceXML based IVR services. In the case of visual-only applications, whether the applications support voice-modality or not are not an indispensable problem. But in the case of voice-only applications, whether the applications support visual-modality or not are a very critical problem to users who already have accustomed to existing visual environments.

We use a XHTML+Voice(X+V) markup language[4] for describing multi-modality that is proposed by IBM and Opera Software in W3C. X+V extends XHTML Basic with a subset of VoiceXML 2.0[10], XML-Events and a small extension module.

In X+V, a modularized VoiceXML doesn’t include “non-local transfers” elements such as “exit”, “goto”, “link”, “script”, and “submit”, “menu” elements such as “menu”, “choice”, “enumerate”, and “object”, “root” elements such as “vxml” and “meta”, “telephone” elements such as “transfer” and “disconnect”. A small extension module includes important “sync” and “cancel” elements. The “sync” element supports synchronization of data entered via either a speech or a visual input.

In Fig. 2, a “Quantity” value entered via speech is displayed in a visual element “textbox” below a label “Quantity” by using a “sync” element. A “cancel” element allows a user to stop a running speech dialogue when he doesn’t want the voice interactions.

The structure of an XHTML+Voice Application is as Fig. 3[5].

Fig. 3. . The components of an XHTML+Voice

A basic XHTML+Voice multi-modal application consists of a Namespace Declaration, a Visual Part, a Voice Part and a Processing Part. The Namespace Declaration for a typical XHTML+Voice applications is written in XHTML, with additional declarations for VoiceXML, and XML-events. The Visual Part of an XHTML+Voice application is XHTML code that is used to display the various form elements to the devices’ screen, if available. This can be ordinary XHTML code and may include check boxes and other form items that are found in a typical form. The Voice Part of an application is the section of code that is used to prompt the user for a desired field within a form.

This VoiceXML code utilizes an external grammar to define the possible field choices. If there are many choices or combination of choices is required, the external grammar can be used to handle the valid combinations. The processing part of the application contains the code that is used to perform the needed instructions for each of the various events[5].

3 The Conversational Browser

3.1 Conceptual Model

The conversational browser transforms voice-only web pages into multi-modal web pages that include visual as well as voice elements to support multi-modal user interactions. By using our conversational browser, a mobile user with a small screen can access the existing VoiceXML-based IVR applications via voice as well as visual interactions. The effects of this supplement – adding visual interaction in voice-only applications is more convenient to users than the effects of the reverse case – adding voice interactions in visual-only applications. Fig.4 describes the conceptual model of our conversational browser.

Fig.4 The Conceptual Model of Conversational Browser

The conversational browser fetches VoiceXML pages from web servers that the user wants to access and analyzes what elements in the VoiceXML pages can be visualized. And then the conversational browser converts the original VoiceXML pages into the XHTML+Voice pages with the same scenario.

The conversion process is divided into four parts – a VoiceXML part, a XHTML part, an Event part, and a Namespace part. In the VoiceXML part, the conversational browser transforms the elements in original VoiceXML pages into the modularized VoiceXML elements allowed in XHTML+Voice. In the XHTML part, the conversational browser adds new XHTML elements for visualizing some VoiceXML elements. For example, a “prompt” element in VoiceXML gives users to any information by saying through TTS engines. This “prompt” element could be changed to a “label” element in XHTML+Voice and showed in the form of a text string on the screen. The “field” element in VoiceXML is an input element to be gathered from the user. This could be changed to an input element with a “textbox” attribute in XHTML+Voice and showed in the form of a text box on the screen. The Event part is for combining visual and voice elements for synchronizing inputs generated from different modalities. The Namespace part is for making variables defined in the original VoiceXML pages to be used in new generated XHTML elements.

The results of conversion produce XHTML+Voice pages including voice as well as visual elements. Finally, conversational browser executes the XHTML+Voice pages.

3.2 Architecture

Fig.5 describes an architecture of the conversational browser. The conversational browser consists of three modules – a VoiceXML Parser, a VoceXMLtoX+V Converter, and a XHTML+Voice Interpreter. Also, the conversational browser needs external systems for voice interactions such as a text to speech(TTS) engine and a speech recognizer, and a javascript engine for executing scripts. In the case of mobile devices, the TTS and speech recognizer have to be located in other platforms.

Fig. 5 The architecture of conversational browser

3.2.1 VoiceXML Parser

A VoiceXML Parser generates a DOM tree by parsing an input VoiceXML page. The generated DOM tree is transmitted to a VoiceXML-to-X+V converter that transforms a voice-only modal application into a multi-modal application.

3.2.2 VoiceXML-to-X+V Converter

A VoiceXML-to-X+V converter creates a XHTML+Event dom tree by referencing the visual-able elements of the VoiceXML dom tree, delete and edit the some elements in the original VoiceXML dom tree.

Fig. 6 shows roughly the VoceXML-to-X+V converter’s execution flows. First, the converter creates a new XHTML+Event dom tree that includes only a head element and a body element. And then, the converter executes as following steps until all elements in the VoiceXML dom tree are visited.

In case of a “block” element that contains executable content, there are two cases – one includes a “pcdata” and the other includes a “submit” element.

In case of the “pcdata”, the original meaning is the TTS engine to read the contents. Therefore the converter adds <P> elements in the created dom tree for visualizing in a text-form. In case of a “submit”, the meaning is to submit values to the specific server. The converter adds an <input> node in the created tree and delete the submit node in the original VoiceXML tree. The reason of deleting is that host language for multi-modal descriptions is not VoiceXML but XHTML. So, the same functions of “submit” is defined in XHTML.