The Business of Voice Hosting with VoiceXML

David L. Thomson and John M. Hibel

Lucent Speech Solutions

Room 4G-163

2000 Naperville Rd.

Naperville IL 60566

Thomson: 630-713-5283

Hibel: 630-979-8268

The Business of Voice Hosting with VoiceXML

David Thomson and John Hibel

Lucent Speech Solutions

Amidst growing interest in voice access to the Internet, a new language, VoiceXML (Voice eXtensible Markup Language), promises to not only speed development and expand markets for both web-based services and speech recognition and synthesis services, but it is also likely to spawn a new industry, that of "voice hosting." This novel business model follows the pattern of web hosting, and allows developers to build new telephone-based services rapidly, and without purchasing or installing new equipment. The voice hosting service provider leases telephone lines to the client, and voice-enables a specific URL, specified and programmed in VoiceXML by the client. This model for building new voice services will make it possible to build speech and telephony services at only a fraction of the time and cost of traditional methods.

What is Voice Hosting?

Provide telephone access to web sites

Fueled by explosive growth and mainstream adoption of the Internet, a new business of providing Web Hosting services is emerging. Industry players are expanding aggressively, and all are seeking ways to differentiate themselves in what promises to be a fiercely competitive industry for many years to come. At the same time, the web itself is entering a new phase of expansion. Playing in a media initially accessible only by computer, Internet properties have been focused on attracting more "eyeballs" to their web sites. Now the Internet is expanding from "eyeballs" to "ears" as the technology has become available to provide access to Internet content via the telephone.

A voice hosting service provider (VSP) offers services that enable client web sites to be accessed by telephone, as shown in Figure 1. Through the use of Automatic Speech Recognition (ASR) and Text-to-Speech (TTS), voice hosting service providers can speech-enable client web sites in a manner that results in a pleasurable caller experience. Voice hosting can be a way for Web Hosting incumbents to differentiate themselves, and it can also provide an opportunity for new entrants to establish a market position in anticipation of the new voice hosting business.

New developments enable voice hosting

A key enabler that makes voice hosting economically practical is the new markup language, VoiceXML, designed to render web content over the phone. Clients, or "application owners," develop VoiceXML pages to present their web content using speech recognition and text-to-speech. The business of voice hosting is dramatically altered because VoiceXML allows the application to be separated from the speech resources, and it is no longer necessary for the application owner to also own the platform and software engines that provide ASR and TTS. With VoiceXML, application owners retain control of their applications while taking advantage of economies of scale by outsourcing necessary speech resources to a voice hosting service provider. The voice hosting service provider deploys high density speech servers and spreads the cost of shared ASR and TTS resources across many clients.

Figure 2 traces the evolution of VoiceXML, the culmination of over four years of research and development effort originating in Bell Labs. VoiceXML is backed by industry leaders, including Lucent, AT&T, Motorola, and IBM, who together founded the VoiceXML Forum in March 1999 for the purpose of establishing and promoting VoiceXML as an industry standard. Within three months, over 60 companies had joined the VoiceXML Forum, and many of them are already developing VoiceXML applications. To learn more about the VoiceXML Forum, visit

As an industry standard, VoiceXML can provide a common high level language that facilitates the development of speech enabled services in a manner that is attractive to developers. VoiceXML allows speech and telephony resources to be controlled in a uniform manner, it shields developers from platform-specific APIs, and it provides a common service creation environment for applications that are portable across many platforms. Furthermore, HTML developers will be able to learn VoiceXML with little effort. This allows application developers to leverage the same resources to develop and maintain both their web applications and their VoiceXML applications, and ensures availability of a large pool of potential developers.

VoiceXML allows the application to be separated from the speech resources

In traditional telephony or IVR services, the application is closely tied to a telephony platform. The application must run on or be co-located with the telephony platform, and if speech resources are used they must be closely integrated as well. The tight integration of the application and platform makes hosting services difficult because clients must depend on the hosting service provider for application integration, maintenance, and upgrades.

VoiceXML enables a web-based architecture for interactive speech services where the application and service logic are separated from the speech resources and telephony interface. Since these functions may even be owned by different companies, this separation opens the door for voice hosting service providers, who offer speech resources and telephony interface to clients who develop and maintain their own VoiceXML applications without service provider intervention.

VoiceXML scope

VoiceXML provides application developers input, output, and telephony control functions necessary to build speech interactive applications. These include:

VoiceXML input:

  • record spoken input
  • recognize spoken input
  • collect character input

VoiceXML output:

  • play audio files
  • produce synthesized speech

VoiceXML telephony control:

  • transfer a user to another destination, such as a live agent
  • disconnect a user

VoiceXML can be used to build both very simple and complex services. It is designed to make common dialogs easy to write while at the same time enabling developers to create very complex dialogs if desired. User interaction can be structured and simple as with many of today's IVR systems, or it can be open and natural to produce speech-enabled interactions that are closer to human dialog.

By separating the application from speech resources, VoiceXML allows speech resources to be put in the network where economies of scale can be leveraged by many applications at once. This carries a significant economic advantage compared to premise-based speech interactive services. Voice hosting service providers can leverage this advantage into lower costs.

In order to fully realize these economies of scale, voice hosting service providers must aggregate large amounts of client traffic onto a highly scalable platform. Given the likelihood that VoiceXML applications will proliferate rapidly, voice hosting service providers are going to have to handle a lot of traffic. This means that from a business perspective, scalability is the issue for voice hosting service providers.

Inside a Voice Hosting Deployment

The Players

Voice Hosting Service Provider (VSP)- The VSP provides speech resources and interconnects to the telephone network on behalf of its application owner clients. VSPs may be web hosting incumbents, telephony service providers, or entrepreneurial ventures with no established telephony or Internet business.

Application Owner- Application owners may be enterprises, service providers, or even individuals who own a VoiceXML application and wish to speech-enable their web sites for telephone access.


Caller- The caller is the end user in the voice hosting business model. Callers can access VoiceXML services simply by dialing a phone number provided by the VSP and most likely publicized by the application owner. The caller needs no special equipment, and unless restricted by the VSP can access the VoiceXML service via landline or wireless phone.

Architecture Elements

Figure 5 gives an overview of the architectural elements needed for voice hosting.

VoiceXML application/web server- Application owners write their applications in VoiceXML. The application itself may be one or many VoiceXML pages residing on a web server. The web server may be located on the application owner's premises, or anywhere else as long as the VSP can access it via Internet Protocol.

VoiceXML interpreter- The VoiceXML interpreter is enabling software which interprets VoiceXML commands for the speech server. The VoiceXML interpreter also maps a phone number to a URL so the speech server knows which VoiceXML page to request for an inbound caller. The VoiceXML interpreter software runs on a speech server.

Speech Server- The speech server is a high density ASR/TTS VoiceXML gateway capable of supporting speech recognition, text-to-speech, or any combination of ASR and TTS. The speech server is deployed by the VSP, and acts as a gateway between the telephone network and the data network.

Internet/LAN connection to web server- The speech server includes a LAN interface. The speech server requests VoiceXML pages via the LAN. If the VoiceXML pages are hosted locally, they are delivered to the speech server directly via the LAN. If hosted remotely, the VoiceXML pages are delivered to the speech server through the LAN via the Internet.

T1 connection to PSTN switch- The speech server includes connections such as T1 or VoIP cards for PSTN connectivity. Telephone calls are delivered to the speech server via this telephony interface.

To enter the voice hosting business, a VSP would install a speech server on its premises, and obtain T1 lines for PSTN connectivity. The VSP would obtain a set of phone numbers that would be routed through the T1 lines to the speech server. A hunt group of many T1 lines can be set up to route calls to multiple speech servers. For data network connectivity, the VSP would interface the speech server to a LAN, which would in turn be tied to the Internet.

The VSP can then approach prospective clients. The clients would be responsible for writing applications in VoiceXML. The client/application owner provides the VSP with a URL for the VoiceXML web page, and the VSP provides the client/application owner with a telephone number that will access that URL.

Business Model

Voice hosting Service Providers supply application owners with PSTN connectivity and speech resources (ASR, TTS, or both.) Application owners can pay the VSP on a per-call basis or a monthly fee based on the number of speech resources (ports) or telephone lines consumed. In return, the application owner does not have to pay for its own T1 access or purchase premise-based ASR or TTS platforms. The application owner only needs to build and maintain a VoiceXML page on a web server. This arrangement can save the application owner months of development, procurement, and installation and on the order of $100,000 or more in startup costs. If the VSP also supplies web hosting, the application owner's VoiceXML pages can be hosted by the VSP as well.

For example, Spacely's Sprockets has a web site its customers use to obtain the latest information on new products and to check the status of current sprocket orders. Many of Spacely's customers are on the go, and frequently find themselves in dire need of sprocket information when they don' t have a computer handy to access Spacely's web site. When this happens, they pick up a phone and call Spacely's IVR system, which treats them to a twisted maze of little prompts and touch-tone responses. Most callers opt out of Spaceley's IVR system in favor of talking to one of Spacely's expensive gregarious customer service representatives, who always converse happily with Spacely's customers and sometimes provide them with the information they are seeking.

Not one to miss out on a golden opportunity, Spacely sets out immediately to improve his bottom line. He mobilizes his web developers and puts them to work on a VoiceXML application intended to present the data in his web site over the telephone. Instead of a maze of touch-tone menus, the VoiceXML developers are able to create a natural, frictionless interaction using speech recognition that allows Spacely's customers to get the information they want much more quickly than ever before. In the event that a caller has a request that cannot be handled by the automated VoiceXML system, the caller can still opt out of the system and be connected to one of Spacely's costly, but friendly, customer service representatives.

Next, Spacely calls his local voice hosting Service Provider. The VSP provides Spacely with a phone number, to which Spacely re-routes his current customer service 800 number. Spacely provides the VSP with the front page URL of his VoiceXML application. Spaceley tells the VSP he expects that about 3 T1 lines and 72 channels of ASR will be required to handle his traffic. The VSP quotes Spacely a monthly fee based on this amount of traffic, and Spacely hangs up the phone a happy man. He shuts down his IVR system, cancels the three T1 lines that fed his IVR system, and puts most of his expensive gregarious customer service representatives to work making outbound calls that generate more revenue. Now his HTML/VoiceXML developers can easily maintain both his web site and the VoiceXML application, and they no longer have the headaches of a separate web site and IVR system. The developers are happy, Mr. Spacely is happy, Spacely's customers are happy, and the VSP is happy.

The Web Meets Telephony

VoiceXML enables more than telephone access to the web

Even if you ignore its ability to access Internet content, VoiceXML can change the way speech interactive services are developed. In fact, virtually any telephony service can be built with VoiceXML, including auto attendants, voice dialing, information portals, and IVR systems. This means voice hosting service providers can also effectively act as service bureaus that provide a wide range of enhanced services for telephony.

Consider some of the possibilities. A web hosting company already has a base of business customers who pay for its web hosting services. It's a natural extension of that business to enter voice hosting and offer that base of business customers telephone access to their hosted web sites. It's also a natural extension to build a VoiceXML auto attendant application that allows each customer to administer its own corporate directory and dialog using a VoiceXML page. Each customer's VoiceXML application could be hosted along with its web site, and the VSP could offer a customized auto attendant service to each of its business customers.

Likewise, a telephony service provider can offer network-based enhanced services that can be self-customized by clients who maintain their own VoiceXML applications. Instead of just offering T1 connectivity to a client's premise-based IVR systems, the service provider can let the client build and maintain its own IVR system in VoiceXML while moving up the value chain and offering voice hosting services in addition to T1 connectivity.

The Future of IVR

VoiceXML accelerates IVR migration from premises to the network

It used to make a lot of sense for only about 10% of the IVR market to be network-based. Enterprises liked to maintain control of their IVR applications, and it as impractical to get a service provider involved every time they wanted to make a minor change. Touch-tone IVR systems were relatively inexpensive even for small-scale deployments, so it made sense to deploy them on the premises. VoiceXML and voice hosting can radically change the assumption that speech-based IVR systems make sense on the premises.

First, VoiceXML allows the enterprise to retain control of its IVR application, even though the IVR service is network-based through a VSP. It's not necessary for the enterprise to involve the VSP for development or changes to the IVR application written in VoiceXML.

Second, premises-based speech-enabled IVR systems can cost up to $2000 or more per speech port. In the network, speech ports can be deployed in very high volume at significant cost savings- 50% or more on a cost per channel basis. A VSP can deploy large volumes of speech ports and pass part of the cost savings to its clients who build their IVR applications in VoiceXML. This represents a cost saving opportunity for the enterprise, and a significant business opportunity for the VSP. For example, it would be less expensive for the enterprise to pay the VSP $50 per month per speech port than to buy premises equipment at $2000 per speech port[1].

In addition to lower cost per speech port, the enterprise would recognize cost savings from operations and maintenance as well. Since the VSP is aggregating T1 lines as well as speech resources, it is likely that the VSP will have lower costs for maintenance and PSTN. Furthermore, the VoiceXML application can be easily maintained along with the enterprise's web site, and there may no longer be a need to maintain both a web site and an IVR system.