Demystifying 10 Common Misconceptions About VoiceXML

Wednesday Nov 6th 2002 by Jonathan Eisenzopf

If you're in the process of deciding on using VoiceXML technologies for a critical business application, you've probably made many assumptions. Before you take the plunge, or even if you have already done so, you might want to read how you can avoid the 10 most common pitfalls with VoiceXML implementations.

Hold it. Stop right there. If you're in the process of deciding on using VoiceXML technologies for a critical business application, you've probably made many assumptions. Most of them are probably wrong. Before you take the plunge, or even if you have already done so, you might want to read how you can avoid the 10 most common pitfalls with VoiceXML implementations.


Drawing on my collective experience over the past couple years, I've noticed a common set of misconceptions that customers, start-ups and vendors often make. While I provide a set of best practices for VoiceXML practitioners in my VoiceXML Bootcamp training course, I haven't as of yet written about some of the common mistakes that are made both by the customers and developers as well as vendors. These mistakes usually originate because of a flawed set of expectations, which are based on incorrect assumptions. I imagine that many of you have already experienced the fallout of some of these misconceptions. For those that are new to VoiceXML, listen up. This list might save you some grief.

Speech Recognition is 98% accurate

This is a common figure touted by speech recognition vendors. The number can be a bit misleading in reality. It is true that speech recognition can be as much as 98% accurate as long as the speech grammars are limited and optimal. Limited means that the total possible grammatical combinations are relatively small. Having to match five hundred first names from a database is an example of a grammar that will have a less than 98% accuracy rate. A list of twenty names would be limited and could potentially reach 98%. 

What I mean by optimal is that the possible phrases that users can speak are dissimilar from each other. An optimal grammar cannot allow speakers to provide single letters or numbers, which have a higher failure rate than a longer word or phrase because they contain fewer phonemes (the basic sounds that make up a language). Additionally, a 98% accuracy is rare in a noisy environment. For example, a caller using a cell phone in their car in traffic with the window rolled down and the radio playing Puff Daddy would be a noisy, problematic environment.

The solution is to fall back to simpler grammars and step callers through a set of directed prompts rather than allowing them to speak more naturally; or to transfer them to a live representative. Your application must be prepared to offer alternatives when speech recognition fails--because it will fail at some point.

I don't think callers will like speech recognition

There are various opinions on this along with a few studies that provide data on this issue. It is true that callers usually prefer to speak with a real person instead of a speech recognition system. However, when given the option between a touch tone IVR and a speech recognition IVR, most callers will prefer a speech system. 

Interesting enough, one study by AT&T showed that older callers preferred speech while younger callers preferred touch-tone. However, in applications that contain more than three levels of menus or contain a complex series of prompts, most callers will prefer speech over touchtone where speech can get the caller to their destination faster and easier.

For example, let's consider an IVR system that allows a car dealer to check their inventory. In a touch-tone IVR system, the caller would either have to know the code for the given car make and model, or they would have wait for the system to provide them with the corresponding number:

"For Ford, press 1. Acura, press 2. Honda, press 3."

A touch-tone system would also require 3 prompts and inputs: make, model, and year.

With a speech recognition system, the task could be accomplished faster and more conveniently:

"How many 2002 Ford Explorers do we have in stock?"

There are many more practical examples where speech provides a more convenient alternative to otherwise overly complex touch-tone interfaces.

VoiceXML gateways are all the same

For the purpose of evaluating VoiceXML gateway vendors, it's easy to think, "Hey, they all support VoiceXML so they'll all function the same." It's been my experience that even though VoiceXML is a common standard, there are still areas of the specification that are left to interpretation, and certain limitations that vendors must address through proprietary mechanisms. For example, Nuance's TTS interprets the VoiceXML TTS tags differently that IBM's TTS. If you've timed and tuned the prosody for one, it'll sound completely different in the other.

A second area in which gateways differ is how they integrate with enterprise applications and databases. Some may provide tighter integration through application integration components, while others will leave the task to you.

A third area in which gateways differ is how they integrate with existing telephony infrastructures. Some gateways were really designed to stand alone and do not integrate well with an existing PBX, IVRs, ACD or telephony switch. Others will provide tightly integrated support for very specific equipment vendors.

Make sure you understand the telephony equipment that the gateway will need to integrate with. Make sure you understand how the gateway will integrate with your applications and databases. Finally, assume that switching to a different vendor's gateway will require modifications to code. 

It's easy to write VoiceXML applications

Because VoiceXML is based on existing Web standards, many of the techniques and skills that Web developers have amassed over the past few years will translate into developing speech applications. Web developers too often underestimate the learning curve required to develop voice user interfaces and the difficulties that arise when integrating VoiceXML applications with telephony equipment.

For example, how do you route callers from the PBX to the VoiceXML gateway? Or how do you transfer a VoiceXML caller into the ACD? To become an effective speech application developer, you'll need to have a foundation in Web development, telephony, and networking.

Mastering speech applications also requires knowledge and experience in designing speech interfaces. That skill is part art, part science. There are few resources on designing Voice User Interfaces (VUIs) and there are only a handful of people and even fewer companies that have any significant experience in this area. One book I can recommend however is: 

"Designing Effective Speech Interfaces" by by Susan Weinschenk, Dean T. Barker, published by Wiley.

VoiceXML as a specification is fairly easy to learn, but don't think that means you can easily develop a good speech application. The best way to test your success is to have a friend test it in their car, on their cell phone, in traffic.

VoiceXML is portable if I use the standard tags

Wrong. Even though I wish it were so, I can't copy my applications from Tellme, to BeVocal, to Voxeo, to VoiceGenie and have it work without any changes. I'm not sure that I will EVER be able to because of the subtle differences in how vendors implement the standard.

What this means is that you can't develop and test your application on Tellme for free, and then go out and buy a dedicated gateway from VoiceGenie without any code changes. Fortunately, the code changes will be minor in scope compared to say, porting a C application to Java, however, it's best to select your platform before you start developing the application so you know it will work when it's deployed. So if you know you'll be going with a VoiceGenie gateway, then go ahead and develop and test your application in their hosted development environment. Then you know that your application will work exactly the same when you install it on the dedicated platform.

I've programmed IVRs so speech should be a breeze

Whoa there! This is equivalent to a Web developer saying that they can develop VoiceXML applications with no training. Experience with touch-tone IVRs will provide you with a good perspective of how it will function in your existing development environment, however you will need to become familiar with Web protocols and programming environments. 

Fortunately, IVR programmers have a leg up on understanding how to design a VUI. Most of this experience does translate to speech, however, you will have to throw out some of the design criteria and assumptions that you would normally make for a touch-tone interface. You'll have to switch from thinking in terms of a menu tree to thinking more about speech dialog progressions.

Since VoiceXML is an open standard, integrating a gateway with our PBX, ACD, or call center will be easier

Actually, the exact opposite is probably true, but for a different reason. Yes, it is true that VoiceXML is an open standard, which means that you will have more options in the future, but openness doesn't necessarily have anything to do with maturity. What I mean is that IVR systems that have had years to develop and mature will likely have features, tools, and integration features that VoiceXML gateways lack. Also, VoiceXML has limited call control functionality and no CTI integration capabilities. Gateway vendors either provide this functionality using proprietary APIs or will utilize a 3rd party product such as Intel's CT Connect. If you have a complex telephony environment, you will want to be very careful about which vendor you select. Make sure the vendor can explain exactly how they will integrate their product into your environment.

With VoiceXML, callers will be able to just talk to the system naturally and it will understand

This misconception has to do with continuous speech recognition products like Dragon Dictate and IBM Via Voice, which allow users to speak Word and email documents into existence. The speech recognition that's used in VoiceXML typically requires developers to create grammars. These grammars define everything that a caller can say. If the caller says something that's not in the grammar, then it will not get recognized. Furthermore, there isn't anything in VoiceXML that allows the speech recognition engine to take some action based upon an interpretation of what was being said. The actions are all coded into the VoiceXML code. Recently however, Nuance and Speechworks have introduced versions of their respective speech recognition engines that allow callers to speak more naturally by using statistical models instead of strictly defined grammars. This technology is still experimental from a VoiceXML standpoint and the voice browser working group at the W3C is still working out how to handle semantic interpretation for speech recognition. Within a year or so, it may be possible for a system to ask, "How may I help you?" Until then, grammars must be hand-coded, restricting the level of natural language that can be used in VoiceXML applications.

VoiceXML is too new and isn't well supported

Well, this may have been true a year and a half ago, but things have changes rapidly since then. Here's a partial list of recognizable companies offering VoiceXML capabilities. You be the judge as to whether VoiceXML is being supported:

  • Lucent
  • Cisco
  • IBM
  • Sun
  • Oracle
  • Siemens
  • Nortel
  • Intel
  • Motorola
  • AT&T

As to VoiceXML being new, yes, it's fairly new, however, it's based on stable technologies that have been developed over the last 30 years or so.

There really isn't a demand for VoiceXML yet and analysts haven't recommended it

To debunk the myth that VoiceXML is not getting traction, I talked with several speech recognition and IVR vendors. All four told pretty much the same story. Customers are including VoiceXML as a requirement in their Request For Proposals (RFPs) and are in the early stages of evaluating or developing VoiceXML applications.

As to analyst coverage, there has been some. Gartner published, "IVR Magic Quadrant for 1H02 - Challenges for Incumbents" in which speech recognition and VoiceXML are two drivers for IVRs. This briefing can be downloaded from the InterVoiceBrite Web site.


I hope these insights will save you from some of the flawed assumptions that I've made in the past. If you have stories or tidbits of advise that you'd like to share, send them over and I might publish them in the future.

About Jonathan Eisenzopf

Jonathan is a member of the Ferrum Group, LLC  which specializes in Voice Web consulting and training. He will be teaching the VoiceXML Bootcamp June 10-13 in Washington, D.C. Feel free to send an email to regarding questions or comments about this or any article, or for more information about training and consulting services.

Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved