If you're in the process of deciding on using VoiceXML technologies for a critical business application, you've probably made many assumptions. Before you take the plunge, or even if you have already done so, you might want to read how you can avoid the 10 most common pitfalls with VoiceXML implementations.
Hold it. Stop right there. If you're in the process of deciding
on using VoiceXML technologies for a critical business application,
you've probably made many assumptions. Most of them are probably
wrong. Before you take the plunge, or even if you have already done so, you
might want to read how you can avoid the 10 most common pitfalls with
Drawing on my collective experience over the past couple years,
I've noticed a common set of misconceptions that customers,
start-ups and vendors often make. While I provide a set of best
practices for VoiceXML practitioners in my VoiceXML Bootcamp
training course, I haven't as of yet written about some of the
common mistakes that are made both by the customers and developers
as well as vendors. These mistakes usually originate because of a
flawed set of expectations, which are based on incorrect
assumptions. I imagine that many of you have already experienced the
fallout of some of these misconceptions. For those that are new to
VoiceXML, listen up. This list might save you some grief.
Speech Recognition is 98% accurate
This is a common figure touted by speech recognition vendors. The
number can be a bit misleading in reality. It is true that speech
recognition can be as much as 98% accurate as long as the speech
grammars are limited and optimal. Limited means that the total
possible grammatical combinations are relatively small. Having to
match five hundred first names from a database is an example of a
grammar that will have a less than 98% accuracy rate. A list of
twenty names would be limited and could potentially reach 98%.
What I mean by optimal is that the possible phrases that users
can speak are dissimilar from each other. An optimal grammar cannot
allow speakers to provide single letters or numbers, which have a
higher failure rate than a longer word or phrase because they
contain fewer phonemes (the basic sounds that make up a language).
Additionally, a 98% accuracy is rare in a noisy environment. For
example, a caller using a cell phone in their car in traffic with
the window rolled down and the radio playing Puff Daddy would be a
noisy, problematic environment.
The solution is to fall back to simpler grammars and step callers
through a set of directed prompts rather than allowing them to speak
more naturally; or to transfer them to a live representative. Your
application must be prepared to offer alternatives when speech
recognition fails--because it will fail at some point.
I don't think callers will like speech recognition
There are various opinions on this along with a few studies that provide
data on this issue. It is true that callers usually prefer to speak
with a real person instead of a speech recognition system. However,
when given the option between a touch tone IVR and a speech
recognition IVR, most callers will prefer a speech system.
Interesting enough, one study by AT&T showed that older
callers preferred speech while younger callers preferred touch-tone.
However, in applications that contain more than three levels of
menus or contain a complex series of prompts, most callers will
prefer speech over touchtone where speech can get the caller to
their destination faster and easier.
For example, let's consider an IVR system that allows a car
dealer to check their inventory. In a touch-tone IVR system, the
caller would either have to know the code for the given car make and
model, or they would have wait for the system to provide them with
the corresponding number:
"For Ford, press 1. Acura, press 2. Honda, press 3."
A touch-tone system would also require 3 prompts and inputs:
make, model, and year.
With a speech recognition system, the task could be accomplished
faster and more conveniently:
"How many 2002 Ford Explorers do we have in stock?"
There are many more practical examples where speech provides a
more convenient alternative to otherwise overly complex touch-tone
VoiceXML gateways are all the same
For the purpose of evaluating VoiceXML gateway vendors, it's easy
to think, "Hey, they all support VoiceXML so they'll all
function the same." It's been my experience that even though
VoiceXML is a common standard, there are still areas of the
specification that are left to interpretation, and certain
limitations that vendors must address through proprietary
mechanisms. For example, Nuance's TTS interprets the VoiceXML TTS
tags differently that IBM's TTS. If you've timed and tuned the
prosody for one, it'll sound completely different in the other.
A second area in which gateways differ is how they integrate with
enterprise applications and databases. Some may provide tighter
integration through application integration components, while others
will leave the task to you.
A third area in which gateways differ is how they integrate with
existing telephony infrastructures. Some gateways were really
designed to stand alone and do not integrate well with an existing
PBX, IVRs, ACD or telephony switch. Others will provide tightly
integrated support for very specific equipment vendors.
Make sure you understand the telephony equipment that the
gateway will need to integrate with. Make sure you understand how
the gateway will integrate with your applications and databases.
Finally, assume that switching to a different vendor's gateway will
require modifications to code.
It's easy to write VoiceXML applications
Because VoiceXML is based on existing Web standards, many of the
techniques and skills that Web developers have amassed over the past
few years will translate into developing speech applications.
Web developers too often underestimate the learning curve required
to develop voice user interfaces and the difficulties that arise
when integrating VoiceXML applications with telephony equipment.
For example, how do you route callers from the PBX to the
VoiceXML gateway? Or how do you transfer a VoiceXML caller into the
ACD? To become an effective speech application developer, you'll
need to have a foundation in Web development, telephony, and
Mastering speech applications also requires knowledge and
experience in designing speech interfaces. That skill is part art, part
science. There are few resources on designing Voice User Interfaces
(VUIs) and there are only a handful of people and even fewer
companies that have any significant experience in this area. One
book I can recommend however is:
"Designing Effective Speech Interfaces" by by
Susan Weinschenk, Dean T. Barker, published by Wiley.
VoiceXML as a specification is fairly easy to learn, but don't
think that means you can easily develop a good speech application. The best
way to test your success is to have a friend test it in their car,
on their cell phone, in traffic.
VoiceXML is portable if I use the standard tags
Wrong. Even though I wish it were so, I can't copy my
applications from Tellme, to BeVocal, to Voxeo, to VoiceGenie and
have it work without any changes. I'm not sure that I will EVER be
able to because of the subtle differences in how vendors implement
What this means is that you can't develop and test your
application on Tellme for free, and then go out and buy a dedicated
gateway from VoiceGenie without any code changes. Fortunately, the
code changes will be minor in scope compared to say, porting a C
application to Java, however, it's best to select your platform
before you start developing the application so you know it will work
when it's deployed. So if you know you'll be going with a VoiceGenie
gateway, then go ahead and develop and test your application in
their hosted development environment. Then you know that your
application will work exactly the same when you install it on the
I've programmed IVRs so speech should be a breeze
Whoa there! This is equivalent to a Web developer saying that they can
develop VoiceXML applications with no training. Experience with
touch-tone IVRs will provide you with a good perspective of how it
will function in your existing development environment, however you
will need to become familiar with Web protocols and programming
Fortunately, IVR programmers have a leg up on understanding how
to design a VUI. Most of this experience does translate to speech,
however, you will have to throw out some of the design criteria and
assumptions that you would normally make for a touch-tone interface.
You'll have to switch from thinking in terms of a menu tree to
thinking more about speech dialog progressions.
Since VoiceXML is an open standard, integrating a gateway with
our PBX, ACD, or call center will be easier
Actually, the exact opposite is probably true, but for a
different reason. Yes, it is true that VoiceXML is an open standard,
which means that you will have more options in the future, but
openness doesn't necessarily have anything to do with maturity. What
I mean is that IVR systems that have had years to develop and mature
will likely have features, tools, and integration features that
VoiceXML gateways lack. Also, VoiceXML has limited call control
functionality and no CTI integration capabilities. Gateway vendors
either provide this functionality using proprietary APIs or will
utilize a 3rd party product such as Intel's CT Connect. If you have
a complex telephony environment, you will want to be very careful
about which vendor you select. Make sure the vendor can explain
exactly how they will integrate their product into your environment.
With VoiceXML, callers will be able to just talk to the system
naturally and it will understand
This misconception has to do with continuous speech recognition
products like Dragon Dictate and IBM Via Voice, which allow users to
speak Word and email documents into existence. The speech
recognition that's used in VoiceXML typically requires developers to
create grammars. These grammars define everything that a caller can
say. If the caller says something that's not in the grammar, then it
will not get recognized. Furthermore, there isn't anything in
VoiceXML that allows the speech recognition engine to take some
action based upon an interpretation of what was being said. The
actions are all coded into the VoiceXML code. Recently however,
Nuance and Speechworks have introduced versions of their respective
speech recognition engines that allow callers to speak more
naturally by using statistical models instead of strictly defined
grammars. This technology is still experimental from a VoiceXML
standpoint and the voice browser working group at the W3C is still
working out how to handle semantic interpretation for speech
recognition. Within a year or so, it may be possible for a system to
ask, "How may I help you?" Until then, grammars must be
hand-coded, restricting the level of natural language that can be
used in VoiceXML applications.
VoiceXML is too new and isn't well supported
Well, this may have been true a year and a half ago, but things
have changes rapidly since then. Here's a partial list of
recognizable companies offering VoiceXML capabilities. You be the
judge as to whether VoiceXML is being supported:
As to VoiceXML being new, yes, it's fairly new, however, it's
based on stable technologies that have been developed over the last
30 years or so.
There really isn't a demand for VoiceXML yet and analysts
haven't recommended it
To debunk the myth that VoiceXML is not getting traction, I
talked with several speech recognition and IVR vendors. All four
told pretty much the same story. Customers are including VoiceXML as
a requirement in their Request For Proposals (RFPs) and are in the
early stages of evaluating or developing VoiceXML applications.
As to analyst coverage, there has been some. Gartner published,
"IVR Magic Quadrant for 1H02 - Challenges for Incumbents"
in which speech recognition and VoiceXML are two drivers for IVRs.
This briefing can be downloaded from the InterVoiceBrite
I hope these insights will save you from some of the flawed
assumptions that I've made in the past. If you have stories or
tidbits of advise that you'd like to share, send them over and I
might publish them in the future.
About Jonathan Eisenzopf
Jonathan is a member of the Ferrum Group, LLC which specializes in Voice Web consulting and training.
He will be teaching the VoiceXML
Bootcamp June 10-13 in Washington, D.C. Feel free to send an
email to email@example.com
regarding questions or comments about this or any article,
or for more information about training and consulting