VoiceXML is an excellent tool for developing voice applications that meet
particular criteria. However, contrary to what might be claimed by VoiceXML
enthusiasts (including those who sell VoiceXML services), it is not the perfect
tool for every project. Just as there is no perfect programming language for
every software application or perfect database for every database application,
there is no perfect platform choice for all voice-enabled applications.
A number of factors will influence which architecture, hardware, operating
system, language, and off-the-shelf software you should use for a particular
project. This article will give you a basis for understanding the strengths and
weaknesses of VoiceXML in order to help you determine if VoiceXML is the right
tool for your project.
To fully benefit from reading this article, you should be familiar with
general web application development principles as well as XML. It would also
help to be familiar with the basics of VoiceXML. A full reference of the
VoiceXML 2.0 specification can be found at http://www.w3.org/TR/voicexml20. If
you aren't inclined to read the entire spec, it would be worthwhile to at least
read the overview and background sections of that document before continuing on.
The strengths of VoiceXML lend it to specific types of applications. First,
VoiceXML is designed to be platform-independent (on the gateway side, not on the
application server side). VoiceXML is designed around the same server-side-pull
model used for HTML applications. In fact, VoiceXML applications can, and often
are, run in conjunction with traditional web applications, accessing the same
data and performing the same essential tasks, even residing on the same
machines. VoiceXML allows a programmer to write a basic voice application
without having to know or learn anything about the voice hardware on which the
application will run.
VoiceXML also has a number of limitations. Its hardware independence comes at
a price; only a limited set of telephony functions are available in the VoiceXML
API (e.g. Onhook/offhook call control, touch-tone synthesis and recognition,
etc). Some occasionally essential (and admittedly less-used) functions are
simply not available in the VoiceXML API. For example, complex frequency
analysis used for outbound call progress detection is not available; speed and
volume control for audio file playback are also unavailable. Audio files cannot
be played beginning at an arbitrary point; this feature is necessary, for
example, when resuming playback of a paused or interrupted voicemail message.
A VoiceXML platform typically consists of a gateway and an application
server. The gateway almost always resides on the same machine as the voice
hardware and the application server interfaces with any data and control
sources, and houses the programming logic. In most cases, all programming takes
place on the application server side. For our purposes, you should treat the
VoiceXML gateway as a black box that interfaces with the phone network, the
caller, and the caller's telephone.
There are several platform options for
development and deployment of voice applications.
- Use a VoiceXML service bureau. This is the most common option for
less elaborate voice applications with relatively modest volume requirements.
You will probably still need to host the logic for your application on your own
- Use a non-VoiceXML service bureau. You will have to pay
them to develop your application. There are fewer of these available as VoiceXML
takes over as the industry standard, but they may be less expensive, and can
provide you with some of the features missing from
- Purchase hardware and build your own non-VoiceXML
application. This is by far the most difficult path to pursue, and will require
significant specialized training in telephony, and speech recognition (if your
application requires it).
- Purchase a VoiceXML system to reside with
your equipment. You will probably still treat it largely as a black box, and may
need assistance ordering phone lines and connecting the system to the phone
Your application may be suited for
VoiceXML if the following conditions are true:
The application only requires basic input from the user, and will only
deliver basic audio information to the user. The specification allows for
the playback of audio voice files as well as text-to-speech audio. VoiceXML
applications can gather touch-tones (DTMF) as well as recognize speech
interaction from the user. A flight status information line might fit this
Little, if any, interaction with the phone network is required. The
application should answer calls, interact with the user, and hang up at the end
of the call. There is a large amount of functionality available from the phone
network, but most of this is handled behind-the-scenes by the VoiceXML gateway
for you. However, if you find that you need more sophisticated phone network
functionality, such as access to automatic number identification (ANI, like
caller ID), billing telephone numbers (BTN), or the ability to set these
attributes for an outgoing call transfer, VoiceXML may not be right for your
application. In addition, there are a number of functions available in the ISDN
and SS7 network specifications which simply aren't available in VoiceXML. You
probably won't need these, but if you do, you're out of luck with VoiceXML.
Your voice application is no more critical than your web site. The
server-side logic for your voice application must reside on a web server. If
your application gives the caller access to the same data used by your web site,
it may be a good decision to run your voice application from the same server.
But if your server has occasional busy periods or outages, this will affect your
voice application too.
The following situations may indicate that
your application is not a good match for VoiceXML:
Playback of very long audio files is required. VoiceXML does not allow
for playing voice files at different speeds, or for beginning playback at a
specific point in a voice file. For example, most voice messaging systems allow
the user to press a key to fast-forward 10 seconds in a message. VoiceXML would
not support this.
Frequent Call transfers to other phone numbers are required. For
example, applications designed to front-end call center transfers (for example,
to gather information from a caller before the call is sent to a live agent) are
typically better handled by hardware integrated with a call center's telephone
Outbound calling functionality is required. The VoiceXML specification
does not handle outbound dialing requirements. Several VoiceXML service providrs
allow outbound calls outside of the VoiceXML spec, but these are proprietary
extensions, and applications written for one provider's platform will have to be
ported to work with another provider. Determining the outcome of call attempts
is a particularly difficult challenge in the development of outbound
applications. This function is usually achieved by complex frequency analysis,
which is not supported in VoiceXML. This function will probably be available
from a VoiceXML service provider, but each provider will have its own approach
and effectiveness claims for solving this problem (as well as its own outbound
There are a number of other
considerations to be taken into account when deciding if VoiceXML is for your
Cost: If you host your application with a service bureau, you will pay
by the minute for phone time . considerably more than you will pay per minute
for long distance or local phone service into your own system.
Call Volume: If you anticipate very high volumes of calls, you may
find a quicker ROI on an equipment purchase. If your call volume will fluctuate
and occasionally spike, purchasing equipment may not be wise, since you will
need enough capacity to support your peak call volumes. Your equipment may sit
idle the rest of the time (for example, a vote-for-your-favorite-contestant by
phone during a television special). Service bureaus may charge more for spiky
volume, but they will probably be able to handle it better than you can with
hundreds or thousands of phone lines at their disposal.
Connection to your equipment: A system hosted off-site can only be as
reliable as the link between your server equipment, data storage, and the
off-site voice gateway. If it's okay for your system to be occasionally
unavailable, you can use the Internet for this connectivity. If outages aren't
acceptable, you may have to lease a point-to-point data line between your site
and your vendor's gateway. Purchasing your own voice system avoids these issues;
you can co-locate it with your application server equipment and have a fast,
reliable connection between them.
Your expertise level (and that of your IT staff): Programming your own
non-VoiceXML application will require specialized skills, including acquiring
detailed knowledge of telephony, and mastering a daunting C or C++ API. In
addition, telephony equipment requires special maintenance skills. If your staff
consists of web programmers and general IT personnel, a hosted VoiceXML solution
may be better.
Capabilities of potential VoiceXML providers: Larger providers with
large phone line capacities will probably be more expensive but may provide some
valuable functionality, like larger maximum capacity. Choosing the right
VoiceXML service provider is a topic for a separate discussion.
Portability of phone numbers: Most VoiceXML service bureaus will
"lend" you phone numbers for your application. If your company makes a
substantial investment in marketing the numbers for your application, the phone
number(s) the provider has lent you may become valuable to you. Unless you
negotiate a different arrangement up-front, a decision to switch providers may
cost you your existing phone numbers.
Availability requirements: If the system is very critical to your
business, and outages would be catastrophic, you will want a highly redundant
system as well as redundant connections from the gateway to the system. This
will drive up costs whether you go the equipment or service bureau route, and
may force you toward using a service bureau due to the expense of redundant
Probability of "feature creep": If it is likely that additional
requirements for your voice application may come later, remember that these new
features might not be supported by the decision you're making. Feature creep
often presents more of a challenge in voice applications because of the
diversity of supported feature sets; the cost of changing platforms to support
new requirements may be quite high.
Finally, here are a few examples of applications
and issues involved when creating each with VoiceXML:
Voice Messaging (as well as unified messaging): Several increasingly
common features of voice messaging are unavailable in the VoiceXML
specification, including the ability to pause and resume message playback, speed
up or slow down message playback, and fast-forward or rewind message playback.
Additionally, reading text-to-speech'ed email messages to users almost demands
these features. Prognosis: NOT a good match for VoiceXML; get closer to the
hardware with C++ or something else.
Order status: Assuming that you provide your customers with a means
for checking order status on the web, writing a narrow VoiceXML front-end to
this application could be fairly easy. You can use the same logic from your web
application to develop your VoiceXML application, and no special voice
functionality is required. You will need to provide some special provisions to
identify your callers (traditional usernames and passwords do not translate well
to phone usage). Prognosis: A pretty good match for VoiceXML, if you can give
them an easy way to log in on the phone.
Non-user-specific status information: (this may include flight status,
road conditions, etc) When callers do not need to be uniquely and positively
identified (i.e. any information can be provided to any caller), no prior setup
needs to be completed. Assuming the data you can provide to your callers is
readily accessible via your back-end web server, you should be able to quickly
build an easy-to-use application. Prognosis: probably as close to an ideal match
for VoiceXML as you will find.
Product availability or ordering system: Systems capable of providing
hundreds of pages of information (or hundreds or thousands of possible search
targets) are particularly difficult to develop for narrow interfaces such as
voice. The availability of a keyboard, mouse and visual display makes the
traditional html web interface a good mechanism for selecting an item from a
large number or search results. The limited input bandwidth of voice
applications makes this task difficult for voice applications. Speech
recognition systems can help crack this nut, but building these requires
considerable effort in the interface design. Prognosis: A difficult application
for any voice system, but if you can get a good speech interface designed,
VoiceXML may work just fine.
If, after reading this article, you think your
application would work well with VoiceXML, sign up for a free development
account with one of the larger VoiceXML service providers and begin to
experiment with the methodology. If you have doubts, start to look at the APIs
provided by telecommunications hardware vendors such as Intel's Dialogic,
Lucent, or NMS. If your application may use speech recognition, investigate the
programming methodologies used by Speechworks and Nuance.
VoiceXML is best suited for applications which require relatively little
input from the user, deliver highly-targeted output, and in particular, provide
a set of data which is already (or easily could be) available via an HTML web
interface. When your application requires substantial content delivery, needs
complex navigation or broad ranges of input, or is mission-critical, give
careful consideration to your decision, seeking out a voice expert if you're not
About the Author
Brian Brown has been designing and building telecommunications and telephony
systems for 10 years, in various roles as employee, manager, company
founder, and outside consultant. Brian is currently Vice President of
Technology for a Denver-based transaction fulfillment startup. He holds a
bachelor's degree in Computer Science from the Massachusetts Institute of