VoiceXML Developer Series, Introduction

Tuesday Oct 1st 2002 by Jonathan Eisenzopf
Share:

The first edition of the VoiceXML Developer series will provide you with a synopsis of VoiceXML and a glimpse into the technology used to develop VoiceXML applications. Subsequent editions will go into the specific details of creating VoiceXML applications.

The goal of the VoiceXML Developer series is to provide a complete series of tutorials that gives developers the insight they need to develop professional quality VoiceXML applications. It's recommended that you read through each edition sequentially to attain a thorough understanding of all of the concepts as they increase in difficulty. A thorough understand of XML basics and server-side scripting is assumed.


VoiceXML is an XML format that utilizes existing telephony technology to interact with users over the telephone through speech recognition, speech synthesis, and standard Web technologies. The first edition of the VoiceXML Developer series will provide you with a synopsis of VoiceXML and a glimpse into the technology used to develop VoiceXML applications. Subsequent editions will go into the specific details of creating VoiceXML applications.


Background

The VoiceXML 1.0 specification was released on March 2000 by the VoiceXML Forum which was founded by technologists from Lucent, AT&T, IBM, and Motorola. The group was formed out of the need to create a unified standard for voice dialogs rather than requiring customers to learn several XML specifications that had been developed internally within each of the member's respective research labs (starting as early as 1995). Other non-founders had also experimented with voice dialog XML formats including HP's TalkML and Sun's Java Speech Markup Language (JSML).

All of this led up to October 2000, when the VoiceXML Forum released VoiceXML 1.0 to the Voice Browser Group (founded in 1998) of the World Wide Web Consortium (W3C), the recognized standards body for the Web. This independent body has been working on the second version of the specification and have announced that it will release a revised specification sometime towards the end of 2001.

The nascent industry has grown rapidly since its millennium debut into a market that is expected to reach $200 million dollars in 2001 and reach $24 billion by 2005. The industry has been driven in part by an existing marketplace that has utilized Interactive Voice Response (IVR) systems for call center automation; think "Press 1 for your account balance. Press 2 to transfer fundsquot;. You've probably used such a system to check your bank or credit card balances.

So VoiceXML fills an existing need for automation by improving upon the current technology and making it simpler to implement and integrate into the rest of the enterprise. VoiceXML also provides a new opportunity for companies that have not been able to afford the cost or complexity of an IVR system by using standard telephony components and leverage its existing Web infrastructure, applications, and developer skills.

Technologies

A VoiceXML system is made up of of a VoiceXML gateway that accesses static or dynamic VoiceXML content on the Web. The gateway contains a VoiceXML browser (interpreter), Text-To-Speech (TTS), Automatic Speech Recognition (ASR), and the telephony hardware that connects to the Public Switched Telephone Network (PSTN) via a T1, POTS, or ISDN telephone connection. A Plain Old Telephone Server (POTS) line is the type that's installed in your home and can only handle a single connection whereas a T1 contains 24 individual phone lines.

A voice dialog typically consist of the following steps:

  1. Caller dials up the system on a fixed or mobile telephone which is picked up by the telephony hardware which passes the call to the VoiceXML browser.
  2. The VoiceXML gateway retrieves a VoiceXML document from the specified Web server and plays a pre-recorded or synthesized prompt.
  3. The user speaks into the telephone or presses a key on their phone keypad (called DTMF tones)
  4. The telephony equipment passes the recorded sound to the ASR engine (if it's speech), which uses a predefined grammar contained in the VoiceXML document.
  5. The VoiceXML browser executes the commands in the document based upon the ASR results (a match against the grammar or not) and plays another pre-recorded or synthesized prompt and waits for the user's response.

Speech Recognition (ASR)

There are three leading products in the speech recognition world today; Via Voice from IBM, Nuance 7 from Nuance, and OpenSpeech Recognizer from Speechworks. The leading non-commercial ASR is Sphinx, a project maintained by the speech group at Carnegie Mellon University. ASR works by taking recorded audio from a telephony card and using advanced algorithms to match it against dictionary and grammars. A grammar defines sets of words and phrases that it expects the users to speak.

Let's use a stock trading example. We might want to define a grammar in that recognizes the action the user wants to take (buy or sell), the number of shares to trade, and the name or stock symbol of the company to trade. So we would need to break the grammar down into the following parts:

  • Recognize whether the user wants to buy or sell stock.
  • Recognize the company name and associates it with a stock symbol
  • Recognize the number of shares to trade

We would create a grammar rule for each item above and associate each with a VoiceXML form field so that when the user says something like:

I want to sell 2000 shares of Microsoft stock.

the system will recognize that:

  • the user wants to sell rather than buy
  • the user wants to trade 2000 shares
  • the user wants to trade Microsoft stock

This information would then be returned back to the VoiceXML interpreter which propagates the results into the VoiceXML field values which are in turn submitted to a back-end script for processing.

Text-To-Speech (TTS)

The final leaders in the speech synthesis (TTS) world are less clear, but the current leaders are Speechify from Speechworks, Vocalizer from Nuance, and Fonix. Even the best TTS engine is still sub-par for most listeners, so limit TTS use to dynamic content that can't be pre-recorded by a professional voice talent. It's possible that we'll see high-quality speech synthesis using limited domain synthesis techniques from companies like Cepstral, but the timing of this technology being introduced as a mainstream technology remains elusive.

TTS engines work using a number of algorithms that take pre-recorded speech to form the sounds for words. As a starting point, the basic phonemes of the language to be spoken (English) are recorded and filed away. These phonemes are then combined to form words using a lexicon that tells the TTS what phonemes make up a particular word. The words are combined to form sentences and so on until the TTS has built the entire phrase, which is usually returned as a wav file.

VoiceXML contains elements that control things such as volume, speech, and pitch. Unfortunately, vendors implement these features differently so tuning to your specific platform is required.

Telephony Equipment

VoiceXML gateways contain one or more telephony cards that handle things such as digital signal processing, call control, and call bridging. The leading card manufacturers are Dialogic (owned by Intel), Natural MicroSystems, Brooktrout, and Acculabs. For the most part, VoiceXML abstracts the existance of this hardware. The developer is able to focus completely on developing the VoiceXML content generated by the Web server rather than programming telephony cards. Most of the vendors support a wide range of connection options including T1, E1, ISDN, POTS, and ISDN.

VoiceXML Documents

While not as popular as interactive dialogs, VoiceXML can be used to synthesize texts like books, articles, or even Web pages.

<?xml version="1.0" 
encoding="iso-8859-1"?>

<vxml version="1.0">
  <form id="form1">
    <block name="block1">Hello, 
this is an example of a Voice XML document 
using synthesized text. As you can hear, 
it's a bit choppy. But I might be able to 
pass as a silon from battle star galactica.
    </block>
</vxml>

The VoiceXML document above is a good example of a simple VoiceXML document. vxml is the root element for VoiceXML documents in the same way that html is the root element for HTML documents. Most documents also contain a form element that contains a combination of recorded or synthesized prompts as well as form fields that users fill in with DTMF tones from keypad selections or from spoken input. This example contains no fields, but a paragraph of text. Text blocks are usually encapsulated inside block elements.

Voice Dialogs

The steps above sum up the activities that make up a single dialog interaction. In fact, most VoiceXML applications allow the user to hold a continuous dialog until they hang up. There are actually two types of voice dialogs that VoiceXML handles: directed and mixed initiative.

A directed dialog is one in which the system controls when and how the user can interact with the system. A good example are the numerous IVR system that allow us to check our account balances. The system plays a pre-recorded prompt, giving us a menu of selections and prompting us to push a number for a given item. Once the selection has been made, the system either gives us the information we've requested or plays another prompt for a sub-menu. For example:

Computer: For account balance, press one. 
For recent transactions posted you.re your 
account, press two. To transfer funds, 
press three.
User: 3 (DTMF)
Computer: To transfer from savings, press 
one. To transfer from checking, press two.
User: 1 (DTMF)
Computer: Please enter the amount to 
transfer using your keypad...

These systems are effective but not friendly. They don't allow the user to control the call flow other than to select a pre-defined choice. VoiceXML provides the <menu> tag, which gives us the same essential functionality as a standard IVR system.

The value would be high enough if it gave us equivalent functionality, but VoiceXML allows us to leverage recent advancements in speech recognition quality to allow users to interact with systems in a more natural way; through conversation. A mixed initiative dialog lets the user make requests in the same way you might ask a co-worker for a piece of information. It's up to the VoiceXML developer to guide the the user towards the right verbal commands and then to recognize them. For example

User: Transfer two hundred dollars from savings 
to checking.
Computer: Please verify that you want to transfer 
two hundred dollars from checking to savings by 
saying yes, or say no to start over.
User: Yes.

While choosing whether to use a directed dialog with menu selections or mixed initiative dialogs depends on the need, let's talk a little more about the specifics of what VoiceXML can provide for menu-driven dialogs versus more open-ended dialogs. First, like HTML forms, VoiceXML forms may contain multiple fields that can be filled out in any order the user chooses (though you could force the order through Javascript). In fact, VoiceXML allows mixed initiative dialogs via the <form> and <field> elements.

Despite the flexibility of a VoiceXML form, menus can also utilize voice recognition technology in addition to recognizing phone key presses (or DTMF tones).

<menu dtmf="true">
  <prompt>What is your favorite color? For red, 
say red or press 1. For blue, say blue or press 2. For 
Yellow, say yellow or press 3.</prompt>
  <choice next="red.vxml">red</choice>
  <choice next="#blue.vxml">blue</choice>
  <choice next="yellow.vxml#yel">yellow</choice>
</menu>

The code segment above gives the user the choice of either using the phone keypad to make a selection or by simply saying the color they prefer. The text inside the choice element specifies the string that the ASR should try to match. You could (and should) prompt for DTMF tones ("press 1") or spoken text ("say red") but not both.

VoiceXML Deployment Costs

I'm often asked how much a VoiceXML system costs to deploy. Fortunately, the range is wide and it depends on whether you need a dedicated system or are willing to outsource to a Voice Service Provider (VSP). A dedicated VoiceXML gateway usually starts around $100,000 for the hardware, software, and installation depending on how many concurrent callers you need to handle.

On the low end, VSPs usually charge you per minute so you only have to pay for actual use. Prices are a few cents more than you're probably paying for long distance service and the top providers (TellMe, BeVocal, and Voxeo) are all quite good in terms of national coverage and pricing.

There really isn't a firm middle ground yet (below $100,000), but we should expect to see offerings in the $30,000 to $50,000 range as competition heats up and competitors move to serve demand in the mid-sized enterprise space. We will be looking at specific products in a future article and product reviews so that you have a better sense of what the options are.

Developing VoiceXML Applications

As was mentioned previously, VoiceXML gateways retrieve VoiceXML files over the HTTP protocol from any standard Web server. This also means that dynamic applications can be built with the same languages and technologies that you're using to build Web applications today. This is truly one of the great advantages of the technology. Furthermore, if you've gone to the trouble of separating your business logic from the presentation logic, you can leverage that same stored business logic to develop VoiceXML applications by swapping out the HTML presentation logic with VoiceXML content. Java Beans, CORBA, and .NET are all technology architectures that encourage this type of logic/presentment separation.

If all of your code is still embedded in a JSP, ASP, or Cold Fusion page, don't fret. You can leverage the existing code into new templates or take this opportunity to separate the code logic into libraries or components. We will address this process in a future article.

Vendors and Tools Support

Support for VoiceXML is nonexistent in most Web development tools that you might be using now like Dreamweaver and BBedit. However, you can use an XML tool like XMLSpy to develop and validate VoiceXML documents. There are also several VoiceXML editors available from independent providers like Voice Studio from Cambridge VoiceTech and V-Builder from Nuance that are shaping up fast.

Support from big vendors is on the horizon however. IBM is one of the few vendors that has integrated VoiceXML into its code editor for Web Sphere. This isn't su prising though since IBM is one of the leading VoiceXML platform providers.

The future of VoiceXML

The W3C hasn't made it totally clear what the next steps are beyond VoiceXML 2 other than the specification drafts that have been published in the past year. It seems likely that VoiceXML will be broken up into several specifications that control various aspects of a voice dialog, like speech synthesis or grammars. This will provide clarity and drive industry adoption. It will also create complexity. We'll have to wait and see the balance that's chosen in moving the VoiceXML standard forward. What is clear, however, is that VoiceXML (or whatever it becomes) is here to stay. One large technology vendor that has remained silent for some reason is Microsoft. I expect that we'll see something like Voice.Net in the future. It's worth noting that Microsoft licensed technology from Lernout & Houspie who was the leading voice technology vendor until they filed bankruptcy after creatively inventing some revenues in Asia.

Conclusion

Well, I hope you've enjoyed reading this introduction to VoiceXML as much as I have writing it. I hope that you'll come pack for the next edition of VoiceXML developer as we learn more about VoiceXML.

Resources

    VoiceXML Development Tools

  • Nuance V-Builder . http://extranet.nuance.com
  • IBM WebSphere
  • Cambridge VoiceTech Voice Studio . http://www.cambridgevoicetech.com
  • Voice Portal MSP

  • Voxeo . http://www.voxeo.com
  • BeVocal . http://www.bevocal.com
  • Turnkey Systems

  • Voice Genie - http://www.voicegenie.com
  • Cambridge VoiceTech - http://www.cambridgevoicetech.com
  • Articles

  • CTLabs VoiceXML Portal Report, img.cmpnet.com/commweb2000/whites/VXMLreport.pdf
  • Tellme More . http://www.voicexmlplanet.com
  • VoiceXML Adventure - http://www.voicexmlplanet.com
  • Web Sites

  • VoiceXML Planet . http://www.voicexmlplanet.com
  • VoiceXML Forum . http://www.voicexml.org
  • Training

  • The Ferrum Group, LLC . http://www.ferrumgroup.com

About Jonathan Eisenzopf

Jonathan is a member of the Ferrum Group, LLC based in Reston, Virginia that specializes in Voice Web consulting and training. He has also written articles for other online and print publications including WebReference.com and WDVL.com. Feel free to send an email to eisen@ferrumgroup.com regarding questions or comments about the VoiceXML Developer series, or for more information about training and consulting services.

Share:
Home
Mobile Site | Full Site
Copyright 2017 © QuinStreet Inc. All Rights Reserved