VoiceXML Developer Series: A Tour Through VoiceXML

by Jonathan Eisenzopf

In this second edition of the VoiceXML Developer, we'll begin a complete walk through of all elements included in the VoiceXML 1.0 specification. This edition introduces the basic elements used to create content for the voice Web.

In this issue of the VoiceXML Developer, we'll begin a complete walk through of all elements included in the VoiceXML 1.0 specification. This issue introduces the basic elements used to markup content for the voice Web. We will focus primarily on the functionality that allows VoiceXML to control Text-To-Speech output.

Root Element

The root element of a VoiceXML document is the <vxml> element, which is similar to the <html> tag in HTML. The root element is preceded by an XML declaration and an optional document type declaration.

<?xml version="1.0"?>
<!DOCTYPE vxml PUBLIC '-//Nuance/DTD VoiceXML 1.0//EN' 
<vxml version="1.0">
can anybody hear me?</block>

The DTD above points to the Nuance version of the VoiceXML 1.0 specification and is necessary to run properly on the Nuance platform. You will need to change this DTD to support your vendor or alternatively remove it altogether since it's not required. The <form> element is similar to HTML forms in that a form can contain multiple fields, which are filled out and submitted by a user. VoiceXML operates in a similar manner, albeit a different user interface. The <block> element, which is the VoiceXML equivalent of the <p> HTML tag, synthesizes the enclosed text via a TTS (or Text-To-Speech) engine.

A VoiceXML document

The following is a first look at a complete VoiceXML document that utilizes the elements that we'll be learning about today. If you are using a VoiceXML editor such as V-Builder, you should be able to cut and paste the example into your editor and play it. To demo this VoiceXML example, call VoiceXML Planet at 510-315-6666. At the first menu, press one. At the demo menu, press 1 to hear the example below.

<?xml version="1.0" encoding="iso-8859-1"?>

<vxml version="1.0">
  <form id="form1">
    <block name="block1">Hello, 
this is an example of a Voice XML document using 
synthesized text. As you can hear, it's a bit choppy. 
But I might be able to pass as a silon from battle 
star galactica.

    <block name="block2">
      <prompt>Voice XML provides some features for
      controlling how I pronounce words and phrases.
      For  example, you can create a pause.
   <break size="large" msecs="5000" />
    I can also emphasize a phrase. John Bigbootae,
    I <emp level="strong">must</emp> 
have that overthruster!

<block name="block3">
<pros vol="1" rate="-50%"><audio
src="../prompts/prompt1.wav" />
      synthesized prompts.</pros>

    <block name="block4">
      <prompt>Sometimes, you may need to tell me how 
      to pronounce a phrase such as a date, currency
      or abbreviation. 
      Please mail
<sayas class="currency">$10,000.55</sayas> into
<sayas sub="world wide web consortium">W3C</sayas>
account number
<sayas class="digits">55432</sayas> by,
<sayas class="date">October 11, 2001</sayas> or call,
<sayas class="phone">800-555-1212</sayas>

    <block name="block5">
You can also control the <pros pitch="+50%">
prosity of <pros vol="1" rate="-50%">
my speech including volume, 
pitch, and speaking rate.</pros></pros>

The example above contains five <block> elements. The first block contains nothing but text, which is synthesized by the TTS engine. The second block creates a pause with the <break> element and adds an emphasis to a synthesized phrase with the <emp> element. The third block plays a pre-recorded prompt with the <audio> element, followed by synthesized text, which uses <pros> to increase the volume and decrease the speaking rate. The fourth block calls <sayas>, which is used to pronounce common character classes; in this case digits, currency, and a phone number.

Playing pre-recorded prompts with <audio>

<audio src="hi.wav">Hello there</audio>

The <audio> element is utilized to play a pre-recorded prompt. The src attribute specifies the URL of the audio file (which is usually a wav file). The <audio> element may also contain text, which is synthesized via the TTS engine in the case where the server cannot retrieve the sound file.

We will be covering the process of recording prompts in more detail in a future article.

Controlling pitch, volume, and speed of TTS

You can emphasize synthesized words and phrases with the <emp> element. For example:

<emp level="strong">Officer</emp>, 
you must have mistaken my Dodge Dart with another 
lime green automobile.

The level attribute can be set to strong, moderate, or reduced based upon the emphasis you desire. The default is moderate.

The <pros> (short for prosody) element on the other hand controls pitch, volume, and speed. For example:

<?xml version="1.0" encoding="iso-8859-1"?>
<vxml version="1.0">
    <block name="block5">
<pros pitch="+90%" rate="+40%">Hey turtle, you wanna race. 
Come on.</pros>
<pros pitch="-40%" rate="-30%">Now rabbit, how many
times do I have to win before you give up?</pros>;

In the example above, we increase the pitch and speaking rate when the rabbit speaks and reduce the rate and pitch when the turtle speaks. The attributes of the prosody element can be increased or decreased by percentage points. The rate attribute specifies the number of words that the TTS engine will speak per minute, while the volume attribute controls the volume (1 is the maximum). The controls for defining prosody were borrowed from the Java Speech Markup Language developed by Sun (see the Resources section at the end of the article).

Use <sayas> to pronounce special character classes

I mentioned a little earlier that VoiceXML is capable of pronouncing certain classes of text. For example, you wouldn't want the TTS engine to pronounce $220.25 as "dollar-two-two-zero-period-two-five". Rather, you would want it to say, "Two hundred twenty dollars and twenty five cents". VoiceXML also borrows the <sayas> element from JSML. The five built-in classes defined in the JSML specification are date, digits, literal, number, and time. Let's take a look at a couple examples:

Your speeding ticket comes to 
<sayas class="currency">$250.00</sayas> 
plus tip.

You must pay the fine by 
<sayas class="date">December 1, 2002</sayas>.

<sayas class="digits">5164</sayas>
, what are you in for?

The <sayas> element also provides a sub attribute, which allows us to control how the TTS engine pronounces words, phrases or abbreviations. For example:

<sayas sub="world wide web 

Control pauses with <break>

The <break> element forces a pause in the execution flow. It can be used inside <audio>, <prompt>, and <pros> elements. The length of the pause is specified by the msecs attribute. For example:

  <prompt>The current 
temperature in San Francisco is fifty eight degrees.
  <break msecs="5000"/>
  The traffic on the golden gate bridge is . . .


We will continue our tour of VoiceXML in the next issue. For now, some closing thoughts on the elements that have been introduced so far. First, be forewarned that each TTS engine is different. For example, it seems that one TTS engine counts milliseconds differently for the <break> element than another. In addition, support for the TTS components of the VoiceXML specification remain spotty and inconsistent. Some implementations may not even recognize certain elements at all. Finally, when using elements like as <pros> and <sayas>, make sure that the platform you're testing on is the same platform you're deploying on or you will be in for big surprises. Well, that's it for now. I'll see you next time as VoiceXML Developer continues to dig deep into the voice Web.


About Jonathan Eisenzopf

Jonathan is a member of the Ferrum Group, LLC based in Reston, Virginia that specializes in Voice Web consulting and training. He has also written articles for other online and print publications including WebReference.com and WDVL.com. Feel free to send an email to eisen@ferrumgroup.com regarding questions or comments about the VoiceXML Developer series, or for more information about training and consulting services.

This article was originally published on Wednesday Oct 2nd 2002
Mobile Site | Full Site