In this issue of the VoiceXML Developer, we'll begin a complete walk through of all elements included in the VoiceXML 1.0 specification. This issue introduces the basic elements used to markup content for the voice Web. We will focus primarily on the functionality that allows VoiceXML to control Text-To-Speech output.
The root element of a VoiceXML document is the <vxml> element, which is similar to the <html> tag in HTML. The root element is preceded by an XML declaration and an optional document type declaration.
<?xml version="1.0"?> <!DOCTYPE vxml PUBLIC '-//Nuance/DTD VoiceXML 1.0//EN' 'http://voicexml.nuance.com/dtd/nuancevoicexml-1-2.dtd'> <vxml version="1.0"> <form> <block>Hello, can anybody hear me?</block> </form> </vxml>
The DTD above points to the Nuance version of the VoiceXML 1.0 specification and is necessary to run properly on the Nuance platform. You will need to change this DTD to support your vendor or alternatively remove it altogether since it's not required. The <form> element is similar to HTML forms in that a form can contain multiple fields, which are filled out and submitted by a user. VoiceXML operates in a similar manner, albeit a different user interface. The <block> element, which is the VoiceXML equivalent of the <p> HTML tag, synthesizes the enclosed text via a TTS (or Text-To-Speech) engine.
A VoiceXML document
The following is a first look at a complete VoiceXML document that utilizes the elements that we'll be learning about today. If you are using a VoiceXML editor such as V-Builder, you should be able to cut and paste the example into your editor and play it. To demo this VoiceXML example, call VoiceXML Planet at 510-315-6666. At the first menu, press one. At the demo menu, press 1 to hear the example below.
<?xml version="1.0" encoding="iso-8859-1"?> <vxml version="1.0"> <form id="form1"> <block name="block1">Hello, this is an example of a Voice XML document using synthesized text. As you can hear, it's a bit choppy. But I might be able to pass as a silon from battle star galactica. </block> <block name="block2"> <prompt>Voice XML provides some features for controlling how I pronounce words and phrases. For example, you can create a pause. <break size="large" msecs="5000" /> I can also emphasize a phrase. John Bigbootae, I <emp level="strong">must</emp> have that overthruster! </prompt> </block> <block name="block3"> <pros vol="1" rate="-50%"><audio src="../prompts/prompt1.wav" /> synthesized prompts.</pros> </block> <block name="block4"> <prompt>Sometimes, you may need to tell me how to pronounce a phrase such as a date, currency or abbreviation. Please mail <sayas class="currency">$10,000.55</sayas> into <sayas sub="world wide web consortium">W3C</sayas> account number <sayas class="digits">55432</sayas> by, <sayas class="date">October 11, 2001</sayas> or call, <sayas class="phone">800-555-1212</sayas> </prompt> </block> <block name="block5"> <prompt> You can also control the <pros pitch="+50%"> prosity of <pros vol="1" rate="-50%"> my speech including volume, pitch, and speaking rate.</pros></pros> </prompt> </block> </form> </vxml>
The example above contains five <block> elements. The first block contains nothing but text, which is synthesized by the TTS engine. The second block creates a pause with the <break> element and adds an emphasis to a synthesized phrase with the <emp> element. The third block plays a pre-recorded prompt with the <audio> element, followed by synthesized text, which uses <pros> to increase the volume and decrease the speaking rate. The fourth block calls <sayas>, which is used to pronounce common character classes; in this case digits, currency, and a phone number.
Playing pre-recorded prompts with <audio>
<audio src="hi.wav">Hello there</audio>
The <audio> element is utilized to play a pre-recorded prompt. The src attribute specifies the URL of the audio file (which is usually a wav file). The <audio> element may also contain text, which is synthesized via the TTS engine in the case where the server cannot retrieve the sound file.
We will be covering the process of recording prompts in more detail in a future article.