VoiceXML Developer Series: A Tour Through VoiceXML, Part VII

by Jonathan Eisenzopf

In this edition of the VoiceXML Developer, we're going to learn how to record and playback speech and how to transfer callers to another phone number.

In the last two editions of the VoiceXML Developer, we learned how to create VoiceXML grammars in both GSL and JSGF formats. In this edition of the VoiceXML Developer, we're going to learn how to record and playback speech and how to transfer callers to another phone number.

Record speech with the <record> element

The <record> element records spoken input and assigns the contents to a VoiceXML variable defined by the name attribute.

<record name="caller_name" beep="true" maxtime="10s"
    finalsilence="2000ms" type="audio/wav" dtmfterm="true" />

The beep attribute determines whether an audible tone is played before the gateway begins recording. Most people are used to hearing a tone on answering machines and voice mail systems as a signal to begin speaking. By default, this is set to false.

The maxtime attribute specifies the maximum number of seconds to record input. The system will automatically stop recording when this value is reached if the user hasn't stopped speaking or hasn't otherwise indicated that they've completed the recording.

The finalsilence attribute sets the number of milliseconds of silence that will signal the system to stop recording input. If you set this to a value that is too small, the system might stop recording when the speaker pauses between a sentence or takes a breath, so be careful cowboy.

The type attribute contains the mime type for the audio format that the recording will be saved to. The supported formats will differ based upon the VoiceXML gateway platform you're using, however, the audio/wav format should be standard on most if not all VoiceXML platforms.

When the dtmfterm attribute is set to true, the system will stop recording input when it hears a DTMF tone. This can be any button on a standard telephone keypad. It can be used instead of or in addition to the finalsilence attribute, which stops recording input when it hears a pause.

Because the <record> element is essentially a form field that contains recorded audio input rather than text, it can contain prompts and event handlers. The example below collects two recordings, first the customer's name, then their message. These recordings are then sent to a back end Perl script for processing.

Click here to see example 1

On lines 8-11 in the example above, we're recording the customers name. If we don't get any input, the noinput event is triggered and the <noinput> element on line 10 is called, which reprompts the user. Once we have the customer's name, we record their emergency on lines 12-15. We submit the recordings to a script with the <submit> element on line 18 and end the call.

Transfer a caller to another line with <transfer>

There are many instances where we will need to transfer a customer to a live operator for assistance if they are having problems with the VoiceXML interface. In the case of our previous example, we will ask the customer to confirm their name and emergency request by saying yes. If they say no, then we know that there is a problem, at which point, we'd want to transfer them to commie the clown for assistance. We will also transfer the caller to an customer support representative if the noinput event gets triggered more than once for either of the two prompts.

<transfer name="transfer" dest="phone://8005551212" 
bridge="false" connecttimeout="30s" maxtime="0" />

The name attribute holds the result of the transfer command. If the transfer succeeds, the VoiceXML gateway will terminate the call with the customer and let the customer continue their conversation with the customer service representative (CSR). If the transfer fails, this named variable will hold one of the following values:

  • busy
  • noanswer
  • network_busy
  • near_end_disconnect
  • far_end_disconnect
  • network_disconnect

There are two types of call transfers. A blind transfer, and a bridged call. A blind transfer is when the gateway terminates the call as soon as the call has been transferred successfully. A bridged call is one in which the caller resumes interaction with the VoiceXML application after the transferred call has been completed. Most call transfers will be blind transfers. To make a bridged call, set the bridge attribute to true. To make a blind transfer, set bridge to false. Support for bridged transfers is spotty at best and largely depends on whether the hardware/software platform you're using supports it. If you're not sure and you'd like to explore this feature, you'll need to contact your VoiceXML gateway provider (if you have one). If you're using a Voice ASP, contact their technical support for help.

The dest attribute defines the URI that you wish to connect to. This will probably be a phone number, though future options will likely include SIP. The VoiceXML 1.0 spec does not explicitly define the URI options for the dest attribute, so you will need to refer to your vendor documentation to find out exactly what format you should be using. The value of the dest attribute above looks a bit like a Web URL, but instead of http:// we have phone:// and instead of an IP address, we have a 10 digit phone number. This format should work on most if not all VoiceXML platforms by the way.

The connecttimeout attribute defines the number of seconds that we should wait for the call to connect. If the time expires and a connection hasn't been made, then one of the values listed above for the name attribute will be set. It's up to you to evaluate the result and do something with the call if it doesn't get connected. You might try to do the transfer again, or give the customer a warning message and disconnect the call.

The maxtime attribute determines the maximum length of the call. Setting this attribute to zero removes a limit on the length of the call. Note that this attribute is only relevant when the bridge attribute is set to true.

Getting back to our clown dispatch example, the example below tranfers the customer to a CSR if they trigger the noinput event more than once or say no when asked to confirm their dispatch:

Click here to see example 2

You'll notice on lines 11-13 and 18-20, that we've added a second <noinput> element, which when triggered, runs the form that transfers the customer to a CSR. Also, lines 22-34 contain the confirm boolean <field>, which prompts the user to say yes or no. If they say yes, the customer receives a confirmation message on line 28, and the two recordings are submitted to /cgi-bin/dispatch.pl on line 29. If the user says no, they are transferred to a CSR on line 31 via a <goto> element.

Lines 36-50 contain the call_transform <form>, which contains the <transfer> element on lines 38-46. At the point that the VoiceXML interpreter reads the <transfer> element, it will dial 1-800-555-1212 and wait for an answer. If the call did not connect, the transfer variable is filled with one of the transfer values listed above. We check for busy on line 41 and noanswer on line 42. We also set a local variable called duration, which is assigned the value for the length of the call in seconds. Both of these values are then sent to /cgi-bin/log.pl to be recorded in a log file for further processing.


It's important to re-emphasize that the transfer element is vendor dependent. The <transfer> element is actually optional and does not require a vendor to implement it to be VoiceXML compliant. The examples provided here are general and may or may not work in your environment. As for the <record> element, while recording spoken input is simple enough, we stopped short of actually saving the wav files to disk via a server-side script. This can be accomplished via any back end scripting language such as Perl, PHP, ASP, Python, Java, etc. We will save this exercise for a later article. For now, we're sticking with the syntax of the VoiceXML elements. This brings up a good point however; VoiceXML by itself is not sufficient for developing voice applications, even with the ability to make documents more dynamic through Javascript. It's only when we combine VoiceXML with a Web application that the application becomes dynamic and capable of storing input and retrieving information from a database. Some might call this an oversight. After all, if you think about it, to grapple with VoiceXML, we have to learn several languages:

  1. VoiceXML
  2. Javascript
  3. GSL or JSGF
  4. Perl,Java,ASP, or other

So why did the VoiceXML authors decide to do it this way rather than just adopting one language and being done with it. I'm not one of the authors, but I think part of the answer is that VoiceXML is for Web developers who are already used to this kind of environment. Developers who have been writing voice applications in C++ or VB might be better off sticking to their guns. On the other hand, coding voice applications XML (at least partly) makes voice applications more portable and potentially easier to write for non-programmers. Ooops, I just opened up a can of worms. I'd better go now. See in the next edition of the VoiceXML Developer.

About Jonathan Eisenzopf

Jonathan is a member of the Ferrum Group, LLC based in Reston, Virginia that specializes in Voice Web consulting and training. He has also written articles for other online and print publications including WebReference.com and WDVL.com. Feel free to send an email to eisen@ferrumgroup.com regarding questions or comments about the VoiceXML Developer series, or for more information about training and consulting services.

This article was originally published on Monday Oct 7th 2002
Mobile Site | Full Site