VoiceXML Developer Series: A Tour Through VoiceXML, Part V

by Jonathan Eisenzopf

In the last edition of the VoiceXML Developer, we created a full VoiceXML application using form fields, a subdialog, and internal grammars. In this edition, we will learn more about one of the most important, but rarely covered components of a VoiceXML application, grammars.

In the last edition of the VoiceXML Developer, we created a full VoiceXML application using form fields, a subdialog, and internal grammars. In this edition, we will learn more about one of the most important, but rarely covered components of a VoiceXML application, grammars.


Now that we've built a few applications, it's time to talk about grammars. Grammars tell the speech recognition software the combinations of words and DTMF tones that it should be listening for. Grammars intentionally limit what the ASR engine will recognize. The method of recognizing speech without the burden of grammars is called "continuous speech recognition" or CSR. IBM's Via Voice is an example of a product that uses CSR technology to allow a user to dictate text to compose an email or dictate a document. While CSR technologies have improved, they're not accurate enough to use without the user training the system to recognize their voice. Also, the success rate of recognition in noisy environments, such as over a cell phone or in a crowded shopping mall, is reduced greatly. Pre-defining the scope of words and phrases that the ASR engine should be listening for can increase the recognition rate to well over 90%, even in noisy environments. The VoiceXML 1.0 standard uses grammars to recognize spoken and DTMF input. It doesn't, however, define the grammar format. This is changing however with the release of VoiceXML 2, which defines a standard XML-based and alternate BNF notation grammar format. Still, the fact that VoiceXML relies heavily on grammars means that we must create or reuse grammars each time we want to gather input from the user.

In fact, the time required to create, maintain, and tune VoiceXML grammars will likely be several magnitudes greater than the time you will take to develop the VoiceXML interfaces. Not having high-quality and complete grammars means that the user will spend too much of their time repeating themselves. A system that cannot recognize input the first time, every time, will alienate users and cause them to abandon the system altogether. Therefore, we are going to spend a bit of time talking about grammars for VoiceXML 1.0 (and now VoiceXML 2) in the coming articles so that you will be armed with the knowledge you need to create successful VoiceXML applications. The first grammar format we are going to learn is GSL, which is used by the Nuance line of products.

Grammar Scopes

The ASR engine activates grammars based upon the scope in which the grammar was declared and the current scope of the VoiceXML interpreter. Declaring a grammar in the root document means that the grammar will be active throughout the execution of the VoiceXML application. A good use for this technique is to use a root grammar to define global voice commands such as "operator" for connecting to an operator or "goodby" to exit the call.

We can also have grammars that are active within a particular document, form, field, or menu. Field grammars will be used the most where we need to collect specific types of information, such as a phone number, address, or social security number. What you don't want is to have all grammars active at the same time unless it is a mixed initiative dialog. The more grammars that are active, the higher the chance that the ASR will misinterpret what the user is saying. For example, when we ask the user for their phone number, only a global menu and the phone number grammars should be active. If the social security grammar were active at the same time, the system may accidentally recognize a social security number rather than a phone number.

When developing a mixed initiative dialog, this problem can become especially tricky where we may have similar grammars active at the same time. It's especially important in this case to differentiate the grammars in a way that minimizes the possibility of input being matched by the wrong grammar.

Inline grammars versus external grammars

VoiceXML allows developers to include grammars directly into the VoiceXML documents using the <grammar> element.

<grammar type="text/gsl">

The inline grammar above would match on the words small, medium, or large. The values that was matched by the grammar would be returned and stored as the form field value.

An external grammar exists in a separate file, which is referenced by the src attribute of the <grammar> element.
<grammar src="PHONE.gsc" />

The <grammar> above would load the grammar named PHONE.gsc.

Inline grammars are good for small VoiceXML applications that have simple grammars, but should be avoided for larger applications that have multiple grammars. First of all, you will likely be able to reuse grammars many times, so it's best to keep them in an external file where you can easily access them from within other applications. Secondly, you may find yourself tuning the grammars on a more or less frequent basis than the VoiceXML content, so it's a good idea to componentize your VoiceXML applications to minimize errors that could result from a change to a grammar in a VoiceXML file. Other than their location, inline grammars work just like external grammars.

Example 4

We will be referring to this example in the rest of this article. To test this application, dial the VoiceXML Planet call VoiceXML Planet at 510-315-6666; press 1 to listen to the demos, then press 4 to hear this example. The example is an application for Joe's Pizza Palace. Joe's store get's overloaded with pizza pie orders during the lunch hour. Joe doesn't want to hire more staff to take phone orders just for lunch, but he does want to give his customers who call in their orders the opportunity to place their order automatically. This is especially desirable for repeat customers who order pizzas for their office lunches and meetings on a regular basis. This first version of the application collects the information for one pizza order and submits it to a back end ASP script for processing. The information that the store needs to place an order is the customer's phone number, the size and type of the pizza, and the toppings.

An example dialog for the application might be as follows:

Computer: Joe's pizza palace. May I 
have your phone number please.
Customer: huh?
Computer: Sorry, I didn't get that. Please 
say your phone number.
Customer: 7 0 3 5 5 5 1 2 1 2.
Computer: I heard 7 0 3 5 5 5 1 2 1 2. Would 
you like a hand tossed, 
          deep dish, or stuffed crust pizza?
Customer: Deep dish.
Computer: I heard deep dish. Would you like 
a small, medium, or large?
Customer: Large.
Computer: I heard large. What toppings would 
you like on your deep dish pizza?
Customer: Pepperoni and mushrooms and anchovies.
Computer: I heard pepperoni and mushrooms and 
Computer: I have a large deep dish pizza with 
pepperoni and mushrooms and anchovies. Your order will be 
delivered within thirty minutes or the pizza is free. Thanks 
for calling Joe's pizza palace.

Once the order has been confirmed, the form field values are submitted via an HTTP POST method call to placeOrder.asp via the <submit> element. The example contains two inline grammars and two external grammars, which are used to recognize spoken input. The two inline grammars occur on lines 23-29 and 41-47. The two external grammars occur on line 10 and 59.

Keyword grammars

Let's take a look at the inline grammar on lines 41-47 first. This is probably the simplist form of a grammar. It contains three words, each representing a different selection. The ASR will attempt to recognize one of these three words after the prompt is played on line 48. If one of the words was not recognized or if the user didn't say anything, the <catch> element on lines 49-51 will tell the user that there was a problem and play the prompt again until the user says one of the options, small, medium, or large. Once the user provides valid input, the <filled> element for the size field is executed on lines 52-56. Notice that this grammar only contains single words rather than phrases.

Phrase grammars

The second inline grammar on lines 23-29 works within the scope of the pizza_type form field and will recognize one of three phrases only:

  • hand tossed
  • deep dish
  • stuffed crust

The three phrases are surrounded by parentheses. This indicates that all words inside the parenthesis must be spoken for a match to occur. We can specify optional words in the phrase by pre-pending them with a ? character. For example, to make hand, deep, and crust optional, we would change the grammar so it looked like the following:

       ( ?hand tossed )
       ( ?deep dish )
       ( stuffed ?crust )

So if the user just said "tossed", we would match hand tossed. We can add alternatives for each selection as well. For example, someone might say "Chicago" instead of deep dish. We might also want to allow someone to specify hand thrown or hand stretched as alternatives to hand tossed. We can do this by specifying the options inside a set of square brackets.

       ( ?hand [tossed stretched thrown] ) 
       ( ?deep [dish chicago] )
       ( stuffed ?crust ) 


Now we're going to take a look at the external grammar that we reference on line 10, which is used to recognize the user's phone number. This particular grammar is made up of several subgrammars that recognize the area code, exchange (the first 3 digits of the local phone number), and the last four digits of the phone number. These subgrammars, or phone number parts, are referenced in the PHONE grammar on lines 1-6. This grammar is listed below. The PHONE matches a number when the AREA_CODE, EXCHANGE, and NUMBER grammars are matched in that order since they're inside a set of parentheses, which require that all elements of the grammar match. Line 6 concatenates the three phone number components together as a single number and returns the number to the field, which uses the number as the value for the phone. Notice that each subgrammar called on lines 2-4 include a colon and second string, which names a local variable to store the results of the subgrammar. For example, one line 2, we call the AREA_CODE subgrammar and store the resulting number that was matched in the $area variable. These variables are referenced later on line 6, which returns the phone number. Line 6 utilizes the strcat() function to piece the numbers into one number. The strcat() function takes two parameters, the second of which will be concatenated to the first. To concatenate all three number segments, we join $exchange and $number in an inner strcat() function call with an outer call, which joins the results of the inner call with $area.

The AREA_CODE grammar on lines 8-13 is made up of exactly three DIGITs. The DIGIT grammar on lines 30-41 consists of a single number, zero through nine. Zero can either be pronounced zero or oh. Similarly, the EXCHANGE grammar is made up of three DIGITs, while the NUMBER grammar is made up of four DIGITs.

1  PHONE [
2     ( AREA_CODE:area
3       EXCHANGE:exchange
4       NUMBER:number
5     )
6  ] { return(strcat($area strcat($exchange $number))) }
9    ( DIGIT:a
10      DIGIT:b
11      DIGIT:c
12    ) { return(strcat($a strcat($b $c))) }
13  ]
16    ( DIGIT:a
17      DIGIT:b
18      DIGIT:c
19    ) { return(strcat($a strcat($b $c))) }
20  ]
22  NUMBER [
23    ( DIGIT:a
24      DIGIT:b
25      DIGIT:c
26      DIGIT:d
27    ) { return(strcat(strcat($a $b) strcat($c $d))) }
28  ]
30  DIGIT [
31    [zero oh] {return(0)}
32    one   {return(1)}
33    two   {return(2)}
34    three {return(3)}
35    four  {return(4)}
36    five  {return(5)}
37    six   {return(6)}
38    seven {return(7)}
39    eight {return(8)}
40    nine  {return(9)}
41  ]

As you can see from the example above, more complex grammars are made up of subgrammars, which may subsequently call on other subgrammars, so that we can match any form of speech by breaking the possibilities down into their most elementary components. You might also be surprised at how large our grammar turned out to be for a simple phone number. In fact, dealing with numbers can be alot more difficult than dealing with words.

Lists in grammars

In the grammar referenced on line 59, we must be able to match one or more toppings without knowing exactly how many topics the user will select. What we do know is what the available topping are. Fortunately, GSL includes a number of builtin list operators to make this requirement possible.

2    +( TOPPING:topping {insert-end(list $topping)} )
3  ] {return($list)}
6    (?and pepperoni)
7    (?and olives)
8    (?and green peppers)
9    (?and mushrooms)
10   (?and pineapple)
11   (?and anchovies)
12 ] {return($string)}

The TOPPINGS grammar above begins with a + sign outside of a set of parenthesis. What this means is match one or more occurences of the TOPPING grammar. The second part of line 2 calls the builtin insert-end function, which adds the new topping that was matched in the TOPPING grammar to the list of toppings that will be returned to the toppings form field in the VoiceXML document.

The TOPPING grammar on lines 5-12 contains our toppings selections: pepperoni, olives, green peppers, mushrooms, pineapple, and anchovies. We're also expecting that the user might separate their selections with the word and, which has been flagged as an optional word by pre-pending it with a ? character. That concludes our exploration of GSL grammars for now.


I want to reflect on some of the things that I've learned as I've been developing new VoiceXML applications over the past year as it relates to grammars. First, grammars can be difficult to develop and time consuming to tune. And things don't stop there. You will probably need to tune the dictionary that the system is using to include alternate word pronunciations as you begin to collect data on where the ASR application is failing. It's very important that the application will be able to recognize what the user is saying most of the time. Because DTMF input is almost 100% accurate, it should be preferred over speech for things like phone and credit card numbers. However, some voice interface designers recommend that you don't mix a touch-tone input with speech input. I'd say it's better than the alternative if you are having problems recognizing number sequences. Remember, speech recognition has gotten much better, but it still takes a great deal of work and care to reach the high 90s percentile success rates that vendors often mention. Thanks again for joining us for another edition of the VoiceXML Developer. In the next edition of the VoiceXML Developer, we will continue our exploration into grammars as part of our tour of the VoiceXML 1.0 specification. And don't forget to send me feedback on this series. I'd like to know how I'm doing and how I can improve this column. You can send feedback directly to eisen@ferrumgroup.com. Until next time.

About Jonathan Eisenzopf

Jonathan is a member of the Ferrum Group, LLC based in Reston, Virginia that specializes in Voice Web consulting and training. He has also written articles for other online and print publications including WebReference.com and WDVL.com. Feel free to send an email to eisen@ferrumgroup.com regarding questions or comments about the VoiceXML Developer series, or for more information about training and consulting services.

This article was originally published on Saturday Oct 5th 2002
Mobile Site | Full Site