It's Only Natural: Evaluating Natural Language Dialogs

by Jonathan Eisenzopf

The decision on whether to use a natural dialog approach instead of a directed dialog in an IVR application will directly affect the cost, effort, and maintenance of the system. This article will give you a process that you can use to make the right decision.

Your decision on whether to use a natural dialog approach instead of a directed dialog in an IVR application will directly affect the cost, effort and maintenance of the system. This article will give you a process that you can use to make the right decision.

Natural Dialogs versus Directed Dialogs

A natural dialog is one in which the prompts, grammars and dialog flow are modeled and designed to more closely simulate a real conversation between two people. Natural dialogs allow the human to participate in controlling the dialog flow. Directed dialogs on the other hand use a pre-defined set of steps and usually occur in a sequential, linear fashion. 

Directed dialogs are modeled in a dialog-flow fashion, similar to a call-flow for touch tone IVRs. Natural dialogs, on the other hand, typically utilize a finite state model where dialogs are executed based on the state of one or more variables.

There isn't a clear dividing line between directed dialogs and natural dialogs however; nor is there an agreed upon approach as to how they should be developed. Directed dialogs may certainly use a mixed initiative dialog as a shortcut mechanism for power users; and natural dialogs will certainly use directed dialogs.

Developers usually differentiate between directed and natural dialogs in terms of whether or not the dialog is mixed initiative. A mixed initiative dialog allows callers to fill multiple form field slots in a single utterance. For example, the utterance, "nineteen ninety ford escort" might fill the year, make and model fields with a single utterance, as opposed to three separate prompts. This illustrates one potential description of a natural dialog versus a directed dialog; not necessarily because it's mixed initiative, but because it is probably more natural to speak all three pieces of information during a phone conversation if you were describing the type of car you were selling (for example).

What is important to understand is that while researchers have tested various natural dialog approaches, there is no "right way" or set of guidelines that will help you create natural dialogs. I think this may have something to do with the fact that scientists don't understand the natural human speech machine in enough detail to effectively model it on a computer. So, remember, the definition of a natural dialog is a bit ambiguous and certainly open to interpretation. In fact, in some cases, a directed dialog may in fact be more natural than a mixed initiative dialog.

Evaluate the Existing Environment

There are three common environments that I see which affect design decisions:

The company has an existing touch-tone IVR that they want to upgrade to speech to reduce call times or reduce the number of callers that bail out of the IVR to a live rep. In this situation, introducing a new approach to callers that use the system regularly may actually cause more harm than good, because while the number of new callers pressing zero may be reduced, more experienced callers may become frustrated with the new interface and end up bailing out. If the percentage of repeat callers is high, and the ratio of repeat callers to new callers is high, then the end result could actually be more callers bailing out than before. Great care must be taken in transitioning to a new system. A recommended approach would be to gradually introduce speech into the application in a way that allows callers to slowly become acclimated to the new interface. Of course, the speech interface must be as good or better than the existing touch-tone system. For example, if the speech system is nothing more than a voice activated menu, callers may become frustrated with the new interface when they experience recognition errors, (which rarely occur with a touch-tone system). When upgrading a touch-tone system, a natural language system may be too radical a change for users and could result in more bail-outs.

The speech IVR system is going to automate what a real person does on the phone now.
In this case it will be important to analyze the existing call flow. Unless the calls are entirely scripted, most human dialogs tend to be composed of a series of open-ended dialogs that each have a goal or milestone, which may also depend on the results of a previous dialog. When analyzing the calls, you want to be listening for these dialog milestones. You can usually spot a dialog milestone by listening for a dialog transition. A dialog transition occurs when one participant changes the topic or focus of the conversation. For example, "Ok, now I need to get your credit card information". When a transition occurs, it usually signals the end of a previous dialog. Dialogs will normally have data points in which one participant communicates a piece of information that translates into a form field value. This sounds easy enough because you can translate this into VoiceXML dialogs. The difficulty is that human speech is usually much more complicated. The caller may change their mind midstream about a previous data point during a subsequent dialog after the transition has already occured. For example, "Actually, I need two widgets instead of one, and I want to pay with a Visa card, but only if you can ship it overnight, otherwise I'd like to pay COD with something like UPS". This instruction is easy enough to process for a human, but incredibly complex for a computer. Another caller might flip back and forth between logical dialogs in a conversation.

There are two things to consider when evaluating how to automate an existing human dialog interface. First, will it be feasible to break down the human dialog interaction into discrete directed dialog components so that callers will be able to communicate the same information to the computer instead of a person in about the same amount of time? When analyzing the conversation, it may seem at first that this would be impossible, but as you listen to many dialogs, you will be able to identify common dialogs that can be broken down into discrete dialog components. 

Second, will a directed dialog be usable from the caller's perspective? Will the directed dialog be so different than the human dialog that callers will become very confused and simply not use the system? Maybe not. If the human agent asks the same series of questions for every call, then a directed dialog would actually be more natural than trying to consolidate the conversation in a more mixed initiative dialog. 

On the other hand, if the entire conversation is dynamic and unscripted, it may actually be impossible to create a directed dialog. It is in these cases, creating a series of open-ended mixed initiative dialog may make the most sense.

The IVR is a new application that will stand on its own or extend a Web application.

This is potentially the most difficult environment to work in, because you have to make a lot of assumptions about how the speech dialog "should" occur whereas in the previous two scenarios, we have existing calls that can be analyzed. Additionally, when integrating an IVR system with an existing Web application, some people have a tendency to think of the IVR in terms of it being a telephone mirror of the Web application, when in the fact the interfaces may necessarily be very different. The one advantage of integrating a speech IVR with an existing Web application is that somebody has already gone through the pain of breaking down the business logic into a programmatic structure. There is also the benefit of not having an existing call flow to analyze; there isn't a legacy of expectations that have to be considered in deciding whether or not to use a natural mixed initiative dialog. If it turns out that an open-ended dialog will be the most natural and efficient way to program the application, we are free to do that (except for resource and time constraints of course).

What is Feasible?

As we've already discussed, some applications are naturally suited to become directed dialogs. Others that are open or conversational may require a mixed initiative dialog. In some cases, it may simply be impossible to practically create the application using one of the styles. This should be identified early on in the process. If that happens, then the decision is clear and it becomes a matter of whether the cost and effort will justify the end result.

Examine the Difference in Effort

In terms of measuring the difference in effort between natural dialogs and directed dialogs, we actually need to think about several different factors.

The average mixed initiative natural dialog will take several orders of magnitude longer to develop and maintain than a directed dialog. The skills required to develop a natural mixed initiative dialog are also steep and require some knowledge of linguistics and speech recognition.

The four areas of development that we can compare are:

  • grammars
  • prompts
  • error handling
  • maintenance

A directed grammar will contain a rather limited number of possible utterances. For example, when you ask a caller for their credit card type and give them a list of their options, there are only a handful of possible responses. However, if you ask the caller a more open ended question like, "How would you like to pay for this?", the number of possibilities goes up quite dramatically. Writing a grammar for an open ended question requires us to represent all of the possible answers that we might get from the caller. Even for such a simple question, this is no small task. In my humble opinion, however, programming natural dialogs is more about how you handle recognition errors than actually focusing on catching every possible utterance. In either case, natural language grammars will always be larger and thus, will take more time to develop.

Prompts in directed dialogs should be clear enough to eliminate most ambiguities. Doing so limits the number of error handlers you have to write. However, in a natural dialog, you will have to write error handlers and prompts to go along with each possible utterance. This will of course require more programming time and more time in the recording studio to record the prompts.

Maintaining a natural mixed initiative dialog will also require a higher degree of maintenance, because the grammars and prompts must be regularly tuned to account for utterances that haven't already been accounted for. Directed dialogs will also need the same maintenance, but not as frequently and won't require as much work.

What Value in Natural Dialogs?

Sure, natural dialogs are cool compared to touch-tone or directed speech menus, but coolness is not a final measure of whether a natural dialog should be employed. Yes, people who are not familiar with speech recognition may have an initial wow factor, but that inevitably wears off and then the question is, why and when is a natural dialog better?

The measure I use is this: If a natural dialog isn't better from a usability standpoint (which translates into fewer bail-outs) or faster (callers can get the job done quicker, reduces call time) when compared to a directed dialog, then go with the directed dialog, which is quicker, easier and cheaper to build.


The differences in directed dialogs vs. natural mixed initiative dialogs can easily be an order in magnitude of 3 or more, so care should be taken in making your decision, especially where cost and time are concerned. This article should give you some ideas on where to look in evaluating which approach makes sense in your case. If you're still not sure which approach to take after reading this, or if you still have more questions, send me an email, eisen@ferrumgroup.com.

About Jonathan Eisenzopf

Jonathan is a member of the Ferrum Group, LLC  which specializes in Voice Web consulting and training. Feel free to send an email to eisen@ferrumgroup.com regarding questions or comments about this or any article.

This article was originally published on Saturday Nov 9th 2002
Mobile Site | Full Site