Introduction to SALT (Part 3): Applying SALT

by Hitesh Seth

So far in our SALT series we have learned about the syntax of key SALT elements and their usage. In this article, 'Applying SALT,' we are going to go back to the drawing board and learn about the various elements and architecture of a SALT-based Speech Solution.

So far in our SALT (Speech Application Language Tags) series we have learned about the syntax of key SALT elements and their usage. We have also seen a preview of some of the applications that can be developed using SALT. In this article, "Applying SALT", we are going to go back to the drawing-board and learn about the various elements and architecture of a SALT-based Speech Solution.

Multimodal & Telephony

As we know about IVR (Interactive Voice Response), touch-tone systems and telephony-based speech applications, the majority of these applications work using Speech/touch tone input and prerecorded or synthesized speech output. What we are really using here is a single modality "speech" (either as both input/output in case of interactive speech recognition or just touch-tone input and speech output in case of touch-tone style applications). Multimodality is where we can utilize more than one mode of the user interface with the application, similar to our normal human communications with each other.

For instance, consider an application which allows us to get driving directions. While it is typically easier to speak the start and destination addresses aloud (or better yet, even shortcuts like: my home, my office, my doctor's office--based on my previously established profile), the turn-by-turn and overall directions are typically best viewed through a map and probably a summary of turn-by-turn directions as well, something similar to what we are used to seeing at MapQuest's web site.

In essence, a multimodal application, when executed on a desktop device, would be an application very similar to MapQuest but would allow the user to talk/listen to the system for parts of the application's input/output as well. For instance, the starting and destination addresses are multimodal. Now imagine this same application using the same interface on a wireless PDA. Now we are talking about a true mobile/multimodal application. If we let our imaginations go a little bit wilder, we could easily extend the same application to the dashboard of our car or any other device we can imagine working with...that's really the vision, which given the current state of technology isn't far away. Another modality that can be added to the example application would be a pointing device which would zoom the map, focusing on that particular location.

So how does SALT fit in with all of this? Well, SALT has been built upon the technology that is required for applications to be deployed in a telephony and/or multimodal context.

SALT Architecture

Let's say we are all set to go and implement our next generation interactive speech driven SALT-based application. How should the architecture be designed? As we can see in the diagram below, the application architecture for deploying SALT-based applications is similar to that of a web application, with two major differences. In this scenario the web application is also capable of delivering SALT-based dynamic speech applications (if the appropriate browser is capable of handling SALT, e.g. through an add-on or natively) and the presence of a stack which represents a set of technologies broadly representing the integration of speech recognition/synthesis and telephony platforms.

Note: this diagram is really a conceptual representation, and where the SALT browser/interpreter and speech recognition/synthesis components specifically fit in depends on the capabilities of the end-user device/browser. Actual implementation of the SALT stack may vary based on vendor implementations.

The speech recognition component (popularly referred to as Advanced Speech Recognition (ASR)) is focused on recognizing spoken user utterances and matching them to a list of possible interpretations using a specified grammar. The speech synthesis component (popularly referred to as Text to Speech (TTS)) is focused on dynamically converting text messages into voice output.

The telephony integration component is focused on connecting the speech platform with the world of telephones--the Public Switched Telephony Network (PSTN). This is typically achieved using telephony cards from vendors such as Intel/Dialogic connected via analog/digital telephony lines with your telephony provider (i.e. your phone company).

When multimodality is used, the regular web application delivery framework (based on TCP/IP/HTTP/HTML/JavaScript etc.) is used for delivering the web application. The speech/telephony platform is used for the "speech/voice" aspect of the whole interaction, depending on the nature of the connection and the location of the speech recognition/synthesis components. Of course, both of these interactions can happen together seamlessly, as part of the same user session, depending on the users choice.

.NET Speech SDK

You might be wondering where .NET Speech SDK fits in? The current preview which is available from Microsoft's site has really two components: (a) an add-in for Microsoft Internet Explorer which recognizes SALT tags and allows the user to interact with the application using the desktop's microphone and speakers/headphones and (b) a set of ASP.NET based Speech controls which allow developers using Microsoft Visual Studio .NET to create multimodal/telephony applications and/or add speech interactivity to existing web applications developed using Microsoft .NET and ASP.NET framework.

I would like to point out that it is quite possible that a SALT-based application could be delivered using a non-ASP.NET web application framework (e.g. Perl or Java Server Pages). What the .NET Speech SDK provides is really the ease of development in adding speech to your existing web applications or creating new applications.

To be Continued

We will continue our exploration of SALT in the next article. We will actually start developing a SALT-based multimodal and telephony application using Microsoft .NET Speech SDK, an extension to Microsoft Visual Studio .NET that is focused around building dynamic speech applications that are based on the SALT specification. You might want to get prepared by ordering the .NET Speech SDK Beta from the Microsoft site (a link is provided below).


About Hitesh Seth

A freelance author and known speaker, Hitesh is a columnist on VoiceXML technology in XML Journal and regularly writes for other technology publications on emerging technology topics such as J2EE, Microsoft .NET, XML, Wireless Computing, Speech Applications, Web Services & Enterprise/B2B Integration. He is the conference chair for VoiceXML Planet Conference & Expo. Hitesh received his Bachelors Degree from the Indian Institute of Technology Kanpur (IITK), India. Feel free to email any comments or suggestions about the articles featured in this column at hks@hiteshseth.com.

This article was originally published on Thursday Nov 7th 2002
Mobile Site | Full Site