So far in our SALT (Speech Application Language Tags) series we have learned about the syntax of key SALT elements and their usage. We have also seen a preview of some of the applications that can be developed using SALT. In this article, "Applying SALT", we are going to go back to the drawing-board and learn about the various elements and architecture of a SALT-based Speech Solution.
Multimodal & Telephony
As we know about IVR (Interactive Voice Response), touch-tone systems and telephony-based speech applications, the majority of these applications work using Speech/touch tone input and prerecorded or synthesized speech output. What we are really using here is a single modality "speech" (either as both input/output in case of interactive speech recognition or just touch-tone input and speech output in case of touch-tone style applications). Multimodality is where we can utilize more than one mode of the user interface with the application, similar to our normal human communications with each other.
For instance, consider an application which allows us to get driving directions. While it is typically easier to speak the start and destination addresses aloud (or better yet, even shortcuts like: my home, my office, my doctor's office--based on my previously established profile), the turn-by-turn and overall directions are typically best viewed through a map and probably a summary of turn-by-turn directions as well, something similar to what we are used to seeing at MapQuest's web site.
In essence, a multimodal application, when executed on a desktop device, would be an application very similar to MapQuest but would allow the user to talk/listen to the system for parts of the application's input/output as well. For instance, the starting and destination addresses are multimodal. Now imagine this same application using the same interface on a wireless PDA. Now we are talking about a true mobile/multimodal application. If we let our imaginations go a little bit wilder, we could easily extend the same application to the dashboard of our car or any other device we can imagine working with...that's really the vision, which given the current state of technology isn't far away. Another modality that can be added to the example application would be a pointing device which would zoom the map, focusing on that particular location.
So how does SALT fit in with all of this? Well, SALT has been built upon the technology that is required for applications to be deployed in a telephony and/or multimodal context.
Let's say we are all set to go and implement our next generation interactive speech driven SALT-based application. How should the architecture be designed? As we can see in the diagram below, the application architecture for deploying SALT-based applications is similar to that of a web application, with two major differences. In this scenario the web application is also capable of delivering SALT-based dynamic speech applications (if the appropriate browser is capable of handling SALT, e.g. through an add-on or natively) and the presence of a stack which represents a set of technologies broadly representing the integration of speech recognition/synthesis and telephony platforms.
Note: this diagram is really a conceptual representation, and where the SALT browser/interpreter and speech recognition/synthesis components specifically fit in depends on the capabilities of the end-user device/browser. Actual implementation of the SALT stack may vary based on vendor implementations.