Overview of VoiceXML 2.0

The Dialogic(r) Media Server Release 2.4.0 supports both VoiceXML Version 2.0 and VoiceXML 1.0. The default browser is VoiceXML 2.0. VoiceXML 2.0 provides a significant advance over VoiceXML 1.0, not so much in terms of functionality, but in terms of interoperability and clarity.

The sections below describe the major enhancements in Voice XML 2.0.

<grammar>

VoiceXML 2.0 is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. VoiceXML 2.0 supports the XML format of the grammar specification for speech and DTMF grammars. The <grammar> element also contains a mode attribute with values to indicate it is a voice grammar or a DTMF grammar. The <dtmf> element of VoiceXML 1.0 is replaced in VoiceXML 2.0 with a <grammar> element with its mode attribute set to dtmf. VoiceXML 2.0 platform supports this XML format for DTMF grammars. By using a single mandatory XML grammar format in VoiceXML 2.0, application developers can write portable speech and DTMF grammars.

MRCP

VoiceXML 2.0 supports Media Resource Control Protocol (MRCP) version 1.0. MRCP controls media service resources like speech synthesizers, recognizers, signal generators, signal detectors, fax servers, etc., over a network. This protocol is designed to work with streaming protocols like RTSP (Real Time Streaming Protocol) or SIP (Session Initiation Protocol) which help establish control connections to external media streaming devices and media delivery mechanisms like RTP (Real Time Protocol). MRCP integrates speech recognition and text to speech engines from Scansoft, Nuance, and IBM. Speech recognition software applications that listen to words spoken over a telephone and recognize these words can pass the recognized words as text to a Text-To-Speech (TTS) software application that synthesizes speech from application text for playback over a telephone.

application.lastresult$

VoiceXML 2.0 defines a new variable, application.lastresult$, that provides information about the last recognition in the application. VoiceXML 1.0 does not provide a mechanism for inspecting the recognition result arising from grammars in the <link> element. If user input in an active dialog (typically a <field> inside a <form> ) matched a grammar in a <link> , there is no way to evaluate its recognition result before executing its action. In VoiceXML 2.0, when user input matches a grammar in <field> , the result can be evaluated using the <field> element; for example, the developer can check the confidence by inspecting the field confidence shadow variable. The <log> element is new in VoiceXML 2.0. Developers typically use it during the debugging process to generate a debug message.

bargeintype

VoiceXML 2.0 provides the developer with more control over the type of bargein performed by the platform with the new bargeintype attribute in the <prompt> element. This attribute has a number of values to determine how aggressively bargein is performed.

<choice>

VoiceXML 1.0 uses the text content of <choice> elements in <menu> to generate a grammar specifying sub-phrases. For example,

<menu>

<prompt>

Welcome home. Say one of: <enumerate/>

</prompt>

<choice ... > Sports news </choice>

<choice ... > Weather news </choice>

<choice ... > Stargazer astrophysics news </choice>

</menu>

The last <choice> would be matched if the user said phrases such as "Stargazer", "Stargazer News", "astrophysics news", and so forth. The exact grammar generation mechanism might be language and platform dependent. While there are some used cases for this mechanism, there is also a strong use case for introducing a strict form of grammar generation where a <choice> is matched if and only if the user says exactly the content. If alternative phrases are required, these can be specified in multiple <choice> s. This option provides the developer with more control over what is recognized, making application behavior more consistent across platforms developed by different vendors.

accept

To provide this control, an accept attribute on <menu> was introduced in VoiceXML 2.0: with an exact value (the default) the element defines the exact phrase to be recognized, while an approximate value indicates the earlier 'approximate' matching. The attribute is also defined on <choice> so specific <choice> elements can override the general <menu> strategy.

<throw>
<catch>

VoiceXML 2.0 enhanced <throw> and <catch> to provide additional information. <throw> now has attributes to allow developers to specify additional information besides the event name; the message attribute is used to statically specify the additional information while messageexpr dynamically specifies the information. In VoiceXML 1.0, it was impossible to specific a handler for a general event type and then process specific event types in different ways. In VoiceXML 2.0, <catch> handlers have two new anonymous variables:

<audio>

The <audio> element has been enhanced with an expr attribute. In addition to providing the capability to dynamically set the audio to be played back, this enhancement also allows <audio> element to be silently ignored if the expr attribute evaluates to ECMAScript undefined. Application developers can use this feature to specify a list of <audio> elements in their document where each <audio> element is only activated if expr has a defined value.

xml:lang

In VoiceXML 2.0, the vxml lang attribute has been replaced with xml:lang to bring VoiceXML into alignment with other W3C XML languages. The application developer can specify the language for both spoken input and output by assigning this attribute a language value defined in [RFC1766].

<subdialog>

The <subdialog> element provides a mechanism for decomposing complex sequences of dialogs to better structure them or to create reusable components. Its description in VoiceXML 1.0 was unclear in terms of the relationship between the calling dialog and the subdialog itself, as well as the relationship between the subdialog and its document context (this was not helped by the inclusion of a modal attribute in <subdialog> description but no such attribute in the DTD).

In VoiceXML 2.0, a subdialog context is independent of its calling dialog, but the subdialog context follows normal scoping rules for grammars, events, and variables.

Root and Leaf Documents

VoiceXML 1.0 lacks clarity in definition of, and transitions between, root and leaf documents. VoiceXML 2.0 explicitly defines these transitions in terms of <choice>, <goto>, <link>, <subdialog> , and <submit> elements and explains whether the application root context is preserved or initialized.

Conformance

VoiceXML 2.0 clarifies conformance both in terms of VoiceXML documents and in terms of VoiceXML processors. This aligns VoiceXML with other W3C specifications and the definitions are generally aligned with those in the Speech Grammar and Speech Synthesis specifications.

A conforming VoiceXML 2.0 document requires that it is a well-formed XML document, and that it provides a namespace declaration on the <vxml> element.