Overview of VoiceXML 2.0
The Dialogic(r) Media Server Release 2.4.0 supports both VoiceXML Version 2.0 and VoiceXML 1.0. The default browser is VoiceXML 2.0. VoiceXML 2.0 provides a significant
advance over VoiceXML 1.0, not so much in terms of functionality, but in terms
of interoperability and clarity.
The sections below describe the major enhancements in Voice XML
2.0.
<grammar>
VoiceXML 2.0 is designed for creating audio dialogs that feature
synthesized speech, digitized audio, recognition of spoken and DTMF key
input, recording of spoken input, telephony, and mixed initiative conversations.
VoiceXML 2.0 supports the XML format of the grammar specification for speech
and DTMF grammars. The <grammar> element
also contains a mode attribute with values
to indicate it is a voice grammar or a DTMF grammar. The
<dtmf> element of VoiceXML 1.0 is replaced in VoiceXML 2.0 with
a <grammar> element with its mode attribute
set to dtmf. VoiceXML 2.0 platform supports
this XML format for DTMF grammars. By using a single mandatory XML grammar
format in VoiceXML 2.0, application developers can write portable speech
and DTMF grammars.
MRCP
VoiceXML 2.0 supports Media Resource Control Protocol (MRCP)
version 1.0. MRCP controls media service resources like speech synthesizers,
recognizers, signal generators, signal detectors, fax servers, etc., over
a network. This protocol is designed to work with streaming protocols like
RTSP (Real Time Streaming Protocol) or SIP (Session Initiation Protocol)
which help establish control connections to external media streaming devices
and media delivery mechanisms like RTP (Real Time Protocol). MRCP integrates
speech recognition and text to speech engines from Scansoft, Nuance, and
IBM. Speech recognition software applications that listen to words spoken
over a telephone and recognize these words can pass the recognized words
as text to a Text-To-Speech (TTS) software application that synthesizes
speech from application text for playback over a telephone.
application.lastresult$
VoiceXML 2.0 defines a new variable, application.lastresult$,
that provides information about the last recognition in the application.
VoiceXML 1.0 does not provide a mechanism for inspecting the recognition
result arising from grammars in the <link>
element. If user input in an active dialog (typically a
<field> inside a <form> )
matched a grammar in a <link> , there
is no way to evaluate its recognition result before executing its action.
In VoiceXML 2.0, when user input matches a grammar in
<field> , the result can be evaluated using the
<field> element; for example, the developer can check the confidence
by inspecting the field confidence shadow
variable. The <log> element is new in
VoiceXML 2.0. Developers typically use it during the debugging process to
generate a debug message.
bargeintype
VoiceXML 2.0 provides the developer with more control over the
type of bargein performed by the platform with the new
bargeintype attribute in the <prompt>
element. This attribute has a number of values to determine how aggressively
bargein is performed.
<choice>
VoiceXML 1.0 uses the text content of
<choice> elements in
<menu>
to generate a grammar specifying sub-phrases. For example,
<menu>
<prompt>
Welcome home. Say one of: <enumerate/>
</prompt>
<choice ... > Sports news </choice>
<choice ... > Weather news </choice>
<choice ... > Stargazer astrophysics
news </choice>
</menu>
The last
<choice> would be
matched if the user said phrases such as "Stargazer", "Stargazer
News", "astrophysics news", and so forth. The exact grammar
generation mechanism might be language and platform dependent. While there
are some used cases for this mechanism, there is also a strong use case
for introducing a strict form of grammar generation where a
<choice> is matched if and only if the user says exactly the
content. If alternative phrases are required, these can be specified in
multiple
<choice> s. This option provides
the developer with more control over what is recognized, making application
behavior more consistent across platforms developed by different vendors.
accept
To provide this control, an accept
attribute on <menu> was introduced in
VoiceXML 2.0: with an exact value (the default)
the element defines the exact phrase to be recognized, while an
approximate value indicates the earlier 'approximate' matching. The
attribute is also defined on <choice>
so specific <choice> elements can override
the general <menu> strategy.
<throw>
<catch>
VoiceXML 2.0 enhanced <throw>
and <catch> to provide additional information.
<throw> now has attributes to allow
developers to specify additional information besides the event name; the
message attribute is used to statically specify
the additional information while messageexpr
dynamically specifies the information. In VoiceXML 1.0, it was impossible
to specific a handler for a general event type and then process specific
event types in different ways. In VoiceXML 2.0,
<catch> handlers have two new anonymous variables:
- _event Containing the full
name of the event that was thrown
- _message Containing the value
of the message string from the corresponding
<thrown> .
<audio>
The <audio> element has been
enhanced with an expr attribute. In addition
to providing the capability to dynamically set the audio to be played back,
this enhancement also allows <audio>
element to be silently ignored if the expr
attribute evaluates to ECMAScript undefined. Application developers can
use this feature to specify a list of <audio>
elements in their document where each <audio>
element is only activated if expr has a defined
value.
xml:lang
In VoiceXML 2.0, the vxml
lang attribute has been replaced with xml:lang
to bring VoiceXML into alignment with other W3C XML languages. The application
developer can specify the language for both spoken input and output by assigning
this attribute a language value defined in [RFC1766].
<subdialog>
The
<subdialog> element provides
a mechanism for decomposing complex sequences of dialogs to better structure
them or to create reusable components. Its description in VoiceXML 1.0 was
unclear in terms of the relationship between the calling dialog and the
subdialog itself, as well as the relationship between the subdialog and
its document context (this was not helped by the inclusion of a
modal attribute in
<subdialog>
description but no such attribute in the DTD).
In VoiceXML 2.0, a subdialog context is independent of its calling
dialog, but the subdialog context follows normal scoping rules for grammars,
events, and variables.
Root and Leaf Documents
VoiceXML 1.0 lacks clarity in definition of, and transitions
between, root and leaf documents. VoiceXML 2.0 explicitly defines these
transitions in terms of <choice>, <goto>,
<link>, <subdialog> , and <submit>
elements and explains whether the application root context is preserved
or initialized.
Conformance
VoiceXML 2.0 clarifies conformance both in terms of VoiceXML
documents and in terms of VoiceXML processors. This aligns VoiceXML with
other W3C specifications and the definitions are generally aligned with
those in the Speech Grammar and Speech Synthesis specifications.
A conforming VoiceXML 2.0 document requires that it is a well-formed
XML document, and that it provides a namespace declaration on the
<vxml> element.