SSML and HTML Tags Recognized by Espeak

SSML Tags

The following SSML markup tags and attributes are recognised:

`<speak>` (the root element)

xml:base: (the value is just passed back as a parameter with the UriCallback() function) Relative URIs are resolved according to a base URI, which may come from a variety of sources. The base URI declaration allows authors to specify a document's base URI explicitly.
xml:lang: a required attribute specifying the language of the root document. Language information is inherited down the document hierarchy, i.e. it has to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.

`<voice>`

This element requests a change in speaking voice.

Although each attribute individually is optional, it is an error if no attributes are specified when the voice element is used.

Attributes are:

xml:lang:
name: a processor-specific voice name to speak the contained text; may be a space-separated list of names ordered from top preference down. As a result a name must not contain any white space.
age: a positive integer (xsd:nonNegativeInteger)
variant: a positive integer ( xsd:positiveInteger)
gender: any of "male", "female", "neutral"

`<prosody>`

rate: a change in the speaking rate for the contained text. Legal values are: a relative change or "x-slow", "slow", "medium", "fast", "x-fast", or "default". Labels "x-slow" through "x-fast" represent a sequence of monotonically non-decreasing speaking rates. When a number is used to specify a relative change it acts as a multiplier of the default rate. For example, a value of 1 means no change in speaking rate, a value of 2 means a speaking rate twice the default rate, and a value of 0.5 means a speaking rate of half the default rate. The default rate for a voice depends on the language and dialect and on the personality of the voice.
The default rate for a voice should be such that it is experienced as a normal speaking rate for the voice when reading aloud text. Since voices are processor-specific, the default rate will be as well.
volume: the volume for the contained text in the range 0.0 to 100.0 (higher values are louder and specifying a value of zero is equivalent to specifying "silent"). Legal values are: number, a relative change or "silent", "x-soft", "soft", "medium", "loud", "x-loud", or "default". The volume scale is linear amplitude. The default is 100.0. Labels "silent" through "x-loud" represent a sequence of monotonically non-decreasing volume levels.
pitch: the baseline pitch for the contained text. Although the exact meaning of baseline pitch will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the approximate pitch of the output. Legal values are: a number followed by Hz, a relative change or "x-low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch levels.
range: the pitch range (variability) for the contained text. Although the exact meaning of "pitch range" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the dynamic range of the output pitch. Legal values are: a number followed by "Hz", a relative change or "x-low", "low", "medium", "high", "x-high", or "default". Labels "x-low" through "x-high" represent a sequence of monotonically non-decreasing pitch ranges.

`<say-as>`

This element allows the author to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text.

Due to the variety of languages that have to be considered and because of the innate flexibility of written languages, SSML only specifies the <say-as> element, its attributes, and their purpose. It does not enumerate the possible values for the attributes.

interpret-as: ="characters"
interpret-as: ="characters" format="glyphs"
interpret-as: ="tts:key"
interpret-as: ="tts:char"
interpret-as: ="tts:digits"

`<mark>`

name:

A mark element is an empty element that places a marker into the text/tag sequence. It has one required attribute, name, which is of type xsd:token . The mark element can be used to reference a specific location in the text/tag sequence, and can additionally be used to insert a marker into an output stream for asynchronous notification. When processing a mark element, a synthesis processor must do one or both of the following:

inform the hosting environment with the value of the name attribute and with information allowing the platform to retrieve the corresponding position in the rendered output.
when audio output of the SSML document reaches the mark, issue an event that includes the required name attribute of the element. The hosting environment defines the destination of the event.

The mark element does not affect the speech output process.

`<s>`

xml:lang:

`<p>`

xml:lang:

`<sub>`

The sub element is employed to indicate that the text in the alias attribute value replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form. The required alias attribute specifies the string to be spoken instead of the enclosed string.

alias

Example:

<sub alias="World Wide Web Consortium">W3C</sub>

`<tts:style>`

field="punctuation" mode={none|all|some}
field="capital_letters" mode={no|spelling|icon|pitch}

`<audio>`

src:

`<emphasis>`

level: any of "strong", "moderate", "none" and "reduced". The default level is "moderate".

`<break>`

strength: any of "none", "x-weak", "weak", "medium" (default value), "strong", or "x-strong". This attribute is used to indicate the strength of the prosodic break in the speech output. The value "none" indicates that no prosodic break boundary should be outputted, which can be used to prevent a prosodic break which the processor would otherwise produce.
time: the duration of a pause to be inserted in the output in seconds or milliseconds. It follows the time value format from the Cascading Style Sheets Level 2 Recommendation [CSS2], e.g. 250ms, 3s.

HTML

eSpeak can speak HTML text directly, or text containing both SSML and HTML markup.

Any unrecognised tags are ignored.

The following tags case a sentence break.:

<br> <dd> <li> <img> <td>

The following tags case a paragraph break:

<h1> <h2> <h3> <h4> <hr>

Text between the following tags is ignored:

<script> ... </script>

<style> ... </style>

SSML and HTML Tags Recognized by Espeak

SSML Tags

<speak> (the root element)

<voice>

<prosody>

<say-as>

<mark>

<s>

<p>

<sub>

<tts:style>

<audio>

<emphasis>

<break>