Festival (TTS)

The Festival Speech Synthesis System is a general multi-lingual speech synthesis system. It offers a full text to speech system with various APIs, as well as an environment for development and research of speech synthesis techniques. It is written in C++ with a Scheme-like command interpreter for general customization and extension. (Scheme being a dialect of lisp.)

Festival is designed to support multiple languages, and comes with support for English (British and American pronunciation), Welsh, and Spanish. Voice packages exist for several other languages, such as Castilian Spanish, Czech, Finnish, Hindi, Italian, Marathi, Polish, Russian and Telugu.

Festival is designed as a speech synthesis system for at least three levels of user. First, those who simply want high quality speech from arbitrary text with the minimum of effort. Second, those who are developing language systems and wish to include synthesis output. In this case, a certain amount of customization is desired, such as different voices, specific phrasing, dialog types etc. The third level is in developing and testing new synthesis methods.

On Ubuntu, the documentation is found at /usr/share/doc/festival-doc/html/.

Installing `Festival`

I select quite a few packages, which will take up above 80MB of disk space. Besides fat and comprehensive documentation (to be found at /usr/share/doc/festival-doc/html/) I include a male castillian Spanish voice and a male Catalan voice.

The manual is found at https://www.cstr.ed.ac.uk/projects/festival/manual/, dated 1999.

Runnint Scheme `(tts MYFILE 'MODE)`

I set out to play doremi.xml, found in the current directory:

(tts "doremi.xml" 'singing)

You should select symbol 'singing for this example or any other with a DOCTYPE that is SINGING. Analogously, SABLE files (DOCTYPE=SABLE) want selectING 'sable.

Some of the recognized in my intallation:

'singing
'sable
'text
nil (just a null value in the Scheme language)

Running `Festival`

I start the (Scheme) interpreter:

$ festival

and type:

 (set! utt1 (Utterance Text "Hello world"))
        (utt.synth utt1)

to make an utterance then synthesize it.

I get no sound. So I type something simpler:

(SayText "Good morning, welcome to Festival")

I remain slightly disappointed with the results so far.

You may select other voices for synthesis by calling the appropriate function. For example:

(voice_cmu_sls_diphone)

This will set a female US English voice (if installed).

Any Scheme command may be typed at the command line for example:

(Parameter.set ’Duration_Stretch 1.5)

will make all durations longer for the current voice (making the voice speak slower.

Calling any specific voice will reset this value (or you may do it by hand).

The SayText is just a simple function that takes the given string, constructs an utterance object from is, synthesizes it and sends the resulting waveform to the audio device. This isn't really suitable for synthesizing anything but very short utterances.

The TTS process involves the more complex task of splitting text streams into utterance synthesizing them and sending them to the audio device to they may play while a[t] the same time working on the next utterance so that the audio output is continuous.

Festival does this through the tts function (which is what gets actually called when Festival is given the --tts argument on the command line. In Scheme the tts funciton takes two arguments, a filename and a mode. Modes can be used to allow special processing of text, such as honouring markup or particular styles of text like email etc.

In simple case the mode will be nil which denotes the basic raw or fundamental mode.

(tts "WarAndPeace.txt" nil)

Commands can also be stored in files, which is common when a number of function definitions and parameter settings are required. These scheme files can be loaded by the function SayText as in

(load "commands.scm")

Arguments to Festival at startup time will normally be treated as command files and loaded:

$ festival commands.scm

However, if the argument starts with a left parenthesis ( the argument is interpreted directly as a Scheme command.

$ festival ’(SayText "a short example.")’

If the -b (batch) option is specified Festival does not go into interactive mode and exits after processing all of the given arguments.

$ festival -b mynewvoicedefs.scm ’(SayText "a short example.")’

Thus we can use Festival interactively or simple as a batch scripting language. The batch format will be used often in the voice building process though the intereactive mode is useful for testing new voices.

Utterance structure

The basic building block for Festival is the utterance. The structure consists of a set of relations over a set of items. Each item represents an object such as a word, segment, syllable, etc. while relations relate these items together. An item may appear in multiple relations, such as a segment will be in a Segment relation and also in the SylStructure relation. Relations define an ordered structure over the items within them. In general these may be arbitrary graphs but in practice so far we have only used lists and trees. Items may contain a number of features.

There are no built-in relations in Festival and the names and use of them is controlled by the particular modules used to do synthesis. Language, voice and module specific relations can be easyly created and manipulated. However within our basic voices we have followed a number of conventions that should be followed if you wish to use some of the existing modules.

The relation names used will depend on the particular structure chosen for your voice. So far most of our released voices have the same basic structure though some of our research voices contain quite a different set of relations. For our basic English voices the relations we have used are as follows:

Text: Contains a single item which contains a feature with the input character string that is being synthesized
Token: A list of trees where each root of each tree is the white space separated tokenized object from the input character string. Punctuation and whitespace has been stripped and placed on features on these token items. The daughters of each of these roots are the list of words that the token is associated with. In many cases this is a one-to-one relationship, but in general it is one to zero or more. For example tokens comprising digits will typically be associated with a number of words.
Word: The words in the utterance. By word we typically mean something that can be given a pronunciation from a lexicon (or letter-to-sound rules). However in most of our voices we distinguish pronunciation by the words and a part of speech feature. Words [may] also be leaves of the Token relation, leaves of the Phrase relation and roots of the SylStructure relation.
Phrase: A simple list of trees representing the prosodic phrasing on the utterance. In our voices we only have one level of prosodic phrase below the utterance (though you can easily add a deeper hierarchy if your models require it). The tree roots are labeled with the phrase type and the leaves of these trees are in the Word relation.
Syllable: A simple list of syllable items. These syllable items are intermediate nodes in the SylStructure relation allowing access to the words these syllables are in and the segments that are in these syllables. In this format no further onset/coda distinction is made explicit but can be derived from this information.
Segment: A simple list of segment (phone) items. These form the leaves of the SylStructure relation through which we can find where each segment is placed within its syllable and word. By convention silence phones do not appear in any syllable (or word) but will exist in the segment relation.
SylStructure: A list of tree structures over the items in the Word, Syllable and Segment items.
IntEvent: A simple list of intonation events (accents and boundaries). These are related to syllables through the Intonation relation.
Intonation: A list of trees whose roots are items in the Syllable relation, and daughters are in the IntEvent relation. It is assumed that a syllable may have a number of intonation events associated with it (at least accents and boundaries), but an intonation event may only by associated with one syllable.
Wave: A relation consisting of a single item that has a feature with the synthesized waveform.
Target: A list of trees whose roots are segments and daughters are F0 target points. This is only used by some intonation modules.
Unit, SourceSegments, Frames, SourceCoef TargetCoef: A number of relations used the the UniSyn module.

Modules

The basic synthesis process in Festival is viewed as applying a set of modules to an utterance. Each module will access various relations and items and potentially generate new features, items and relations. Thus as the modules are applied the utterance structure is filled in with more and more relations until ultimately the waveform is generated.

Modules may be written in C++ or Scheme. Which modules are executed are defined in terms of the utterance type, a simple feature on the utterance itself. For most text-to-speech cases this is defined to be of type Tokens. The function utt.synth simply looks up an utterance's type and then looks up the definition of the defined synthesis process for that type and applies the named modules. Synthesis types maybe defined using the function defUttType. For example definition for utterances of type Tokens is

(defUttType Tokens
  (Token_POS utt)
  (Token utt)
  (POS utt)
  (Phrasify utt)
  (Word utt)
  (Pauses utt)
  (Intonation utt)
  (PostLex utt)
  (Duration utt)
  (Int_Targets utt)
  (Wave_Synth utt)
)

While a simpler case is when the input is phone names and we don't wish to do all that text analysis and prosody prediction. Then we use the type Phones, which simply loads the phones, applies fixed prosody and the synthesizes the waveform:

(defUttType Phones
      (Initialize utt)
      (Fixed_Prosody utt)
      (Wave_Synth utt)
      )

In general the modules named in the type definitions are general and actually allow further selection of more specific modules within them. For example the Duration module respects the global parameter Duration_Method and will call then desired duration module depending on this value.

When building a new voice you will probably not need to change any of these definitions, though you may wish to add a new module and we will show how to do that without requiring any change to the synthesis definitions elsewhere.

There are many modules in the system, some simply wraparounds to choose between other modules. However the basic modules used for text-to-speech have the basic following function:

Token_POS: basic token identification, used for homograph disambiguation
Token: Apply the token to word rules building the Word relation.
POS: A standard part of speech tagger (if desired)
Phrasify: Build the Phrase relation using the specified method. Various are offered, from statistically trained models to simple CART trees.
Word: Lexical look up building the Syllable and Segment relations and the SylStructure related these together.
Pauses: Prediction of pauses, inserting silence into the Segment relation, again through a choice of different prediction mechanisms.
Intonation: Prediction of accents and boundaries, building the IntEvent relation and the Intonation relation that links IntEvents to syllables. This can easily be parameterized for most practical intonation theories.
PostLex: Post lexicon rules that can modify segments based on their context. This is used for things like vowel reduction, contractions, etc.
Duration: Prediction of durations of segments.
Int_Targets: The second part of intonation. This creates the Target relation representing the desired F0 contour.
Wave_Synth: A rather general function that in turn calls the appropriate method to actually generate the waveform.

Festival Scheme Specifics

There a number of additions to SIOD that are Festival specific though still part of the Lisp system rather than the synthesis functions per se.

Documentation Strings

By convention if the first statement of a function is a string, it is treated as a documentation string. The string will be printed when help is requested for that function symbol.

:backtrace and set_backtrace

In interactive mode if the function :backtrace is called (within parenthesis) the previous stack trace is displayed. Calling :backtrace with a numeric argument will display that particular stack frame in full. Note that any command other than :backtrace will reset the trace. You may optionally call:

(set_backtrace t)

which will cause a backtrace to be displayed whenever a Scheme error occurs. This can be put in your .festivalrc if you wish. This is especially useful when running Festival in non-interactive mode (batch or script mode) so that more information is printed when an error occurs.

Hooks

A hook in Lisp terms is a position within some piece of code where a user may specify their own customization. The notion is used heavily in Emacs. In Festival there a number of places where hooks are used. A hook variable contains either a function or list of functions that are to be applied at some point in the processing. For example the after_synth_hooks are applied after synthesis has been applied to allow specific customization such as resampling or modification of the gain of the synthesized waveform. The Scheme function apply_hooks takes a hook variable as argument and an object and applies the function/list of functions in turn to the object.

Errors: unwind-protect

When an error occurs in either Scheme or within the C++ part of Festival by default the system jumps to the top level, resets itself and continues. Note that errors are usually serious things, pointing to bugs in parameters or code. Every effort has been made to ensure that the processing of text never causes errors in Festival. However when using Festival as a development system it is often that errors occur in code.

Sometimes in writing Scheme code you know there is a potential for an error but you wish to ignore that and continue on to the next thing without exiting or stopping and returning to the top level. For example you are processing a number of utterances from a database and some files containing the descriptions have errors in them but you want your processing to continue through every utterance that can be processed rather than stopping 5 minutes after you left for home after setting a big batch job for overnight.

Festival's Scheme provides the function unwind-protect which allows the catching of errors and then continuing normally. For example suppose you have the function process_utt which takes a filename and does things which you know might cause an error. You can write the following to ensure you continue processing even in an error occurs.

(unwind-protect
 (process_utt filename)
 (begin
   (format t "Error found in processing %s\n" filename)
   (format t "continuing\n")))

The unwind-protect function takes two arguments. The first is evaluated and if no error occurs the value returned from that expression is returned. If an error does occur while evaluating the first expression, the second expression is evaluated. unwind-protect may be used recursively. Note that all files opened while evaluating the first expression are closed if an error occurs. All global variables outside the scope of the unwind-protect will be left as they were set up until the error. Care should be taken in using this function but its power is necessary to be able to write robust Scheme code.

Scheme I/O

Different Scheme's may have quite different implementations of file i/o functions so in this section we will describe the basic functions in Festival SIOD regarding i/o.

Simple printing to the screen may be achieved with the function print which prints the given s-expression to the screen. The printed form is preceded by a new line. This is often useful for debugging but isn't really powerful enough for much else.

Files may be opened and closed and referred to file descriptors in a direct analogy to C's stdio library. The SIOD functions fopen and fclose work in the exactly the same way as their equivalently named partners in C.

The format command follows the command of the same name in Emacs and a number of other Lisps. C programmers can think of it as fprintf. format takes a file descriptor, format string and arguments to print. The file description may be a file descriptor as returned by the Scheme function fopen, it may also be t, which means the output will be directed as standard out (cf. printf). A third possibility is nil, which will cause the output to printed to a string which is returned (cf. sprintf).

The format string closely follows the format strings in ANSI C, but it is not the same. Specifically the directives currently supported are, %%, %d, %x, %s, %f, %g and %c. All modifiers for these are also supported. In addition %l is provided for printing of Scheme objects as objects.

For example

(format t "%03d %3.4f %s %l %l %l\n" 23 23 "abc" "abc" '(a b d) utt1)

will produce

023 23.0000 abc "abc" (a b d) #<Utterance 32f228>

on standard output.

When large lisp expressions are printed they are difficult to read because of the parentheses. The function pprintf prints an expression to a file description (or t for standard out). It prints so that the s-expression is nicely lined up and indented. This is often called pretty printing in computing.

For reading input from terminal or file, there is currently no equivalent to scanf. Items may only be read as Scheme expressions. The command

(load FILENAME t)

will load all s-expressions in FILENAME and return them, unevaluated as a list. Without the third argument the load function will load and evaluate each s-expression in the file.

To read individual s-expressions use readfp. For example

(let ((fd (fopen trainfile "r"))
      (entry)
      (count 0))
    (while (not (equal? (set! entry (readfp fd)) (eof-val)))
     (if (string-equal (car entry) "home")
        (set! count (+ 1 count))))
    (fclose fd))

To convert a symbol whose print name is a number to a number use parse-number. This is the equivalent to atof in C.

Note that, all I/O from Scheme input files is assumed to be basically some form of Scheme data (though can be just numbers, tokens). For more elaborate analysis of incoming data it is possible to use the text tokenization functions which offer a fully programmable method of reading data.

Text to Speech (TTS)

Festival supports text to speech for raw text files. If you are not interested in using Festival in any other way except as black box for rendering text as speech, the following method is probably what you want.

festival --tts myfile

This will say the contents of myfile. Alternatively text may be submitted on standard input

echo "hello world" | festival --tts
cat myfile | festival --tts

Festival supports the notion of text modes where the text file type may be identified, allowing Festival to process the file in an appropriate way. Currently only two types are considered stable: STML and raw, but other types such as email, HTML, Latex, etc. are being developed and discussed below. This follows the idea of buffer modes in Emacs where a file's type can be utilized to best display the text. Text mode may also be selected based on a filename's extension.

Within the command interpreter the function tts is used to render files as text; it takes a filename and the text mode as arguments.

Utterance Chunking	From text to utterances
Text Modes	Mode specific text analysis
Example Text Mode	An example mode for reading email

Utterance Chunking: From text to utterances

Text to speech works by first tokenizing the file and chunking the tokens into utterances. The definition of utterance breaks is determined by the utterance tree in variable eou_tree. A default version is given in lib/tts.scm. This uses a decision tree to determine what signifies an utterance break. Obviously blank lines are probably the most reliable, followed by certain punctuation. The confusion of the use of periods for both sentence breaks and abbreviations requires some more heuristics to best guess their different use. The following tree is currently used which works better than simply using punctuation:

(defvar eou_tree
'((n.whitespace matches ".*\n.*\n\\(.\\|\n\\)*") ;; 2 or more newlines
  ((1))
  ((punc in ("?" ":" "!"))
   ((1))
   ((punc is ".")
    ;; This is to distinguish abbreviations vs periods
    ;; These are heuristics
    ((name matches "\\(.*\\..*\\|[A-Z][A-Za-z]?[A-Za-z]?\\|etc\\)")
     ((n.whitespace is " ")
      ((0))                  ;; if abbrev single space isn't enough for break
      ((n.name matches "[A-Z].*")
       ((1))
       ((0))))
     ((n.whitespace is " ")  ;; if it doesn't look like an abbreviation
      ((n.name matches "[A-Z].*")  ;; single space and non-cap is no break
       ((1))
       ((0)))
      ((1))))
    ((0)))))

The token items this is applied to will always (except in the end of file case) include one following token, so look ahead is possible. The "n." and "p." and "p.p." prefixes allow access to the surrounding token context. The features name, whitespace and punc allow access to the contents of the token itself. At present there is no way to access the lexicon form this tree, which unfortunately might be useful if certain abbreviations were identified as such there.

Note these are heuristics and written by hand not trained from data, though problems have been fixed as they have been observed in data. The above rules may make mistakes where abbreviations appear at end of lines, and when improper spacing and capitalization is used. This is probably worth changing for modes where more casual text appears, such as email messages and USENET news messages. A possible improvement could be made by analysing a text to find out its basic threshold of utterance break (i.e. if no full stop, two spaces, followed by a capitalized word sequences appear and the text is of a reasonable length then look for other criteria for utterance breaks).

Ultimately what we are trying to do is to chunk the text into utterances that can be synthesized quickly and start to play them quickly to minimise the time someone has to wait for the first sound when starting synthesis. Thus it would be better if this chunking were done on prosodic phrases rather than chunks more similar to linguistic sentences. Prosodic phrases are bounded in size (that is, not very long), while sentences are not.

Text Modes: Mode specific text analysis

We do not believe that all texts are of the same type. Often information about the general contents of file will aid synthesis greatly. For example in Latex files we do not want to here "left brace, backslash e m" before each emphasized word, nor do we want to necessarily hear formating commands. Festival offers a basic method for specifying customization rules depending on the mode of the text. By type we are following the notion of modes in Emacs and eventually will allow customization at a similar level.

Modes are specified as the third argument to the function tts. When using the Emacs interface to Festival the buffer mode is automatically passed as the text mode. If the mode is not supported a warning message is printed and the raw text mode is used.

Our initial text mode implementation allows configuration both in C++ and in Scheme. Obviously in C++ almost anything can be done but it is not as easy to reconfigure without recompilation. Here we will discuss those modes which can be fully configured at run time.

A text mode may contain the following:

filter: A Unix shell program filter that processes the text file in some appropriate way. For example for email it might remove uninteresting headers and just output the subject, from line and the message body. If not specified, an identity filter is used.
init_function: This (Scheme) function will be called before any processing will be done. It allows further set up of tokenization rules and voices etc.
exit_function: This (Scheme) function will be called at the end of any processing allowing reseting of tokenization rules etc.
analysis_mode: If analysis mode is xml the file is read through the built in XML parser rxp. Alternatively if analysis mode is xxml the filter should an SGML normalising parser and the output is processed in a way suitable for it. Any other value is ignored.

These mode specific parameters are specified in the a-list held in tts_text_modes.

When using Festival in Emacs the emacs buffer mode is passed to Festival as the text mode.

Note that above mechanism is not really designed to be re-entrant, this should be addressed in later versions.

Following the use of auto-selection of mode in Emacs, Festival can auto-select the text mode based on the filename given when no explicit mode is given. The Lisp variable auto-text-mode-alist is a list of dotted pairs of regular expression and mode name. For example to specify that the email mode is to be used for files ending in .email we would add to the current auto-text-mode-alist as follows:

(set! auto-text-mode-alist
      (cons (cons "\\.email$" 'email)
            auto-text-mode-alist))

If the function tts is called with a mode other than nil that mode overrides any specified by the auto-text-mode-alist. The mode fundamental is the explicit null mode, it is used when no mode is specified in the function tts, and no match is found in auto-text-mode-alist or the specified mode is not found.

By convention if a requested text model is not found in tts_text_modes the file MODENAME-mode will be required. Therefore if you have the file MODENAME-mode.scm in your library then it will be automatically loaded on reference. Modes may be quite large and it is not necessary to have Festival load them all at start up time.

Because of the auto-text-mode-alist and the auto loading of currently undefined text modes you can use Festival like

festival --tts example.email

Then Festival with automatically synthesize example.email in text mode email.

If you add your own personal text modes you should do the following. Suppose you've written an HTML mode. You have named it html-mode.scm and put it in /home/awb/lib/festival/. In your .festivalrc first identify your personal Festival library directory by adding it to lib-path:

(set! lib-path (cons "/home/awb/lib/festival/" lib-path))

Then add the definition to the auto-text-mode-alist that file names ending .html or .htm should be read in HTML mode:

(set! auto-text-mode-alist
      (cons (cons "\\.html?$" 'html)
            auto-text-mode-alist))

Then you may synthesize an HTML file either from Scheme:

(tts "example.html" nil)

Or from the shell command line:

festival --tts example.html

Anyone familiar with modes in Emacs should recognise that the process of adding a new text mode to Festival is very similar to adding a new buffer mode to Emacs.

An example mode for reading email

XML/SGML Mark-Up*

(From XML/SGML Mark-Up)

Phonesets*

(From Phonesets)

Lexicons*

(From Lexicons)

Utterances*

(From Utterances)

You can change voices like this: Code: text2wave -o output.wav text.to.speak.txt -eval "(voice_us1_mbrola)" voice_us1_mbrola = voice type output.wav = output audio file text.to.speak.txt = input text file