Text To Speech (TTS) Software

Text-to-speech (TTS) technology is a form of assistive technology that converts written text into spoken words. This technology has been widely used in various applications, including screen readers, voice assistants, and language translation tools. TTS engines work by processing text input and generating synthetic speech output that resembles human speech.

`Mbrola`

Mbrola is Thierry Dutoit's phonemiser for multilingual speech synthesis. The various diphone databases are distributed on separate packages, but they must be used with and only with Mbrola because of license matters. Read the copyright for details.

Mbrola itself doesn't provide full TTS. It is a speech synthesizer based on the concatenation of diphones. It takes a list of phonemes as input, together with prosodic information (duration of phonemes and a piecewise linear description of pitch), and produces speech samples on 16 bits (linear), at the sampling frequency of the diphone database.

Use Mbrola along with Freephone, cicero or espeak to have a complete text-to-speech in English.

A Survey of TTS Software

I initially settled for festival, then chose espeak because I wanted Castillian Spanish speach generation and output into a file, albeit a .wav one. Now, both festival and espeak recognize SSML tags. I think festival is suitable for research in sound and linguistics, whereas currently I just want decent speech from marked-up text.

This way I can concentrate on SSML mark up, instead of on switches and Festival Scheme niceties.

Mozilla TTS (Mozilla TTS at GitHub)

Mozilla TTS is an open-source text-to-speech engine developed by Mozilla Research. It offers developers a high-quality and customizable text-to-speech solution. Mozilla TTS is a versatile option for various applications supporting multiple languages and voices.

Some key features of Mozilla TTS include:

Cross-platform compatibility: Mozilla TTS is designed to work across different operating systems, including Windows, macOS, and Linux, making it widely accessible and versatile.
Multilingual support: The engine supports multiple languages, enabling developers to create speech synthesis applications that cater to diverse linguistic needs.
High-quality voices: Mozilla TTS employs advanced speech synthesis techniques to generate natural-sounding voices, ensuring a seamless and pleasant user experience.
Open source: Mozilla TTS is an open-source project that allows developers to access, modify, and contribute to the codebase, fostering collaboration and innovation within the speech synthesis community.
Integration with web technologies: Mozilla TTS is particularly well-suited for integrating web-based applications and services, as it can be easily embedded into web pages using JavaScript.
Mozilla TTS is part of Mozilla's broader efforts to promote open standards, accessibility, and innovation on the web. By providing an open-source speech synthesis engine, Mozilla aims to empower developers and researchers to create speech-enabled applications and contribute to advancing text-to-speech technologies.

eSpeak

eSpeak is an open-source TTS engine specifically developed for Linux and other Unix-like operating systems. As a compact and lightweight solution, eSpeak NG offers basic speech synthesis functionality with support for 50+ languages and pronunciation rules. Its simplicity and ease of use make it an attractive choice for Linux users to install eSpeak NG and get a straightforward TTS solution for basic text to speech conversion tasks, command-line utilities, and accessibility features.

While eSpeak may not offer the advanced customization options or premium voice quality of some commercial TTS engines, its open-source nature and extensive language support make it a valuable addition to the Linux software ecosystem.

Festival

Festival is a comprehensive TTS system developed by the University of Edinburgh, offering extensive support for Linux and other Unix-based platforms. Festival distinguishes itself with its modular architecture and flexible design, allowing users to customize and extend its functionality through a variety of plugins, language models, and voice synthesis techniques. It currently supports five languages (British English, American English, Spanish, Czech and Italian) with many languages in the prototype mode.

With its powerful scripting capabilities and extensive documentation, Festival is well-suited for advanced users, researchers, and developers seeking to explore the depths of speech synthesis technology on Linux. Despite its steep learning curve, Festival remains a popular choice among Linux enthusiasts and academics for its robustness, extensibility, and support for cutting-edge research in TTS and natural language processing.

Flite

Flite is a lightweight and fast open source TTS engine developed by Carnegie Mellon University. It is designed for embedded systems and mobile devices, making it a popular choice for resource-constrained environments. Flite offers clear and natural-sounding speech synthesis for various applications.

Some key points about Flite TTS:

Light-weight: Flite is designed to be a small, lightweight engine suitable for embedded systems and devices with limited resources. The entire engine is around 5MB in size.
Open Source: Flite is an open source project released under a permissive license allowing free commercial and non-commercial use.
Multilingual: While English is the most supported language, Flite provides voices for other languages, such as Spanish, Italian, Romanian, German, and more.
Synthesis Technique: It uses concatenative synthesis combined with deterministic unit selection to generate speech output.
Input Formats: Flite can process plain text, SSML markup, and its own custom XML format.
Programming APIs: It provides C/C++, Python and other programming language APIs for integrating TTS into applications.
Multiple Voices: For some languages, like English, multiple voices with varying characteristics (age, gender, etc.) are provided.
Fast Performance: Flite aims to maximise CPU execution speed while keeping output intelligibility high.
Flite is suitable for applications needing a small, lightweight and efficient embedded TTS engine that can run on low-resource devices like smartphones, embedded systems, IoT devices, etc. Its open nature allows customization for specific use cases.

picotts

Pico TTS is a small and efficient open-source TTS engine optimized for mobile devices. It offers high-quality speech synthesis with minimal resource usage, making it ideal for smartphones and tablets. Pico TTS is a reliable option for developers looking for a compact TTS solution. It was formerly known as SVOX Pico, a compact, lightweight, embeddable text-to-speech engine developed by the SVOX company.

Here are some key points about Pico TTS:

Small Footprint: One of Pico TTS's distinguishing features is its very small size. The complete engine is just around 0.5MB, making it suitable for embedded systems.
Cross-Platform: It is written in C and can run on multiple platforms/architectures like ARM, x86, MIPS etc.
Multilingual: Pico provides voices for several widely spoken languages, including English, German, French, Spanish, and Italian.
Open Source: Since being acquired by Nuance, the Pico engine has been open sourced under the Apache 2.0 license.
Synthesis Technique: It uses a compact form of concatenative synthesis coupled with prosodic modelling.
APIs: C/C++ APIs are provided to integrate Pico into applications and devices.
Wake Word Support: Pico supports embedded wake word/hotword detection useful for voice interfaces.
Low Resource Usage: It is designed for low memory usage and minimal CPU requirements during runtime.

Pico TTS is optimized for applications and products that require a small TTS engine footprint while retaining reasonable speech quality, such as IoT devices, wearables, embedded systems, or mobile apps where disk space and memory are limited. Its open-source nature also allows customization.

mimic

Mimic is a lightweight and fast open source TTS engine developed by Mycroft AI. It offers natural-sounding speech synthesis with support for multiple languages and voices. Mimic is designed for voice assistants and other interactive applications requiring real-time speech output.

Here are some key points about Mimic TTS:

Neural TTS: Mimic utilizes neural network models and deep learning for speech synthesis rather than older concatenative or formant synthesis methods. This allows it to produce more natural-sounding speech.
Open Source: The engine and pre-trained models are released under an open-source Apache 2.0 license.
Multi-Speaker: In addition to standard TTS voices, Mimic can generate audio in the voice style and characteristics of specific speakers by training on that person's voice data.
Low Footprint: Mimic is designed to have a small disk and memory footprint suitable for running on devices like smartphones, IoT hardware etc.
Cross-Platform: It supports multiple platforms, including Linux, Windows, and macOS, and can also run in web browsers via WebAssembly.
Customizable: Mimic is open-source; developers can retrain their models on custom data to build new voices or fine-tune existing ones.
Multi-Lingual: While English is currently the primary focus, Mimic supports other languages, such as Spanish, French, and German, to varying degrees.
Integrations: Mimic can be integrated into applications via APIs for programming languages like Python, JavaScript, C++, etc.

Mimic aims to provide an open, customizable, and natural-sounding neural TTS engine that can be embedded into smart devices, voice assistants, audio apps, and other use cases that require low footprint but high-quality speech synthesis.

tacotron

Tacotron is an open-source TTS engine that uses deep learning techniques to generate natural-sounding speech. It offers high-quality speech synthesis with support for expressive and emotional speech styles. Tacotron is a cutting-edge TTS engine suitable for advanced applications. In a nutshell, it is a neural network architecture for speech synthesis developed by Google's AI research team.

Some key points about Tacotron 2:

Neural TTS: It is based on an end-to-end neural network model that directly converts text to speech audio in a single step without requiring additional signal processing components.
Sequence-to-Sequence Model: Tacotron 2 uses an encoder-decoder architecture with attention, treating speech synthesis as a sequence-to-sequence problem.
Natural Synthesis: It produces highly natural-sounding synthesized speech compared to older concatenative or statistical parametric methods.
Speaker Adaptation: The model can be fine-tuned on a new speaker's voice data to generate audio mimicking that speaker's vocal characteristics.
WaveNet Integration: Tacotron 2 generates mel spectrograms fed to a modified WaveNet model to produce the final time-domain waveform audio.
Published Model: Google released a pre-trained Tacotron 2 model for English capable of generating high-quality speech.
Open Source: Google has open-sourced the tensorflow implementation of Tacotron 2.
Further Extensions: Researchers have built upon Tacotron 2 to create multi-speaker, multi-lingual and other extensions of the base model.

While not a full production-ready system, Tacotron 2 demonstrated significant advances in neural speech synthesis leveraging sequence models. Its open source release enabled further research in highly natural and controllable TTS systems.

ESPnet-TTS

ESPnet-TTS is an open-source text-to-speech (TTS) toolkit developed by Nagoya University and others. It is based on the ESPnet framework, initially designed for speech recognition but extended to support TTS tasks. ESPnet-TTS provides a unified framework for various TTS models and allows researchers to easily train, evaluate, and deploy different TTS models.

Here are some key points about ESPnet-TTS:

Part of ESPnet: It is a specialized module within the larger ESPnet (End-to-End Speech Processing Toolkit) framework for speech processing tasks like ASR, ST, VC, etc.
End-to-End TTS: ESPnet-TTS implements various end-to-end neural network models for text-to-speech synthesis without relying on traditional concatenative/statistical parametric components.
Model Architectures: It implements popular models such as Tacotron 2, Transformer TTS, FastSpeech, ParaNet, and others.
Multi-Task Training: The toolkit supports multi-task learning to optimize TTS models for other tasks like speech recognition jointly.
Multi-Lingual: While focusing on English initially, it supports building TTS systems for other languages through data augmentation.
Open Source: ESPnet-TTS is an open-source toolkit under the Apache 2.0 license on GitHub.
Used in Research: Researchers at NICT and other institutions actively use it to develop new TTS techniques and models.

So, in essence, ESPnet-TTS aims to provide an open framework to develop, train, and evaluate state-of-the-art end-to-end neural text-to-speech models leveraging techniques like transfer learning, multi-task optimization, data augmentation, etc., across languages. It complements the broader speech-processing capabilities of the ESPnet toolkit.

Opinions and Comparisons

A message by Jonathan Duddington in 2008 explains:

> It would be great if somebody who thinks that Festival
> is actually worse than eSpeak in quality of speech
> could try to elaborate more about the reasons.

It depends what you mean by "quality".

There is no doubt that the good Festival voices sound more human than
eSpeak.

I'm not blind, but I use text-to-speech a lot for reading blogs, news
articles, etc.  The main reasons why I prefer to listen to eSpeak
rather than Festival are:

1.  Clarity.  The eSpeak voice (I use British English) sounds more
clear, and sharp, and more articulated.  An alternative description
might be "artificial and harsh".

The perceived quality of eSpeak may depend on your loudspeakers.  I use
a domestic sound system with big speakers and it sounds good to me.
But eSpeak has less "bass" and more mid-frequencies than other
synthesizers, and perhaps that's less suitable for small computer
speakers where it sounds more "harsh"?  People have experimented with
new eSpeak "voice variants" with changes to the "tone" and "formant"
parameters to change the tonal balance.

2.  Intonation (the changes in pitch during a sentence).  Festival
seems more "flat" or "boring".  I prefer eSpeak's more lively
intonation (although that may not sound good for some languages).
Perhaps it's possible to make a new improved intonation algorithm in
Festival.

Note that you can use eSpeak as a front-end to a Mbrola diphone voice,
so you get eSpeak's intonation with a more natural sounding voice
(intonation with Mbrola was improved in eSpeak version 1.31 and later).
http://espeak.sf.net/mbrola.html.
Try comparing Festival with eSpeak+Mbrola.

> This is why eSpeak is the current default in Speech Dispatcher
> because it is initially easier to get running and it covers a great
> span of languages. The documentation however strongly suggest
> users whose language is supported by Festival to try it as their
> primary syntesizer for a better voice quality.

That is good advice, especially since the quality of different
languages in eSpeak is very variable.

[...]