Innovative algorithm for automatic conversion of text into natural sounding speech
Speech synthesis, also known as Text-to-Speech (TTS), is the automatic generation of speech from any textual input. Philips has developed a state-of-the-art TTS engine, with a highly natural speech quality, customizable in terms of voice and emotion expression, and yet a low implementation complexity.
The increasing need for natural interfaces, together with developments in linguistics, speech- and IC-technology, makes the introduction of speech synthesis in everyday life possible. Anticipating the trend of people interacting with complex, multi modal and personalized systems, we expect TTS to play an important role in the user interface of many applications.
To download the document, please provide the following information.
- Natural user interface for complex consumer products
- Hands and eyes free when on the move
- User interface for products with no or small display e.g
- personal healthcare devices
- spoken artist and song title for mp3 players
- Mobile gaming
- Spoken e-books
- Car navigation systems
- Communication aid for disabled and dyslectic people
- Context aware, spoken user manual/installation guide
How it works
Our algorithm has a highly natural speech quality. It uses diphone synthesis: the concatenation of prerecorded speech segments (diphones) from a database. A diphone is the transition from one basic sound (phoneme) to the next.
Traditionally, diphone synthesis suffers from artifacts. These mainly come from mismatched joints between recorded diphones and modifications to the synthesized speech for prosodic requirements. Our unique IP enables us to generate an artifact-free, very natural speech quality.
Text-to-Speech Users can define their own personalized voice from a single database, and an advanced recording tool can rapidly add new voices.
There is a set of predefined characters: man, old man, old woman, boy, young girl, robot, giant, dwarf, and alien.
There is also a set of predefined emotions: friendly, angry, furious, drill, scared, emotional, weepy, excited, surprised, sad, disgusted and whisper.
Currently, supported languages are: American English, British English, French, German, Dutch, Italian, Castilian Spanish, Brazilian Portuguese, Russian, Turkish, and Mandarin Chinese.
The compact TTS engine suits embedded systems:
- the CPU load on a low cost ARM7 processor is only 20-60 Megahertz,
- with 10-30 Kilobytes RAM and 450-3000 Kilobytes ROM usage.
- It runs on Windows (PC), Windows CE (PDA), ARM, TriMedia and Linux.
Natural, artifact free, speech quality
Flexible emotion control and personalization from a single database
- The voice can be precisely customized in pitch, speed, spectral shape, formant sharpening, etc
- Set of predefined emotions (friendly, angry, furious, drill, scared, emotional, weepy, excited, surprised, sad, disgusted, whisper) and characters (man, old man, old woman, boy, young girl, robot, giant, dwarf, alien)
- Speech Synthesis Markup Language (SSML) support
Compact TTS engine, ideal for embedded systems
- Low complexity, e.g. on ARM7, for a quality level ranging from 3.5 kHz bandwidth narrowband speech to 15 kHz bandwidth ultra wide band speech CPU load: 20-60 MHz RAM usage: 10-30 KB ROM usage: 450-3000 KB
- Highly scalable: trade offs can be made between speech quality, memory size, and processing power
- Generic C++ code, portable to various embedded processors
- Supported platforms: Windows (PC), Windows CE (PDA), ARM, TriMedia, Linux
- Available languages: American English, British English, French, German, Dutch, Italian, Castilian Spanish, Brazilian Portuguese, Russian, Turkish, and Mandarin Chinese
- Easy to add new application specific dictionaries
- Cross language speaker support, i.e. a voice can also speak in other (non native) languages
- Advanced recording tool to rapidly add new voices