Re: Creating a new synthesizer

Reece H. Dunn

I'll have more later, but here is a start (welcome to the rabbit hole).

A speech synthesizer voice typically consists of two parts:
1.  the text to phonemes part;
2.  the phonemes to audio part.

The text to phonemes part typically consists of a dictionary mapping words to phonemes and a set of rules for how to pronounce certain word patterns (like "EE" in English).

Phonemes (General)
1. -- used by linguists for transcribing languages (see also all the references from this for phoneme theory)
1. (English)
1. -- see also the different IPA references for a given language

Phoneme Transcription Schemes
1. -- Language-specific SAMPA transcriptions; used by MBROLA voices
1. -- Used as the basis of the CMU/FestVox voices
1. -- Conlang X-SAMPA
1. -- Kirshenbaum / ASCII-IPA
1. -- X-SAMPA

Pronunciation Dictionaries
1. -- python tools for working with CMU dictionary like pronunciation dictionaries
1. -- historical view of the CMU pronunciation dictionary for American English
1. -- my attempts to clean up and extend the cmudict to make it more consistent

Formant Synthesizers
1. -- Dennis Klatt's original 1980 paper
1. -- Dennis Klatt's follow up 1990 paper

Creating a Voice
1. -- A set of 7 English voices with US, Canadian, Indian, and Scottish accents
1. -- FestVox documentation on building a voice
1. -- MBROLA documentation on creating a voice
1. -- eSpeak NG docs on adding a language; the other docs in the docs folder contains more information, and the documentation can definitely be improved

