Re: Creating a new synthesizer

Reece H. Dunn

More links, resources, and information...

# Natural Language Processing -- Lexers

A lexer or tokenizer (e.g. a Finite State Machine or FSM) will tokenize a string of text -- splitting that text into sequences of characters that represent a full stop, comma, word, number, etc. Things to consider:
1.  Unicode General_Category property.
1.  Unicode Script property (e.g. for mixed Hiragana, Katakana, and Kanji in Japanese).

Be aware that some word sequences can be contracted in speech ("we will" to "we'll") and there can be other suprasegmental (across word) pronunciations (e.g. the "d" and "j" (/dZ/) in "Said John" are typically geminated [1]).


1. -- Includes Unicode character and emoji data
1. -- Unicode Common Locale Data Repository (includes emoji translations)

# Natural Language Processing -- Part of Speech Tagging

This is used to differentate words with the same pronunciation, but different pronunciations or stresses ("the object" vs "I object", read, lead, "St Noun" vs "Noun St" vs "East St Noun St", etc.). This includes context/usage (as in Chinese "Tai Chi" vs Greek "Chi squared").

In the modern state of the art, this is typically done by a Hidden Markov Model (HMM).

1. -- Natural Language Processing in Python (see chapter 5 for part of speech tagging)

# Numbers, Abbreviations, etc.

This is identifying the correct context and insering words in place of the numbers, abbreviations, or other contractions. For example, replacing "214" with "two hundred and fourteen".

1. -- English large numbers
1. -- Dutch large numbers

# Natural Language Processing -- Stemmers

A stemmer is an algorithm that identifies and removes prefices and suffices. This can be used to identify the prefices, suffices, and base words to be pronounced. The classic stemmer is the Porter stemmer.

1. -- An algorithm for suffix stripping, 1980.
1. -- stemmers in other languages

# Grapheme to Phoneme Translation

1. -- Automatic Translation of English Text to Phonetics by Means of Letter-to-Sound Rules (NRL Report 7948), 1976.

# Audio Synthesis / Vocoders

A vocoder (voice encoder/decoder) is a device or application that encodes and decodes voice audio. This can be used in telephone systems, music (e.g. Cher's Believe), or speech synthesis.

A diphone database (e.g. MBROLA) stores audio for phonemes in pairs, from the midpoints of each. This makes it easier to join the audio with reduced artifacts.

1. -- Linear Predictive Coding (LPC) is the basis for a number of diphone synthesizers
1. Residual-excited LPC (RELP) vocoder. This is the model used by the festvox/festival diphone voices.
1. -- This is the other method typically used in concatenative synthesizers (adding audio segments together). A variant of this is used by MBROLA.

More recent vocoders (like wavenet) use neural networks. I'm not as familiar with this approach.

It is also useful to be familiar with Digital Signal Processing (DSP) techniques and terminology, e.g. spectrum and formants.

# Some YouTube Resources

1. -- Prof. Simon King - Using Speech Synthesis to give Everyone their own Voice
1. -- Rachel's English for American English pronunciation. Includes demonstrations on how the mouth moves when producing vowels, etc.
1. -- NativLang. Has information about the structure and pronunciation of different languages, especially anciant and difficult languages.
1. -- The Virtual Linguistics Campus. Has resources on different aspects of linguistics and phonology. It also has a series on the evolution of English.
1. -- Jackson Crawford. Old Norse with some information on related languages (Old English, Old Icelandic, Old Norwegian).

Kind regards,

Join { to automatically receive all group messages.