We may see some really great Sounding TTS in the near feature


Josh Kennedy
 

yes this does apply to NVDA or i hope it will in the future very soon!




-------- Forwarded Message --------
Subject: We may see some really great Sounding TTS in the near feature
Date: Sun, 11 Sep 2016 22:01:48 -0700
From: Warren Carr <warcarr@...>
Reply-To: eyes-free@...
To: eyes-free@...


I was reading a blog post from WaveNet and I was blown away by some of the stuff that they are doing.

 

I can’t wait to have those voices on our devices!

 

Here’s is the extract, followed by the URL to the page, and be sure to head over to that page, and take a listen to some of those voices.

 

If you don’t want to read while you are on the page, you can simply hit letter B, to take you to the “play button.”

 

The first ones are demonstrating how the current Google TTS sound, and then the latter ones, demonstrate the more modern sounding ones.

 

There are a couple other languages in there besides U.S. English and Chinese.

 

Quote:

 

This post presents WaveNet

, a deep generative model of raw audio waveforms. We show that WaveNets are able to generate speech which mimics any human voice and which sounds more

natural than the best existing Text-to-Speech systems, reducing the gap with human performance by over 50%.

 

We also demonstrate that the same network can be used to synthesize other audio signals such as music, and present some striking samples of automatically

generated piano pieces.

 

Talking Machines

 

Allowing people to converse with machines is a long-standing dream of human-computer interaction. The ability of computers to understand natural speech

has been revolutionised in the last few years by the application of deep neural networks (e.g., Google Voice Search ). However, generating speech with computers

 — a process usually referred to as

speech synthesis

or text-to-speech (TTS) — is still largely based on so-called concatenative TTS , where a very large database of short speech fragments are recorded from

a single speaker and then recombined to form complete utterances. This makes it difficult to modify the voice (for example switching to a different speaker,

or altering the emphasis or emotion of their speech) without recording a whole new database.

 

This has led to a great demand for parametric TTS

, where all the information required to generate the data is stored in the parameters of the model, and the contents and characteristics of the speech

can be controlled via the inputs to the model. So far, however, parametric TTS has tended to sound less natural than concatenative, at least for syllabic

languages such as English. Existing parametric models typically generate audio signals by passing their outputs through signal processing algorithms known

as

vocoders .

 

WaveNet changes this paradigm by directly modelling the raw waveform of the audio signal, one sample at a time. As well as yielding more natural-sounding

speech, using raw waveforms means that WaveNet can model any kind of audio, including music.

 

WaveNets

 

Wave animation

 

Researchers usually avoid modelling raw audio because it ticks so quickly: typically 16,000 samples per second or more, with important structure at many

time-scales. Building a completely autoregressive model, in which the prediction for every one of those samples is influenced by all previous ones (in

statistics-speak, each predictive distribution is conditioned on all previous observations), is clearly a challenging task.

 

However, our PixelRNN and PixelCNN

 models, published earlier this year, showed that it was possible to generate complex natural images not only one pixel at a time, but one colour-channel

at a time, requiring thousands of predictions per image. This inspired us to adapt our two-dimensional PixelNets to a one-dimensional WaveNet.

 

Architecture animation

 

The above animation shows how a WaveNet is structured. It is a fully convolutional neural network, where the convolutional layers have various dilation

factors that allow its receptive field to grow exponentially with depth and cover thousands of timesteps.

 

At training time, the input sequences are real waveforms recorded from human speakers. After training, we can sample the network to generate synthetic

utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the

input and a new prediction for the next step is made. Building up samples one step at a time like this is computationally expensive, but we have found

it essential for generating complex, realistic-sounding audio.

 

Improving the State of the Art

 

We trained WaveNet using some of Google’s TTS datasets so we could evaluate its performance. The following figure shows the quality of WaveNets on a scale

from 1 to 5, compared with Google’s current best TTS systems (

parametric and concatenative

), and with human speech using 

Mean Opinion Scores (MOS)

. MOS are a standard measure for subjective sound quality tests, and were obtained in blind tests with human subjects (from over 500 ratings on 100 test

sentences). As we can see, WaveNets reduce the gap between the state of the art and human-level performance by over 50% for both US English and Mandarin

Chinese.

 

For both Chinese and English, Google’s current TTS systems are considered among the best worldwide, so improving on both with a single model is a major

achievement.

 

Here are some samples from all three systems so you can listen and compare yourself:

End of quote from:

 

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

 

What do you think?

 

Warren

--
To report violations of our ground rules or content guidelines, contact eyes-free+owners@.... -- https://goo.gl/rDveM8
---
You received this message because you are subscribed to the Google Groups "eyes-free" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eyes-free+unsubscribe@....
To post to this group, send email to eyes-free@....
For more options, visit https://groups.google.com/d/optout.


Devin Prater
 

I do hope that screen readers get this option for synthesizers. We still don't have the Google TTS as an NVDA addon.

Devin Pratersent from Gmail.

On Mon, Sep 12, 2016 at 6:47 AM, Josh Kennedy <joshknnd1982@...> wrote:

yes this does apply to NVDA or i hope it will in the future very soon!




-------- Forwarded Message --------
Subject: We may see some really great Sounding TTS in the near feature
Date: Sun, 11 Sep 2016 22:01:48 -0700
From: Warren Carr <warcarr@...>
Reply-To: eyes-free@...
To: eyes-free@...


I was reading a blog post from WaveNet and I was blown away by some of the stuff that they are doing.

 

I can’t wait to have those voices on our devices!

 

Here’s is the extract, followed by the URL to the page, and be sure to head over to that page, and take a listen to some of those voices.

 

If you don’t want to read while you are on the page, you can simply hit letter B, to take you to the “play button.”

 

The first ones are demonstrating how the current Google TTS sound, and then the latter ones, demonstrate the more modern sounding ones.

 

There are a couple other languages in there besides U.S. English and Chinese.

 

Quote:

 

This post presents WaveNet

, a deep generative model of raw audio waveforms. We show that WaveNets are able to generate speech which mimics any human voice and which sounds more

natural than the best existing Text-to-Speech systems, reducing the gap with human performance by over 50%.

 

We also demonstrate that the same network can be used to synthesize other audio signals such as music, and present some striking samples of automatically

generated piano pieces.

 

Talking Machines

 

Allowing people to converse with machines is a long-standing dream of human-computer interaction. The ability of computers to understand natural speech

has been revolutionised in the last few years by the application of deep neural networks (e.g., Google Voice Search ). However, generating speech with computers

 — a process usually referred to as

speech synthesis

or text-to-speech (TTS) — is still largely based on so-called concatenative TTS , where a very large database of short speech fragments are recorded from

a single speaker and then recombined to form complete utterances. This makes it difficult to modify the voice (for example switching to a different speaker,

or altering the emphasis or emotion of their speech) without recording a whole new database.

 

This has led to a great demand for parametric TTS

, where all the information required to generate the data is stored in the parameters of the model, and the contents and characteristics of the speech

can be controlled via the inputs to the model. So far, however, parametric TTS has tended to sound less natural than concatenative, at least for syllabic

languages such as English. Existing parametric models typically generate audio signals by passing their outputs through signal processing algorithms known

as

vocoders .

 

WaveNet changes this paradigm by directly modelling the raw waveform of the audio signal, one sample at a time. As well as yielding more natural-sounding

speech, using raw waveforms means that WaveNet can model any kind of audio, including music.

 

WaveNets

 

Wave animation

 

Researchers usually avoid modelling raw audio because it ticks so quickly: typically 16,000 samples per second or more, with important structure at many

time-scales. Building a completely autoregressive model, in which the prediction for every one of those samples is influenced by all previous ones (in

statistics-speak, each predictive distribution is conditioned on all previous observations), is clearly a challenging task.

 

However, our PixelRNN and PixelCNN

 models, published earlier this year, showed that it was possible to generate complex natural images not only one pixel at a time, but one colour-channel

at a time, requiring thousands of predictions per image. This inspired us to adapt our two-dimensional PixelNets to a one-dimensional WaveNet.

 

Architecture animation

 

The above animation shows how a WaveNet is structured. It is a fully convolutional neural network, where the convolutional layers have various dilation

factors that allow its receptive field to grow exponentially with depth and cover thousands of timesteps.

 

At training time, the input sequences are real waveforms recorded from human speakers. After training, we can sample the network to generate synthetic

utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the

input and a new prediction for the next step is made. Building up samples one step at a time like this is computationally expensive, but we have found

it essential for generating complex, realistic-sounding audio.

 

Improving the State of the Art

 

We trained WaveNet using some of Google’s TTS datasets so we could evaluate its performance. The following figure shows the quality of WaveNets on a scale

from 1 to 5, compared with Google’s current best TTS systems (

parametric and concatenative

), and with human speech using 

Mean Opinion Scores (MOS)

. MOS are a standard measure for subjective sound quality tests, and were obtained in blind tests with human subjects (from over 500 ratings on 100 test

sentences). As we can see, WaveNets reduce the gap between the state of the art and human-level performance by over 50% for both US English and Mandarin

Chinese.

 

For both Chinese and English, Google’s current TTS systems are considered among the best worldwide, so improving on both with a single model is a major

achievement.

 

Here are some samples from all three systems so you can listen and compare yourself:

End of quote from:

 

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

 

What do you think?

 

Warren

--
To report violations of our ground rules or content guidelines, contact eyes-free+owners@googlegroups.com. -- https://goo.gl/rDveM8
---
You received this message because you are subscribed to the Google Groups "eyes-free" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eyes-free+unsubscribe@googlegroups.com.
To post to this group, send email to eyes-free@....
For more options, visit https://groups.google.com/d/optout.



Damien Sykes-Lindley <damien@...>
 

Hi there,
I may have misunderstood the article, but it says that it is a computationally expensive task. To me, that sounds like it means that we’ll be facing a fair amount of the lag that we are currently facing with concatenative voices. While the voices themselves sound remarkable, I can’t help feeling that they are best heard reading text, than interacting with a screen reader for day-to-day computer use which includes input as well as output.
Synthesisers are a sticky, tricky topic, especially for me. I am very picky about what voices I will use (they have to have extremely low latency, be audibly and vocally clear, and customisable, which in my view none of the voices currently supported by NVDA fit all three criteria). This leaves me using SAPI and even that has its major drawbacks.
Kind regards,
Damien.
 

Sent: Monday, September 12, 2016 12:47 PM
Subject: [nvda] Fwd: We may see some really great Sounding TTS in the near feature
 

yes this does apply to NVDA or i hope it will in the future very soon!

 



-------- Forwarded Message --------
Subject: We may see some really great Sounding TTS in the near feature
Date: Sun, 11 Sep 2016 22:01:48 -0700
From: Warren Carr mailto:warcarr@...
Reply-To: eyes-free@...
To: eyes-free@...


I was reading a blog post from WaveNet and I was blown away by some of the stuff that they are doing.

 

I can’t wait to have those voices on our devices!

 

Here’s is the extract, followed by the URL to the page, and be sure to head over to that page, and take a listen to some of those voices.

 

If you don’t want to read while you are on the page, you can simply hit letter B, to take you to the “play button.”

 

The first ones are demonstrating how the current Google TTS sound, and then the latter ones, demonstrate the more modern sounding ones.

 

There are a couple other languages in there besides U.S. English and Chinese.

 

Quote:

 

This post presents WaveNet

, a deep generative model of raw audio waveforms. We show that WaveNets are able to generate speech which mimics any human voice and which sounds more

natural than the best existing Text-to-Speech systems, reducing the gap with human performance by over 50%.

 

We also demonstrate that the same network can be used to synthesize other audio signals such as music, and present some striking samples of automatically

generated piano pieces.

 

Talking Machines

 

Allowing people to converse with machines is a long-standing dream of human-computer interaction. The ability of computers to understand natural speech

has been revolutionised in the last few years by the application of deep neural networks (e.g., Google Voice Search ). However, generating speech with computers

— a process usually referred to as

speech synthesis

or text-to-speech (TTS) — is still largely based on so-called concatenative TTS , where a very large database of short speech fragments are recorded from

a single speaker and then recombined to form complete utterances. This makes it difficult to modify the voice (for example switching to a different speaker,

or altering the emphasis or emotion of their speech) without recording a whole new database.

 

This has led to a great demand for parametric TTS

, where all the information required to generate the data is stored in the parameters of the model, and the contents and characteristics of the speech

can be controlled via the inputs to the model. So far, however, parametric TTS has tended to sound less natural than concatenative, at least for syllabic

languages such as English. Existing parametric models typically generate audio signals by passing their outputs through signal processing algorithms known

as

vocoders .

 

WaveNet changes this paradigm by directly modelling the raw waveform of the audio signal, one sample at a time. As well as yielding more natural-sounding

speech, using raw waveforms means that WaveNet can model any kind of audio, including music.

 

WaveNets

 

Wave animation

 

Researchers usually avoid modelling raw audio because it ticks so quickly: typically 16,000 samples per second or more, with important structure at many

time-scales. Building a completely autoregressive model, in which the prediction for every one of those samples is influenced by all previous ones (in

statistics-speak, each predictive distribution is conditioned on all previous observations), is clearly a challenging task.

 

However, our PixelRNN and PixelCNN

models, published earlier this year, showed that it was possible to generate complex natural images not only one pixel at a time, but one colour-channel

at a time, requiring thousands of predictions per image. This inspired us to adapt our two-dimensional PixelNets to a one-dimensional WaveNet.

 

Architecture animation

 

The above animation shows how a WaveNet is structured. It is a fully convolutional neural network, where the convolutional layers have various dilation

factors that allow its receptive field to grow exponentially with depth and cover thousands of timesteps.

 

At training time, the input sequences are real waveforms recorded from human speakers. After training, we can sample the network to generate synthetic

utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the

input and a new prediction for the next step is made. Building up samples one step at a time like this is computationally expensive, but we have found

it essential for generating complex, realistic-sounding audio.

 

Improving the State of the Art

 

We trained WaveNet using some of Google’s TTS datasets so we could evaluate its performance. The following figure shows the quality of WaveNets on a scale

from 1 to 5, compared with Google’s current best TTS systems (

parametric and concatenative

), and with human speech using

Mean Opinion Scores (MOS)

. MOS are a standard measure for subjective sound quality tests, and were obtained in blind tests with human subjects (from over 500 ratings on 100 test

sentences). As we can see, WaveNets reduce the gap between the state of the art and human-level performance by over 50% for both US English and Mandarin

Chinese.

 

For both Chinese and English, Google’s current TTS systems are considered among the best worldwide, so improving on both with a single model is a major

achievement.

 

Here are some samples from all three systems so you can listen and compare yourself:

End of quote from:

 

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

 

What do you think?

 

Warren

--
To report violations of our ground rules or content guidelines, contact eyes-free+owners@.... -- https://goo.gl/rDveM8
---
You received this message because you are subscribed to the Google Groups "eyes-free" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eyes-free+unsubscribe@....
To post to this group, send email to eyes-free@....
For more options, visit https://groups.google.com/d/optout.