AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style

ArXiv: arXiv:2107.02530

Accepted by INTERSPEECH 2021

Author

Audio Samples

All of the audio samples use MelGAN as vocoder.

Spontaneous Quality

cecily package in all of that um yeah so …

GT GT Mel+Vocoder AdaSpeech AdaSpeech 3

lot of stuff with the even CSA had igaming tournament they sponsored teams and their really get into the gaming it’s that’s coming.

GT GT Mel+Vocoder AdaSpeech AdaSpeech 3

FP Insertion

Note: We did not use punctuation marks (, or .) in real practice.

Input: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.

Output: Six spoons of fresh snow peas, um, five thick slabs of blue cheese, and maybe a snack for her brother Bob.

GT Before Insertion After Insertion

Input: When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow.

Output: When the sunlight strikes raindrops in the air, um, they act as a prism and form a rainbow.

GT Before Insertion After Insertion

Input: Some have accepted it as a miracle without physical explanation.

Output: Some have accepted it um as a miracle without physical explanation.

GT Before Insertion After Insertion

Speaker Adaptation (VCTK)

cecily package in all of that um yeah so something any jamaican upgrade yeah it like said.

Original Sound Male Adaptation Female Adaptation

lot of stuff with the even CSA had igaming tournament they sponsored teams and their really get into the gaming it’s that’s coming.

Original Sound Male Adaptation Female Adaptation

amplifier from them was a kinda interesting a jackson he really liked it it’s a good way to get bluetooth.

Original Sound Male Adaptation Female Adaptation

Change the threshold to control the FP indensity

The rainbow um is a division of white light into many beautiful colors.

The rainbow um is a division of white light uh into many beautiful colors.

Threshold=0.1 Threshold=0.5

AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data