AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style
ArXiv: arXiv:2107.02530
Accepted by INTERSPEECH 2021
Author
- Yuzi Yan (EE, Tsinghua University) yan-yz17@mails.tsinghua.edu.cn
- Xu Tan (Microsoft Research Asia) xuta@microsoft.com
- Bohan Li (Microsoft Azure Speech) bohan.li@microsoft.com
- Guangyan Zhang (EE, The Chinese University of Hong Kong) gyzhang@link.cuhk.edu.hk
- Tao Qin (Microsoft Research Asia) taoqin@microsoft.com
- Sheng Zhao (Microsoft Azure Speech) sheng.zhao@microsoft.com
- Yuan Shen (EE, Tsinghua University) shenyuan_ee@tsinghua.edu.cn
- Wei-Qiang Zhang (EE, Tsinghua University) wqzhang@tsinghua.edu.cn
- Tie-Yan Liu (Microsoft Research Asia) tie-yan.liu@microsoft.com
Audio Samples
All of the audio samples use MelGAN as vocoder.
Spontaneous Quality
cecily package in all of that um yeah so …
GT | GT Mel+Vocoder | AdaSpeech | AdaSpeech 3 |
---|---|---|---|
lot of stuff with the even CSA had igaming tournament they sponsored teams and their really get into the gaming it’s that’s coming.
GT | GT Mel+Vocoder | AdaSpeech | AdaSpeech 3 |
---|---|---|---|
FP Insertion
Note: We did not use punctuation marks (, or .) in real practice.
Input: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.
Output: Six spoons of fresh snow peas, um, five thick slabs of blue cheese, and maybe a snack for her brother Bob.
GT | Before Insertion | After Insertion |
---|---|---|
Input: When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow.
Output: When the sunlight strikes raindrops in the air, um, they act as a prism and form a rainbow.
GT | Before Insertion | After Insertion |
---|---|---|
Input: Some have accepted it as a miracle without physical explanation.
Output: Some have accepted it um as a miracle without physical explanation.
GT | Before Insertion | After Insertion |
---|---|---|
Speaker Adaptation (VCTK)
cecily package in all of that um yeah so something any jamaican upgrade yeah it like said.
Original Sound | Male Adaptation | Female Adaptation |
---|---|---|
lot of stuff with the even CSA had igaming tournament they sponsored teams and their really get into the gaming it’s that’s coming.
Original Sound | Male Adaptation | Female Adaptation |
---|---|---|
amplifier from them was a kinda interesting a jackson he really liked it it’s a good way to get bluetooth.
Original Sound | Male Adaptation | Female Adaptation |
---|---|---|
Change the threshold to control the FP indensity
The rainbow um is a division of white light into many beautiful colors.
The rainbow um is a division of white light uh into many beautiful colors.
Threshold=0.1 | Threshold=0.5 |
---|---|
Our Related Works
AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data