AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents
1. Abstract
Learning accent from crowd-sourced data is a feasible way to achieve a target speaker TTS system that can synthesize accent speech. To this end, there are two challenging problems to be solved. First, direct use of the poor acoustic quality crowd-sourced data and the target speaker data in accent transfer will apparently lead to synthetic speech with degraded quality. To mitigate this problem, we take a bottleneck feature (BN) based TTS approach, in which TTS is decomposed into a Text-to-BN (T2BN) module to learn accent and a BN-to-Mel (BN2Mel) module to learn speaker timbre, where neural network based BN feature serves as the intermediate representation that are robust to noise interference. Second, direct training T2BN using the crowd-sourced data in the two-stage system will produce accent speech of target speaker with poor prosody. This is because the the crowd-sourced recordings are contributed from the ordinary unprofessional speakers. To tackle this problem, we update the two-stage approach to a novel three-stage approach, where T2BN and BN2Mel are trained using the high-quality target speaker data and a new BN-to-BN module is plugged in between the two modules to perform accent transfer. To train the BN2BN module, the parrallel unaccented and accented BN features are obtianed by a proposed data augmentation method. Finally the proposed three-stage approach manages to produce accent speech for the target speaker with good prosody, as the prosody pattern is inherited from the professional target speaker and accent transfer is achieved by the BN2BN module at the same time. The proposed approach, named as AccentSpeech, is validated in a Mandarin TTS accent transfer task.
2. Dataset
High-quality unaccented dataset -- DB1 (Standard Mandarin)
Text | 卡尔普陪外孙玩滑梯 | 假语村言别再拥抱我 | 宝马配挂跛骡鞍,貂蝉怨枕董翁榻 |
Audio |
Low-quality accented dataset -- Kespeech (Accented Mandarin)
City: Chengdu-成都
Text | 该行将推出无卡化时代电子支付 | 色彩从清新的水蓝到稳重的黑 | 五十七万扶贫村实施光伏 |
Audio |
City: Xian-西安
Text | 但在感知路况的周围环境方面 | 评论家白烨也看不上春树能获奖 | 但是滚动播放姚贝娜的音乐 |
Audio |
City: Zhengzhou-郑州
Text | 为素食主义为素食主义者开发的一款班尼迪蛋 | 为让余家店六十六户村民住上好的房子 | 为天津发展提供舆论支持聚正能量 |
Audio |
3. Demos
Synthetic accent speech for DB1 (learned accent from Kespeech)
City: Chengdu-成都
1、Text:那些庄稼田园在果果眼里感觉太亲切了DB1(Standard Mandarin): | Accent-Hieratron: | AccentSpeech: |
2、Text:她把鞋子拎在手上光着脚丫故意踩在水洼里
DB1(Standard Mandarin): | Accent-Hieratron: | AccentSpeech: |
3、Text:如大堂东有海蚀穴二层,穴顶呈弧形,表面多蜂窝状浪花风化而成的小圆穴
DB1(Standard Mandarin): | Accent-Hieratron: | AccentSpeech: |
4、Text:分水岭把水流分成两个水系,西入四川盆地的内陆湖,东入湖北宜昌附近的湖泊
DB1(Standard Mandarin): | Accent-Hieratron: | AccentSpeech: |
City: Xian-西安
1、Text:那些庄稼田园在果果眼里感觉太亲切了DB1(Standard Mandarin): | Accent-Hieratron: | AccentSpeech: |
2、Text:她把鞋子拎在手上光着脚丫故意踩在水洼里
DB1(Standard Mandarin): | Accent-Hieratron: | AccentSpeech: |
3、Text:如大堂东有海蚀穴二层,穴顶呈弧形,表面多蜂窝状浪花风化而成的小圆穴
DB1(Standard Mandarin): | Accent-Hieratron: | AccentSpeech: |
4、Text:分水岭把水流分成两个水系,西入四川盆地的内陆湖,东入湖北宜昌附近的湖泊
DB1(Standard Mandarin): | Accent-Hieratron: | AccentSpeech: |
City: Zhengzhou-郑州
1、Text:那些庄稼田园在果果眼里感觉太亲切了DB1(Standard Mandarin): | Accent-Hieratron: | AccentSpeech: |
2、Text:她把鞋子拎在手上光着脚丫故意踩在水洼里
DB1(Standard Mandarin): | Accent-Hieratron: | AccentSpeech: |
3、Text:如大堂东有海蚀穴二层,穴顶呈弧形,表面多蜂窝状浪花风化而成的小圆穴
DB1(Standard Mandarin): | Accent-Hieratron: | AccentSpeech: |
4、Text:分水岭把水流分成两个水系,西入四川盆地的内陆湖,东入湖北宜昌附近的湖泊
DB1(Standard Mandarin): | Accent-Hieratron: | AccentSpeech: |