AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents

Yongmao Zhang1, Zhichao Wang1, Peiji Yang2, Hongshen Sun2, Zhisheng Wang2, Lei Xie1
1 Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China
2 Tencent, Shenzhen, China

1. Abstract

Learning accent from crowd-sourced data is a feasible way to achieve a target speaker TTS system that can synthesize accent speech. To this end, there are two challenging problems to be solved. First, direct use of the poor acoustic quality crowd-sourced data and the target speaker data in accent transfer will apparently lead to synthetic speech with degraded quality. To mitigate this problem, we take a bottleneck feature (BN) based TTS approach, in which TTS is decomposed into a Text-to-BN (T2BN) module to learn accent and a BN-to-Mel (BN2Mel) module to learn speaker timbre, where neural network based BN feature serves as the intermediate representation that are robust to noise interference. Second, direct training T2BN using the crowd-sourced data in the two-stage system will produce accent speech of target speaker with poor prosody. This is because the the crowd-sourced recordings are contributed from the ordinary unprofessional speakers. To tackle this problem, we update the two-stage approach to a novel three-stage approach, where T2BN and BN2Mel are trained using the high-quality target speaker data and a new BN-to-BN module is plugged in between the two modules to perform accent transfer. To train the BN2BN module, the parrallel unaccented and accented BN features are obtianed by a proposed data augmentation method. Finally the proposed three-stage approach manages to produce accent speech for the target speaker with good prosody, as the prosody pattern is inherited from the professional target speaker and accent transfer is achieved by the BN2BN module at the same time. The proposed approach, named as AccentSpeech, is validated in a Mandarin TTS accent transfer task.



2. Dataset

High-quality unaccented dataset -- DB1 (Standard Mandarin)

Text 卡尔普陪外孙玩滑梯 假语村言别再拥抱我 宝马配挂跛骡鞍,貂蝉怨枕董翁榻
Audio

Low-quality accented dataset -- Kespeech (Accented Mandarin)

City: Chengdu-成都

Text 该行将推出无卡化时代电子支付 色彩从清新的水蓝到稳重的黑 五十七万扶贫村实施光伏
Audio

City: Xian-西安

Text 但在感知路况的周围环境方面 评论家白烨也看不上春树能获奖 但是滚动播放姚贝娜的音乐
Audio

City: Zhengzhou-郑州

Text 为素食主义为素食主义者开发的一款班尼迪蛋 为让余家店六十六户村民住上好的房子 为天津发展提供舆论支持聚正能量
Audio

3. Demos

Synthetic accent speech for DB1 (learned accent from Kespeech)

City: Chengdu-成都

1、Text:那些庄稼田园在果果眼里感觉太亲切了
DB1(Standard Mandarin): Accent-Hieratron: AccentSpeech:

2、Text:她把鞋子拎在手上光着脚丫故意踩在水洼里
DB1(Standard Mandarin): Accent-Hieratron: AccentSpeech:

3、Text:如大堂东有海蚀穴二层,穴顶呈弧形,表面多蜂窝状浪花风化而成的小圆穴
DB1(Standard Mandarin): Accent-Hieratron: AccentSpeech:

4、Text:分水岭把水流分成两个水系,西入四川盆地的内陆湖,东入湖北宜昌附近的湖泊
DB1(Standard Mandarin): Accent-Hieratron: AccentSpeech:

City: Xian-西安

1、Text:那些庄稼田园在果果眼里感觉太亲切了
DB1(Standard Mandarin): Accent-Hieratron: AccentSpeech:

2、Text:她把鞋子拎在手上光着脚丫故意踩在水洼里
DB1(Standard Mandarin): Accent-Hieratron: AccentSpeech:

3、Text:如大堂东有海蚀穴二层,穴顶呈弧形,表面多蜂窝状浪花风化而成的小圆穴
DB1(Standard Mandarin): Accent-Hieratron: AccentSpeech:

4、Text:分水岭把水流分成两个水系,西入四川盆地的内陆湖,东入湖北宜昌附近的湖泊
DB1(Standard Mandarin): Accent-Hieratron: AccentSpeech:

City: Zhengzhou-郑州

1、Text:那些庄稼田园在果果眼里感觉太亲切了
DB1(Standard Mandarin): Accent-Hieratron: AccentSpeech:

2、Text:她把鞋子拎在手上光着脚丫故意踩在水洼里
DB1(Standard Mandarin): Accent-Hieratron: AccentSpeech:

3、Text:如大堂东有海蚀穴二层,穴顶呈弧形,表面多蜂窝状浪花风化而成的小圆穴
DB1(Standard Mandarin): Accent-Hieratron: AccentSpeech:

4、Text:分水岭把水流分成两个水系,西入四川盆地的内陆湖,东入湖北宜昌附近的湖泊
DB1(Standard Mandarin): Accent-Hieratron: AccentSpeech: