AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents

Yongmao Zhang¹, Zhichao Wang¹, Peiji Yang², Hongshen Sun², Zhisheng Wang², Lei Xie¹ ¹ Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China ² Tencent, Shenzhen, China

1. Abstract

Learning accent from crowd-sourced data is a feasible way to achieve a target speaker TTS system that can synthesize accent speech. To this end, there are two challenging problems to be solved. First, direct use of the poor acoustic quality crowd-sourced data and the target speaker data in accent transfer will apparently lead to synthetic speech with degraded quality. To mitigate this problem, we take a bottleneck feature (BN) based TTS approach, in which TTS is decomposed into a Text-to-BN (T2BN) module to learn accent and a BN-to-Mel (BN2Mel) module to learn speaker timbre, where neural network based BN feature serves as the intermediate representation that are robust to noise interference. Second, direct training T2BN using the crowd-sourced data in the two-stage system will produce accent speech of target speaker with poor prosody. This is because the the crowd-sourced recordings are contributed from the ordinary unprofessional speakers. To tackle this problem, we update the two-stage approach to a novel three-stage approach, where T2BN and BN2Mel are trained using the high-quality target speaker data and a new BN-to-BN module is plugged in between the two modules to perform accent transfer. To train the BN2BN module, the parrallel unaccented and accented BN features are obtianed by a proposed data augmentation method. Finally the proposed three-stage approach manages to produce accent speech for the target speaker with good prosody, as the prosody pattern is inherited from the professional target speaker and accent transfer is achieved by the BN2BN module at the same time. The proposed approach, named as AccentSpeech, is validated in a Mandarin TTS accent transfer task.

2. Dataset

High-quality unaccented dataset -- DB1 (Standard Mandarin)

Text	卡尔普陪外孙玩滑梯	假语村言别再拥抱我	宝马配挂跛骡鞍,貂蝉怨枕董翁榻
Audio

Low-quality accented dataset -- Kespeech (Accented Mandarin)

City: Chengdu-成都

Text	该行将推出无卡化时代电子支付	色彩从清新的水蓝到稳重的黑	五十七万扶贫村实施光伏
Audio

City: Xian-西安

Text	但在感知路况的周围环境方面	评论家白烨也看不上春树能获奖	但是滚动播放姚贝娜的音乐
Audio

City: Zhengzhou-郑州

Text	为素食主义为素食主义者开发的一款班尼迪蛋	为让余家店六十六户村民住上好的房子	为天津发展提供舆论支持聚正能量
Audio