高级检索

    基于内容特征kNN回归的零样本口音转换模型

    Zero-Shot Accent Conversion Model Based on the kNN Regression of Content Features

    • 摘要: 口音转换 (Accent Conversion, AC) 旨在将源口音语音转换为目标口音语音,并保持源说话人音色和语音内容不变。现有的AC模型缺乏对训练数据分布以外的语音口音转换的泛化性。本文提出基于内容特征k-邻近(kNN)回归的零样本AC模型。一方面,采用WavLM第23层提取源和目标口音语音的内容特征,并利用kNN回归将源口音语音内容特征置换为目标口音语音及其最邻近的内容特征以实现口音转换;另一方面,为了保持转换后语音中源说话人音色,构建多说话人声码器对含有目标口音的语音内容特征和源说话人音色特征进行融合,以合成目标口音语音。该模型无需源口音语音参与训练,即可实现多种源口音到目标口音的转换。实验结果表明,该模型取得了比并行或非并行AC模型更好的客观与主观评价结果。

       

      Abstract: Accent Conversion (AC) aims to convert speech from the source accent to the target accent while preserving the source speaker's timbre and the speech content at the same time. Existing AC models cannot achieve good generalization capability for AC on speech that does not follow the distribution of the training data, as limits their applications seriously. To this end, a zero-shot AC model based on the kNN regression of speech content features is proposed. On the one hand, the 23rd layer of WavLM is adopted as the content encoder to extract the content features from both source and target accented speech, and kNN regression is employed to replace the source accented content feature with its nearest neighbors in the pool constructed by the target accented content features to achieve accent conversion. On the other hand, to preserve the source speaker's timbre in the converted speech, a multi-speaker vocoder is constructed to fuse the obtained target accented content features with the source speaker's timbre feature extracted by the speaker encoder to synthesize the speech with the target accent. In the proposed model, no source accented speech is required at the training stage, so it can convert various source accented speech to the target accented speech. That is, the proposed model achieves good generalization ability. Experimental results demonstrate that the proposed model achieves better objective and subjective evaluation results than available parallel or non-parallel AC models.

       

    /

    返回文章
    返回