高级检索

    基于联合对比学习策略的单样本语音转换模型

    One-Shot Voice Conversion Model Based on Combined Contrastive Learning

    • 摘要: 针对单样本(one-shot)语音转换任务中目标说话人内容保持与说话人特征迁移难以兼顾的问题,本文在变分推理语音合成(Variational Inference Text-to-Speech, VITS)框架的基础上,提出了基于联合对比学习策略的单样本语音转换模型。该模型在内容与说话人两个特征空间同时引入对比约束。针对内容特征提取,将WavLM (Waveform Language Model)第23层的输出经K最近邻(K-Nearest Neighbors, KNN)匹配后和WavLM的第24层输出结合,构成样本对,以实现音素级对齐和语义一致性。针对说话人特征提取,通过时序扰动生成对比样本并采用监督对比损失提取稳定且与内容无关的说话人嵌入,生成器部分结合条件变分自编码器(Conditional Variational Auto Encoder, CVAE)与对抗损失优化训练。实验结果表明,该模型在VCTK (Voice Cloning Tool Kit)语料库上获得的客观与主观评价指标均优于最新基线模型,验证了所提模型在语音自然度、内容保持以及说话人相似性方面的有效提升。

       

      Abstract: To address the challenge of balancing content preservation of the target speaker and speaker characteristic transfer in one-shot voice conversion tasks, this paper proposes a one-shot voice conversion model based on a joint contrastive learning strategy, built upon the Variational Inference Text-to-Speech (VITS) framework. The model introduces contrastive constraints in both the content and speaker feature spaces simultaneously. For content feature extraction, the output from the 23rd layer of Waveform Language Model(WavLM) is combined with the output from the 24th layer of WavLM after K-Nearest Neighbors (KNN) matching to form sample pairs, achieving phoneme-level alignment and semantic consistency. For speaker feature extraction, contrastive samples are generated via temporal perturbation, and supervised contrastive loss is adopted to extract speaker embeddings that are stable and content-independent. The generator is trained with the combination of a Conditional Variational AutoEncoder (CVAE) and adversarial loss optimization. Experimental results on the Voice Cloning Tool Kit (VCTK) corpus show that the proposed model outperforms state-of-the-art baseline models in both objective and subjective evaluation metrics, verifying its effectiveness in improving speech naturalness, content preservation, and speaker similarity.

       

    /

    返回文章
    返回