Abstract:
To address the challenge of balancing content preservation of the target speaker and speaker characteristic transfer in one-shot voice conversion tasks, this paper proposes a one-shot voice conversion model based on a joint contrastive learning strategy, built upon the Variational Inference Text-to-Speech (VITS) framework. The model introduces contrastive constraints in both the content and speaker feature spaces simultaneously. For content feature extraction, the output from the 23rd layer of Waveform Language Model(WavLM) is combined with the output from the 24th layer of WavLM after K-Nearest Neighbors (KNN) matching to form sample pairs, achieving phoneme-level alignment and semantic consistency. For speaker feature extraction, contrastive samples are generated via temporal perturbation, and supervised contrastive loss is adopted to extract speaker embeddings that are stable and content-independent. The generator is trained with the combination of a Conditional Variational AutoEncoder (CVAE) and adversarial loss optimization. Experimental results on the Voice Cloning Tool Kit (VCTK) corpus show that the proposed model outperforms state-of-the-art baseline models in both objective and subjective evaluation metrics, verifying its effectiveness in improving speech naturalness, content preservation, and speaker similarity.