Zero-Shot Accent Conversion Model Based on the kNN Regression of Content Features
-
Graphical Abstract
-
Abstract
Accent Conversion (AC) aims to convert speech from the source accent to the target accent while preserving the source speaker's timbre and the speech content at the same time. Existing AC models cannot achieve good generalization capability for AC on speech that does not follow the distribution of the training data, as limits their applications seriously. To this end, a zero-shot AC model based on the kNN regression of speech content features is proposed. On the one hand, the 23rd layer of WavLM is adopted as the content encoder to extract the content features from both source and target accented speech, and kNN regression is employed to replace the source accented content feature with its nearest neighbors in the pool constructed by the target accented content features to achieve accent conversion. On the other hand, to preserve the source speaker's timbre in the converted speech, a multi-speaker vocoder is constructed to fuse the obtained target accented content features with the source speaker's timbre feature extracted by the speaker encoder to synthesize the speech with the target accent. In the proposed model, no source accented speech is required at the training stage, so it can convert various source accented speech to the target accented speech. That is, the proposed model achieves good generalization ability. Experimental results demonstrate that the proposed model achieves better objective and subjective evaluation results than available parallel or non-parallel AC models.
-
-