GAN-Based Domain Adaptation Algorithm for Speaker Verification
-
摘要: 针对声纹识别任务中常常出现的由于真实场景语音与模型训练语料在内部特征(情感、语言、说话风格、年龄)或外部特征(背景噪声、传输信号、麦克风、室内混响)等方面的差异所导致的模型识别率低的问题,提出了一种基于对抗网络的声纹识别域迁移算法。首先,利用源域语音对X-Vector的声纹识别模型进行训练;然后,采用域迁移方法将源域训练的X-Vector模型迁移至目标域训练数据;最后,在目标域测试数据上检测迁移后的模型性能,并将其与迁移前的模型性能进行对比。实验中采用AISHELL1作为源域,采用VoxCeleb1和CN-Celeb分别作为目标域对算法性能进行测试。实验结果表明,采用本文方法进行迁移后,在VoxCeleb1和CN-Celeb的目标域测试集上的等错误率分别下降了21.46%和19.24%。Abstract: A key problem in speaker verification task is the condition mismatch between the training data and the testing data, which may significantly affect the verification performance. In most of the speaker recognition application scenarios, it is usually impossible to obtain enough samples to retrain the speaker recognition model. At the same time, the samples that is used to train the original model usually may be quite different from those obtained in real applications due to the variability caused by the intrinsic factors (e.g., the changes in emotion, language, vocal effect, speaking style, and aging, etc.) or extrinsic ones (e.g., background noise, transmission channel, microphone, room acoustics, and distance from the microphone, etc.). In this paper, an adversarial domain adaptation strategy is designed and applied to the X-Vector-based speaker verification scheme to enhance its domain adaptation ability. First, the X-Vector scheme is trained on the source dataset (AISHELL1). Then, the domain adaptation strategy is applied to the obtained X-Vector scheme for enabling it adapt to the target dataset (VoxCeleb1 or CN-Celeb). Finally, the performances of the X-Vector schemes obtained before and after adaptation are compared via the target dataset, from which it is demonstrated that the proposed adaptation strategy achieves 21.46% and 19.24% Equal Error Rate (EER) reduction on VoxCeleb1 and CN-Celeb dataset, respectively.
-
Key words:
- speaker verification /
- domain adaptation /
- adversarial network
-
表 1 X-Vector网络结构
Table 1. Network structure of X-Vector model
Layer Context Dim TDNN-ReLU t−2, t+2 512 TDNN-ReLU t−2, t, t+2 512 TDNN-ReLU t−3, t, t+3 512 TDNN-ReLU t 512 TDNN-ReLU t 1 500 Pooling(mean+stddev) Full-seq 3 000 Dense-ReLU - 512 Dense-ReLU - 512 Dense-Softmax - Speakers 表 2 鉴别器网络结构
Table 2. Network structure of discriminator
Layer Input dim Output dim Dense1-ReLU 512 512 Dense2-ReLU 512 512 Dense3-ReLU 256 64 Softmax 64 2 表 3 迁移前后性能对比
Table 3. Performance comparison before and after domain adaptation
Schemes EER/% VoxCeleb1 CN-Celeb PLDA (Before adaptation) 30.57 35.07 PLDA (After adaptation) 9.11 15.83 CDS (Before adaptation) 32.69 43.58 CDS (After adaptation) 15.41 20.36 表 4 本文算法与DANN算法对比
Table 4. Performance comparison between this paper and DANN
Schemes VoxCeleb1 CN-Celeb EER/% DCF EER/% DCF DANN (After adaptation) 12.97 0.536 3×10−2 16.50 0.696 2×10−2 This paper (After adaptation) 9.11 0.347 8×10−2 15.83 0.674 4×10−2 -
[1] MISRA A, HANSEN J H L. Modelling and compensation for language mismatch in speaker verification[J]. Speech Communication, 2018, 96: 58-66. doi: 10.1016/j.specom.2017.09.004 [2] DEHAK N, KENNY P J, DEHAK R, et al. Front-end factor analysis for speaker verification[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 19(4): 788-798. [3] SHUM S H, REYNOLDS D A, GARCIA-ROMERO D, et al. Unsupervised clustering approaches for domain adaptation in speaker recognition systems[C]// Odyssey 2014. Joensuu Finland: ISCA, 2014: 265-272. [4] GARCIA-ROMERO D, MCCREE A, SHUM S, et al. Unsupervised domain adaptation for i-vector speaker recognition[C]// Odyssey 2014. Joensuu Finland: ISCA, 2014, 8: 260-264. [5] RAHMAN M H, KANAGASUNDARAM A, DEAN D, et al. Dataset-invariant covariance normalization for out-domain PLDA speaker verification[C]// Conference of the International Speech Communication Association (INTERSPEECH). Dresden, Germany: ISCA, 2015: 1017-1021. [6] ALAM M J, BHATTACHARYA G, KENNY P. Speaker verification in mismatched conditions with frustratingly easy domain adaptation[C]//Odyssey 2018. France: ISCA, 2018: 176-180. [7] SUN B, FENG J, SAENKO K. Return of frustratingly easy domain adaptation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. USA: AAAI, 2016: 2058-2065. [8] LEE K A, WANG Q, KOSHINAKA T. The CORAL+ algorithm for unsupervised domain adaptation of PLDA[C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019: 5821-5825. [9] SNYDER D, GARCIA-ROMERO D, POVEY D, et al. Deep neural network embeddings for text-independent speaker verification[C]// Conference of the International Speech Communication Association (INTERSPEECH). Stockholm, Sweden: ISCA, 2017: 999-1003. [10] SNYDER D, GARCIA-ROMERO D, SELL G, et al. X-vectors: Robust DNN embeddings for speaker recognition[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Alberta: IEEE, 2018: 5329-5333. [11] NANDWANA M K, MCLAREN M, FERRER L, et al. Analysis and mitigation of vocal effort variations in speaker recognition[C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019: 6001-6005. [12] ROHDIN J, STAFYLAKIS T, SILNOVA A, et al. Speaker verification using end-to-end adversarial language adaptation[C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019: 6006-6010. [13] XIA W, HUANG J, HANSEN J H L. Cross-lingual text-independent speaker verification using unsupervised adversarial discriminative domain adaptation[C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019: 5816-5820. [14] GHARIB S, DROSSOS K, CAKIR E, et al. Unsupervised adversarial domain adaptation for acoustic scene classification[C]//Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018). Surrey, UK: IEEE 2018: 138-142. [15] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]//Advances in Neural Information Processing Systems. Montreal, Quebec, Canada: NIPS, 2014: 2672-2680. [16] SHRIVASTAVA A, PFISTER T, TUZEL O, et al. Learning from simulated and unsupervised images through adversarial training[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Honolulu, Hawaii: IEEE, 2017: 2107-2116. [17] BU H, DU J, NA X, et al. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline[C]//2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA). Seoul: IEEE, 2017: 1-5. [18] NAGRANI A, CHUNG J S, ZISSERMAN A. Voxceleb: A large-scale speaker identification dataset[C]// Conference of the International Speech Communication Association (INTERSPEECH). Stockholm: ISCA, 2017: 2616-2620. [19] FAN Y, KANG J W, LI L T, et al. CN-CELEB: A challenging Chinese speaker recognition dataset[C]//2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona: IEEE, 2020: 7604-7608. [20] KINGMA D P, BA J. Adam: A method for stochastic optimization[C]//The 3rd International Conference for Learning Representations (ICLR). San Diego, USA: IEEE, 2015: 1-15. [21] GANIN Y, USTINOVA E, AJAKAN H, et al. Domain-adversarial training of neural networks[J]. The Journal of Machine Learning Research, 2016, 17(1): 2096-2130. -