GAN-Based Domain Adaptation for Gender Identification
-
摘要: 在实际应用场景中,由于实际语音数据与模型训练数据存在较大差异,导致基于音频的性别识别模型的性能严重下降。为了解决这一问题,提出了一种结合生成对抗网络(GAN)和GhostVLAD层的域自适应模型。基于GhostVLAD的引入可有效减少语音中噪声和无关信息的干扰,而基于GAN思想的训练方法可以实现模型对目标域数据的自适应。在对抗训练中,通过引入辅助损失保持网络对性别特征的表征能力。采用Voxceleb1数据集作为源域,Audioset和Movie数据集分别作为目标域,对本文的域自适应模型的性能进行测试实验。实验结果表明,相比于基于卷积神经网络的性别识别模型,本文模型可将性别识别的准确率分别提高5.13%和7.72%。Abstract: Gender identification is a quite important task in speaker verification and can also be used as an auxiliary tool in automatic speech recognition (ASR) to improve model performance. In order to increase the accuracy of gender identification, some schemes based on deep learning have been recently reported. However, compared with the acoustic conditioned data in training, speech data in the actual application scenarios is usually masked by the background noise, such as music, environmental noise, background chatter, etc. Thus, the performance of gender identification model based on audio is seriously degraded due to the great difference between the actual speech data and the model training data. In order to solve this problem, we propose a domain adaptive model via combining generative adversarial network(GAN) and Ghost VLAD layer. The introduction of GhostVLAD can effectively reduce the interference of noise and irrelevant information in speech and the training method based on GaN can realize the adaptation of the model to the target domain data. During the confrontation training, auxiliary loss is introduced to maintain the representation ability of gender characteristics. Finally, by voxceleb1 data set as the source domain, audioset and movie data set as the target domain, the performance of the domain adaptive model is tested, from which it is shown that compared with the gender recognition model based on convolution neural network, this model can improve the accuracy of gender recognition by 5.13% and 7.72% , respectively.
-
表 1 CNN-GV模型的结构
Table 1. Structure of the CNN-GV model
Input size $\left( {{\rm{96}} \times {\rm{64}} \times {\rm{1}}} \right)$ Output size Conv2d,${\rm{3}} \times {\rm{3}}$,32, stride(1,1) ${\rm{96}} \times {\rm{64}} \times {\rm{32}}$ Conv2d,${\rm{3}} \times {\rm{3}}$,32, stride(1,1) ${\rm{96}} \times {\rm{64}} \times {\rm{32}}$ Conv2d,${\rm{3}} \times {\rm{3}}$,64, stride(2,2) ${\rm{48}} \times {\rm{32}} \times {\rm{64}}$ Conv2d,${\rm{3}} \times {\rm{3}}$,64, stride(1,1) ${\rm{48}} \times {\rm{32}} \times {\rm{64}}$ Max pool,${\rm{1}} \times {\rm{2}}$,stride(1,2) ${\rm{48}} \times {\rm{16}} \times {\rm{64}}$ Conv2d,${\rm{3}} \times {\rm{3}}$,128, stride(2,2) ${\rm{24}} \times {\rm{8}} \times {\rm{128}}$ Conv2d,${\rm{3}} \times {\rm{3}}$,128, stride(1,1) ${\rm{24}} \times {\rm{8}} \times {\rm{128}}$ Max pool,${\rm{2}} \times {\rm{2}}$,stride(2,2) ${\rm{12}} \times {\rm{4}} \times {\rm{128}}$ Conv2d,${\rm{3}} \times {\rm{3}}$, 256, stride(2,2) ${\rm{6}} \times {\rm{2}} \times {\rm{256}}$ Ghost VLAD 5120 Dense 1024 1024 Dense 256 256 Softmax 2 2 表 2 模型性能对比
Table 2. Performance comparison of models
Model UWA Voxceleb1 Audioset Movie Literature[9] 0.947 0 0.798 8 0.765 7 CNN-GV-DA(Without GhostVLAD) 0.965 6 0.827 0 0.806 4 CNN-GV 0.953 9 0.811 1 0.773 4 CNN-GV-DA 0.965 7 0.850 1 0.842 9 -
[1] HARB H, CHEN L. Gender identification using a general audio classifier[C]//2003 International Conference on Multimedia and Expo. USA: IEEE, 2003: 733-736. [2] ABDULLA W H, KASABOV N K. Improving speech recognition performance through gender separation[C]// Fifth Biannual Conference on Artificial Neural Networks and Expert Systems (ANNES). USA: IEEE, 2001: 218-222. [3] ORE B M, SLYH R E, HANSEN E G. Speaker segmentation and clustering using gender information[C]//2006 IEEE Odyssey−The Speaker and Language Recognition Workshop. USA: IEEE, 2006: 1-8. [4] KUMAR N, NASIR M, GEORGIOU P, et al. Robust multichannel gender classification from speech in movie audio[C]// Interspeech 2016. [s. l. ]: [s. n. ], 2016: 2233-2237. [5] WU K, CHILDERS D G. Gender recognition from speech: Part I. Coarse analysis[J]. The Journal of the Acoustical Society of America, 1991, 90(4): 1828-1840. doi: 10.1121/1.401663 [6] YÜCESOY E, NABIYEV V V. Gender identification of a speaker using MFCC and GMM[C]//2013 8th International Conference on Electrical and Electronics Engineering (ELECO). Turkey: IEEE, 2013: 626-629. [7] SENOUSSAOUI M, KENNY P, BRÜMMER N, et al. Mixture of PLDA models in i-vector space for gender-independent speaker recognition[C]// 2011 interspeech 12th Annual Conference of the International Speech Communication Association. Italy: DBLP, 2011: 25-28. [8] El SHAFEY L, KHOURY E, MARCEL S. Audio-visual gender recognition in uncontrolled environment using variability modeling techniques[C]//IEEE International Joint Conference on Biometrics (IJCB). USA: IEEE, 2014: 1-8. [9] DOUKHAN D, CARRIE J, VALLET F, et al. An open-source speaker gender detection framework for monitoring gender equality[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Canada: IEEE, 2018: 5214-5218. [10] LIEW S S, HANI M K, RADZI S A, et al. Gender classification: A convolutional neural network approach[J]. Turkish Journal of Electrical Engineering and Computer Sciences, 2016, 24(3): 1248-1264. [11] DU J, NA X, LIU X, et al. AISHELL-2: Transforming mandarin ASR research into industrial scale[EB/OL]. (2018-08-31)[2020-12-20]. https://arxiv.org/abs/1808.10583. [12] HEBBAR R, SOMANDEPALLI K, NARAYANAN S S. Improving gender identification in movie audio using cross-domain data[C]// Interspeech 2018. [s. l. ]: [s. n. ], 2018: 282-286. [13] GEMMEKE J F, ELLIS D P W, FREEDMAN D, et al. Audio set: An ontology and human-labeled dataset for audio events[C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). USA: IEEE, 2017: 776-780. [14] ZHONG Y, ARANDJELOVIĆ R, ZISSERMAN A. GhostVLAD for set-based face recognition[C]// Asian Conference on Computer Vision. Cham: Springer, 2018: 35-50. [15] ARANDJELOVIC R, GRONAT P, TORII A, et al. NetVLAD: CNN architecture for weakly supervised place recognition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2016: 5297-5307. [16] XIE W, NAGRANI A, CHUNG J S, et al. Utterance-level aggregation for speaker recognition in the wild[C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). UK: IEEE, 2019: 5791-5795. [17] BEN-DAVID S, BLITZER J, CRAMMER K, et al. Analysis of representations for domain adaptation[C]//Advances in Neural Information Processing Systems. USA: MIT Press, 2007: 137-144. [18] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[J]. Advances in Neural Information Processing Systems, 2014, 27: 2672-2680. [19] TZENG E, HOFFMAN J, SAENKO K, et al. Adversarial discriminative domain adaptation[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2017: 7167-7176. [20] SHEN J, QU Y, ZHANG W, et al. Wasserstein distance guided representation learning for domain adaptation [EB/OL]. (2017-07-05)[2020-12-20]. https://arxiv.org/abs/1707.01217. [21] ODENA A, OLAH C, SHLENS J. Conditional image synthesis with auxiliary classifier gans[C]//International Conference on Machine Learning(PMLR). [s. l. ]: [s. n. ], 2017: 2642-2651. [22] NAGRANI A, CHUNG J S, ZISSERMAN A. Voxceleb: A large-scale speaker identification dataset[EB/OL]. (2017-06-26)[2020-12-20]. https://arxiv.org/abs/1706.08612. -