高级检索

  • ISSN 1006-3080
  • CN 31-1691/TQ

面向性别识别的基于GAN的域自适应模型

吕乔健 陈宁

吕乔健, 陈宁. 面向性别识别的基于GAN的域自适应模型[J]. 华东理工大学学报(自然科学版). doi: 10.14135/j.cnki.1006-3080.20210104002
引用本文: 吕乔健, 陈宁. 面向性别识别的基于GAN的域自适应模型[J]. 华东理工大学学报(自然科学版). doi: 10.14135/j.cnki.1006-3080.20210104002
LV Qiaojian, CHEN Ning. GAN-Based Domain Adaptation for Gender Identification[J]. Journal of East China University of Science and Technology. doi: 10.14135/j.cnki.1006-3080.20210104002
Citation: LV Qiaojian, CHEN Ning. GAN-Based Domain Adaptation for Gender Identification[J]. Journal of East China University of Science and Technology. doi: 10.14135/j.cnki.1006-3080.20210104002

面向性别识别的基于GAN的域自适应模型

doi: 10.14135/j.cnki.1006-3080.20210104002
基金项目: 国家自然科学基金面上项目(61771196)
详细信息
    作者简介:

    吕乔健(1995-),男,浙江人,硕士生,主要研究方向为音频信号处理。E-mail:y45180171@mail.ecust.edu.cn

    通讯作者:

    陈 宁,E-mail:chenning_750210@163.com

  • 中图分类号: TP391

GAN-Based Domain Adaptation for Gender Identification

  • 摘要: 在实际应用场景中,由于实际语音数据与模型训练数据存在较大差异,导致基于音频的性别识别模型的性能严重下降。为了解决这一问题,本文提出了一种结合GAN和GhostVLAD层的域自适应模型。基于GhostVLAD的引入可有效减少语音中噪声和无关信息的干扰,而基于GAN思想的训练方法可以实现模型对目标域数据的自适应。在对抗训练中,通过引入辅助损失保持网络对性别特征的表征能力。采用Voxceleb1数据集作为源域,Audioset和Movie数据集分别作为目标域,对本文的域自适应模型的性能进行测试实验。实验结果表明,相比于基于卷积神经网络的性别识别模型,本文模型可将性别识别的准确率分别提高5.13%和7.72%。

     

  • 图  1  性别识别模型框图

    Figure  1.  Block diagram of gender identification model

    表  1  CNN-GV模型的结构

    Table  1.   Structure of the CNN-GV model

    Input size $\left( {{\rm{96}} \times {\rm{64}} \times {\rm{1}}} \right)$Output size
    Conv2d,${\rm{3}} \times {\rm{3}}$,32, stride(1,1)${\rm{96}} \times {\rm{64}} \times {\rm{32}}$
    Conv2d,${\rm{3}} \times {\rm{3}}$,32, stride(1,1)${\rm{96}} \times {\rm{64}} \times {\rm{32}}$
    Conv2d,${\rm{3}} \times {\rm{3}}$,64, stride(2,2)${\rm{48}} \times {\rm{32}} \times {\rm{64}}$
    Conv2d,${\rm{3}} \times {\rm{3}}$,64, stride(1,1)${\rm{48}} \times {\rm{32}} \times {\rm{64}}$
    Max pool,${\rm{1}} \times {\rm{2}}$,stride(1,2)${\rm{48}} \times {\rm{16}} \times {\rm{64}}$
    Conv2d,${\rm{3}} \times {\rm{3}}$,128, stride(2,2)${\rm{24}} \times {\rm{8}} \times {\rm{128}}$
    Conv2d,${\rm{3}} \times {\rm{3}}$,128, stride(1,1)${\rm{24}} \times {\rm{8}} \times {\rm{128}}$
    Max pool,${\rm{2}} \times {\rm{2}}$,stride(2,2)${\rm{12}} \times {\rm{4}} \times {\rm{128}}$
    Conv2d,${\rm{3}} \times {\rm{3}}$,256, stride(2,2)${\rm{6}} \times {\rm{2}} \times {\rm{256}}$
    Ghost VLAD5120
    Dense,10241024
    Dense,256256
    Softmax,22
    下载: 导出CSV

    表  2  模型性能对比

    Table  2.   Model performance comparison

    ModelUWA
    Voxceleb1AudiosetMovie
    Literature[9]0.947 00.798 80.765 7
    CNN-GV-DA(without GhostVLAD)0.965 60.827 00.806 4
    CNN-GV0.953 90.811 10.773 4
    CNN-GV-DA0.965 70.850 10.842 9
    下载: 导出CSV
  • [1] HARB H, CHEN L. Gender identification using a general audio classifier[C]//2003 International Conference on Multimedia and Expo. USA: IEEE, 2003: 733-736.
    [2] ABDULLA W H, KASABOV N K. Improving speech recognition performance through gender separation[C]// Fifth Biannual Conference on Artificial Neural Networks and Expert Systems (ANNES). USA: IEEE, 2001: 218-222.
    [3] ORE B M, SLYH R E, HANSEN E G. Speaker segmentation and clustering using gender information[C]//2006 IEEE Odyssey-The Speaker and Language Recognition Workshop. USA: IEEE, 2006: 1-8.
    [4] KUMAR N, NASIR M, GEORGIOU P, et al. Robust multichannel gender classification from speech in movie audio[C]// Interspeech 2016. [s. l. ]: [s. n. ], 2016: 2233-2237.
    [5] WU K, CHILDERS D G. Gender recognition from speech: Part I. Coarse analysis[J]. The Journal of the Acoustical Society of America, 1991, 90(4): 1828-1840. doi: 10.1121/1.401663
    [6] YÜCESOY E, NABIYEV V V. Gender identification of a speaker using MFCC and GMM[C]//2013 8th International Conference on Electrical and Electronics Engineering (ELECO). Turkey: IEEE, 2013: 626-629.
    [7] SENOUSSAOUI M, KENNY P, BRÜMMER N, et al. Mixture of PLDA models in i-vector space for gender-independent speaker recognition[C]// 2011 interspeech 12th Annual Conference of the International Speech Communication Association. Italy: DBLP, 2011: 25-28.
    [8] El SHAFEY L, KHOURY E, MARCEL S. Audio-visual gender recognition in uncontrolled environment using variability modeling techniques[C]//IEEE International Joint Conference on Biometrics (IJCB). USA: IEEE, 2014: 1-8.
    [9] DOUKHAN D, CARRIE J, VALLET F, et al. An open-source speaker gender detection framework for monitoring gender equality[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Canada: IEEE, 2018: 5214-5218.
    [10] LIEW S S, HANI M K, RADZI S A, et al. Gender classification: A convolutional neural network approach[J]. Turkish Journal of Electrical Engineering and Computer Sciences, 2016, 24(3): 1248-1264.
    [11] DU J, NA X, LIU X, et al. AISHELL-2: Transforming mandarin ASR research into industrial scale[EB/OL]. Arxiv. org, (2018-08-31)[2020-12-20]. https://arxiv.org/abs/1808.10583.
    [12] HEBBAR R, SOMANDEPALLI K, NARAYANAN S S. Improving gender identification in movie audio using cross-domain data[C]// Interspeech 2018. [s. l. ]: [s. n. ], 2018: 282-286.
    [13] GEMMEKE J F, ELLIS D P W, FREEDMAN D, et al. Audio set: An ontology and human-labeled dataset for audio events[C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). USA: IEEE, 2017: 776-780.
    [14] ZHONG Y, ARANDJELOVIĆ R, ZISSERMAN A. GhostVLAD for set-based face recognition[C]// Asian Conference on Computer Vision. Cham: Springer, 2018: 35-50.
    [15] ARANDJELOVIC R, GRONAT P, TORII A, et al. NetVLAD: CNN architecture for weakly supervised place recognition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2016: 5297-5307.
    [16] XIE W, NAGRANI A, CHUNG J S, et al. Utterance-level aggregation for speaker recognition in the wild[C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). UK: IEEE, 2019: 5791-5795.
    [17] BEN-DAVID S, BLITZER J, CRAMMER K, et al. Analysis of representations for domain adaptation[C]//Advances in Neural Information Processing Systems. USA: MIT Press, 2007: 137-144.
    [18] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[J]. Advances in Neural Information Processing Systems, 2014, 27: 2672-2680.
    [19] TZENG E, HOFFMAN J, SAENKO K, et al. Adversarial discriminative domain adaptation[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2017: 7167-7176.
    [20] SHEN J, QU Y, ZHANG W, et al. Wasserstein distance guided representation learning for domain adaptation[EB/OL]. arxiv. org, (2017-07-05)[2020-12-20]. https://arxiv.org/abs/1707.01217.
    [21] ODENA A, OLAH C, SHLENS J. Conditional image synthesis with auxiliary classifier gans[C]//International Conference on Machine Learning(PMLR). [s. l. ]: [s. n. ], 2017: 2642-2651.
    [22] NAGRANI A, CHUNG J S, ZISSERMAN A. Voxceleb: A large-scale speaker identification dataset[EB/OL]. arxiv.org, (2017-06-26)[2020-12-20]. https://arxiv.org/abs/1706.08612.
  • 加载中
图(1) / 表(2)
计量
  • 文章访问数:  310
  • HTML全文浏览量:  317
  • PDF下载量:  1
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-01-04
  • 网络出版日期:  2021-04-27

目录

    /

    返回文章
    返回