高级检索

  • ISSN 1006-3080
  • CN 31-1691/TQ

基于对抗网络的声纹识别域迁移算法

季敏飞 陈宁

季敏飞, 陈宁. 基于对抗网络的声纹识别域迁移算法[J]. 华东理工大学学报(自然科学版). doi: 10.14135/j.cnki.1006-3080.20201209001
引用本文: 季敏飞, 陈宁. 基于对抗网络的声纹识别域迁移算法[J]. 华东理工大学学报(自然科学版). doi: 10.14135/j.cnki.1006-3080.20201209001
JI Minfei, CHEN Ning. GAN-based Domain Adaptation Algorithm for Speaker Verification[J]. Journal of East China University of Science and Technology. doi: 10.14135/j.cnki.1006-3080.20201209001
Citation: JI Minfei, CHEN Ning. GAN-based Domain Adaptation Algorithm for Speaker Verification[J]. Journal of East China University of Science and Technology. doi: 10.14135/j.cnki.1006-3080.20201209001

基于对抗网络的声纹识别域迁移算法

doi: 10.14135/j.cnki.1006-3080.20201209001
基金项目: 国家自然科学基金面上项目(61771196)
详细信息
    作者简介:

    季敏飞(1994-),男,上海人,硕士生,主要研究方向为音频信号处理。E-mail:Y45180166@mail.ecust.edu.cn

    通讯作者:

    陈 宁,E-mail:chenning_750210@163.com

  • 中图分类号: TP391

GAN-based Domain Adaptation Algorithm for Speaker Verification

  • 摘要: 针对声纹识别任务中常常出现的由于真实场景语音与模型训练语料在内部特征(情感、语言、说话风格、年龄)或外部特征(背景噪声、传输信号、麦克风、室内混响)等方面的差异所导致的模型识别率低的问题,提出了一种基于对抗网络的声纹识别域迁移算法。首先,利用源域语音对X-Vector的声纹识别模型进行训练;然后,采用域迁移方法将源域训练的X-Vector模型迁移至目标域训练数据;最后,在目标域测试数据上检测迁移后的模型性能,并将其与迁移前的模型性能进行对比。实验中采用AISHELL1作为源域,采用VoxCeleb1和CN-Celeb分别作为目标域对算法性能进行测试。实验结果表明,采用本文方法进行迁移后,在VoxCeleb1和CN-Celeb的目标域测试集上的等错误率分别下降了21.46%和19.24%。

     

  • 图  1  X-Vector模型框图

    Figure  1.  Block diagram of X-Vector model

    图  2  GAN-DASV模型框图

    Figure  2.  Block diagram of GAN-DASV model

    图  3  域迁移前后DET曲线对比

    Figure  3.  DET Curves comparison before and after domain adaptation

    表  1  X-Vector网络结构

    Table  1.   Network structure of X-Vector model

    LayerContextDim
    TDNN-ReLUt-2, t+2512
    TDNN-ReLUt-2, t, t+2512
    TDNN-ReLUt-3, t, t+3512
    TDNN-ReLUt512
    TDNN-ReLUt1 500
    Pooling(mean+stddev)Full-seq3 000
    Dense-ReLU-512
    Dense-ReLU-512
    Dense-Softmax-Speakers
    下载: 导出CSV

    表  2  鉴别器网络结构

    Table  2.   Network structure of discriminator

    LayerInput dimOutput dim
    Dense1-ReLU512512
    Dense2-ReLU512512
    Dense3-ReLU25664
    Softmax642
    下载: 导出CSV

    表  3  迁移前后性能对比

    Table  3.   Performance comparison before and after domain adaptation

    SchemesEER/%
    VoxCeleb1CN-Celeb
    PLDA (Before adaptation)30.5735.07
    PLDA (After adaptation)9.1115.83
    CDS (Before adaptation)32.6943.58
    CDS (After adaptation)15.4120.36
    下载: 导出CSV

    表  4  本文算法与DANN算法对比

    Table  4.   Performance comparison between this paper and DANN

    SchemesVoxCeleb1CN-Celeb
    EER (%)DCFEER (%)DCF
    DANN (After adaptation)12.970.536 3×10−216.50.696 2×10−2
    This paper (After adaptation)9.110.347 8×10−215.830.674 4×10−2
    下载: 导出CSV
  • [1] MISRA A, HANSEN J H L. Modelling and compensation for language mismatch in speaker verification[J]. Speech Communication, 2018, 96: 58-66. doi: 10.1016/j.specom.2017.09.004
    [2] DEHAK N, KENNY P J, DEHAK R, et al. Front-end factor analysis for speaker verification[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 19(4): 788-798.
    [3] SHUM S H, REYNOLDS D A, GARCIA-ROMERO D, et al. Unsupervised clustering approaches for domain adaptation in speaker recognition systems[C]// Odyssey 2014. Joensuu Finland: ISCA, 2014: 265-272.
    [4] GARCIA-ROMERO D, MCCREE A, SHUM S, et al. Unsupervised domain adaptation for i-vector speaker recognition[C]// Odyssey 2014. Joensuu Finland: ISCA, 2014, 8: 260-264.
    [5] RAHMAN M H, KANAGASUNDARAM A, DEAN D, et al. Dataset-invariant covariance normalization for out-domain PLDA speaker verification[C]// Conference of the International Speech Communication Association (INTERSPEECH). Dresden, Germany: ISCA, 2015: 1017-1021.
    [6] ALAM M J, BHATTACHARYA G, KENNY P. Speaker verification in mismatched conditions with frustratingly easy domain adaptation[C]//Odyssey 2018. France: ISCA, 2018: 176-180.
    [7] SUN B, FENG J, SAENKO K. Return of frustratingly easy domain adaptation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. USA: AAAI, 2016: 2058-2065.
    [8] LEE K A, WANG Q, KOSHINAKA T. The CORAL+ algorithm for unsupervised domain adaptation of PLDA[C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019: 5821-5825.
    [9] SNYDER D, GARCIA-ROMERO D, POVEY D, et al. Deep neural network embeddings for text-independent speaker verification[C]// Conference of the International Speech Communication Association (INTERSPEECH). Stockholm, Sweden: ISCA, 2017: 999-1003.
    [10] SNYDER D, GARCIA-ROMERO D, SELL G, et al. X-vectors: Robust dnn embeddings for speaker recognition[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Alberta: IEEE, 2018: 5329-5333.
    [11] NANDWANA M K, MCLAREN M, FERRER L, et al. Analysis and mitigation of vocal effort variations in speaker recognition[C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019: 6001-6005.
    [12] ROHDIN J, STAFYLAKIS T, SILNOVA A, et al. Speaker verification using end-to-end adversarial language adaptation[C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019: 6006-6010.
    [13] XIA W, HUANG J, HANSEN J H L. Cross-lingual text-independent speaker verification using unsupervised adversarial discriminative domain adaptation[C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019: 5816-5820.
    [14] GHARIB S, DROSSOS K, CAKIR E, et al. Unsupervised adversarial domain adaptation for acoustic scene classification[C]//Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018). Surrey, UK: IEEE 2018: 138–142.
    [15] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]//Advances in Neural Information Processing Systems. Montreal, Quebec, Canada: NIPS, 2014: 2672-2680.
    [16] SHRIVASTAVA A, PFISTER T, TUZEL O, et al. Learning from simulated and unsupervised images through adversarial training[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Honolulu, Hawaii: IEEE, 2017: 2107-2116.
    [17] BU H, DU J, NA X, et al. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline[C]//2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA). Seoul: IEEE, 2017: 1-5.
    [18] NAGRANI A, CHUNG J S, ZISSERMAN A. Voxceleb: A large-scale speaker identification dataset[C]// Conference of the International Speech Communication Association (INTERSPEECH). Stockholm: ISCA, 2017: 2616-2620.
    [19] FAN Y, KANG J W, LI L T, et al. CN-CELEB: A challenging Chinese speaker recognition dataset[C]//2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona: IEEE, 2020: 7604-7608.
    [20] KINGMA D P, BA J. Adam: A method for stochastic optimization[C]//The 3rd International Conference for Learning Representations (ICLR). San Diego USA: IEEE, 2015: 1-15.
    [21] GANIN Y, USTINOVA E, AJAKAN H, et al. Domain-adversarial training of neural networks[J]. The Journal of Machine Learning Research, 2016, 17(1): 2096-2130.
  • 加载中
图(3) / 表(4)
计量
  • 文章访问数:  281
  • HTML全文浏览量:  151
  • PDF下载量:  15
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-12-09
  • 网络出版日期:  2021-04-07

目录

    /

    返回文章
    返回