高级检索

  • ISSN 1006-3080
  • CN 31-1691/TQ

基于耳蜗图多示例分析的音频场景分类模型

林巧颖 陈宁

林巧颖, 陈宁. 基于耳蜗图多示例分析的音频场景分类模型[J]. 华东理工大学学报(自然科学版). doi: 10.14135/j.cnki.1006-3080.20201124001
引用本文: 林巧颖, 陈宁. 基于耳蜗图多示例分析的音频场景分类模型[J]. 华东理工大学学报(自然科学版). doi: 10.14135/j.cnki.1006-3080.20201124001
LIN Qiaoying, CHEN Ning. Acoustic Scene Classification Model Based on Multi-Instance Analysis of Cochleagram[J]. Journal of East China University of Science and Technology. doi: 10.14135/j.cnki.1006-3080.20201124001
Citation: LIN Qiaoying, CHEN Ning. Acoustic Scene Classification Model Based on Multi-Instance Analysis of Cochleagram[J]. Journal of East China University of Science and Technology. doi: 10.14135/j.cnki.1006-3080.20201124001

基于耳蜗图多示例分析的音频场景分类模型

doi: 10.14135/j.cnki.1006-3080.20201124001
基金项目: 国家自然科学基金面上项目(61771196)
详细信息
    作者简介:

    林巧颖(1995-),女,福建人,硕士生,主要研究方向为音频信号处理。E-mail:y30180621@mail.ecust.edu.cn

    通讯作者:

    陈 宁,E-mail:chenning_750210@163.com

  • 中图分类号: TP391

Acoustic Scene Classification Model Based on Multi-Instance Analysis of Cochleagram

  • 摘要: 音频场景分类(Acoustic Scene Classification, ASC)是计算听觉场景领域最具挑战的任务之一。传统的ASC模型大多采用基于线性频率分析的手工特征加基于深度学习的分类模型方法。然而,一方面,由于基于线性频率分析的特征提取方法无法模拟人耳基底膜的非线性频率选择特性,从而造成特征分辨率低下的问题;另一方面,现有的分类模型无法解决由于音源复杂且音频事件高度重叠所造成的分类准确率低下的问题。为了解决以上问题,提出了基于耳蜗图多示例分析的音频场景分类模型,一方面采用中心频率按照等效带宽均匀分布的余弦滤波器组对信号频谱进行滤波以模拟人耳听觉感知特性;另一方面,通过引入多示例学习刻画整个数据结构的特点以提高分类准确率。此外,为了抵抗音频事件的频移影响,在多示例学习分类模型的分类预测整合器中采用平均池化方法。在DCASE 2018 和DCASE 2019竞赛所提供的Task1a数据集上的实验结果表明,本文提出的模型比DCASE 2018竞赛所提供的基线系统以及传统的基于Log Mel特征提取和多示例学习的模型实现了更高的分类准确率,同时也验证了平均池化要优于最大池化。

     

  • 图  1  ASC-MIL模型框图

    Figure  1.  Block diagram of ASC-MIL model

    图  2  多示例检测器K值对模型性能的影响

    Figure  2.  Influence of hyper-parameter K in multi-detector

    表  1  采样因子取值对滤波器个数的影响

    Table  1.   Effect of sample factor on the numbers of filters

    Sample factorLow-pass filterBand-pass filterHigh-pass filterNumber of filter
    11$n$1$n + 2$
    22$2 n + 1$2$2 n + 1 + 4$
    44$4 n + 3$4$4 n + 3 + 8$
    $s$$s$$s \left( {n + 1} \right) - 1$$s$$s \left( {n + 1} \right) - 1 + 2 s$
    下载: 导出CSV

    表  2  音频样本切分长度对模型性能的影响

    Table  2.   Performance comparison based on different lengths of audio segment

    Length/sSample factorDownsampling rateInput shapeAccuracy/%
    22400(65, 800)68.3
    42400(65, 1600)67.2
    62400(65, 2400)66.1
    82400(65, 3200)65.8
    102400(65, 4000)64.4
    下载: 导出CSV

    表  3  基于多示例分析的音频场景分类算法性能比较

    Table  3.   ASC Performance comparison based on multi-instance analysis

    ModelFeatureNetworkAccuracy/%
    2018 Task 1A2019 Task 1A
    Literature [5]Log MelCNN + Fully connected layer58.952.4
    Literature [12]Log MelVGGNet + MIL (MaxPool)66.266.7
    ASC-MIL +SVMCochleagramASC-MIL+ SVM58.753.6
    ASC-MILSpectrogramASC-MIL(AvgPool)60.459.5
    ASC-MILMFCCASC-MIL(AvgPool)64.464.1
    ASC-MILCochleagramASC-MIL (MaxPool)67.267.9
    ASC-MILCochleagramASC-MIL(AvgPool)68.368.9
    下载: 导出CSV
  • [1] CHU S, NARAYANAN S, KUO C J, et al. Where am i? scene recognition for mobile robots using audio features[C]//2006 IEEE International Conference on Multimedia and Expo. Canada: IEEE, 2006: 885-888.
    [2] SCHILIT B, ADAMS N, WANT R. Context-aware computing applications[C]//1994 IEEE Workshop on Mobile Computing Systems and Applications. USA: IEEE, 1994: 85–90.
    [3] ERONEN A J, PELTONEN V T, TUOMI J T, et al. Audio-based context recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(1): 321-329. doi: 10.1109/TSA.2005.854103
    [4] MULIMANI M, KOOLAGUDI S G. Acoustic scene classification using MFCC and MP features[C]//2016 Detection and Classification of Acoustic Scenes and Events. DCASE, 2016: Tech. Rep.
    [5] KONG Q, IQBAL T, XU Y, et al. DCASE 2018 challenge baseline with convolutional neural networks[C]//Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop. UK: Tampere University, 2018: 217-221.
    [6] VALENTI M, SQUARTINI S, DIMENT A, et al. A convolutional neural network approach for acoustic scene classification[C]//2017 International Joint Conference on Neural Networks (IJCNN). USA: IEEE, 2017: 1547-1554.
    [7] XU Y, KONG Q, WANG W, et al. Large-scale weakly supervised audio classification using gated convolutional neural network[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). USA: IEEE, 2018: 121-125.
    [8] ABEßER J. A review of deep learning based methods for acoustic scene classification[J]. Applied Sciences, 2020, 10(6): 1-16.
    [9] SHARAN R V, MOIR T J. Pseudo-color cochleagram image feature and sequential feature selection for robust acoustic event recognition[J]. Applied Acoustic, 2018, 140: 198-204. doi: 10.1016/j.apacoust.2018.05.030
    [10] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553): 436-444. doi: 10.1038/nature14539
    [11] PELTONEN V T, ERONEN A J, PARVIAINEN M P, et al. Recognition of everyday auditory scenes: Potentials, latencies and cues[C]//Proceedings of the 110th Convention of the Audio Engineering Society. Amsterdam: AES, 2001: 1-5.
    [12] SONG H W, HAN J Q, DENG S W, et al. Acoustic scene classification by implicitly identifying distinct sound events[C]// Proceedings of the Annual Conference of the International Speech Communication Association. Austria: INTERSPEECH, 2019: 3860-3864.
    [13] AMORES J. Multiple instance classification: Review, taxonomy and comparative study[J]. Artificial Intelligence, 2013, 201: 81-105. doi: 10.1016/j.artint.2013.06.003
    [14] KUMAR A, RAJ B. Audio event detection using weakly labeled data[C]//Proceedings of the 24th ACM International Conference on Multimedia. USA: ACM, 2016: 1038-1047.
    [15] WANG Y. Polyphonic sound event detection with weak labeling[D]. Pittsburgh: Carnegie Mellon University, 2018.
    [16] KUMAR A, RAJ B. Audio event and scene recognition: A unified approach using strongly and weakly labeled data[C]//2017 International Joint Conference on Neural Networks (IJCNN). USA: IEEE, 2017: 3475-3482.
    [17] BRIGGS F, LAKSHMINARAYANAN B, NEAL L, et al. Acoustic classification of multiple simultaneous bird species: A multi-instance multi-label approach[J]. The Acoustical Society of America, 2012, 131(6): 4640-4650. doi: 10.1121/1.4707424
    [18] MC DERMOTT J H, SIMONCELLI E P. Sound texture perception via statistics of the auditory periphery: Evidence from sound synthesis[J]. Neuron, 2011, 71: 926-940. doi: 10.1016/j.neuron.2011.06.032
    [19] FENG J, ZHOU Z. Deep MIML network[C]// Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. San Francisco: AAAI, 2017: 1884-1890.
    [20] KONG Q, CAO Y, IQBAL T, et al. Cross-task learning for audio tagging, sound event detection and spatial localization: Dcase 2019 baseline systems[EB/OL]. arXiv. org, (2019-04-06)[2020-11-01], https://arxiv.org/abs/1904.03476v3.
    [21] SONG H W, HAN J Q, DENG S W. A Compact and discriminative feature based on auditory summary statistics for acoustic scene classification[C]//Proceedings of the Annual Conference of the International Speech Communication Association. Hyderabad: INTERSPEECH, 2018: 3294-3298.
  • 加载中
图(2) / 表(3)
计量
  • 文章访问数:  239
  • HTML全文浏览量:  237
  • PDF下载量:  9
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-11-24
  • 网络出版日期:  2021-03-24

目录

    /

    返回文章
    返回