高级检索

    林巧颖, 陈宁. 基于耳蜗图多示例分析的音频场景分类模型[J]. 华东理工大学学报(自然科学版), 2022, 48(1): 99-104. DOI: 10.14135/j.cnki.1006-3080.20201124001
    引用本文: 林巧颖, 陈宁. 基于耳蜗图多示例分析的音频场景分类模型[J]. 华东理工大学学报(自然科学版), 2022, 48(1): 99-104. DOI: 10.14135/j.cnki.1006-3080.20201124001
    LIN Qiaoying, CHEN Ning. Acoustic Scene Classification Model Based on Multi-Instance Analysis of Cochleagram[J]. Journal of East China University of Science and Technology, 2022, 48(1): 99-104. DOI: 10.14135/j.cnki.1006-3080.20201124001
    Citation: LIN Qiaoying, CHEN Ning. Acoustic Scene Classification Model Based on Multi-Instance Analysis of Cochleagram[J]. Journal of East China University of Science and Technology, 2022, 48(1): 99-104. DOI: 10.14135/j.cnki.1006-3080.20201124001

    基于耳蜗图多示例分析的音频场景分类模型

    Acoustic Scene Classification Model Based on Multi-Instance Analysis of Cochleagram

    • 摘要: 音频场景分类(Acoustic Scene Classification, ASC)是计算听觉场景领域最具挑战的任务之一。传统的ASC模型大多采用基于线性频率分析的手工特征加基于深度学习的分类模型方法。然而,一方面,由于基于线性频率分析的特征提取方法无法模拟人耳基底膜的非线性频率选择特性,从而造成特征分辨率低下的问题;另一方面,现有的分类模型无法解决由于音源复杂且音频事件高度重叠所造成的分类准确率低下的问题。为了解决以上问题,提出了基于耳蜗图多示例分析的音频场景分类模型,一方面采用中心频率按照等效带宽均匀分布的余弦滤波器组对信号频谱进行滤波以模拟人耳听觉感知特性;另一方面,通过引入多示例学习刻画整个数据结构的特点以提高分类准确率。此外,为了抵抗音频事件的频移影响,在多示例学习分类模型的分类预测整合器中采用平均池化方法。在DCASE 2018 和DCASE 2019竞赛所提供的Task1a数据集上的实验结果表明,本文提出的模型比DCASE 2018竞赛所提供的基线系统以及传统的基于Log Mel特征提取和多示例学习的模型实现了更高的分类准确率,同时也验证了平均池化要优于最大池化。

       

      Abstract: Acoustic scene classification (ASC) is one of the most challenging tasks in the field of computational auditory scene analysis (CASA). Most of the traditional ASC models are based on the combination of the linear frequency analysis-based handcrafted feature and the deep learning-based classification model method. However, the linear frequency analysis-based feature extraction method cannot mimic the nonlinear frequency selectivity of the human basilar membrane, which results in lower feature resolution. On the other hand, the existing classification model cannot solve the low classification accuracy caused by complex sound sources and highly overlapping of sound events. To deal with these problems, this paper proposes an ASC model based on cochleagram multi-instance analysis. The equivalent rectangular bandwidth cosine filter bank is adopted to analyze the signal spectrum and simulate the acoustic perception property of human beings. Meanwhile, multi-instance learning strategy is introduced to characterize the entire data structure of acoustic scenes for improving classification accuracy. In addition, in order to enhance the robustness to frequency shift of sound events, the average pooling method is adopted in the classification prediction integrator of the multi-instance learning classification model. Finally, it is shown via the experimental results on the DCASE 2018 and DCASE 2019 Challenge Task 1A dataset that the proposed model in this work can achieve higher classification accuracy than the baseline model provided by the DCASE 2018 Challenge and the traditional model based on Log Mel spectrogram and multi-instance learning. Moreover, it is also verified that the average pooling is better than the maximum pooling.

       

    /

    返回文章
    返回