Abstract:
Acoustic scene classification (ASC) is one of the most challenging tasks in the field of computational auditory scene analysis (CASA). Most of the traditional ASC models are based on the combination of the linear frequency analysis-based handcrafted feature and the deep learning-based classification model method. However, the linear frequency analysis-based feature extraction method cannot mimic the nonlinear frequency selectivity of the human basilar membrane, which results in lower feature resolution. On the other hand, the existing classification model cannot solve the low classification accuracy caused by complex sound sources and highly overlapping of sound events. To deal with these problems, this paper proposes an ASC model based on cochleagram multi-instance analysis. The equivalent rectangular bandwidth cosine filter bank is adopted to analyze the signal spectrum and simulate the acoustic perception property of human beings. Meanwhile, multi-instance learning strategy is introduced to characterize the entire data structure of acoustic scenes for improving classification accuracy. In addition, in order to enhance the robustness to frequency shift of sound events, the average pooling method is adopted in the classification prediction integrator of the multi-instance learning classification model. Finally, it is shown via the experimental results on the DCASE 2018 and DCASE 2019 Challenge Task 1A dataset that the proposed model in this work can achieve higher classification accuracy than the baseline model provided by the DCASE 2018 Challenge and the traditional model based on Log Mel spectrogram and multi-instance learning. Moreover, it is also verified that the average pooling is better than the maximum pooling.