Music Emotion Recognition Based on the Broad and Deep Learning Network
-
摘要: 将梅尔频率倒谱系数(Mel Frequency Cepstral Coefficient, MFCC)和残差相位(Residual Phase, RP)进行加权结合来提取音乐情感特征,提高了音乐情感特征的挖掘效率;同时为了提高音乐情感的分类精度,缩短模型训练时间,将长短期记忆网络(Long Short-Term Memory, LSTM)和宽度学习系统(Broad Learning System, BLS)相结合,使用LSTM作为BLS的特征映射节点,搭建了一种新型宽深学习网络(LSTM-BLS)进行音乐情感识别分类训练。在Emotion数据集上的实验结果表明,本文算法取得了比其他复杂网络更高的识别准确率,为音乐情感识别的发展提供了新的可行性思路。Abstract: With the development of artificial intelligence and digital audio technology, music information retrieval (MIR) has gradually become a research hotspot. Meanwhile, music emotion recognition (MER) is becoming an important research direction, due to its great research value for video soundtracks. Although some researchers combine Mel Frequency Cepstral coefficient (MFCC) and Residual Phase (RP) to extract music emotional features and improve classification accuracy, the training models in traditional deep learning takes longer time. In order to improve the efficiency of feature mining of music emotional features, MFCC and RP are weighted and combined in this work to extract music emotion features so that the mining efficiency of music emotion features can be effectively improved. At the same time, in order to improve the classification accuracy of music emotion and shorten the training time of the model, by integrating the Long Short-Term Memory (LSTM) and the Broad Learning System (BLS), a new wide and deep learning network (LSTM-BLS) is further built to train music emotion recognition and classification by using LSTM as the feature mapping node of BLS. The network structure of this model makes full use of the ability of BLS to quickly process complex data. Its advantages are simple structure and short model training time, thereby improving recognition efficiency, and LSTM has excellent performance in extracting time series features from time series data. The time sequence relationship of music can be extracted so that the emotional characteristics of the music can be preserved to the greatest extent. Finally, the experimental results on the emotion dataset show that the proposed algorithm can achieve higher recognition accuracy than other complex networks and provide new feasible ideas for the music emotion recognition.
-
Key words:
- music emotion recognition /
- residual phase /
- broad learning /
- deep learning /
- long short-term memory
-
表 1 模型参数设置
Table 1. Parameters for models
Model Parameter LSTM-BLS ${k_1} = 40$,${k_2} = 80$ OutputDim_L1=400 OutputDim_L2=200 ${N_1} = 10$,${N_{\rm{2}}} = 10$,${N_{\rm{3}}} = 10{\rm{0}}$ LSTM OutputDim_L1=400 OutputDim_L2=200 OutputDim_L3=100 BLS ${N_1} = 10$,${N_{\rm{2}}} = 10$,${N_{\rm{3}}} = {\rm{5}}0{\rm{0}}$ CCFBLS $F = {\rm{3}} \times {\rm{3}}$ ${N_1} = 10$,${N_{\rm{2}}} = 10$,${N_{\rm{3}}} = {\rm{5}}0{\rm{0}}$ 表 2 模型分类准确率比较
Table 2. Classification accuracy comparison of different models
Model Classification accuracy /% CNN 52.36±2.31 LSTM 56.17±3.53 MCCLSTM 56.33±2.15 MCCBL 55.81±2.64 RCNNLSTM 57.33±3.03 RCNNBL 59.56±2.16 MCCLSTM+BLS 60.71±1.39 CCFBLS 62.33±2.03 LSTM-BLS 66.78±2.12 表 3 模型训练效率比较
Table 3. Training efficiency comparison of different models
Model Training time/s MCCLSTM 247.96 RCNNLSTM 615.57 MCCLSTM+BLS 285.65 CCFBLS 123.39 LSTM-BLS 169.32 -
[1] 陈颖呈, 陈宁. 基于音频内容和歌词文本相似度融合的翻唱歌曲识别模型[J]. 华东理工大学学报(自然科学版), 2021, 47(1): 74-80. [2] WENINGER F, EYBEN F, SCHULLER B. On-line continuous-time music mood regression with deep recurrent neural networks[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Italy: IEEE, 2014: 5412-5416. [3] MARKOV K, MATSUI T. Music genre and emotion recognition using Gaussian processes[J]. IEEE Access, 2014, 2: 688-697. doi: 10.1109/ACCESS.2014.2333095 [4] CHEN S H, LEE Y S, HSIEH W C, et al. Music emotion recognition using deep Gaussian process[C]//2015Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). Hong Kong, China: IEEE, 2015: 495-498. [5] LI X X, XIAN Y H, TIAN J, et al. A deep bidirectional long short-term memory based multi-scale approach for music dynamic emotion prediction[C]//2016IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). China: IEEE, 2016: 544-548. [6] 魏琛, 陈兰岚, 张傲. 基于集成卷积神经网络的脑电情感识别[J]. 华东理工大学学报(自然科学版), 2019, 45(4): 614-622. [7] 宋振振, 陈兰岚, 娄晓光. 基于时序卷积网络的情感识别算法[J]. 华东理工大学学报(自然科学版), 2020, 46(4): 564-572. [8] SARKAR R, CHOUDHURY S, DUTTA S, et al. Recognition of emotion in music based on deep convolutional neural network[J]. Multimedia Tools and Applications, 2020, 79(10): 765-783. [9] 唐霞, 张晨曦, 李江峰. 基于深度学习的音乐情感识别[J]. 电脑知识与技术, 2019, 15(11): 232-237. [10] ISSA D, DEMIRCI M F, YAZICI A. Speech emotion recognition with deep convolutional neural networks[J]. Biomedical Signal Processing and Control, 2020, 59: 101894. doi: 10.1016/j.bspc.2020.101894 [11] NALINI N J, PALANIVE S. Music emotion recognition: The combined evidence of MFCC and residual phase[J]. Egyptian Informatics Journal, 2016, 17(1): 1-10. doi: 10.1016/j.eij.2015.05.004 [12] TANG H, CHEN N. Combining CNN and broad learning for music classification[J]. IEICE Transactions on Information and Systems, 2020, E103. D(3): 695-701. doi: 10.1587/transinf.2019EDP7175 [13] CHEN P C L, LIU Z L, FENG S. Universal approximation capability of broad learning system and its structural variations[J]. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(4): 1191-1204. doi: 10.1109/TNNLS.2018.2866622 [14] CHEN PHILIP C L, LIU Z L. Broad learning system: An effective and efficient incremental learning system without the need for deep architecture[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(1): 10-24. doi: 10.1109/TNNLS.2017.2716952 [15] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. doi: 10.1162/neco.1997.9.8.1735 [16] MARIUS K, FRANCESCO R. Contextual music information retrieval and recommendation: State of the art and challenges[J]. Computer Science Review, 2012, 6(2/3): 89-119. [17] KOOLAGUDI S G, RAO K S. Emotion recognition from speech: A review[J]. International Journal of Speech Technology, 2012, 15: 99--117. doi: 10.1007/s10772-011-9125-1 [18] CHEN N, WANG S. High-level music descriptor extraction algorithm based on combination of multi-channel CNNs and LSTM[C]// 18th International Society of Music Information Retrieval (ISMIR). Suzhou, China: National University of Singapore, 2017: 509-514. [19] PONS J, LIDY T, SERRA X. Experimenting with musically motivated convolutional neural networks[C]//14th IEEE International Workshop on Content-Based Multimedia Indexing (CBMI). Romania: IEEE, 2016: 1-6. -