Multi-Task Learning 3D CNN-BLSTM with Attention Mechanism for Speech Emotion Recognition
-
摘要: 语音情感识别广泛应用于车载驾驶系统、服务行业、教育以及医疗等各个领域。为了使计算机能更准确地识别出说话人的情感,提出了一种基于注意力机制的多任务三维卷积神经网络(Convolution Neural Network, CNN)和双向长短期记忆网络(Bidirectional Long-Short Term Memory, BLSTM)相结合的情感语音识别方法(3D CNN-BLSTM)。基于多谱特征融合组图,利用三维卷积神经网络提取深层语音情感特征,结合性别分类的多任务学习机制提升语音情感识别准确率。在CASIA汉语情感语料库上的实验结果表明,该方法获得了较高的准确率。Abstract: Speech emotion recognition has been widely used in various fields such as vehicle driving systems, service industries, education, and medical care. In order to make the computer recognize the speaker's emotion more accurately, this paper proposes an emotional speech recognition method based on the combination of multi-task 3D convolutional neural network and bidirectional long-short term memory network with attention mechanism. Based on the multi-spectral feature fusion group map, the deep speech emotion features are extracted by three-dimensional convolutional neural network, and the multi-task learning mechanism of gender classification is combined to improve the accuracy of speech emotion recognition. Finally, experimental results show that the proposed model can attain higher accuracy on CASIA Chinese emotional corpus.
-
表 1 数据增强方法参数设置
Table 1. Data augmentation method parameter setting
Method Min Max Probability/% AddGaussianNoise 0.0005 times 0.001 times 30 TimeStretch 0.9 times 1.1 times 30 PitchShift −2 semitones 2 semitones 30 Shift −0.3 s 0.3 s 30 表 2 数据增强对语音情感识别准确率的影响
Table 2. Influence of the data augmentation on speech emotion recognition accuracy
Experiment Data augmentation Accuracy/% 1 No-ops 84.10 2 AddGaussianNoise 84.08 3 TimeStretch 84.75 4 PitchShift 87.83 5 Shift 84.33 表 3 输入不同声纹图的对比结果
Table 3. Comparison results of different voiceprints
Input Accuracy/% Recall/% Precision/% F1/% LPC 84.08 84.08 84.15 84.02 SPC 80.75 80.75 80.71 80.63 Mel 87.67 87.67 87.75 87.63 LPC+SPC 88.25 84.50 88.37 88.16 LPC+Mel 90.25 90.25 90.56 90.27 SPC+Mel 90.08 90.08 90.10 90.08 Mel+LPC+SPC 91.08 91.08 91.15 91.10 表 4 不同α值的情感分类准确率
Table 4. Speech emotion recognition accuracy of different α values
$\alpha $ Accuracy/% 0.9 87.67 0.8 86.92 0.7 88.00 0.6 86.83 0.5 91.08 0.4 87.25 0.3 88.02 0.2 87.33 0.1 86.00 表 5 在CASIA汉语情感语料库上不同模型方法的准确率对比
Table 5. Accuracy comparison of different models in CASIA Chinese sentiment corpus
表 6 5种模型的对比结果
Table 6. Comparison of five models
Model Accuracy/% Recall/% Precision/% F1/% Modified CNN-BLSTM 82.50 82.50 82.63 82.49 3D CNN-BLSTM 83.50 83.50 83.71 83.51 CNN-BLSTM+multi-tasking 85.17 82.33 85.37 85.18 CNN-BLSTM+augmentation 87.92 87.92 88.12 87.91 3D CNN-BLSTM+multi-tasking+ augmentation 91.08 91.08 91.15 91.10 -
[1] 韩文静, 李海峰, 阮华斌. 语音情感识别研究进展综述[J]. 软件学报, 2014, 25(1): 37-50. [2] 张石清, 李乐民, 赵知劲. 人机交互中的语音情感识别研究进展[J]. 电路与系统学报, 2013, 18(2): 440-451, 434. [3] HU H, XU M X, WU W. GMM supervector based SVM with spectral features for speech emotion recognition[C]// IEEE International Conference on Acoustics, Speech and Signal Processing. USA: IEEE, 2007: 413-416. [4] LEE C C, MOWER E, BUSSO C, et al. Emotion recognition using a hierarchical binary decision tree approach[J]. Speech Communication, 2009, 53(9): 1162-1171. [5] LIN Z, FENG M, SANTOS C N D, et al. A structured self-attentive sentence embedding[C]//ICLR 2017. USA: IBM, 2017: 1-15. [6] HUANG Z, DONG M, MAO Q, et al. Speech emotion recognition using CNN[C]//Proceedings of the 22nd ACM International Conference on Multimedia. USA: ACM, 2014: 801-804. [7] LEE J, TASHEV I. High-level multimedia feature representation using recurrent neural network for speech emotion recognition[C]//Interspeech 2015. Germany: International Speech Communication Association, 2015: 1537-1540. [8] SATT A, ROZENBERG S, HOORY R. Efficient emotion recognition from speech using deep learning on spectrograms[C]//Interspeech 2017. [s. l. ]: [s. n. ], 2017: 1089-1093. [9] 胡婷婷, 冯亚琴, 沈凌洁, 等. 基于注意力机制的LSTM语音情感主要特征选择[J]. 声学技术, 2019, 38(4): 414-421. [10] 薛艳飞, 毛启容, 张建明. 基于多任务学习的多语言语音情感识别方法[J]. 计算机应用研究, 2021, 38(4): 1069-1073. [11] 史晶. 基于深度神经网络的语音情感识别模型研究[D]. 重庆: 重庆大学, 2019. [12] KAWAHARA H. Speech representation and transformation using adaptive interpolation of weighted spectrum: Vocoder revisited[C]// 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing. Germany: IEEE, 1997: 1303-1306. [13] 唐晓进. 基于LPC倒谱的语音特征参数提取[J]. 山西电子技术, 2012(6): 15-16, 19. doi: 10.3969/j.issn.1674-4578.2012.06.006 [14] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. doi: 10.1162/neco.1997.9.8.1735 [15] 辛创业, 许芬. 基于LSTM神经网络的语音情绪识别[J]. 工业控制计算机, 2020, 33(8): 87-89. doi: 10.3969/j.issn.1001-182X.2020.08.037 [16] 曾义夫, 蓝天, 吴祖峰, 等. 基于双记忆注意力的方面级别情感分类模型[J]. 计算机学报, 2019(8): 1845-1857. doi: 10.11897/SP.J.1016.2019.01845 [17] CARUANA R. Multitask learning[J]. Machine Learning, 1997, 28(1): 41-75. doi: 10.1023/A:1007379606734 [18] VERVERIDIS D, KOTROPOULOS C. Automatic speech classification to five emotional states based on gender information[C]//2004 12th European Signal Processing Conference. Austria: IEEE, 2004: 341-344. [19] 韩文静, 李海峰. 情感语音数据库综述[J]. 智能计算机与应用, 2013(1): 5-7. doi: 10.3969/j.issn.2095-2163.2013.01.002 [20] MCFEE B, RAFFEL C, LIANG D, et al. librosa: Audio and music signal analysis in python[C]//Proceedings of the 14th Python in Science Conference. [s. l. ]: [s. n. ], 2015: 18-25. [21] 曾润华, 张树群. 改进卷积神经网络的语音情感识别方法[J]. 应用科学学报, 2018, 36(5): 837-844. doi: 10.3969/j.issn.0255-8297.2018.05.011 [22] 姜芃旭, 傅洪亮, 陶华伟, 等. 一种基于卷积神经网络特征表征的语音情感识别方法[J]. 电子器件, 2019, 42(4): 998-1001. doi: 10.3969/j.issn.1005-9490.2019.04.036 [23] 陈巧红, 于泽源, 孙麒, 等. 基于注意力机制与LSTM的语音情绪识别[J]. 浙江理工大学学报(自然科学版), 2020, 43(6): 815-822. -