基于语音音素后验概率图关键特征提取的中文方言识别模型

冯罡; 陈宁

doi:10.14135/j.cnki.1006-3080.20221011001

基于语音音素后验概率图关键特征提取的中文方言识别模型

冯罡,
陈宁

A Chinese Dialect Identification Model Based on Key Feature Extraction from Phonetic Posteriorgram

摘要

摘要: 不同方言对相同字的发音往往有所不同，因此不同方言所包含音素的概率分布存在较大差异，这是方言差异性的重要体现。为了充分利用这一差异性，提出了基于音素后验概率图分析的方言识别模型，该模型引入Convolutional Block Attention Module（CBAM）的提取音素后验概率图关键特征，并利用Emphasized Channel Attention-Propagation and Aggregation in TDNN（ECAPA-TDNN）模型对其进行聚合和注意力池化得到句子级特征。为进一步提升类间距离，引入了Additive Angular Margin（AAM）损失。实验结果表明，该模型取得了比传统模型更高的分类准确率，并且以上改进均对准确率提升有所贡献。

Abstract: There are relatively few existing dialect recognition models for phonemic features and different dialects have different pronunciations, all of which lead to large differences in the probability distribution of phonemes contained in different dialects. Aiming at the above issues, this paper proposes a dialect identification model based on the phonetic posteriorgram feature. For the single dimension of attention analysis, this model extracts key features of frame-level phonetic posteriorgram by using the self-attention mechanism of Convolutional Block Attention Module (CBAM). At the same time, in order to make full use of the information in the middle layers of the model and avoid the loss of dialect information, Emphasized Channel Attention-Propagation and Aggregation in TDNN (ECAPA-TDNN) model is used to extract long-range information of frame-level feature and obtain effective sentence-level features via feature aggregation and attention statistical pooling. Finally, in order to avoid the problem of single loss function, we introduce Additive Angular Margin loss based on cross-entropy loss and replace the decision boundary with decision region to maximize inter-class distance across dialects and optimize the classification decision. It is shown via experimental results on Aishell2 and Datatang-Dialect datasets that the proposed model can achieve higher performance than the traditional model. All the above improvements contribute to the improvement of model performance. Meanwhile, the results of the ablation experiments demonstrate that these improvements in phonetic posteriorgram features, convolutional block attention module, ECAPA-TDNN, and additive angular margin loss, contribute to the improvement of identification accuracy.

HTML全文

参考文献(20)

施引文献

资源附件(0)