Abstract:
There are relatively few existing dialect recognition models for phonemic features and different dialects have different pronunciations, all of which lead to large differences in the probability distribution of phonemes contained in different dialects. Aiming at the above issues, this paper proposes a dialect identification model based on the phonetic posteriorgram feature. For the single dimension of attention analysis, this model extracts key features of frame-level phonetic posteriorgram by using the self-attention mechanism of Convolutional Block Attention Module (CBAM). At the same time, in order to make full use of the information in the middle layers of the model and avoid the loss of dialect information, Emphasized Channel Attention-Propagation and Aggregation in TDNN (ECAPA-TDNN) model is used to extract long-range information of frame-level feature and obtain effective sentence-level features via feature aggregation and attention statistical pooling. Finally, in order to avoid the problem of single loss function, we introduce Additive Angular Margin loss based on cross-entropy loss and replace the decision boundary with decision region to maximize inter-class distance across dialects and optimize the classification decision. It is shown via experimental results on Aishell2 and Datatang-Dialect datasets that the proposed model can achieve higher performance than the traditional model. All the above improvements contribute to the improvement of model performance. Meanwhile, the results of the ablation experiments demonstrate that these improvements in phonetic posteriorgram features, convolutional block attention module, ECAPA-TDNN, and additive angular margin loss, contribute to the improvement of identification accuracy.