Abstract:
Compared to dialects in other languages, there are a wide variety of dialects with small inter-class differences but large intra-class differences in China. Therefore, Chinese dialect identification poses significant challenges. Considering that the differences between Chinese dialects may manifest in both local (short-term) and global (long-term) speech characteristics, as well as in different hierarchical levels of speech, this paper proposes a Chinese dialect identification model that integrates the extraction of both local and global speech features and the aggregation of multi-level features. Specifically, this paper first extracts the local features of speech using Res2Block, then utilizes Conformer to extract the global features of speech, and finally aggregates multi-level features by cascading the outputs of multiple Conformers. Experimental results on both unseen domain and seen domain settings demonstrate that the proposed model achieves higher recognition accuracy compared to the baseline model.