一种针对BERT模型的多教师蒸馏方案

石佳来; 郭卫斌

doi:10.14135/j.cnki.1006-3080.20230118001

一种针对BERT模型的多教师蒸馏方案

A Multi-Teacher Distillation Scheme for BERT Model

摘要

摘要: 在传统的知识蒸馏中，若教师、学生模型的参数规模差距过大，则会出现学生模型无法学习较大教师模型的负面结果。为了获得在不同任务上均拥有较好表现的学生模型，深入研究了现有的模型蒸馏方法、不同教师模型的优缺点，提出了一种新型的来自 Transformers 的双向编码器表示（Bidrectional Enoceder Respresentations from Transformers，BERT）模型的多教师蒸馏方案，即使用BERT、鲁棒优化的BERT方法（Robustly optimized BERT approach，RoBERTa）、语言理解的广义自回归预训练模型（XLNET）等多个拥有BERT结构的教师模型对其进行蒸馏，同时修改了对教师模型中间层知识表征的蒸馏方案，加入了对Transformer层的蒸馏。该蒸馏方案在通用语言理解评估（General Language Understanding Evaluation，GLUE）中的多个数据集上的实验结果表明，最终蒸馏实验的结果较为理想，可以保留教师模型95.1%的准确率。

Abstract: In traditional knowledge distillation, if there is a significant difference in the parameter scale between the teacher and student models, there will be a negative impact that the student model cannot learn from the larger teacher model. In order to achieve better performance of student models on different tasks in knowledge distillation for BERT, a new multi-teacher distillation scheme for BERT model is proposed by making intensive research on existing model distillation methods and analysis on the advantages and disadvantages of different teacher models, that is, BERT, RoBERTA and XLNET are used to distill the teacher model with BERT structure. Meanwhile, the distillation scheme for the knowledge representation of the middle layer of the teacher model is modified, and the distillation on Transformer layer is added. Finally, experiments on several datasets in GLUE show that the final distillation results are ideal and can retain 95.1% accuracy of the teacher model.

HTML全文

参考文献(23)

施引文献

资源附件(0)