Abstract:
In traditional knowledge distillation, if there is a significant difference in the parameter scale between the teacher and student models, there will be a negative impact that the student model cannot learn from the larger teacher model. In order to achieve better performance of student models on different tasks in knowledge distillation for BERT, a new multi-teacher distillation scheme for BERT model is proposed by making intensive research on existing model distillation methods and analysis on the advantages and disadvantages of different teacher models, that is, BERT, RoBERTA and XLNET are used to distill the teacher model with BERT structure. Meanwhile, the distillation scheme for the knowledge representation of the middle layer of the teacher model is modified, and the distillation on Transformer layer is added. Finally, experiments on several datasets in GLUE show that the final distillation results are ideal and can retain 95.1% accuracy of the teacher model.