基于RoBERTa和对抗训练的中文医疗命名实体识别

郭瑞; 张欢欢

doi:10.14135/j.cnki.1006-3080.20210909003

基于RoBERTa和对抗训练的中文医疗命名实体识别

郭瑞,
张欢欢

Chinese Medical Named Entity Recognition Based on RoBERTa and Adversarial Training

摘要

摘要: BERT(Bidirectional Encoder Representations from Transformers)和神经网络模型相结合的方法目前已被广泛应用于中文医疗命名实体识别领域。但BERT在中文中是以字为粒度切分的，没有考虑到中文分词。而神经网络模型往往局部不稳定，即使微小的扰动也可能误导它们，导致模型的鲁棒性差。为了解决这两个问题，提出了一种基于RoBERTa(A Robustly Optimized BERT Pre-training Approach)和对抗训练的中文医疗命名实体识别模型（AT-RBC）。首先，使用RoBERTa-wwm-ext-large(A Robustly Optimized BERT Pre-training Approach-whole word masking-extended data-large)预训练模型得到输入文本的初始向量表示；其次，在初始向量表示上添加一些扰动来生成对抗样本；最后，将初始向量表示和对抗样本一同依次输入双向长短期记忆网络和条件随机场中，得到最终的预测结果。在CCKS 2019数据集上的实验结果表明，AT-RBC模型的F1值达到了88.96%；在Resume数据集上的实验结果表明，AT-RBC模型的F1值也达到了97.14%，证明了该模型的有效性。

Abstract: Recently, the method of combining BERT (Bidirectional Encoder Representations from Transformers) and neural network model has been widely used in the field of Chinese medical named entity recognition. However, BERT is word segmentation in Chinese, without considering Chinese word segmentation. Neural network models are often locally unstable, and even small disturbances may mislead them and result in poor model robustness. In order to solve these two problems, this paper proposes a Chinese medical named entity recognition model based on RoBERTa (A Robustly Optimized BERT Pre-training Approach) and adversarial training, namely AT-RBC (Adversarial Training with RoBERTa-wwm-ext-large+BiLSTM+CRF). Firstly, RoBERTa-wwm-ext-large (A Robustly Optimized BERT Pre-training Approach-whole word masking-extended data-large) pre-trained model is utilized to obtain the initial vector representation of input text. Secondly, some perturbations are added to the initial vector representation for generating adversarial samples. Finally, the initial vector representation and adversarial samples are sequentially inputted into bidirectional long short-term memory network and conditional random field to obtain the final prediction. Finally, it is shown via the experiments that the F1 value of AT-RBC model on the CCKS 2019 data set can reach 88.96%, and this value on the resume dataset reaches 97.14%.

HTML全文

参考文献(23)

施引文献

资源附件(1)