高级检索

    李一斌, 张欢欢. 基于双向GRU-CRF的中文包装产品实体识别[J]. 华东理工大学学报(自然科学版), 2019, 45(3): 486-490. DOI: 10.14135/j.cnki.1006-3080.20180407001
    引用本文: 李一斌, 张欢欢. 基于双向GRU-CRF的中文包装产品实体识别[J]. 华东理工大学学报(自然科学版), 2019, 45(3): 486-490. DOI: 10.14135/j.cnki.1006-3080.20180407001
    LI Yibin, ZHANG Huanhuan. Chinese Packaging Product Entity Recognition Based on Bidirectional GRU-CRF[J]. Journal of East China University of Science and Technology, 2019, 45(3): 486-490. DOI: 10.14135/j.cnki.1006-3080.20180407001
    Citation: LI Yibin, ZHANG Huanhuan. Chinese Packaging Product Entity Recognition Based on Bidirectional GRU-CRF[J]. Journal of East China University of Science and Technology, 2019, 45(3): 486-490. DOI: 10.14135/j.cnki.1006-3080.20180407001

    基于双向GRU-CRF的中文包装产品实体识别

    Chinese Packaging Product Entity Recognition Based on Bidirectional GRU-CRF

    • 摘要: 为了实现包装行业的信息自动抽取,需要对文本中的包装产品进行命名实体识别工作。设计了一种基于双向GRU-CRF的中文包装产品实体识别方法。以预训练的领域词向量为输入,通过双向GRU网络对上下文语义信息进行建模,并使用输出端的CRF层对最佳标签序列进行预测。将该模型与传统的序列标注模型以及循环神经网络模型在包装产品文本数据集上进行了对比,实验结果表明,本文模型具有较少人工特征干预、更高准确率和召回率等优点。

       

      Abstract: With the prevailing trend of packaging industry, there exist diversity in the naming conventions of packaging products. So, the named entity recognition (NER) of these products is becoming necessity for packaging information extraction. Statistically speaking, Chinese product names are characterized by complex composition and long length, which makes the product names more complex and difficult to recognize in textual corpora. By analyzing current algorithms of NER, this paper proposes a Chinese packaging products NER method via using bidirectional GRU-CRF model. GRU (Gated recurrent unit) is an improved structure of hidden layer nodes in recurrent neural network (RNN). In this proposed model, a bidirectional GRU network is used to store and represent contextual semantic information of word, while CRF is responsible for modeling the probability of transition within output word label sequence. From packaging vertical website, we gather textual documents, such as news report, announcements and regulations, and obtain word vectors as pre-trained distributed representation of domain glossary. After automatic labeling of product entities in text data, word sequences are sent to the model in the form of vector. Thus, the best labeling sequence is generated to highlight product entities in the sentence. Finally, the proposed model using Chinese packaging corpus is compared with other classical models and state-of-the-art RNN models. It is shown from the simulation result that the proposed in this paper can achieve a precision rate of 82.45% and a recall rate of 80.38%. By conducting another series of contrast experiments on different length of input vectors selected in the forms of both word-level representation and char-level representation, it is found that the word-level representation fits better on the corpus and model used here. Meanwhile, this method can achieve less artificial feature engineering work than traditional machine learning models, such as CRF, HMM etc. Hence, the bidirectional CRF-GRU method with word-level distributed representation is more suitable for Chinese packaging product recognition task.

       

    /

    返回文章
    返回