高级检索

  • ISSN 1006-3080
  • CN 31-1691/TQ

词典信息分层调整的中文命名实体识别方法

李宝昌 郭卫斌

李宝昌, 郭卫斌. 词典信息分层调整的中文命名实体识别方法[J]. 华东理工大学学报(自然科学版). doi: 10.14135/j.cnki.1006-3080.20211105003
引用本文: 李宝昌, 郭卫斌. 词典信息分层调整的中文命名实体识别方法[J]. 华东理工大学学报(自然科学版). doi: 10.14135/j.cnki.1006-3080.20211105003
LI Baochang, GUO Weibin. Research on Chinese Named Entity Recognition Based on Hierarchical Adjustment of Lexicon Information[J]. Journal of East China University of Science and Technology. doi: 10.14135/j.cnki.1006-3080.20211105003
Citation: LI Baochang, GUO Weibin. Research on Chinese Named Entity Recognition Based on Hierarchical Adjustment of Lexicon Information[J]. Journal of East China University of Science and Technology. doi: 10.14135/j.cnki.1006-3080.20211105003

词典信息分层调整的中文命名实体识别方法

doi: 10.14135/j.cnki.1006-3080.20211105003
基金项目: 国家自然科学基金(61672227,62076094)
详细信息
    作者简介:

    李宝昌(1996—),男,山东青岛人,硕士生,主要研究方向为自然语言处理。E-mail:l4853720@163.com

    通讯作者:

    郭卫斌, E-mail:gweibin@ecust.edu.cn

  • 中图分类号: TP391.1

Research on Chinese Named Entity Recognition Based on Hierarchical Adjustment of Lexicon Information

  • 摘要: 在中文命名实体识别任务中,字信息融合词汇信息能丰富文本特征。但一个字可能对应多个候选词汇,容易产生词汇冲突,融合无关词汇信息会影响模型的识别效果,对此提出了词典信息分层调整的中文命名实体识别方法。首先将所有潜在词语按照词语长度进行分层,通过高层词语反馈调整低层词语的权重来保留更有用的信息,以此缓解语义偏差问题和降低词汇冲突影响;然后将词汇信息拼接到字信息来增强文本特征表示。在Resume和Weibo数据集上的实验结果表明,该方法与传统方法相比具有更好的效果。

     

  • 图  1  本文模型的总体框架图

    Figure  1.  General framework of model in this paper

    图  2  词典信息分层方法

    Figure  2.  Dictionary information layering method

    图  3  词典信息反馈调整图

    Figure  3.  Dictionary information feedback adjustment diagram

    表  1  数据集统计信息

    Table  1.   Data set statistics

    DatasetTypeTrainDevTest
    WeiboSentence1400270270
    Char738001450014800
    ResumeSentence3800460480
    Char1241001390015100
    下载: 导出CSV

    表  2  参数设置

    Table  2.   Parameter setting

    ParameterValue
    Character vector dimension50
    Word vector dimension50
    Dropout0.5
    Learning rate0.015
    Attenuation rate0.05
    LSTM Number of LSTM hidden nodes200
    BiLSTM layers1
    下载: 导出CSV

    表  3  不同模型在Resume数据集上的实验效果

    Table  3.   Experimental results of different models on Resume dataset

    ModelP/%R/%F1/%
    Word-based93.7293.4493.58
    Char-based93.6693.3193.48
    Lattice-LSTM94.8194.1194.46
    LR-CNN95.3794.8495.11
    WC-LSTM+Shortest93.2694.9194.94
    WC-LSTM+Longest95.2795.1595.21
    WC-LSTM+Average95.0994.9795.03
    WC-LSTM+Self-attention95.1494.7994.96
    S-LSTM94.8794.7794.82
    H-LSTM95.2395.1895.20
    Att-LSTM95.6895.1595.42
    HA-LSTM95.9895.2895.63
    下载: 导出CSV

    表  4  不同模型在Weibo数据集上的F1值

    Table  4.   F1 values of different models on Weibo dataset

    ModelsF1/%
    NENMOverall
    Peng(2015)51.9661.0556.05
    He(2017)50.6059.3254.82
    Char-based46.1155.2952.77
    Lattice-LSTM53.0462.2558.79
    LR-CNN57.1466.6759.92
    WC-LSTM+shortest52.9965.7559.20
    WC-LSTM+longest52.5567.4159.84
    WC-LSTM+average53.1964.1758.67
    WC-LSTM+self-attention49.8665.3157.51
    S-LSTM53.9162.2759.17
    H-LSTM53.5564.4159.24
    Att-LSTM53.2464.0259.35
    HA-LSTM54.0263.3159.96
    下载: 导出CSV

    表  5  不同模型在Resume数据集上的时间效率表现

    Table  5.   Time efficiency performance of different models on the Resume dataset

    Modelt/sArticle per second
    Lattice-LSTM(batch_size=1)9803.9
    HA-LSTM(batch_size=1)34611
    HA-LSTM(batch_size=4)16023.9
    下载: 导出CSV
  • [1] ZHAO S, CAI Z, CHEN H, et al. Adversarial training based lattice LSTM for Chinese clinical named entity recognition[J]. Journal of Biomedical Informatics, 2019, 99(14): 103290.
    [2] OLIVETTI E A, COLE J M, Kim E, et al. Data-driven materials research enabled by natural language processing and information extraction[J]. Applied Physics Reviews, 2020, 7(4): 1-19.
    [3] ZHANG J, ZONG C. Neural machine translation: Challenges, progress and future[J]. Science China Technological Sciences, 2020, 63: 2028-2050. doi: 10.1007/s11431-020-1632-x
    [4] DIEFENBACH D, LOPEZ V, SINGH K, et al. Core techniques of question answering systems over knowledge bases: A survey[J]. Knowledge and Information Systems, 2018, 55(3): 529-569. doi: 10.1007/s10115-017-1100-y
    [5] PATIL N V, PATIL A S, PAWAR B V. HMM based named entity recognition for inflectional language[C]//2017 International Conference on Computer, Communications and Electronics (Comptelix). USA: IEEE, 2017: 565-572.
    [6] 朱颢东, 杨立志, 丁温雪, 等. 基于主题标签和 CRF 的中文微博命名实体识别[J]. 华中师范大学学报 (自然科学版), 2018, 52(3): 316-321.
    [7] DIN U M U, ANWAR M W, MALLAH G A. Maximum entropy based Urdu part of speech tagging[C]//International Conference on Intelligent Technologies and Applications. Singapore: Springer, 2019: 484-492.
    [8] JU Z, WANG J, ZHU F. Named entity recognition from biomedical text using SVM[C]//2011 5th International Conference on Bioinformatics and Biomedical Engineering. Wuhan, China: IEEE, 2011: 1-4.
    [9] COLLOBERT R, WESTON J, BOTTOU L, et al. Natural language processing (almost) from scratch[J]. Journal of Machine Learning Research, 2011, 12(Article): 2493-2537.
    [10] JIN Y, XIE J, GUO W, et al. LSTM-CRF neural network with gated self attention for Chinese NER[J]. IEEE Access, 2019, 7: 136694-136703. doi: 10.1109/ACCESS.2019.2942433
    [11] HUANG Z, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging[EB/OL]// (2015-08-09)[2021-10-10].https://arxiv.org/abs/1508.01991.
    [12] MAHANTA H. A study on the approaches of developing a named entity recognition tool[J]. International Journal of Research in Engineering and Technology, 2013, 2(14): 58-61. doi: 10.15623/ijret.2013.0214011
    [13] LU Y, ZHANG Y, JI D. Multi-prototype Chinese character embedding[C]//Proceedings of the Tenth International Conference on Language Resources and Evaluation. Slovenia: LREC, 2016: 855-859.
    [14] DONG C, ZHANG J, ZONG C, et al. Character-based LSTM-CRF with radical-level features for Chinese named entity recognition[M]//Natural Language Understanding and Intelligent Applications. Cham: Springer, 2016: 239-250.
    [15] TIAN Y, SONG Y, XIA F, et al. Improving Chinese word segmentation with wordhood memory networks[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. USA: ACL, 2020: 8274-8285.
    [16] Liu Y J , Zhang Y, Che W X , et al. Domain adaptation for CRF-based Chinese word segmentation using free annotations[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Macao: ACL, 2014, 864-874.
    [17] CHEN X , ZHAN S , QIU X , et al. Adversarial multi-criteria learning for Chinese word segmentation[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Canada: ACL, 2017: 1193-1203.
    [18] ZHANG Y , YANG J . Chinese NER using lattice LSTM[C]// 56th Annual Meeting of the Association for Computational Linguistics (ACL). Australia: ACL, 2018: 1554-1564.
    [19] LIU W, XU T, XU Q, et al. An encoding strategy based word-character LSTM for Chinese NER[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. USA: ACL, 2019: 2379-2389.
    [20] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. USA: MIT Press, 2017: 5998-6008.
    [21] LI J, SUN A, HAN J, et al. A survey on deep learning for named entity recognition[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(1): 50-70.
    [22] GUI T, MA R, ZHANG Q, et al. CNN-based Chinese NER with lexicon rethinking[C]//Twenty-Eighth International Joint Conference on Artificial Intelligence. Macao: Springer, 2019: 4982-4988.
    [23] PENG N, DREDZE M. Named entity recognition for Chinese social media with jointly trained embeddings[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Portugal: ACL, 2015: 548-554.
    [24] HE H, SUN X. F-score driven max margin neural network for named entity recognition in Chinese social media[C]//15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia: ACL, 2017: 713-718.
  • 加载中
图(3) / 表(5)
计量
  • 文章访问数:  39
  • HTML全文浏览量:  60
  • PDF下载量:  6
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-11-05
  • 录用日期:  2021-12-31
  • 网络出版日期:  2022-04-12

目录

    /

    返回文章
    返回