高级检索

  • ISSN 1006-3080
  • CN 31-1691/TQ

基于BERT预训练模型的事故案例文本分类方法

涂远来 周家乐 王慧锋

涂远来, 周家乐, 王慧锋. 基于BERT预训练模型的事故案例文本分类方法[J]. 华东理工大学学报(自然科学版). doi: 10.14135/j.cnki.1006-3080.20220223002
引用本文: 涂远来, 周家乐, 王慧锋. 基于BERT预训练模型的事故案例文本分类方法[J]. 华东理工大学学报(自然科学版). doi: 10.14135/j.cnki.1006-3080.20220223002
TU Yuanlai, ZHOU Jiale, WANG Huifeng. Text Classification Method of Accident Cases Based on BERT Pre-training Model[J]. Journal of East China University of Science and Technology. doi: 10.14135/j.cnki.1006-3080.20220223002
Citation: TU Yuanlai, ZHOU Jiale, WANG Huifeng. Text Classification Method of Accident Cases Based on BERT Pre-training Model[J]. Journal of East China University of Science and Technology. doi: 10.14135/j.cnki.1006-3080.20220223002

基于BERT预训练模型的事故案例文本分类方法

doi: 10.14135/j.cnki.1006-3080.20220223002
基金项目: 青年科学基金项目(61906068);国家重点研发计划(2018YFC1803306)
详细信息
    作者简介:

    涂远来(1996—),男,江西南昌人,硕士生,主要研究方向为自然语言处理、危险辨识。E-mail:394901837@qq.com

    通讯作者:

    周家乐, E-mail:zhou.jiale@ecust.edu.cn

  • 中图分类号: TP183;X45

Text Classification Method of Accident Cases Based on BERT Pre-training Model

  • 摘要: 事故案例数据库中的大量事故信息为安全攸关系统的设计提供了丰富宝贵的经验,包括事故发生的时间地点、原因、经过等等。这些信息在危险辨识中起着至关重要的作用,但它们通常分布在事故文档的各个段落中,使得人工提取的效率低且成本高。本文提出了一种基于BERT(Bidirectional Encoder Representations from Transformers)预训练模型的事故案例文本分类方法,可将事故案例文本分为ACCIDENT、CAUSE、CONSEQUENCE、RESPONSE这4类。此外,收集制作了事故案例文本数据集用于训练模型。实验表明本文方法可以实现对事故案例文本的自动分类,分类准确率达到73.44%,召回率为69.13%,F1值为0.71。

     

  • 图  1  BERT模型预训练-微调过程

    Figure  1.  Pretraining-finetuning process of BERT model

    图  2  编码器结构

    Figure  2.  Encoder structure

    图  3  事故案例文本分类整体研究框架

    Figure  3.  Text classification research framework of accident case

    图  4  基于BERT的微调分类模型

    Figure  4.  BERT-based fine-tuning classification model

    图  5  训练损失

    Figure  5.  Training loss

    表  1  事故案例文本示例

    Table  1.   Accident case text example

    LabelText typeText description
    0ACCIDENTAerosol cans, packed in cartons on pallets in a store, caught fire. The store belongs to a factory in an urban area which prepares this type of product. The store is in the lower basement of the factory. The aerosol cans exploded during the fire, making the firefighting more difficult - a mini-BLEVE. The fire spread very rapidly to all the installation
    1CAUSETwo principal causes led to the accident: the immediate cause was a fire (seen by the driver) which started under a fork-lift truck when it passed through the store. This was caused by an aerosol can which had fallen earlier and was crushed, with subsequent ignition of the gas. Moreover, the aerosol cans returned by customers had leaks
    The fork-lift truck was not a priori of an appropriate type for this area
    2CONSEQUENCEOne employee and 4 firemen were injured, the firemen while firefighting. Fire-fighting water was contained
    within the site's retention system, so there was no release to the environment
    3RESPONSEThe detection and alarm system worked. About 140 firemen fought the fire for 4 hours. About 100 people were evacuated, since the factory was in an urban area
    下载: 导出CSV

    表  2  不同模型方法分类结果

    Table  2.   Classification results of different model methods

    ModelPrecision/%Recall/%F1
    BERT73.4469.130.71
    SVM60.7958.570.60
    Logistic regression58.3953.880.56
    Naive bayes62.3459.890.60
    下载: 导出CSV

    表  3  不同预训练模型结构及参数

    Table  3.   Structure and parameters of different pre-training models

    ModelLayersHiddensParameters
    bert-base-uncased12768110M
    bert-base-cased12768110M
    bert-large-uncased241024340M
    下载: 导出CSV

    表  4  不同预训练模型分类结果

    Table  4.   Classification results of different pre-training models

    ModelLearning ratePrecision/%Recall/%F1
    bert-base-uncased0.0000273.4469.130.71
    0.0000573.3968.580.71
    0.0000970.2767.240.69
    bert-base-cased0.0000272.5567.430.71
    0.0000572.6766.580.71
    0.0000971.8767.090.70
    bert-large-uncased0.0000272.8468.830.71
    0.0000571.4967.690.71
    0.0000971.3768.990.71
    下载: 导出CSV

    表  5  不同学习率分类结果

    Table  5.   Classification results of models under different learning rates

    Learning ratePrecision/%Recall/%F1
    0.0000273.4469.130.71
    0.0000372.8468.340.71
    0.0000573.3968.580.71
    0.0000772.3368.910.71
    0.0000970.2767.240.69
    下载: 导出CSV

    表  6  不同批次大小下模型分类结果(学习率:0.00002)

    Table  6.   Classification results of models under different batch sizes (Learning rate: 0.00002)

    Batch sizePrecision/%Recall/%F1
    869.7965.170.67
    1670.6466.940.69
    3273.4469.130.71
    下载: 导出CSV
  • [1] ERICSON C A. Hazard Analysis Techniques for System Safety[M]. New York: John Wiley & Sons Inc, 2005.
    [2] KIDAM K, SAHAK H A, HASSIM M H, et al. Method for identifying errors in chemical process development and design base on accidents knowledge[J]. Process Safety and Environmental Protection, 2015, 97: 49-60. doi: 10.1016/j.psep.2015.06.004
    [3] JING S, LIU X, GONG X, et al. Correlation analysis and text classification of chemical accident cases based on word embedding[J]. Process Safety and Environmental Protection, 2022, 158: 698-710. doi: 10.1016/j.psep.2021.12.038
    [4] FANG W, LUO H, XU S, et al. Automated text classification of near-misses from safety reports: An improved deep learning approach[J]. Advanced Engineering Informatics, 2020, 44: 101060. doi: 10.1016/j.aei.2020.101060
    [5] GOH Y M, UBEYNARAYANA C U. Construction accident narrative classification: An evaluation of text mining techniques[J]. Accident Analysis and Prevention, 2017, 108: 122-130. doi: 10.1016/j.aap.2017.08.026
    [6] LOVE P E D, SMITH J, TEO P. Putting into practice error management theory: unlearning and learning to manage action errors in construction[J]. Applied Ergonomics, 2018, 69: 104-111. doi: 10.1016/j.apergo.2018.01.007
    [7] VARGAS A P, BLOOMFIELD R. Using ontologies to support model-based exploration of the dependencies between causes and consequences of hazards[C]// 7th International Conference on Knowledge Engineering and Ontology Development(KEOD). Lisbon: SCITEPRESS, 2015: 316-327.
    [8] TAYLOR J R. Statistics of design error in the process industries[J]. Safety Science, 2007, 45(1-2): 61-73. doi: 10.1016/j.ssci.2006.08.013
    [9] JING S, LIU X, XU C. A simple and effective method for the use of chemical accident cases[C]//2016 IEEE International Conference on Service Operations and Logistics, and Informatics (SOLI). Beijing: IEEE, 2016: 206-210.
    [10] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]// 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics(NAACL). Minneapolis: Association for Computational Linguistics, 2019: 4171-4186.
    [11] 赵鸿山, 范贵生, 虞慧群. 基于归一化文档频率的文本分类特征选择方法[J]. 华东理工大学学报(自然科学版), 2019, 45(5): 809-814.
    [12] CHI N W, LIN K Y, HSIEH S H. Using ontology-based text classification to assist job hazard analysis[J]. Advanced Engineering Informatics, 2014, 28(4): 381-394. doi: 10.1016/j.aei.2014.05.001
    [13] GOH Y M, UBEYNARAYANA C U. Construction accident narrative classification: An evaluation of text mining techniques[J]. Accident Analysis & Prevention, 2017, 108: 122-130.
    [14] LAI S, XU L, LIU K, et al. Recurrent convolutional neural networks for text classification[C]//Twenty-ninth AAAI Conference on Artificial Intelligence. Texas: AAAI Press, 2015: 2267-2273.
    [15] XU W, TAN Y. Semi-supervised target-oriented sentiment classification[J]. Neurocomputing, 2019, 337: 120-128. doi: 10.1016/j.neucom.2019.01.059
    [16] FU X, WEI Y, XU F, et al. Semi-supervised aspect-level sentiment classification model based on variational autoencoder[J]. Knowledge-Based Systems, 2019, 171: 81-92. doi: 10.1016/j.knosys.2019.02.008
    [17] SHIN B, CHOKSHI F H, LEE T, et al. Classification of radiology reports using neural attention models[C]//2017 International joint conference on neural networks (IJCNN). Anchorage: IEEE, 2017: 4363-4370.
    [18] YANG Z, YANG D, DYER C, et al. Hierarchical attention networks for document classification[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. California: Association for Computational Linguistics, 2016: 1480-1489.
    [19] PECHENICK E A, DANFORTH C M, DODDS P S. Characterizing the Google Books corpus: Strong limits to inferences of socio-cultural and linguistic evolution[J]. Plos One, 2015, 10(10): e0137041. doi: 10.1371/journal.pone.0137041
    [20] COSTER W, KAUCHAK D. Simple English Wikipedia: A new text simplification task[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. USA: Association for Computational Linguistics, 2011: 665-669.
    [21] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Neural Information Processing Systems(NIPS). USA: MIT Press, 2017: 6000-6010.
    [22] SAIF H, FERNANDEZ M, HE Y, et al. On stopwords, filtering and data sparsity for sentiment analysis of twitter[C]//The 9th International Conference on Language Resources and Evaluation. [s. l. ]: {s. n. ], 2014: 810-817.
    [23] ZHANG Y, JIN R, ZHOU Z H. Understanding bag-of-words model: a statistical framework[J]. International Journal of Machine Learning and Cybernetics, 2010, 1(1): 43-52.
  • 加载中
图(5) / 表(6)
计量
  • 文章访问数:  262
  • HTML全文浏览量:  92
  • PDF下载量:  38
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-02-23
  • 网络出版日期:  2022-06-07

目录

    /

    返回文章
    返回