Text Classification Method of Accident Cases Based on BERT Pre-training Model
-
摘要: 事故案例数据库中的大量事故信息为安全攸关系统的设计提供了丰富宝贵的经验,包括事故发生的时间地点、原因、经过等等。这些信息在危险辨识中起着至关重要的作用,但它们通常分布在事故文档的各个段落中,使得人工提取的效率低且成本高。本文提出了一种基于BERT(Bidirectional Encoder Representations from Transformers)预训练模型的事故案例文本分类方法,可将事故案例文本分为ACCIDENT、CAUSE、CONSEQUENCE、RESPONSE这4类。此外,收集制作了事故案例文本数据集用于训练模型。实验表明本文方法可以实现对事故案例文本的自动分类,分类准确率达到73.44%,召回率为69.13%,F1值为0.71。Abstract: The large amount of accident information in the accident case database provides rich and valuable experience for the design of safety related system, including the time, place, cause, process and so on. These information plays an important role in hazard identification, but they are usually distributed in various paragraphs of accident documents, which makes manual extraction inefficient and costly. This paper presents a text classification method of accident cases based on Bert pre training model, which can be divided into four categories: ACCIDENT、CAUSE、CONSEQUENCE、RESPONSE. In addition, the accident case text data set is collected and produced for training the model. Experiments show that this method can realize the automatic classification of accident case text, and the classification accuracy is 73.44%, the recall is 69.13%, and the F1 value is 0.71. In this paper, multiple groups of different experimental parameters were set up, and the influence of parameter Settings on classification effect was fully explored through experiments to find the best parameter Settings. This classification method helps to better mine the semantic information in the accident case text and provides powerful technical support for the subsequent establishment of expert knowledge base and efficient accident retrieval platform.
-
Key words:
- Hazard identification /
- Text classification /
- BERT /
- Requirement analysis /
- Safety critical system
-
表 1 事故案例文本示例
Table 1. Accident case text example
Label Text type Text description 0 ACCIDENT Aerosol cans, packed in cartons on pallets in a store, caught fire. The store belongs to a factory in an urban area which prepares this type of product. The store is in the lower basement of the factory. The aerosol cans exploded during the fire, making the firefighting more difficult - a mini-BLEVE. The fire spread very rapidly to all the installation 1 CAUSE Two principal causes led to the accident: the immediate cause was a fire (seen by the driver) which started under a fork-lift truck when it passed through the store. This was caused by an aerosol can which had fallen earlier and was crushed, with subsequent ignition of the gas. Moreover, the aerosol cans returned by customers had leaks
The fork-lift truck was not a priori of an appropriate type for this area2 CONSEQUENCE One employee and 4 firemen were injured, the firemen while firefighting. Fire-fighting water was contained
within the site's retention system, so there was no release to the environment3 RESPONSE The detection and alarm system worked. About 140 firemen fought the fire for 4 hours. About 100 people were evacuated, since the factory was in an urban area 表 2 不同模型方法分类结果
Table 2. Classification results of different model methods
Model Precision/% Recall/% F1 BERT 73.44 69.13 0.71 SVM 60.79 58.57 0.60 Logistic regression 58.39 53.88 0.56 Naive bayes 62.34 59.89 0.60 表 3 不同预训练模型结构及参数
Table 3. Structure and parameters of different pre-training models
Model Layers Hiddens Parameters bert-base-uncased 12 768 110M bert-base-cased 12 768 110M bert-large-uncased 24 1024 340M 表 4 不同预训练模型分类结果
Table 4. Classification results of different pre-training models
Model Learning rate Precision/% Recall/% F1 bert-base-uncased 0.00002 73.44 69.13 0.71 0.00005 73.39 68.58 0.71 0.00009 70.27 67.24 0.69 bert-base-cased 0.00002 72.55 67.43 0.71 0.00005 72.67 66.58 0.71 0.00009 71.87 67.09 0.70 bert-large-uncased 0.00002 72.84 68.83 0.71 0.00005 71.49 67.69 0.71 0.00009 71.37 68.99 0.71 表 5 不同学习率分类结果
Table 5. Classification results of models under different learning rates
Learning rate Precision/% Recall/% F1 0.00002 73.44 69.13 0.71 0.00003 72.84 68.34 0.71 0.00005 73.39 68.58 0.71 0.00007 72.33 68.91 0.71 0.00009 70.27 67.24 0.69 表 6 不同批次大小下模型分类结果(学习率:0.00002)
Table 6. Classification results of models under different batch sizes (Learning rate: 0.00002)
Batch size Precision/% Recall/% F1 8 69.79 65.17 0.67 16 70.64 66.94 0.69 32 73.44 69.13 0.71 -
[1] ERICSON C A. Hazard Analysis Techniques for System Safety[M]. New York: John Wiley & Sons Inc, 2005. [2] KIDAM K, SAHAK H A, HASSIM M H, et al. Method for identifying errors in chemical process development and design base on accidents knowledge[J]. Process Safety and Environmental Protection, 2015, 97: 49-60. doi: 10.1016/j.psep.2015.06.004 [3] JING S, LIU X, GONG X, et al. Correlation analysis and text classification of chemical accident cases based on word embedding[J]. Process Safety and Environmental Protection, 2022, 158: 698-710. doi: 10.1016/j.psep.2021.12.038 [4] FANG W, LUO H, XU S, et al. Automated text classification of near-misses from safety reports: An improved deep learning approach[J]. Advanced Engineering Informatics, 2020, 44: 101060. doi: 10.1016/j.aei.2020.101060 [5] GOH Y M, UBEYNARAYANA C U. Construction accident narrative classification: An evaluation of text mining techniques[J]. Accident Analysis and Prevention, 2017, 108: 122-130. doi: 10.1016/j.aap.2017.08.026 [6] LOVE P E D, SMITH J, TEO P. Putting into practice error management theory: unlearning and learning to manage action errors in construction[J]. Applied Ergonomics, 2018, 69: 104-111. doi: 10.1016/j.apergo.2018.01.007 [7] VARGAS A P, BLOOMFIELD R. Using ontologies to support model-based exploration of the dependencies between causes and consequences of hazards[C]// 7th International Conference on Knowledge Engineering and Ontology Development(KEOD). Lisbon: SCITEPRESS, 2015: 316-327. [8] TAYLOR J R. Statistics of design error in the process industries[J]. Safety Science, 2007, 45(1-2): 61-73. doi: 10.1016/j.ssci.2006.08.013 [9] JING S, LIU X, XU C. A simple and effective method for the use of chemical accident cases[C]//2016 IEEE International Conference on Service Operations and Logistics, and Informatics (SOLI). Beijing: IEEE, 2016: 206-210. [10] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]// 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics(NAACL). Minneapolis: Association for Computational Linguistics, 2019: 4171-4186. [11] 赵鸿山, 范贵生, 虞慧群. 基于归一化文档频率的文本分类特征选择方法[J]. 华东理工大学学报(自然科学版), 2019, 45(5): 809-814. [12] CHI N W, LIN K Y, HSIEH S H. Using ontology-based text classification to assist job hazard analysis[J]. Advanced Engineering Informatics, 2014, 28(4): 381-394. doi: 10.1016/j.aei.2014.05.001 [13] GOH Y M, UBEYNARAYANA C U. Construction accident narrative classification: An evaluation of text mining techniques[J]. Accident Analysis & Prevention, 2017, 108: 122-130. [14] LAI S, XU L, LIU K, et al. Recurrent convolutional neural networks for text classification[C]//Twenty-ninth AAAI Conference on Artificial Intelligence. Texas: AAAI Press, 2015: 2267-2273. [15] XU W, TAN Y. Semi-supervised target-oriented sentiment classification[J]. Neurocomputing, 2019, 337: 120-128. doi: 10.1016/j.neucom.2019.01.059 [16] FU X, WEI Y, XU F, et al. Semi-supervised aspect-level sentiment classification model based on variational autoencoder[J]. Knowledge-Based Systems, 2019, 171: 81-92. doi: 10.1016/j.knosys.2019.02.008 [17] SHIN B, CHOKSHI F H, LEE T, et al. Classification of radiology reports using neural attention models[C]//2017 International joint conference on neural networks (IJCNN). Anchorage: IEEE, 2017: 4363-4370. [18] YANG Z, YANG D, DYER C, et al. Hierarchical attention networks for document classification[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. California: Association for Computational Linguistics, 2016: 1480-1489. [19] PECHENICK E A, DANFORTH C M, DODDS P S. Characterizing the Google Books corpus: Strong limits to inferences of socio-cultural and linguistic evolution[J]. Plos One, 2015, 10(10): e0137041. doi: 10.1371/journal.pone.0137041 [20] COSTER W, KAUCHAK D. Simple English Wikipedia: A new text simplification task[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. USA: Association for Computational Linguistics, 2011: 665-669. [21] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Neural Information Processing Systems(NIPS). USA: MIT Press, 2017: 6000-6010. [22] SAIF H, FERNANDEZ M, HE Y, et al. On stopwords, filtering and data sparsity for sentiment analysis of twitter[C]//The 9th International Conference on Language Resources and Evaluation. [s. l. ]: {s. n. ], 2014: 810-817. [23] ZHANG Y, JIN R, ZHOU Z H. Understanding bag-of-words model: a statistical framework[J]. International Journal of Machine Learning and Cybernetics, 2010, 1(1): 43-52. -