Identification Method of Cylinder in Laboratory Dangerous Scene Based on Image Caption
-
摘要: 针对实验室气瓶场景提出了一种结合目标检测与文本检测识别的图像描述生成方法,用于辨识气瓶场景中的潜在危险信息,并以文本形式警示监控人员。该方法首先提取场景物体的特征与瓶身上文字的特征,而后将特征映射入多模态嵌入空间,接着使用Transformer模型生成描述结果,最后根据描述语句判断场景是否危险。实验结果表明,通过本方法生成的描述语句可以有效辨识出实验室气瓶场景中的危险物品与危险原因。
-
关键词:
- 气瓶监管 /
- 危险辨识 /
- 图像描述 /
- 多模态嵌入空间 /
- Transformer模型
Abstract: Cylinders are common equipment in the laboratory, which are characterized by large, quantity, high risk concealment and great accident harm. Therefore, cylinder supervision is very important for laboratory safety management. Video monitoring is an effective laboratory safety management means, but the monitoring videos need to be watched by specially assigned staff, and the ability of the quality of the surveillance personnel is different, so it cannot be guaranteed that they can identify dangerous information in the video pictures. Therefore, this paper proposes an image description generation method combining object detection and text recognition for the laboratory gas cylinder scene, which is used to identify the potential danger information in the cylinder scene and warn the monitoring personnel in the form of text. Firstly, the features of the scene object and the text on the cylinder body are extracted and mapped into the multi-modal embedding space. Then, Transformer structure is utilized to generate caption results. Finally, it is judged whether the scene is dangerous according to the description statement. It is shown from experimental results that the description statements generated by this method can effectively identify the dangerous substances and causes in the laboratory cylinder scene. -
表 1 气瓶危险场景分类
Table 1. Classification of cylinder dangerous scene
Class Cause of danger I The cylinder is not fixed II Two cylinders can’t be placed together 表 2 部分网络参数
Table 2. Parameters of part network
Batchsize Momentum Decay Learning rate 4 0.9 0.001 0.0005 表 3 气瓶场景不同目标检测算法对比
Table 3. Comparison of different object detection algorithms in cylinder scene
Scene Baseline ResNet FPN AP MAP Cylinder Carrier Strap Cabinet I √ 0.746 0.754 0.625 0.748 0.718 √ √ 0.759 0.751 0.628 0.774 0.728 √ √ √ 0.817 0.827 0.687 0.817 0.787 II √ 0.754 — 0.636 0.782 0.724 √ √ 0.767 — 0.641 0.821 0.743 √ √ √ 0.826 — 0.713 0.892 0.810 III √ 0.752 0.735 0.630 0.770 0.722 √ √ 0.763 0.730 0.634 0.812 0.734 √ √ √ 0.821 0.796 0.702 0.884 0.801 表 4 正负样本判定规则
Table 4. Positive and negative sample rule
Sample classes Rules Positive The candidate box and GT have the highest IoU,and the included angle is less than 15° The IoU between candidate box and GT is greater than 0.7, and the included angle is less than 15° Negative The IoU between the candidate box and the GT is less than 0.3 The IoU between candidate box and GT is greater than 0.7, and the included angle is greater than 15° 表 5 气瓶文本检测识别实验结果
Table 5. Experimental results of text detection and recognition of cylinder
Detection result Recognition result Detection result Recognition result Detection result Recognition result Text Confidence Text Confidence Text Confidence CO2 0.807 OXYGEN 0.917 N2 0.772 表 6 本文算法与其他算法对比
Table 6. Comparation between our method and other algorithms
Algorithm BLEU-1 BLEU-4 ROUGE CIDER Soft-Attention 0.630 0.248 — 0.653 Adaptive 0.642 0.345 0.539 0.788 Ours 0.792 0.572 0.724 1.068 -
[1] 何浏, 石荣铭, 陈艳, 等. 高校实验室气瓶管理问题分析[J]. 中国特种设备安全, 2021, 37(7): 51-54. doi: 10.3969/j.issn.1673-257X.2021.07.011 [2] 陶亚辉, 冯玉如. 高校实验室气瓶管理的探讨[J]. 化工管理, 2019(22): 10-12. doi: 10.3969/j.issn.1008-4800.2019.22.008 [3] HOSSAIN M D Z, SOHEL F, SHIRATUDDIN M F, et al. A comprehensive survey of deep learning for image captioning[J]. ACM Computing Surveys (CsUR), 2019, 51(6): 1-36. [4] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2015: 3156-3164. [5] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]// International Conference on Machine Learning. France: PMLR, 2015: 2048-2057. [6] LU J, XIONG C, PARIKH D, et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2017: 375-383. [7] ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2018: 6077-6086. [8] WANG Z, BAO R, WU Q, et al. Confidence-aware non-repetitive multimodal transformers for TextCaps[C]//Proceedings of the AAAI Conference on Artificial Intelligence. USA: AAAI, 2021, 35(4): 2835-2843. [9] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(6): 1137-1149. [10] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2016: 770-778. [11] LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2017: 2117-2125. [12] MA J, SHAO W, YE H, et al. Arbitrary-oriented scene text detection via rotation proposals[J]. IEEE Transactions on Multimedia, 2018, 20(11): 3111-3122. doi: 10.1109/TMM.2018.2818020 [13] SHI B, BAI X, YAO C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(11): 2298-2304. [14] JIN S, JANG H, KIM W. Improving Bidirectional LSTM-CRF model of sequence tagging by using ontology knowledge based feature[J]. Journal of Intelligence and Information Systems, 2018, 24(1): 253-266. [15] 陈颖呈, 陈宁. 基于音频内容和歌词文本相似度融合的翻唱歌曲识别模型[J]. 华东理工大学学报(自然科学版), 2021, 47(1): 74-80. [16] JOULIN A, GRAVE É, BOJANOWSKI P, et al. Bag of tricks for efficient text classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Spain: ACL, 2017: 427-431. [17] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. USA: MIT Press, 2017: 6000-6010. [18] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[C]//European Conference on Computer Vision. Cham, Switzerland: Springer, 2014: 740-755. [19] NEUMANN L, MATAS J. Efficient scene text localization and recognition with local character refinement[C]//2015 13th International Conference on Document Analysis and Recognition (ICDAR). USA: IEEE, 2015: 746-750. [20] SIDOROV O, HU R, ROHRBACH M, et al. Textcaps: A dataset for image captioning with reading comprehension[C]//Computer Vision-ECCV 2020: 16th European Conference.[s.l.]: Springer, 2020: 742-758. [21] KINGMA D P, BA J L. Adam: A method for stochastic optimization[C]//3rd International Conference on Learning Representations. USA: ICLR, 2015: 273-297. [22] VEDANTAM R, LAWRENCE ZITNICK C, PARIKH D. Cider: Consensus-based image description evaluation[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2015: 4566-4575. -