DeepOCSR: A Deep Encoder-Decoder Network for Optical Chemical Structure Recognition
-
摘要: 从科学出版物中识别光学化学结构是重新发现化学结构性质的重要组成部分,但基于规则的方法和新兴的深度学习方法都面临着识别率低的问题。本文提出了一种用于光学化学结构识别的深度学习方法(DeepOCSR)。该方法基于编码器-解码器架构,引入了Transformer和ResNeSt模型,将出版物中的化学结构图像转换为SMILES序列。构建了两种新的化学结构数据集,其中一个包含了化学文献中常见的取代基。将本文方法与现有的其他方法进行对比实验,结果表明本文方法在相似度和有效性等关键指标上均优于对比方法。Abstract: Optical chemical structure recognition from scientific publications is an essential part of rediscovering a chemical structure. Rule-based approaches and emerging deep learning methods both face certain problems, such as a low recognition rate. In this paper, we propose DeepOCSR, a deep learning method for optical chemical structure recognition. Based on the encoder–decoder architecture, this method introduces Transformer and ResNeSt models for converting chemical structure images from publications into SMILES sequences. To train and verify our method, two novel chemical structure datasets were constructed, one of which contained common substituents in the chemical literature. Our proposed method has been extensively tested against existing publicly available deep-learning approaches. The experimental results show that our method outperforms the compared approaches in several pivotal evaluation metrics, including similarity and validity, proving the effectiveness of our method.
-
Key words:
- deep learning /
- chemical structure recognition /
- encoder-decoder /
- SMILES /
- substituents
-
表 1 模型参数量和批处理时间的比较
Table 1. Comparison of model parameter numbers and batch processing times
Modl Params/MB Batch time/s DA-LSTM 58.25 0.71 DeepOCSR 64.16 0.64 表 2 不同堆叠层数时模型的性能比较结果
Table 2. Model performance comparison results with different number of stacking layers
N Accuracy Similarity Validity BLEU 2 88.65 0.977 0.9943 0.9898 4 90.25 0.984 0.9943 0.9907 6 91.71 0.988 0.9958 0.9920 表 3 3种方法在两种测试集上的比较
Table 3. Comparison of three methods on two test sets
Data set Method Accuracy Similarity Validity BLEU CSDD MolVec 47.19 0.841 0.7775 0.6780 DA-LSTM 70.90 0.917 0.9914 0.9423 DeepOCSR 82.26 0.963 0.9940 0.9709 CSDD-SUB MolVec 17.03 0.485 0.8647 0.5547 DA-LSTM 76.10 0.959 0.9585 0.9657 DeepOCSR 91.71 0.988 0.9958 0.9910 表 4 数据集划分
Table 4. Division of dataset
Dataset index Method Data size/ KB Train data
sizeVal data
sizeTest data
size1 DECIMER 60 54000 0 6000 DA-LSTM, DeepOCSR 60 48000 6000 6000 2 DECIMER 100 90000 0 10000 DA-LSTM, DeepOCSR 100 80000 10000 10000 3 DECIMER 500 450000 0 50000 DA-LSTM, DeepOCSR 500 400000 50000 50000 表 5 3种方法在测试集上的性能比较
Table 5. Performance comparison of three approaches on testing set
Dataset index Method Epoch Similarity Validity
1DA-LSTM 60 0.928 0.9925 DECIMER 600 0.387 0.8992 DeepOCSR 60 0.970 0.9977
2DA-LSTM 60 0.960 0.9979 DECIMER 600 0.399 0.8913 DeepOCSR 60 0.984 0.9992
3DA-LSTM 60 0.990 0.9989 DECIMER 600 0.470 0.9805 DeepOCSR 60 0.997 0.9996 -
[1] MCDANIEL J R, BALMUTH J R. Kekule: OCR-optical chemical (structure) recognition[J]. Journal of Chemical Information and Computer Sciences, 1992, 32(4): 373-381. doi: 10.1021/ci00008a018 [2] CASEY R, BOYER S, HEALEY P, et al. Optical recognition of chemical graphics[C]//2nd International Conference on Document Analysis and Recognition (ICDAR'93). Japan : IEEE, 1993: 627-631. [3] IBISON P, JACQUOT M, KAM F, et al. Chemical literature data extraction: The CLiDE project[J]. Journal of Chemical Information and Computer Sciences, 1993, 33(3): 338-344. doi: 10.1021/ci00013a010 [4] FRASCONI P, GABBRIELLI F, LIPPI M, et al. Markov logic networks for optical chemical structure recognition[J]. Journal of Chemical Information and Modeling, 2014, 54(8): 2380-2390. doi: 10.1021/ci5002197 [5] PARK J, ROSANIA G R, SHEDDEN K A, et al. Automated extraction of chemical structure information from digital raster images[J]. Chemistry Central Journal, 2009, 3(1): 1-16. doi: 10.1186/1752-153X-3-1 [6] FILIPPOV I V, NICKLAUS M C. Optical structure recognition software to recover chemical information: OSRA, an open source solution[J]. Journal of Chemical Information and Modeling, 2009, 49(3): 740-743. doi: 10.1021/ci800067r [7] PERYEA T, KATZEL D, ZHAO T, et al. MOLVEC: Open source library for chemical structure recognition[C]//Proceedings of the Abstracts of Papers of The American Chemical Society. USA : ACS, 2019: 258. [8] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA : IEEE, 2015: 3156-3164. [9] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA : IEEE, 2015: 1-9. [10] ZAREMBA W, SUTSKEVER I, VINYALS O. Recurrent neural network regularization[EB/OL]// (2014-09-08)[2021-09-01].https://arxiv.org/abs/1409.2329v5. [11] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. doi: 10.1162/neco.1997.9.8.1735 [12] CHO K, VAN MERRIëNBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Qatar : ACL, 2014: 1724-1734. [13] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]// Proceedings of the International Conference on Machine Learning. France : PMLR, 2015: 2048-2057. [14] ZHANG H, WU C, ZHANG Z, et al. Resnest: Split-attention networks[EB/OL]//(2020-04-19)[2021-09-01].https://arxiv.org/abs/2004.08955v2. [15] WEININGER D. SMILES, a chemical language and information system: 1. Introduction to methodology and encoding rules[J]. Journal of Chemical Information and Computer Sciences, 1988, 28(1): 31-36. [16] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the Advances in neural information processing systems. USA : NIPS, 2017: 5998-6008. [17] VALKO A T, JOHNSON A P. CLiDE Pro: the latest generation of CLiDE, a tool for optical chemical structure recognition[J]. Journal of Chemical Information and Modeling, 2009, 49(4): 780-787. doi: 10.1021/ci800449t [18] KARZEL D, NAKAGAWA K, FUJIYOSHI A, et al. Inconsistency-Driven chemical graph construction in cheminfty[C]// Proceedings of the International Workshop on Graphics Recognition. Korea : Springer, 2011: 119-128. [19] SADAWI N M, SEXTON A P, SORGE V. Chemical structure recognition: A rule-based approach[C]//Proceedings of the Document Recognition and Retrieval XIX. USA : SPIE, 2012: 82970D-82970E. [20] STAKER J, MARSHALL K, ABEL R, et al. Molecular structure extraction from documents using deep learning[J]. Journal of Chemical Information and Modeling, 2019, 59(3): 1017-1029. doi: 10.1021/acs.jcim.8b00669 [21] RAJAN K, ZIELESNY A, STEINBECK C. DECIMER: Towards deep learning for chemical image recognition[J]. Journal of Cheminformatics, 2020, 12(1): 1-9. doi: 10.1186/s13321-019-0407-y [22] OLDENHOF M, ARANY A, MOREAU Y, et al. ChemGrapher: Optical graph recognition of chemical compounds by deep learning[J]. Journal of Chemical Information and Modeling, 2020, 60(10): 4506-4517. doi: 10.1021/acs.jcim.0c00459 [23] DALKE A. DeepSMILES: An adaptation of SMILES for use in machine-learning of chemical structures[EB/OL]// (2018-09-19)[2021-09-01].https://doi.org/10.26434/chemrxiv.7097960.v1. [24] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA : IEEE, 2016: 770-778. [25] XIE S, GIRSHICK R, DOLLáR P, et al. Aggregated residual transformations for deep neural networks[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, USA : IEEE, 2017: 1492-1500. [26] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA : IEEE, 2018: 7132-7141. [27] ZHANG H, GOODFELLOW I, METAXAS D, et al. Self-attention generative adversarial networks[C]//Proceedings of the International Conference on Machine Learning. USA : PMLR, 2019: 7354-7363. [28] STEINBECK C, HAN Y, KUHN S, et al. The chemistdevelopment kit (CDK): An open-source Java library for chemo- and bioinformatics[J]. Journal of Chemical Information and Computer Sciences, 2003, 43(2): 493-500. doi: 10.1021/ci025584y [29] KIM S, CHEN J, CHENG T, et al. PubChem 2019 update: Improved access to chemical data[J]. Nucleic Acids Research, 2019, 47(D1): D1102-D1109. doi: 10.1093/nar/gky1033 [30] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. USA : ACL, 2002: 311-318. [31] 季秀怡, 李建华. 基于双路注意力机制的化学结构图像识别[J]. 计算机工程, 2020, 46(9): 213-220. -