Semantic Segmentation Method of Indoor Scene Based on Perceptual Attention and Lightweight Pyramid Fusion Network Model
-
摘要: 针对实验室场景理解时存在背景复杂、光照多变等问题,利用RGB信息与深度信息在场景理解中具有互补性的特点,提出了一种感知注意力和轻量空间金字塔融合的网络模型(Perception Attention and Lightweight Spatial Fusion Network,PLFNet)。在该模型的感知注意力模块中,利用RGB图像与深度图像在网络中的权重不同,以加权的方式实现深度信息对RGB信息的多级辅助;在轻量空间金字塔池化模块中,通过增加级联的空洞空间卷积,不但有效地聚集了多尺度特征,而且比传统空间金字塔池化模块的参数量减少了约92%,使RGB信息和深度信息的融合更充分。在两个室内场景公开数据集上的实验结果表明,该模型的表现均优于经典算法。消融实验的结果表明,本文模型的平均交并比分别提高了4.3%和3.5%。最后,利用场景较复杂的生物实验室数据集进行测试,结果表明本文模型可以有效地实现对生物实验室的场景理解。Abstract: Aimed at solving the problems of complex background and variable lighting in laboratory scene understanding, based on the complementary characteristics of RGB image information and depth image information in scene understanding, a perceptual attention and lightweight spatial fusion network model is proposed. In the perceptual attention module of this model, the RGB image and the depth image in the network are used to implement the multi-level assistance to the RGB information by the depth information in a weighted mode. In the lightweight spatial pyramid pooling module, increasing the level of the joint atrous space convolution not only effectively aggregates multi-scale features, but also reduces the parameter amount of the traditional spatial pyramid pooling module by around 89%, enabling the RGB image information and depth image information to fuse more adequately. The model performs better on the two public datasets of indoor scenes than among the classic algorithm. The analysis of each module through ablation experiments verifies that the mean intersections over union of the algorithm proposed in this paper increase by 4.3% and 3.5% respectively. Finally, a test based on the dataset of biological laboratory on the more complex scenes is carried out, which shows that the model can effectively realize the scene understanding of biological laboratory.
-
表 1 不同算法PA、MPA和MIoU在NYU-Depth V2数据集上的比较
Table 1. PA, MPA and MIoU of different algorithms in NYU-Depth V2 dataset
Algorithm $ \mathrm{P}\mathrm{A}/\mathrm{\%} $ MPA$ /\mathrm{\%} $ MIoU$ /\mathrm{\%} $ Bayesian SegNet[28] 68.0 45.8 32.4 3DGNN[29] - 55.7 43.1 Context[30] 70.0 53.6 40.6 LSD-GF[31] 71.9 60.7 45.9 CFN(VGG-16)[32] - - 41.7 Dilated FCN [33] 62.6 47.1 32.3 RefineNet-LW-152[34] - - 44.4 TD2-PSP50[36] - 55.2 43.5 DACNN-DSPP[35] 61.7 38.7 28.0 PLFNet 72.2 60.8 46.2 表 2 不同算法PA、MPA和MIoU在SUN RGB-D数据集上的比较
Table 2. PA, MPA and MIoU of different algorithms in SUN RGB-D dataset
表 3 两个模块对网络模型PA、MPA和MIoU的影响
Table 3. Impact of two modules on PA, MPA and MIoU
Model RGB Depth PAM LSPP $ \mathrm{P}\mathrm{A} $/% $ \mathrm{M}\mathrm{P}\mathrm{A}/\mathrm{\%} $ $ \mathrm{M}\mathrm{I}\mathrm{o}\mathrm{U}/\mathrm{\%} $ Model_0 √ × × × 58.7 48.7 32.2 Model_1 √ √ × × 64.3 54.9 38.4 Model_2 √ √ √ × 69.3 57.3 42.7 PLFNet √ √ √ √ 72.2 60.8 46.2 表 4 RGB相机与深度相机内参数
Table 4. Internal parameters of RGB camera and depth camera
Camera $ {f}_{x} $ $ {f}_{y} $ $ {c}_{x} $ $ {c}_{x} $ RGB 1045.91588 1046.39491 944.06530 542.89044 Depth 362.07166 362.06795 257.70148 206.12828 -
[1] GARCIA-GARCIA A, ORTS-ESCOLANO S, OPREA S, et al. A survey on deep learning techniques for image and video semantic segmentation[J]. Applied Soft Computing, 2018, 70: 41-65. doi: 10.1016/j.asoc.2018.05.018 [2] CORTINHAL T, TZELEPIS G, AKSOY E E. SalsaNext: Fast, uncertainty-aware semantic segmentation of LiDAR point clouds[C]//International Symposium on Visual Computing. Cham: Springer, 2020: 207-222. [3] 康萌萌, 杨浩, 谷小婧, 等. 基于融合路径监督的多波段图像语义分割[J]. 华东理工大学学报 (自然科学版), 2021, 47(2): 233-240. [4] TEICHMANN M, WEBER M, ZOELLNER M, et al. Multinet: Real-time joint semantic reasoning for autonomous driving[C]//2018 IEEE Intelligent Vehicles Symposium (IV). USA: IEEE, 2018: 1013-1020. [5] TESO-FZ-BETOÑO D, ZULUETA E, SÁNCHEZ-CHICA A, et al. Semantic segmentation to develop an indoor navigation system for an autonomous mobile robot[J]. Mathematics, 2020, 8(5): 855. doi: 10.3390/math8050855 [6] MIYAMOTO R, ADACHI M, NAKAMURA Y, et al. Accuracy improvement of semantic segmentation using appropriate datasets for robot navigation[C]//2019 6th International Conference on Control, Decision and Information Technologies (CoDIT). Paris: IEEE, 2019: 1610-1615. [7] ALONSO I, RIAZUELO L, MURILLO A C. Mininet: An efficient semantic segmentation ConvNet for real-time robotic applications[J]. IEEE Transactions on Robotics, 2020, 36(4): 1340-1347. doi: 10.1109/TRO.2020.2974099 [8] TANZI L, PIAZZOLLA P, PORPIGLIA F, et al. Real-time deep learning semantic segmentation during intra-operative surgery for 3D augmented reality assistance[J]. International Journal of Computer Assisted Radiology and Surgery, 2021, 16(9): 1435-1445. doi: 10.1007/s11548-021-02432-y [9] ZHANG H, HAN B, IP C Y, et al. Slimmer: Accelerating 3D semantic segmentation for mobile augmented reality[C]//2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS). USA: IEEE, 2020: 603-612. [10] AN Z, XU X P,YANG J H, et al. Design of augmented reality head-up display system based on image semantic segmentation[J]. Acta Optica Sinica, 2018, 38(7): 0710004. doi: 10.3788/AOS201838.0710004 [11] VALENTIN J P C, SENGUPTA S, WARRELL J, et al. Mesh based semantic modelling for indoor and outdoor scenes[C]// IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2013: 2067-2074. [12] SHAO L, HAN J, KOHLI P, et al. Computer Vision and Machine Learning with RGB-D Sensors[M]. UK: Springer International Publishing, 2014. [13] ZHOU T, FAN D P, CHENG M M, et al. RGB-D salient object detection: A survey[J]. Computational Visual Media, 2021, 7: 37-69. doi: 10.1007/s41095-020-0199-z [14] COUPRIE C, FARABET C, NAJMAN L, et al. Indoor semantic segmentation using depth information[C]//First International Conference on Learning Representations (ICLR 2013). [s. l. ]: [s. n. ] , 2013: 1-8. [15] GUPTA S, GIRSHICK R, ARBELÁEZ P, et al. Learning rich features from RGB-D images for object detection and segmentation[C]//European Conference on Computer Vision. Cham: Springer, 2014: 345-360. [16] 何俊, 张彩庆, 李小珍, 等. 面向深度学习的多模态融合技术研究综述[J]. 计算机工程, 2020, 46(5): 1-11. [17] HAZIRBAS C, MA L, DOMOKOS C, et al. Fusenet: Incorporating depth into semantic segmentation via fusion-based CNN architecture[C]//Asian Conference on Computer Vision. Cham: Springer, 2016: 213-228. [18] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2016: 770-778. [19] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2018: 7132-7141. [20] CHEN L C, ZHU Y, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the European Conference on Computer Vision (ECCV). Cham: Springer, 2018: 801-818. [21] RONNEBERGER O, FISCHER P, BROX T. U-net: Convolutional networks for biomedical image segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer, 2015: 234-241. [22] SILBERMAN N, HOIEM D, KOHLI P, et al. Indoor segmentation and support inference from RGBD images[C]//European Conference on Computer Vision. Berlin, Heidelberg: Springer, 2012: 746-760. [23] SONG S, LICHTENBERG S P, XIAO J. Sun RGB-D: A RGB-D scene understanding benchmark suite[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2015: 567-576. [24] LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2015: 3431-3440. [25] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Advances in Neural Information Processing Systems, 2012, 25: 1097-1105. [26] GLOROT X, BENGIO Y. Understanding the difficulty of training deep feedforward neural networks[J]. Journal of Machine Learning Research, 2010, 9: 249-256. [27] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 318-327. doi: 10.1109/TPAMI.2018.2858826 [28] KENDALL A, BADRINARAYANAN V, CIPOLLA R. Bayesian segNet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding[EB/OL]. (2015-11-09)[2021-08-30].https://arxiv.org/pdf/1511.02680.pdf . [29] QI X, LIAO R, JIA J, et al. 3D graph neural networks for RGBD semantic segmentation[C]//Proceedings of the IEEE International Conference on Computer Vision. Italy : IEEE, 2017: 5199-5208. [30] LIN G, SHEN C, VAN DEN HENGEL A, et al. Exploring context with deep structured models for semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(6): 1352-1366. [31] CHENG Y, CAI R, LI Z, et al. Locality-sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2017: 3029-3037. [32] LIN D, CHEN G, COHEN-OR D, et al. Cascaded feature network for semantic segmentation of RGB-D images[C]//Proceedings of the IEEE International Conference on Computer Vision. Italy : IEEE, 2017: 1311-1319. [33] KAMRAN S A, SABBIR A S. Efficient yet deep convolutional neural networks for semantic segmentation[C]//2018 International Symposium on Advanced Intelligent Informatics (SAIN). Indonesia : IEEE, 2018: 123-130. [34] NEKRASOV V, SHEN C, REID I. Light-weight refinenet for real-time semantic segmentation[EB/OL]. (2018-10-8)[2021-9-20].https://arxiv.org/abs/1810.03272. [35] 杨胜杰, 仇振安, 高小宁, 等. 基于深度敏感空间金字塔池化的 RGBD 语义分割[J]. 电光与控制, 2020, 27(12): 84-89. doi: 10.3969/j.issn.1671-637X.2020.12.018 [36] HU P, CABA F, WANG O, et al. Temporally distributed networks for fast video semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2020: 8818-8827. [37] BADRINARAYANAN V, KENDALL A, CIPOLLA R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481-2495. doi: 10.1109/TPAMI.2016.2644615 [38] WANG W, NEUMANN U. Depth-aware CNN for RGB-D segmentation[C]//Proceedings of the European Conference on Computer Vision (ECCV). UK: Springer, 2018: 135-150. [39] LU H, DAI Y, SHEN C, et al. Index network[EB/OL]. (2019-8-11)[2021-9-20].https://arxiv.org/abs/1908.09895. [40] 崔微, 朱民耀, 颜柯, 等. 基于 Kinect 的三维重建技术在三维显示中的应用[J]. 电子测量技术, 2017, 40(3): 113-116. doi: 10.3969/j.issn.1002-7300.2017.03.024 [41] ZHANG Z. Flexible camera calibration by viewing a plane from unknown orientations[C]//Proceedings of the Seventh IEEE International Conference on Computer Vision. Greece: IEEE, 1999: 666-673. [42] LIU J, GONG X. Guided depth enhancement via anisotropic diffusion[C]//Pacific-Rim Conference on Multimedia. Cham: Springer, 2013: 408-417. -