Action Detection Based on Region Spatiotemporal Two-in-One Network
-
摘要: 视频动作检测研究是在动作识别的基础上进一步获取动作发生的位置和时间信息。结合RGB空间流和光流时间流,提出了一种基于SSD的区域时空二合一动作检测网络。改进了非局部时空模块,在光流中设计了像素点筛选器来提取运动关键区域信息,只对空间流中筛选出的动作关键区域进行相关性计算,有效获得动作长距离依赖并改善非局部模块计算成本较大的缺陷,同时降低了视频背景噪声的干扰。在基准数据集UCF101-24上进行了实验,结果表明所提出的区域时空二合一网络具有更好的检测性能,视频级别的平均精度(video_mAP)达到了43.17%@0.5。Abstract: With the explosive growth of video data, video intelligent analysis has been becoming the academic and industrial research hotspot. The objective of video action detection is to obtain the location and time information of actions based on action recognition. By combining the single shot multi-box detector (SSD) with the RGB space flow and optical flow, this paper proposes a region spatiotemporal two-in-one action detection network. To improve the nonlocal spatiotemporal module in the network, a pixel filter is proposed in optical flow to extract the information of key motion regions, and then, the correlation calculation is performed only on the selected key motion regions in the spatial flow. The proposed module can get long-range dependence of actions effectively and reduce the computational cost of the nonlocal module and the interference of video background noise. Finally, the proposed network is tested on the benchmark dataset UCF101-24, and attain better detection performance.
-
Key words:
- video action detection /
- SSD /
- two-stream network /
- non local model /
- UCF101-24
-
表 1 UCF101-24数据集中各类别在IoU阈值为0.5时frame_AP的对比结果
Table 1. Comparison of frame_AP of UCF101-24 at IOU threshold of 0.5
Class frame_AP/% $ \Delta ({\rm{diff}}) $ SSD This paper Basketball 28.91 32.37 3.46 Basketball_dunk 49.90 49.61 −0.29 Biking 78.36 78.27 −0.09 Cliff_diving 50.19 57.95 7.76 Crick_bowling 27.68 31.44 3.76 Diving 78.97 80.70 1.73 Fencing 87.95 88.16 0.21 Floor_gymnastics 83.38 85.44 2.06 Golf_swing 43.44 44.83 1.39 Horse_riding 88.57 88.41 −0.16 Ice_dancing 71.61 72.40 0.79 Long_jump 56.77 59.44 2.67 Pole_vault 55.04 56.72 1.68 Rope-climbing 81.36 82.12 0.76 Salsa_spin 69.26 69.01 −0.25 Skate_boarding 68.63 71.71 3.08 Skiing 68.09 77.73 9.64 Skijet 84.44 87.45 3.01 Soccer_juggling 79.97 80.14 0.17 Surfing 82.88 86.50 3.62 Tennis_swing 37.26 37.18 −0.08 Trampoline_jumping 60.63 60.54 −0.09 Volleyball_spiking 35.51 36.50 0.99 Walking_with_dog 74.26 74.44 0.18 frame_mAP 64.29 66.21 1.92 表 2 不同算法在UCF101-24数据集上的video_mAP结果对比
Table 2. Comparison of video_mAP of different algorithms on UCF101-24
-
[1] 黄晴晴, 周风余, 刘美珍. 基于视频的人体动作识别算法综述[J]. 计算机应用研究, 2020, 37(11): 3213-3219. [2] 朱煜, 赵江坤, 王逸宁, 等. 基于深度学习的人体行为识别算法综述[J]. 自动化学报, 2016, 42(6): 848-857. [3] TIAN Y, SUKTHANKAR R, SHAH M. Spatiotemporal deformable part models for action detection[C]//the IEEE International Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2013: 2642-2649. [4] TRAN D, YUAN J. Max-margin structured output regression for spatio-temporal action localization[C]//Advances in Neural Information Processing Systems. USA: NIPS, 2012: 350-358. [5] YUAN J, LIU Z, WU Y. Discriminative video pattern search for efficient action detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(9): 1728-1743. doi: 10.1109/TPAMI.2011.38 [6] GAIDON A, HARCHAOUI Z, SCHMID C. Temporal localization of actions with actoms[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(11): 2782-2795. doi: 10.1109/TPAMI.2013.65 [7] ONEATA D, VERBEEK J J, SCHMID C. Efficient action localization with approximately normalized fisher vectors[C]//the IEEE Conference on Computer Vision and Pattern Recognition. USA: IEEE, 2014: 2545-2552. [8] VAN G J C, JAIN M, GATI E, et al. APT: Action localization proposals from dense trajectories[C]//the British Machine Vision Conference. UK: BMVA Press, 2015: 1-12. [9] PENG X, SCHMID C. Multi-region two-stream R-CNN for action detection[C]//European Conference on Computer Visio. Amsterdam, Netherlands: Springer, 2016: 744-759. [10] SINGH G, SAHA S, SAPIENZA M, et al. Online real-time multiple spatiotemporal action localisation and prediction[C]//the IEEE International Conference on Computer Vision. Italy: IEEE, 2017: 3637-3646. [11] WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//the IEEE Conference on Computer Vision and Pattern Recognitio. USA: IEEE, 2018: 7794-7803. [12] SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]//Advances in Neural Information Processing Systems. Canada: NIPS, 2014: 568-576. [13] FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]//the IEEE International Conference on Computer Vision and Pattern Recognitio. USA: IEEE, 2016: 1933-1941. [14] 杨天明, 陈志, 岳文静. 基于视频深度学习的时空双流人物动作识别模型[J]. 计算机应用, 2018, 38(3): 895-899, 915. doi: 10.11772/j.issn.1001-9081.2017071740 [15] SAHA S, SINGH G, SAPIENZA M, et al. Deep learning for detecting multiple space-time action tubes in videos[C]//International Computer Vision Summer School. Italy: ICVSS, 2016: 1-13. [16] YANG Z H, GAO J Y, NEVATIA R. Spatio-temporal action detection with cascade proposal and location anticipation[C]//British Machine Vision Conference. UK: BMVC, 2017: 1-12. [17] BEHL H S, SAPIENZA M, SINGH G, et al. Incremental tube construction for human action detection[C]//British Machine Vision Conference. Newcastle, UK: BMVC, 2018: 1-12. [18] SONG Y, KIM I. Spatio-temporal action detection in untrimmed videos by using multimodal features and region proposals[J]. Sensors, 2019, 19(5): 1085-1103. doi: 10.3390/s19051085 [19] ALWANDO E H P, CHEN Y T, FANG W H. CNN-Based multiple path search for action tube detection in videos[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(1): 104-116. doi: 10.1109/TCSVT.2018.2887283 -