Weight adaptive concept drift detection method based on McDiarmid boundary
-
摘要: 针对概念漂移主动检测方法检测延迟高,易出现漏检、误报的问题,提出了一种基于McDiarmid边界的自适应加权概念漂移检测方法。引入衰减函数对分类结果加权,赋予旧数据更低权值,提升新数据的影响力。利用McDiarmid不等式得到加权分类正确率的置信边界,在检测到分类正确率下降超过置信边界时调节衰减因子,实现权值的动态改变。实验主要与DDM、RDDM、HDDM、FHDDM和ADWIN算法对比,结果表明,该算法具有最低的误报率和漏检率,且平均检测延迟和正确率在6种算法中排前2。
-
关键词:
- 概念漂移 /
- 主动检测方法 /
- 数据流分类 /
- McDiarmid边界 /
- 衰减算法
Abstract: An adaptive weighted concept drift detection method based on McDiarmid boundary (WMDDM) is proposed to solve the problems of high detection delay, missed detection and false alarm in the active detection method of concept drift. WMDDM algorithm has a weight adjustment mechanism. The adaptive attenuation algorithm is introduced as a weight function to give the old data lower weights and dynamically adjust according to the changes in the data stream in order to adapt to the concept drift faster. The warning level and drift level of the weighted classification accuracy are obtained by McDiarmid's inequality. When it is detected that the weighted classification accuracy rate drops outside the drift level, the detection result is fed back to the classifier. When it is detected that the weighted classification accuracy rate drops beyond the warning level, the detector adapts to the change of the data flow through the triggered weight adjustment mechanism. The experiment uses 4 artificial data sets (two mutation drift data sets, two gradual drift data sets) and 1 real data set, which are mainly compared with Fast Hoeffding Drift Detection Method (FHDDM), Drift Detection Method based on the Hoeffding’s inequality (HDDM) and other algorithms. Experimental results show that the WMDDM algorithm has the lowest false alarm rate and missed detection rate, and the average detection delay and accuracy rate rank the top 2 among the six algorithms. Finally, WMDDM algorithm is used to classify real data sets and compared with FHDDM algorithm. The results show that WMDDM algorithm has a higher classification accuracy rate than FHDDM. Therefore, the WMDDM algorithm is suitable for abrupt and gradual conceptual drift, and has strong robustness. -
表 1 数据集特征表
Table 1. Data set feature table
Data set Instances Attributes Category Number Of srifts Noise rate Concept length Drift type SINE 100 000 2 2 4 10% 20 000 Sudden MIXED 100 000 4 2 4 10% 20 000 Sudden LED 100 000 24 10 3 10% 25 000 Gradual CIRCLE 100 000 2 2 3 10% 25 000 Gradual Electricity 45 312 7 2 Unknown Unknown Unknown Unknown 表 2 实验1结果
Table 2. Results of experiment 1
Detector FPR/% FNR/% ADOD Correct rate/% WMDDM 0 0 37.25 85.34 WMDDM#1 80 0 44 85.32 FHDDM 0 0 44.00 85.32 表 3 实验2结果
Table 3. Results of experiment 2
Detector FPR/% FNR/% ADOD Correct rate/% WMDDM 0 0 69.67 86.25 WMDDM#1 81.25 0 49 86.26 FHDDM 0 0 74 86.24 表 4 在SINE数据集上的实验结果
Table 4. Experimental results on the SINE dataset
Classifier Detector TP FP FN FPR/% FNR/% ADOD Correct rate/% HT WMDDM 4 1 0 20.0 0 47.00 86.36 FHDDM 4 1 0 20.0 0 49.75 85.37 HDDM 4 1 0 20.0 0 34.50 86.39 DDM 4 0 0 0 0 150.50 86.05 RDDM 4 3 0 42.9 0 100.25 86.08 ADWIN 4 27 0 87.1 0 64.25 84.09 NB WMDDM 4 0 0 0 0 37.25 85.34 FHDDM 4 0 0 0 0 44.00 85.32 HDDM 4 0 0 0 0 33.25 85.36 DDM 4 0 0 0 0 152.75 85.06 RDDM 4 0 0 0 0 101.00 85.17 ADWIN 4 6 0 60.0 0 67.00 84.72 表 5 在MIXED数据集上的实验结果
Table 5. Experimental results on the MIXED dataset
Classifier Detector TP FP FN FPR/% FNR/% ADOD Correct rate/% HT WMDDM 4 2 0 33.3 0 41.00 85.98 FHDDM 4 7 0 63.6 0 41.00 85.24 HDDM 4 5 8 59.3 0 35.75 85.38 DDM 4 11 0 73.3 0 130.75 84.27 RDDM 4 9 0 69.2 0 88.25 85.91 ADWIN 4 9 0 69.2 0 76.00 84.82 NB WMDDM 4 0 0 0 0 43.50 86.62 FHDDM 4 0 0 0 0 45.50 86.62 HDDM 4 0 0 0 0 33.75 86.60 DDM 4 0 0 0 0 149.00 86.26 RDDM 4 0 0 0 0 103.00 86.41 ADWIN 4 6 0 60.0 0 67.50 86.03 表 6 在LED数据集上的实验结果
Table 6. Experimental results on the LED dataset
Classifier Detector TP FP FN FPR/% FNR/% ADOD Correct rate/% HT WMDDM 3 0 0 0 0 237.67 89.68 FHDDM 3 0 0 0 0 256.00 89.64 HDDM 2 1 0 33.3 0 367.67 89.61 DDM 0 3 3 100.0 100.0 400.0 89.53 RDDM 2 2 1 50.0 33.3 380.67 89.59 ADWIN 0 266 3 100.0 100.0 400.0 87.50 NB WMDDM 3 0 0 0 0 237,67 89.68 FHDDM 3 0 0 0 0 256.00 89.67 HDDM 2 1 1 33.3 33.3 367.67 89.61 DDM 0 3 3 100.0 100.0 400.0 89.53 RDDM 2 2 1 50.0 33.3 380.67 89.59 ADWIN 0 266 3 100.0 100.0 400.0 87.50 表 7 在CIRCLE数据集上的实验结果
Table 7. Experimental results on the CIRCLE dataset
Classifier Detector TP FP FN FPR/% FNR/% ADOD Correct rate/% HT WMDDM 3 0 0 0 0 58.33 87.18 FHDDM 3 0 0 0 0 61.33 87.16 HDDM 3 0 0 0 0 44.33 87.19 DDM 2 1 1 33.3 33.3 332.67 86.97 RDDM 3 1 0 25.0 0 246.33 87.03 ADWIN 3 4 0 57.1 0 93.67 86.74 NB WMDDM 3 0 0 0 0 69.67 86.25 FHDDM 3 0 0 0 0 74.00 86.24 HDDM 3 0 0 0 0 41.33 86.26 DDM 2 1 1 33.3 33.3 325.33 86.06 RDDM 3 0 0 0 0 225.00 86.17 ADWIN 3 2 0 40.0 0 113.67 86.20 -
[1] BARROS R S M, SANTOS S G T C. A large-scale comparison of concept drift detectors[J]. Information Sciences, 2018, 451(4): 348-370. [2] Yan MMW. Accurate detecting concept drift in evolving data streams[J]. ICT Express, 2020, 6(4): 332-338. doi: 10.1016/j.icte.2020.05.011 [3] 任思琪. 基于概念漂移的数据流集成分类算法研究[D]. 湖南: 湖南大学, 2018. [4] YU Hang, LIU Tianyu, LU Jie, Guangquan Zhang. Automatic Learning to Detect Concept Drift[EB/OL]. arxiv. org. 2021-04-04 [2021-11-15]. arxiv. org/abs/2105.01419. [5] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016. [6] GAMA J, MEDAS P, CASTILLO G, et al. Learning with drift detection[C]//Brazilian symposium on artificial intelligence. Berlin Heidelberg: Springer, 2004: 286-295. [7] BARROS R, CABRAL D, GONCALVES P M, et al. RDDM: Reactive drift detection method[J]. Expert Systems with Applications, 2017, 90(8): 344-355. [8] FRIAS-BLANCO I, DEL CAMPO-ÁVILA J, RAMOS-JIMENEZ G, et al. Online and non-parametric drift detection methods based on Hoeffding’s bounds[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 27(3): 810-823. [9] PESARANGHADER A, VIKTOR H L. Fast Hoeffding drift detection method for evolving data streams[C]//Joint European Conference on Machine Learning & Knowledge Discovery in Databases. Cham: Springer, 2016: 286-295. [10] 李静林, 袁泉. 流数据分析技术[M]. 北京: 北京邮电大学出版社, 2020. [11] MANAPRAGADA C, WEBB G I, SALEHI M. Extremely fast decision tree[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. London: ACM. 2018: 1953-1962. [12] 邬东辉, 顾幸生. 基于自适应稀疏表示和保局投影的工业故障检测[J]. 华东理工大学学报(自然科学版), 2021, 47(4): 455-464. [13] 赵刚, 王梦灵, 薛斌强, 等. 基于自适应事件触发的交通网络预测控制[J]. 华东理工大学学报(自然科学版), 2021, 47(03): 316-322. [14] RUTKOWSKI, LESZEK, PIETRUCZUK, et al. Decision trees for mining data streams based on the McDiarmid's bound[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(6): 1272-1279. doi: 10.1109/TKDE.2012.66 [15] RIO E. On McDiarmid's concentration inequality[J]. Electronic Communications in Probability, 2013, 18: 1-11. [16] BIFET A, GAVALDA R. Learning from time-changing data with adaptive windowing[C]//Proceedings of the 2007 SIAM international conference on data mining. Minneapolis: SIAM, 2007: 443-448. [17] Srimani P K, Patil M. M, et al. Mining data streams with concept drift in massive online analysis frame work[J]. Procedia Computer Science, 2015, 15(3): 133-142. -