Adaptive Weighted Concept Drift Detection Method Based on McDiarmid Boundary
-
摘要: 针对概念漂移主动检测方法检测延迟高,易出现漏检、误报的问题,提出了一种基于McDiarmid边界的自适应加权概念漂移检测方法。引入衰减函数对分类结果加权,赋予旧数据更低权值,提升新数据的影响力。利用McDiarmid不等式得到加权分类正确率的置信边界,在检测到分类正确率下降超过置信边界时调节衰减因子时,实现权值的动态改变。实验主要与DDM(Drift Detection Method)、RDDM(Reactive Drift Detection Method)、HDDM(Drift Detection Method based on the Hoeffding's inequality)、FHDDM(Fast Hoeffding Drift Detection Method)和窗口(ADWIN)算法对比,结果表明,该算法具有最低的误报率和漏检率,且平均检测延迟和正确率在6种算法中排前2。
-
关键词:
- 概念漂移 /
- 主动检测方法 /
- 数据流分类 /
- McDiarmid边界 /
- 衰减算法
Abstract: Aiming at the shortcomings that the active detection method of concept drift is subject to high detection delay, missed detection, and false alarm, this paper proposes an adaptive weighted concept drift detection method based on McDiarmid boundary (WMDDM). The proposed algorithm has a weighted adjustment mechanism. The adaptive attenuation algorithm is introduced as a weight function to give old data lower weights that are dynamically adjusted to adapt to the concept drift faster according to the changes in the data stream. McDiarmid's inequality is utilized to obtain the warning level and drift level of the weighted classification accuracy. While the weighted classification accuracy rate is detected to drop outside the drift level, the detection result is fed back to the classifier. While it is detected that the weighted classification accuracy rate drops beyond the warning level, the detector adapts to the change of the data flow by the triggered weight adjustment mechanism. Finally, the experiments are made on 4 artificial data sets and 1 real data set by the comparison with Fast Hoeffding Drift Detection Method (FHDDM), Drift Detection Method based on Hoeffding’s inequality (HDDM) and other algorithms. It is shown via experimental results that the proposed WMDDM algorithm has the lowest false alarm rate and missed detection rate, and its average detection delay and accuracy rate rank second among six algorithms. In addition, the proposed WMDDM algorithm is also used to classify real data sets by comparing with FHDDM algorithm, from which it is shown that WMDDM algorithm has a higher classification accuracy rate than FHDDM. Therefore, the WMDDM algorithm is more suitable for abrupt and gradual conceptual drift, and has strong robustness. -
表 1 数据集特征表
Table 1. Data set feature table
Data set Instances Attributes Category Number of drifts Noise rate Concept length Drift type SINE 100 000 2 2 4 10% 20 000 Sudden MIXED 100 000 4 2 4 10% 20 000 Sudden LED 100 000 24 10 3 10% 25 000 Gradual CIRCLE 100 000 2 2 3 10% 25 000 Gradual Electricity 45 312 7 2 Unknown Unknown Unknown Unknown 表 2 实验1结果
Table 2. Results of experiment 1
Detector FPR/% FNR/% ADOD Correct rate/% WMDDM 0 0 37.25 85.34 WMDDM#1 80 0 44.00 85.32 FHDDM 0 0 44.00 85.32 表 3 实验2结果
Table 3. Results of experiment 2
Detector FPR/% FNR/% ADOD Correct rate/% WMDDM 0 0 69.67 86.25 WMDDM#1 81.25 0 49.00 86.26 FHDDM 0 0 74.00 86.24 表 4 在SINE数据集上的实验结果
Table 4. Experimental results on the SINE dataset
Classifier Detector TP FP FN FPR/% FNR/% ADOD Correct rate/% HT WMDDM 4 1 0 20.0 0 47.00 86.36 FHDDM 4 1 0 20.0 0 49.75 85.37 HDDM 4 1 0 20.0 0 34.50 86.39 DDM 4 0 0 0 0 150.50 86.05 RDDM 4 3 0 42.9 0 100.25 86.08 ADWIN 4 27 0 87.1 0 64.25 84.09 NB WMDDM 4 0 0 0 0 37.25 85.34 FHDDM 4 0 0 0 0 44.00 85.32 HDDM 4 0 0 0 0 33.25 85.36 DDM 4 0 0 0 0 152.75 85.06 RDDM 4 0 0 0 0 101.00 85.17 ADWIN 4 6 0 60.0 0 67.00 84.72 表 5 MIXED数据集上的实验结果
Table 5. Experimental results on the MIXED dataset
Classifier Detector TP FP FN FPR/% FNR/% ADOD Correct rate/% HT WMDDM 4 2 0 33.3 0 41.00 85.98 FHDDM 4 7 0 63.6 0 41.00 85.24 HDDM 4 5 8 59.3 0 35.75 85.38 DDM 4 11 0 73.3 0 130.75 84.27 RDDM 4 9 0 69.2 0 88.25 85.91 ADWIN 4 9 0 69.2 0 76.00 84.82 NB WMDDM 4 0 0 0 0 43.50 86.62 FHDDM 4 0 0 0 0 45.50 86.62 HDDM 4 0 0 0 0 33.75 86.60 DDM 4 0 0 0 0 149.00 86.26 RDDM 4 0 0 0 0 103.00 86.41 ADWIN 4 6 0 60.0 0 67.50 86.03 表 6 LED数据集上的实验结果
Table 6. Experimental results on the LED dataset
Classifier Detector TP FP FN FPR/% FNR/% ADOD Correct rate/% HT WMDDM 3 0 0 0 0 237.67 89.68 FHDDM 3 0 0 0 0 256.00 89.64 HDDM 2 1 0 33.3 0 367.67 89.61 DDM 0 3 3 100.0 100.0 400.0 89.53 RDDM 2 2 1 50.0 33.3 380.67 89.59 ADWIN 0 266 3 100.0 100.0 400.0 87.50 NB WMDDM 3 0 0 0 0 237.67 89.68 FHDDM 3 0 0 0 0 256.00 89.67 HDDM 2 1 1 33.3 33.3 367.67 89.61 DDM 0 3 3 100.0 100.0 400.0 89.53 RDDM 2 2 1 50.0 33.3 380.67 89.59 ADWIN 0 266 3 100.0 100.0 400.0 87.50 表 7 CIRCLE数据集上的实验结果
Table 7. Experimental results on the CIRCLE dataset
Classifier Detector TP FP FN FPR/% FNR/% ADOD Correct rate/% HT WMDDM 3 0 0 0 0 58.33 87.18 FHDDM 3 0 0 0 0 61.33 87.16 HDDM 3 0 0 0 0 44.33 87.19 DDM 2 1 1 33.3 33.3 332.67 86.97 RDDM 3 1 0 25.0 0 246.33 87.03 ADWIN 3 4 0 57.1 0 93.67 86.74 NB WMDDM 3 0 0 0 0 69.67 86.25 FHDDM 3 0 0 0 0 74.00 86.24 HDDM 3 0 0 0 0 41.33 86.26 DDM 2 1 1 33.3 33.3 325.33 86.06 RDDM 3 0 0 0 0 225.00 86.17 ADWIN 3 2 0 40.0 0 113.67 86.20 -
[1] BARROS R S M, SANTOS S G T C. A large-scale comparison of concept drift detectors[J]. Information Sciences, 2018, 451(4): 348-370. [2] YAN M. Accurate detecting concept drift in evolving data streams[J]. ICT Express, 2020, 6(4): 332-338. doi: 10.1016/j.icte.2020.05.011 [3] 任思琪. 基于概念漂移的数据流集成分类算法研究[D]. 长沙: 湖南大学, 2018. [4] YU H, LIU T Y, LU J, et al. Automatic Learning to Detect Concept Drift[EB/OL]. (2021-04-04) [2021-11-15]. https://arxiv.org/abs/2105.01419. [5] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016. [6] GAMA J, MEDAS P, CASTILLO G, et al. Learning with drift detection[C]//Brazilian Symposium on Artificial Intelligence. Berlin, Heidelberg: Springer, 2004: 286-295. [7] BARROS R, CABRAL D, GONCALVES P M, et al. RDDM: Reactive drift detection method[J]. Expert Systems with Applications, 2017, 90(8): 344-355. [8] FRIAS-BLANCO I, DEL CAMPO-ÁVILA J, RAMOS-JIMENEZ G, et al. Online and non-parametric drift detection methods based on Hoeffding’s bounds[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 27(3): 810-823. [9] PESARANGHADER A, VIKTOR H L. Fast Hoeffding drift detection method for evolving data streams[C]//Joint European Conference on Machine Learning & Knowledge Discovery in Databases. Cham, Switzerland: Springer, 2016: 286-295. [10] 李静林, 袁泉. 流数据分析技术[M]. 北京: 北京邮电大学出版社, 2020. [11] MANAPRAGADA C, WEBB G I, SALEHI M. Extremely fast decision tree[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. London: ACM. 2018: 1953-1962. [12] 邬东辉, 顾幸生. 基于自适应稀疏表示和保局投影的工业故障检测[J]. 华东理工大学学报(自然科学版), 2021, 47(4): 455-464. [13] 赵刚, 王梦灵, 薛斌强, 等. 基于自适应事件触发的交通网络预测控制[J]. 华东理工大学学报(自然科学版), 2021, 47(3): 316-322. [14] RUTKOWSKI, LESZEK, PIETRUCZUK, et al. Decision trees for mining data streams based on the McDiarmid's bound[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(6): 1272-1279. doi: 10.1109/TKDE.2012.66 [15] RIO E. On McDiarmid's concentration inequality[J]. Electronic Communications in Probability, 2013, 18: 1-11. [16] BIFET A, GAVALDA R. Learning from time-changing data with adaptive windowing[C]//Proceedings of the 2007 SIAM international conference on data mining. Minneapolis: SIAM, 2007: 443-448. [17] SRIMANI P K, PATIL M M. Mining data streams with concept drift in massive online analysis frame work[J]. Procedia Computer Science, 2015, 15(3): 133-142. -