Comparison Of Methods For Handling Imbalanced Datasets In Improving Classification Algorithm Performance

Dyah Setyo Rini; Winalia Agwil; Dian Agustina; Ahmad Famuji

doi:10.18860/cauchy.v11i1.35780

Comparison Of Methods For Handling Imbalanced Datasets In Improving Classification Algorithm Performance

Dyah Setyo Rini, Winalia Agwil, Dian Agustina, Ahmad Famuji

Abstract

Data availability in large observations and dimensions is known as big data. There are several problems in processing big data, such as imbalanced datasets. In classification modeling, an imbalanced dataset is a common challenge. Data class predictions are more likely to be accurate in the majority class data and inaccurate in the minority class, resulting from the problem of imbalanced data. The data-level, the algorithm-level, and the ensemble method approach are the solutions that have been extensively researched. Some methods with a data-level approach are SMOTE, Undersampling, and Oversampling. The algorithm-level method is NWKNN. And then, the ensemble approach is UnderBagging, RUSBoosting, SMOTEBoost, and SMOTEBagging. The goal of this study is to determine the best method for handling each case of the imbalanced dataset. There are three cases of imbalance, namely mild, moderate, and extreme. A simulation study was conducted for each imbalanced case to evaluate the accuracy of each method. Based on the AUC value, the SMOTEBagging method is the best for mild imbalance cases with an AUC value of 0.9581. For moderate imbalance cases, the SMOTEBagging method is the best method, with an AUC value of 0.9033. Meanwhile, for extreme imbalance cases, the UnderBagging method provides the best performance.

Keywords

Imbalanced; SMOTE; NWKNN; Ensemble; AUC

Full Text:

PDF

References

[1] I. H. Sarker, “Machine learning: Algorithms, real-world applications and research directions,” SN Computer Science, vol. 2, no. 3, p. 160, 2021. doi: 10.1007/s42979-021-00592-x.

[2] C. E. Varma and P. S. Prasad, “Supervised and unsupervised machine learning approaches—a survey,” in Proceedings of the 3rd International Conference on Data Science, Machine Learning and Applications (ICDSMLA 2021), Singapore: Springer Nature Singapore, 2023, pp. 73–81. doi: 10.1007/978-981-19-5936-3_7.

[3] A. Ugarković and D. Oreški, “Supervised and unsupervised machine learning approaches on class imbalanced data,” in 2022 International Conference on Smart Systems and Technologies (SST), IEEE, 2022, pp. 159–162. doi: 10.1109/SST55530.2022.9954646.

[4] D. Ispriyanti, A. Prahutama, and M. Mustafid, “Analisis klasifikasi kemiskinan di kota semarang menggunakan algoritma quest,” Statistika, vol. 7, no. 1, pp. 47–54, 2019.

[5] L. Nuzula, A. Prahutama, and A. R. Hakim, “Klasifikasi status kemiskinan rumah tangga dengan metode support vector machines (svm) dan classification and regression trees (cart) menggunakan gui r (studi kasus di kabupaten wonosobo tahun 2018),” J. Gaussian, vol. 9, no. 4, pp. 525–534, 2020. doi: 10.14710/j.gauss.v9i4.29449.

[6] T. Le, “A comprehensive survey of imbalanced learning methods for bankruptcy prediction,” IET Communications, vol. 16, no. 5, pp. 433–441, 2021. doi: 10.1049/cmu2.12268.

[7] A. O. Adegbenjo and M. O. Ngadi, “Handling the imbalanced problem in agri-food data analysis,” Foods, vol. 13, no. 20, p. 3300, 2024. doi: 10.3390/foods13203300.

[8] Y. Permatasari, “Penanganan masalah kelas tidak seimbang dengan rusboost dan underbagging (studi kasus: Mahasiswa drop out sps ipb program magister),” http://repository.ipb.ac.id/handle/123456789/80118, M.S. thesis, IPB University, 2016.

[9] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class imbalanced data: Review of methods and applications,” Expert Systems with Applications, vol. 73, pp. 220–239, 2017. doi: 10.1016/j.eswa.2016.12.035.

[10] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote : Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002. doi: 10.1613/jair.953.

[11] F. R. Torres, J. A. Carrasco-Ochoa, and J. F. Martínez-Trinidad, “Smote-d: A deterministic version of smote,” in Mexican Conference on Pattern Recognition, Cham: Springer International Publishing, 2016, pp. 177–188. doi: 10.1007/978-3-319-39393-3_18.

[12] S. Tan, “For unbalanced text corpus,” Expert Systems with Applications, vol. 28, no. 4, pp. 667–671, 2005. doi: 10.1016/j.eswa.2004.12.023.

[13] S. R. Mounce, K. Ellis, J. M. Edwards, V. L. Speight, N. Jakomis, and J. B. Boxall, “Ensemble decision tree models using rusboost for estimating risk of iron failure in drinking water distribution systems,” Water Resources Management, vol. 31, no. 5, pp. 1575–1589, 2017. doi: 10.1007/s11269-017-1595-8.

[14] G. Tüysüzoğlu and D. Birant, “Enhanced bagging (ebagging): A novel approach for ensemble learning,” International Arab Journal of Information Technology, vol. 17, no. 4, pp. 515–528, 2020. doi: 10.34028/iajit/17/4/10.

[15] R. Barandela, R. M. Valdovinos, and J. S. Sánchez, “New applications of ensembles of classifiers,” Pattern Anal. Appl., vol. 6, pp. 245–256, 2003. doi: 10.1007/s10044-003-0192-z.

[16] B. Sun, H. Chen, J. Wang, and H. Xie, “Evolutionary under-sampling based bagging ensemble method for imbalanced data classification,” Frontiers of Computer Science, vol. 12, no. 2, pp. 331–350, 2018. doi: 10.1007/s11704-016-5306-z.

[17] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, “Smoteboost: Improving prediction of the minority class in boosting,” in Proceedings of the Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Cavtat-Dubrovnik, Croatia, 2003, pp. 107–119. doi: 10.1007/978-3-540-39804-2_12.

[18] S. Wang and X. Yao, “Diversity analysis on imbalances data sets by using ensemble models,” in IEEE Symposium on Computational Intelligence and Data Mining, 2009, pp. 324–331. doi: 10.1109/CIDM.2009.4938667.

[19] J. Xu, H. Wang, and Z. Li, “Comparing multi-class classifier performance by multi-class roc analysis: A nonparametric approach,” Neurocomputing, vol. 583, p. 127 520, 2024. doi: 10.1016/j.neucom.2024.127520.

DOI: https://doi.org/10.18860/cauchy.v11i1.35780

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Editorial Office
Mathematics Department,
Maulana Malik Ibrahim State Islamic University of Malang
Gajayana Street 50 Malang, East Java, Indonesia 65144
e-mail: cauchy@uin-malang.ac.id

CAUCHY: Jurnal Matematika Murni dan Aplikasi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Username
Password
Remember me