Optimization of Imbalanced Class Using CTGAN and RSCV-Ensemble Learning for Clean Water Quality Classification

Favian Sis Bagus Febrianto, Umu Sa'adah, Imam Nurhadi Purwanto

Abstract


Class imbalance is a common challenge in classification learning. This condition often leads to poor model performance in identifying minority class observations. This study aims to evaluate the performance of an Ensemble Learning model optimized using Randomized Search CV (RSCV) to classify clean water quality under imbalanced class distributions. To address the imbalance problem, the CTGAN technique is applied and compared with other oversampling methods like SMOTENC that also can handle mix data. The model performance is assessed using two sided independent t-test form different mean (accuracy, precision, recall, F1-score, and AUC-ROC) two models. The best model performance from Data Testing in this study is the Ensemble Learning model with RSCV combined with CTGAN oversampling followed by mean Accuracy of 92,8%, mean Precision of 90,23%, mean Recall of 90,98%, mean f1-score of 90,82%, and mean AUC-ROC 96,80% from 25 split data experiment. Almost all accuracy metrics showed statistically significant differences when tested using a two-sided independent t-test. Overall, the results indicate that CTGAN outperforms SMOTENC, and RSCV optimization improves performance compared to models without optimization. Not only accuracy, all different mean Recall metrics is significant, further supporting the superiority of CTGAN and RSCV-enhanced models.


Keywords


TGAN; Random Search; Environment; Ensemble Learning; Imbalanced Class.

Full Text:

PDF

References


[1] A. Ali, S. M. H. Shamsuddin, and A. L. Ralescu. “Classification with class imbalance problem: A review”. Soft Computing Models in Industrial and Environmental Applications. (2015). URL: https://api.semanticscholar.org/CorpusID:26644563.

[2] Y. Wu. “Imbalanced prediction in epidemiological study: A machine learning-based analysis”. Annals of Epidemiology 109 (2025), pp. 83–92. DOI: https://doi.org/10.1016/j.annepidem.2025.07.023.

[3] W. Chen, D. Xu, B. Pan, Y. Zhao, and Y. Song. “Machine Learning-Based Water Quality Classification Assessment”. Water 16.20 (2024), p. 2951. DOI: https://doi.org/10.3390/w16202951.

[4] N. Nasir, A. Kansal, O. Alshaltone, F. Barneih, M. Sameer, A. Shanableh, and A. Al-Shamma’a. “Water quality classification using machine learning algorithms”. Journal of Water Process Engineering 48 (2022), p. 102920. DOI: https://doi.org/10.1016/j.jwpe.2022.102920.

[5] D. Yu and M. Wang. “Harnessing the hybrid machine learning methods for stroke risk classification”. Computer Methods in Biomechanics and Biomedical Engineering (2025), pp. 1–18. DOI: https://doi.org/10.1080/10255842.2025.2501636.

[6] J. Zhai, J. Qi, and C. Shen. “Binary imbalanced data classification based on diversity oversampling by generative models”. Information Sciences 585 (2022), pp. 313–343. DOI: https://doi.org/10.1016/j.ins.2021.11.058.

[7] J. Chen, X. Zhou, J. Yao, and S. Tang. “Application of machine learning in higher education to predict students’ performance, learning engagement and self-efficacy: a systematic literature review”. Asian Education and Development Studies 14.2 (2025), pp. 205–240. DOI: https://doi.org/10.1108/AEDS-08-2024-0166.

[8] Adi Fajri Firmansyah, Basuki Rahmat, and Muhammad Muharrom Al Haromainy. “Optimization of the Random Forest Algorithm Using Random Search for Potable Water Quality Classification”. Journal of Artificial Intelligence and Engineering Applications 5.1 (2025). DOI: https://doi.org/10.59934/jaiea.v5i1.1221. URL: https://ioinformatic.org/index.php/JAIEA/article/view/1221.

[9] H. Woldesellasse and S. Tesfamariam. “Prediction of Lateral Spreading Displacement Using Conditional Generative Adversarial Network (cGAN)”. Soil Dynamics and Earthquake Engineering 156 (2022), p. 107214. DOI: https://doi.org/10.1016/j.soildyn.2022.107214.

[10] P. Gogoi and J. A. Valan. “Enhancing date fruit classification using machine learning, CTGAN, and SHAP-based explainability”. Food Measure 19 (2025), pp. 6851–6872. DOI: https://doi.org/10.1007/s11694-025-03428-x.

[11] A. Alzahrani. “Early detection of lung cancer using predictive modeling incorporating CTGAN features and Tree-Based learning”. IEEE Access 13 (2025), pp. 34321–34333. DOI: https://doi.org/10.1109/ACCESS.2025.3543215.

[12] R. Shafique, A. S. Al-Shamayleh, S. K. Posa, A. Ishaq, F. Rustam, and G. S. Choi. “Advancing ovarian cancer outcomes with CTGAN-enhanced hybrid machine learning approach”. Knowledge-Based Systems 328 (2025), p. 114206. DOI: https://doi.org/10.1016/j.knosys.2025.114206.

[13] A. Alabdulwahab. “Enhancing deep learning-based side-channel analysis using feature engineering in a fully simulated IoT system”. Expert Systems with Applications 266 (2025), p. 126079. DOI: https://doi.org/10.1016/j.eswa.2024.126079.

[14] Y. Chen, W. Pedrycz, C. Zhang, J. Wang, and J. Yang. “Oversampling with GAN via Meta-learning for imbalanced data”. IEEE Transactions on Multimedia (2025), pp. 1–16. DOI: https://doi.org/10.1109/TMM.2025.3607712.

[15] S. Mukherjee. “SMOTE-ENN resampling technique with bayesian optimization for multi-class classification of dry bean varieties”. Applied Soft Computing 181 (2025), p. 113467. DOI: https://doi.org/10.1016/j.asoc.2025.113467.

[16] D. R. Jones, M. Schonlau, and W. J. Welch. “Efficient global optimization of expensive black-box functions”. Journal of Global Optimization 13.4 (1998), pp. 455–492. DOI: https://doi.org/10.1023/A:1008306431147.

[17] E. Zhang, F. Zhou, H. Xi, X. Duan, and J. Liu. “Predicting cycle-to-cycle variations in liquid methane engines using CTGAN-augmented machine learning”. Journal of Marine Science and Engineering 13 (2025), p. 1513. DOI: https://doi.org/10.3390/jmse13081513.

[18] E. Hong and J.-S. Yi. “Sequence image layout generation for construction accident simulation using domain-tuned NER by ZSL-PLM dan scene graph learning”. Advanced Engineering Informatics 68 (2025), p. 103673. DOI: https://doi.org/10.1016/j.aei.2025.103673.

[19] Kaggle. Water Pollution and Disease. Accessed: 2026-06-09. (2024). URL: https://www.kaggle.com/datasets/khushikyad001/water-pollution-and-disease.

[20] World Health Organization (WHO). Guidelines for Drinking-water Quality, 4th edition. Diakses 15 September 2025. 2011. URL: https://www.who.int/publications/i/item/9789240045064.

[21] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. “Generative adversarial networks”. arXiv preprint (2014). URL: https://arxiv.org/abs/1406.2661.

[22] I. N. M. Adiputra, P.-C. Lin, and P. Wanchai. “The effectiveness of generative adversarial network-based oversampling methods for imbalanced multi-class credit score classification”. Electronics 14.4 (2025), p. 697. DOI: https://doi.org/10.3390/electronics14040697.

[23] L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni. “Modeling Tabular Data Using Conditional GAN”. Advances in Neural Information Processing Systems. Vol. 32. 2019. URL: https://arxiv.org/abs/1907.00503.

[24] L. Breiman. “Random forests”. Machine Learning 45.1 (2001), pp. 5–32. DOI: https://doi.org/10.1023/A:1010933404324.

[25] A. Bhattacharyya, J. Vaughan, and V. N. Nair. “Behavior of hyper-parameters for selected machine learning algorithms: an empirical investigation”. arXiv (2022). URL: https://arxiv.org/abs/2211.08536.

[26] J. Bergstra and Y. Bengio. “Random Search for Hyper-Parameter Optimization”. Journal of Machine Learning Research 13.10 (2012), pp. 281–305. URL: https://www.jmlr.org/papers/v13/bergstra12a.html.

[27] A. Field. Discovering Statistics Using IBM SPSS Statistics. 5th ed. SAGE Publications, 2020. URL: https://us.sagepub.com/en-us/nam/discovering-statistics-using-ibm-spss-statistics/book258032.

[28] R. E. Walpole, R. H. Myers, S. L. Myers, and K. Ye. Probability & Statistics for Engineers & Scientists. Ninth Edition, Global Edition. Edinburgh Gate, Harlow, Essex CM20 2JE, England: Pearson Education Limited, 2016. URL: https://www.pearsonglobaleditions.com.




DOI: https://doi.org/10.18860/cauchy.v11i2.42071

Refbacks

  • There are currently no refbacks.


Copyright (c) 2026 Favian Sis Bagus Febrianto, Umu Sa'adah, Imam Nurhadi Purwanto

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Editorial Office
Mathematics Department,
Universitas Islam Negeri Maulana Malik Ibrahim Malang
Gajayana Street 50 Malang, East Java, Indonesia 65144
Faximile (+62) 341 558933
e-mail: cauchy@uin-malang.ac.id

Creative Commons License
CAUCHY: Jurnal Matematika Murni dan Aplikasi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.