Coronary Heart Disease Risk Prediction under Class Imbalance Using XGBoost with SHAP-Based Interpretation
Abstract
Coronary heart disease (CHD) risk prediction is challenging because clinical data are heterogeneous and the response variable is imbalanced. This study develops an interpretable predictive framework for CHD risk using Extreme Gradient Boosting (XGBoost), median imputation, IQR-based winsorization, standardization, the Synthetic Minority Over-sampling Technique (SMOTE), bootstrap-based uncertainty assessment, and Shapley Additive Explanations (SHAP). The learning problem is formulated within a regularized empirical risk minimization framework, so the model is viewed as a statistical estimator rather than merely an algorithmic classifier. To avoid information leakage, train–test splitting is performed before any resampling, and SMOTE is applied only to the training data. The primary analysis is fixed a priori at an 80:20 stratified split, whereas 60:40 and 70:30 splits are treated as sensitivity analyses rather than model-selection devices. In the primary analysis, the model attains accuracy of 79.36%, precision of 27.88%, recall of 22.48%, F1-score of 24.89%, and ROC–AUC of 0.6502. The 95% bootstrap confidence interval for ROC–AUC is [0.6017, 0.6981]. SHAP analysis in probability space identifies age, cigsPerDay, male, heartRate, and sysBP as the most influential predictors. These results show that the proposed framework is mathematically well-structured and interpretable, but that its out-of-sample discrimination on this dataset is moderate rather than high.
Keywords
Full Text:
PDFReferences
[1] World Health Organization. Cardiovascular Diseases (CVDs). WHO Fact Sheet. 2023. Accessed 2026-01-29. https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds).
[2] World Heart Federation. World Heart Report 2023. 2023. Accessed 2026-01-29. https://world-heart-federation.org/resource/world-heart-report-2023/.
[3] Roth, G. A., Mensah, G. A., Johnson, C. O., et al. "Global Burden of Cardiovascular Diseases and Risk Factors, 1990-2019." Journal of the American College of Cardiology, 76(25), 2982-3021, 2020. doi: 10.1016/j.jacc.2020.11.010.
[4] Virani, S. S., et al. "Heart Disease and Stroke Statistics-2024 Update: A Report From the American Heart Association." Circulation, 2024. doi: 10.1161/CIR.0000000000001209.
[5] Arnett, D. K., Blumenthal, R. S., Albert, M. A., et al. "2019 ACC/AHA Guideline on the Primary Prevention of Cardiovascular Disease." Circulation, 140(11), e596-e646, 2019. doi: 10.1161/CIR.0000000000000678.
[6] D'Agostino, R. B., Sr., Vasan, R. S., Pencina, M. J., et al. "General Cardiovascular Risk Profile for Use in Primary Care: The Framingham Heart Study." Circulation, 117(6), 743-753, 2008. doi: 10.1161/CIRCULATIONAHA.107.699579.
[7] Friedman, J. H. "Greedy Function Approximation: A Gradient Boosting Machine." Annals of Statistics, 29(5), 1189-1232, 2001. doi: 10.1214/aos/1013203451.
[8] Chen, T., and Guestrin, C. "XGBoost: A Scalable Tree Boosting System." In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785-794, 2016. doi: 10.1145/2939672.2939785.
[9] Al Faroby, M. H. Z., Amiroch, S., Aji, B. A. S., and Aritonang, A. "Classification of IGF1R Ligand Compounds for Identification of Herbal Extracts Using Extreme Gradient Boosting." Jurnal Informatika, 16(3), 139-150, 2022. doi: 10.26555/jifo.v16i3.a23286.
[10] Amiroch, S., Irawan, M. I., Mukhlash, I., Al Faroby, M. H. Z., and Nidom, C. A. "Machine Learning for the Prediction of Antiviral Compounds Targeting Avian Influenza A/H9N2 Viral Proteins." Symmetry, 14(6), 1114, 2022. doi: 10.3390/sym14061114.
[11] Muhammad, I., Mukhlash, I., Jamhuri, M., Iqbal, M., and Irawan, M. I. "Classification of COVID-19 Variants Using Boosting Algorithm." In 2022 9th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI), pp. 29-34. IEEE, 2022. doi: 10.23919/EECSI56542.2022.9946452.
[12] Irawan, M. I., and Jamhuri, M. "State of the Art of Machine Learning: An Overview of the Past, Current, and the Future Research Trends in the Era of Quantum Computing." AIP Conference Proceedings, 2641(1), 040009, 2022. doi: 10.1063/5.0131848.
[13] Budholiya, K., Shrivastava, S. K., and Sharma, V. "An Optimized XGBoost Based Diagnostic System for Effective Prediction of Heart Disease." Journal of King Saud University-Computer and Information Sciences, 34(7), 4514-4523, 2022. doi: 10.1016/j.jksuci.2020.10.013.
[14] Permana, A. H., Umbara, F. R., and Kasyidi, F. "Klasifikasi Penyakit Jantung Tipe Kardiovaskular Menggunakan Adaptive Synthetic Sampling dan Algoritma Extreme Gradient Boosting." Building of Informatics, Technology and Science (BITS), 6(1), 499-508, 2024. doi: 10.47065/bits.v6i1.5421.
[15] Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. "SMOTE: Synthetic Minority Over-sampling Technique." Journal of Artificial Intelligence Research, 16, 321-357, 2002. doi: 10.1613/jair.953.
[16] He, H., and Garcia, E. A. "Learning from Imbalanced Data." IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284, 2009. doi: 10.1109/TKDE.2008.239.
[17] Swets, J. A. "Measuring the Accuracy of Diagnostic Systems." Science, 240(4857), 1285-1293, 1988. doi: 10.1126/science.3287615.
[18] Fawcett, T. "An Introduction to ROC Analysis." Pattern Recognition Letters, 27(8), 861-874, 2006. doi: 10.1016/j.patrec.2005.10.010.
[19] Saito, T., and Rehmsmeier, M. "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." PLOS ONE, 10(3), e0118432, 2015. doi: 10.1371/journal.pone.0118432.
[20] Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., et al. "Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI." Information Fusion, 58, 82-115, 2020. doi: 10.1016/j.inffus.2019.12.012.
[21] Amann, J., Blasimme, A., Vayena, E., Frey, D., and Madai, V. I. "To Explain or Not to Explain? Artificial Intelligence Explainability in Clinical Decision Support Systems." PLOS Digital Health, 1(2), e0000016, 2022. doi: 10.1371/journal.pdig.0000016.
[22] Lundberg, S. M., and Lee, S.-I. A Unified Approach to Interpreting Model Predictions. arXiv preprint, 2017. doi: 10.48550/arXiv.1705.07874.
[23] Lundberg, S. M., et al. "From Local Explanations to Global Understanding with Explainable AI for Trees." Nature Machine Intelligence, 2(1), 56-67, 2020. doi: 10.1038/s42256-019-0138-9.
[24] Efron, B., and Tibshirani, R. J. An Introduction to the Bootstrap. New York: Chapman & Hall/CRC, 1993.
DOI: https://doi.org/10.18860/cauchy.v11i1.37938
Refbacks
- There are currently no refbacks.
Copyright (c) 2026 Siti Amiroch

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Editorial Office
Mathematics Department,
Universitas Islam Negeri Maulana Malik Ibrahim Malang
Gajayana Street 50 Malang, East Java, Indonesia 65144
Faximile (+62) 341 558933
e-mail: cauchy@uin-malang.ac.id

CAUCHY: Jurnal Matematika Murni dan Aplikasi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.







