Coronary Heart Disease Risk Prediction under Class Imbalance Using XGBoost with SHAP-Based Interpretation

Siti Amiroch, Fitri Nur Laili, Awawin Mustana Rohmah, Dicka Yale Kardono

Abstract


Coronary heart disease (CHD) risk prediction is challenging because clinical data are heterogeneous and the response variable is imbalanced. This study develops an interpretable predictive framework for CHD risk using Extreme Gradient Boosting (XGBoost), median imputation, IQR-based winsorization, standardization, the Synthetic Minority Over-sampling Technique (SMOTE), bootstrap-based uncertainty assessment, and Shapley Additive Explanations (SHAP). The learning problem is formulated within a regularized empirical risk minimization framework, so the model is viewed as a statistical estimator rather than merely an algorithmic classifier. To avoid information leakage, train–test splitting is performed before any resampling, and SMOTE is applied only to the training data. The primary analysis is fixed a priori at an 80:20 stratified split, whereas 60:40 and 70:30 splits are treated as sensitivity analyses rather than model-selection devices. In the primary analysis, the model attains accuracy of 79.36%, precision of 27.88%, recall of 22.48%, F1-score of 24.89%, and ROC–AUC of 0.6502. The 95% bootstrap confidence interval for ROC–AUC is [0.6017, 0.6981]. SHAP analysis in probability space identifies age, cigsPerDay, male, heartRate, and sysBP as the most influential predictors. These results show that the proposed framework is mathematically well-structured and interpretable, but that its out-of-sample discrimination on this dataset is moderate rather than high.


Keywords


Coronary heart disease; Class imbalance; Extreme Gradient Boosting; SHAP; Bootstrap confidence interval

Full Text:

PDF

References


[1] World Health Organization. Cardiovascular Diseases (CVDs). WHO Fact Sheet. Accessed 2026-01-29. 2023. https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds).

[2] World Heart Federation. World Heart Report 2023. Accessed 2026-01-29. 2023. https://world-heart-federation.org/resource/world-heart-report-2023/.

[3] Gregory A. Roth, George A. Mensah, Catherine O. Johnson, et al. “Global Burden of Cardiovascular Diseases and Risk Factors, 1990–2019”. In: Journal of the American College of Cardiology 76.25 (2020), pp. 2982–3021. doi: 10.1016/j.jacc.2020.11.010.

[4] Salim S. Virani et al. “Heart Disease and Stroke Statistics—2024 Update: A Report From the American Heart Association”. In: Circulation (2024). doi: 10.1161/CIR.0000000000001209.

[5] Donna K. Arnett, Roger S. Blumenthal, Michelle A. Albert, et al. “2019 ACC/AHA Guideline on the Primary Prevention of Cardiovascular Disease”. In: Circulation 140.11 (2019), e596–e646. doi: 10.1161/CIR.0000000000000678.

[6] Ralph B. Sr. D’Agostino, Ramachandran S. Vasan, Michael J. Pencina, et al. “General Cardiovascular Risk Profile for Use in Primary Care: The Framingham Heart Study”. In: Circulation 117.6 (2008), pp. 743–753. doi: 10.1161/CIRCULATIONAHA.107.699579.

[7] Jerome H. Friedman. “Greedy Function Approximation: A Gradient Boosting Machine”. In: Annals of Statistics 29.5 (2001), pp. 1189–1232. doi: 10.1214/aos/1013203451.

[8] Tianqi Chen and Carlos Guestrin. “XGBoost: A Scalable Tree Boosting System”. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, pp. 785–794. doi: 10.1145/2939672.2939785.

[9] Mohammad Hamim Zajuli Al Faroby, Siti Amiroch, Bernadus Anggo Seno Aji, and Avriono Aritonang. “Classification of IGF1R Ligand Compounds for Identification of Herbal Extracts Using Extreme Gradient Boosting”. In: Jurnal Informatika 16.3 (2022), pp. 139–150. doi: 10.26555/jifo.v16i3.a23286.

[10] Siti Amiroch, Mohammad Isa Irawan, Imam Mukhlash, Mohammad Hamim Zajuli Al Faroby, and Chairul Anwar Nidom. “Machine Learning for the Prediction of Antiviral Compounds Targeting Avian Influenza A/H9N2 Viral Proteins”. In: Symmetry 14.6 (2022), p. 1114. doi: 10.3390/sym14061114.

[11] Izzudin Muhammad, Imam Mukhlash, Mohammad Jamhuri, Mohammad Iqbal, and Mohammad Isa Irawan. “Classification of COVID-19 Variants Using Boosting Algorithm”. In: 2022 9th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI). IEEE. 2022, pp. 29–34. doi: 10.23919/EECSI56542.2022.9946452.

[12] Mohammad Isa Irawan and Mohammad Jamhuri. “State of the Art of Machine Learning: An Overview of the Past, Current, and the Future Research Trends in the Era of Quantum Computing”. In: AIP Conference Proceedings. Vol. 2641. 1. AIP Publishing LLC. 2022, p. 040009. doi: 10.1063/5.0131848.

[13] K. Budholiya, S. K. Shrivastava, and V. Sharma. “An Optimized XGBoost Based Diagnostic System for Effective Prediction of Heart Disease”. In: Journal of King Saud University–Computer and Information Sciences 34.7 (2022), pp. 4514–4523. doi: 10.1016/j.jksuci.2020.10.013.

[14] A. Handika Permana, F. Rakhmat Umbara, and F. Kasyidi. “Klasifikasi Penyakit Jantung Tipe Kardiovaskular Menggunakan Adaptive Synthetic Sampling dan Algoritma Extreme Gradient Boosting”. In: Building of Informatics, Technology and Science (BITS) 6.1 (2024), pp. 499–508. doi: 10.47065/bits.v6i1.5421.

[15] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. “SMOTE: Synthetic Minority Over-sampling Technique”. In: Journal of Artificial Intelligence Research 16 (2002), pp. 321–357. doi: 10.1613/jair.953.

[16] Haibo He and Edwardo A. Garcia. “Learning from Imbalanced Data”. In: IEEE Transactions on Knowledge and Data Engineering 21.9 (2009), pp. 1263–1284. doi: 10.1109/TKDE.2008.239.

[17] John A. Swets. “Measuring the Accuracy of Diagnostic Systems”. In: Science 240.4857 (1988), pp. 1285–1293. doi: 10.1126/science.3287615.

[18] Tom Fawcett. “An Introduction to ROC Analysis”. In: Pattern Recognition Letters 27.8 (2006), pp. 861–874. doi: 10.1016/j.patrec.2005.10.010.

[19] Takaya Saito and Marc Rehmsmeier. “The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets”. In: PLOS ONE 10.3 (2015), e0118432. doi: 10.1371/journal.pone.0118432.

[20] Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, et al. “Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI”. In: Information Fusion 58 (2020), pp. 82–115. doi: 10.1016/j.inffus.2019.12.012.

[21] Julia Amann, Alessandro Blasimme, Effy Vayena, Daniel Frey, and Viola I. Madai. “To Explain or Not to Explain? Artificial Intelligence Explainability in Clinical Decision Support Systems”. In: PLOS Digital Health 1.2 (2022), e0000016. doi: 10.1371/journal.pdig.0000016.

[22] Scott M. Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. arXiv preprint. 2017. doi: 10.48550/arXiv.1705.07874.

[23] Scott M. Lundberg et al. “From Local Explanations to Global Understanding with Explainable AI for Trees”. In: Nature Machine Intelligence 2.1 (2020), pp. 56–67. doi: 10.1038/s42256-019-0138-9.

[24] Bradley Efron and Robert J. Tibshirani. An Introduction to the Bootstrap. New York: Chapman & Hall/CRC, 1993.




DOI: https://doi.org/10.18860/cauchy.v11i1.37938

Refbacks

  • There are currently no refbacks.


Copyright (c) 2026 Siti Amiroch

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Editorial Office
Mathematics Department,
Universitas Islam Negeri Maulana Malik Ibrahim Malang
Gajayana Street 50 Malang, East Java, Indonesia 65144
Faximile (+62) 341 558933
e-mail: cauchy@uin-malang.ac.id

Creative Commons License
CAUCHY: Jurnal Matematika Murni dan Aplikasi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.