Cross-Dataset Evaluation of Support Vector Machines: A Reproducible, Calibration-Aware Baseline for Tabular Classification

Nurus Syafi'ah, Mohammad Jamhuri, Farahnas Imaniyah Pranata, Ari Kusumastuti, Juhari Juhari, Usman Pagalay, Muhammad Khudzaifah

Abstract


Support Vector Machines (SVMs) remain competitive for small and medium-sized tabular classification problems, yet reported results on benchmark datasets vary widely due to inconsistent preprocessing, validation, and probability calibration. This paper presents a calibration-aware, cross-dataset benchmark that evaluates SVMs against classical baselines—Logistic Regression, Decision Tree, and Random Forest—under leakage-safe pipelines and statistically principled protocols. Using three representative binary datasets (Titanic survival, Pima Indians Diabetes, and UCI Heart Disease), we standardize imputation, encoding, scaling, and nested cross-validation to ensure comparability. Performance is assessed not only on discrimination metrics (accuracy, precision, recall, F1, PR--AUC) but also on probability reliability (Brier score, Expected Calibration Error) and threshold optimization. Results show that tuned RBF--SVMs consistently outperform Logistic Regression and Decision Trees, and perform comparably to Random Forests. Calibration (Platt scaling, isotonic regression) substantially reduces error and improves decision quality, while domain-specific features enhance Titanic prediction. By embedding all steps in a transparent, reproducible protocol and validating across multiple datasets, this study establishes a rigorous methodological baseline for SVMs in tabular binary classification, providing a reference point for future machine learning research.

Keywords


Tabular classification; Support Vector Machine; Probability calibration; Cross-dataset benchmarking; Small datasets

Full Text:

PDF

References


[1] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. doi: 10.1007/BF00994018.

[2] R. Shwartz-Ziv and A. Armon, “Tabular data: Deep learning is not all you need,” Information Fusion, vol. 81, pp. 84–90, 2022. doi: 10.1016/j.inffus.2021.11.011.

[3] S. Kaufman, S. Rosset, C. Perlich, and O. Stitelman, “Leakage in data mining: Formulation, detection, and avoidance,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 6, no. 4, pp. 1–21, 2012. doi: 10.1145/2382577.2382579.

[4] M. Jamhuri, M. Irawan, I. Mukhlash, M. Iqbal, and N. Puspaningsih, “Neural networks optimization via Gauss–Newton based QR factorization on SARS-CoV-2 variant classification,” Systems and Soft Computing, vol. 7, p. 200195, 2025.

[5] D. Ulya, J. Juhari, R. Yuliana, and M. Jamhuri, “Reliable and efficient sentiment analysis on IMDb with logistic regression,” CAUCHY: Jurnal Matematika Murni dan Aplikasi, vol. 10, no. 2, pp. 821–834, 2025.

[6] J. Pineau, P. Vincent-Lamarre, K. Sinha, et al., “Improving reproducibility in machine learning research: A report from the NeurIPS 2019 reproducibility program,” Journal of Machine Learning Research, vol. 22, no. 164, pp. 1–20, 2021. Available online.

[7] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” in Proc. 34th Int. Conf. on Machine Learning (ICML), vol. 70, PMLR, 2017, pp. 1321–1330. doi: 10.48550/arXiv.1706.04599.

[8] M. P. Naeini, G. Cooper, and M. Hauskrecht, “Obtaining well calibrated probabilities using Bayesian binning,” in Proc. AAAI Conf. on Artificial Intelligence, vol. 29, 2015. doi: 10.1609/aaai.v29i1.9602.

[9] Scikit-learn documentation, “Pipeline and composite estimators,” Available: https://scikit-learn.org/stable/modules/compose.html. Accessed: Aug. 30, 2025.

[10] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009. doi: 10.1109/TKDE.2008.239.

[11] J. Davis and M. Goadrich, “The relationship between precision-recall and ROC curves,” in Proc. 23rd Int. Conf. on Machine Learning (ICML), 2006, pp. 233–240. doi: 10.1145/1143844.1143874.

[12] J. Platt, et al., “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” Advances in Large Margin Classifiers, vol. 10, no. 3, pp. 61–74, 1999. Available online.

[13] B. Zadrozny and C. Elkan, “Transforming classifier scores into accurate multiclass probability estimates,” in Proc. 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2002, pp. 694–699. Available online.

[14] B. Zadrozny and C. Elkan, “Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers,” in Proc. 18th Int. Conf. on Machine Learning (ICML), Morgan Kaufmann, 2001, pp. 609–616.

[15] S. Arlot and A. Celisse, “A survey of cross-validation procedures for model selection,” Statistics Surveys, vol. 4, pp. 40–79, 2010. doi: 10.1214/09-SS054.

[16] Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,” Psychometrika, vol. 12, no. 2, pp. 153–157, 1947. doi: 10.1007/BF02295996.

[17] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.

[18] G. Ke, Q. Meng, T. Finley, et al., “LightGBM: A highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017.

[19] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2016, pp. 785–794. doi: 10.1145/2939672.2939785.

[20] S. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017.

[21] J. Larson, S. Mattu, J. Angwin, and L. Kirchner, “An evaluation of machine learning classifiers for predicting diabetes on Pima Indians data: Ethical implications,” in Proc. AAAI/ACM Conf. on AI, Ethics, and Society, ACM, 2018, pp. 1–7. doi: 10.1145/3278721.3278730.

[22] Y. Rimal, N. Sharma, S. Paudel, A. Alsadoon, M. P. Koirala, and S. Gill, “Comparative analysis of heart disease prediction using logistic regression, SVM, KNN, and random forest with cross-validation for improved accuracy,” Scientific Reports, vol. 15, no. 1, p. 13444, 2025.

[23] C. Bentéjac, A. Csörgő, and G. Martínez-Muñoz, “A comparative analysis of gradient boosting algorithms,” Artificial Intelligence Review, vol. 54, no. 3, pp. 1937–1967, 2021.

[24] D. Khanna, R. Sahu, V. Baths, and B. Deshpande, “Comparative study of classification techniques (SVM, logistic regression and neural networks) to predict the prevalence of heart disease,” International Journal of Machine Learning and Computing, vol. 5, no. 5, p. 414, 2015. doi: 10.7763/IJMLC.2015.V5.544.

[25] M. Awad and R. Khanna, Efficient Learning Machines: Theories, Concepts, and Applications for Engineers and System Designers. Springer Nature, 2015. doi: 10.1007/978-1-4302-5990-9.




DOI: https://doi.org/10.18860/jrmm.v4i6.33438

Refbacks

  • There are currently no refbacks.