A Study on Multi-Class Topic Prediction for E-commerce Review Data Using Ensemble Learning

Kevin Alifviansyah

Abstract


The exponential growth of e-commerce platforms has generated massive volumes of unstruc tured user reviews, necessitating advanced automated analysis methodologies to extract actionable insights for strategic decision-making. This study addresses multi-class text classi f ication challenges by integrating BERTopic-based topic modeling with ensemble learning algorithms to analyze Indonesian e-commerce reviews. A dataset comprising 24,000 customer reviews from Google Play Store underwent systematic preprocessing and topic extraction using BERTopic, yielding eight distinct thematic clusters reflecting application performance, product quality, pricing, delivery logistics, and service reliability. The dataset exhibited severe class imbalance with an imbalance ratio of 65:1, where the dominant class represented 76.02% of instances while minority classes constituted less than 2.12%. Hybrid resampling techniques combining undersampling and oversampling successfully reduced the imbalance ratio to 1.4:1. TF-IDF vectorization transformed preprocessed text into numerical features, followed by supervised classification using CatBoost and Extra Trees classifiers optimized through randomized hyperparameter search with stratified k fold cross-validation. CatBoost demonstrated superior performance, achieving balanced accuracy of 0.829, recall of 0.829, and AUC of 0.965, attributed to its ordered boosting mechanism and capacity for handling categorical and imbalanced data. Independent validation of 2025 data confirmed robust gen eralization with prediction confidence exceeding 0.90, revealing significant temporal evolution in which product-related topics emerged dominant at 70.35%, pricing concerns increased from 6.58% to 16.57%, while application issues decreased from 76.02% to 2.51%. This research establishes a methodologically rigorous framework integrating unsupervised topic discovery with supervised ensemble classification, demonstrating computational efficiency while providing scalable solutions for automated review categorization.

Keywords


Multi-class Classification; Ensemble learning; BERTopic; CatBoost; Extra Trees

Full Text:

PDF

References


[1] S. Das, S. S. Mullick, and I. Zelinka, “On supervised class-imbalanced learning: An updated perspective and some key challenges,” IEEE Transactions on Artificial Intelligence, vol. 3, no. 6, pp. 973–993, 2022.

[2] M. Mishra, “A holistic review of customer experience research: Topic modelling using BERTopic,” Marketing Intelligence & Planning, 2024.

[3] I. P. Nuralam, N. Yudiono, M. R. A. Fahmi, E. S. Yuliaji, and T. Hidayat, “Perceived ease of use, perceived usefulness, and customer satisfaction as driving factors on repurchase intention: The perspective of the e-commerce market in Indonesia,” Cogent Business & Management, vol. 11, no. 1, p. 2413376, 2024.

[4] P. Nedungadi, G. Veena, K.-Y. Tang, R. R. K. Menon, and R. Raman, “AI techniques and applications for online social networks and media: Insights from BERTopic modeling,” IEEE Access, vol. 13, pp. 37389–37407, 2025.

[5] A. Anitha and R. Gandhi, “Performance analysis of ensemble learning algorithms in intrusion detection systems: A survey,” AIP Conference Proceedings, vol. 2915, no. 1, 2024.

[6] A. N. A. Aldania, A. M. Soleh, and K. A. Notodiputro, “A comparative study of CatBoost and double random forest for multi-class classification,” Jurnal RESTI, vol. 7, no. 1, pp. 129 137, 2023.

[7] A. Sharaff and H. Gupta, “Extra-tree classifier with metaheuristics approach for email classification,” in Advances in Computer Communication and Computational Sciences, Springer, 2019, pp. 189–197.

[8] J. T. Hancock and T. M. Khoshgoftaar, “CatBoost for big data: An interdisciplinary review,” Journal of Big Data, vol. 7, p. 94, 2020.

[9] M. Grandini, E. Bagli, and G. Visani, “Metrics for multi-class classification: an overview,” arXiv preprint arXiv:2008.05756, 2020.

[10] A. Rahman et al., “A novel stacking based classifier for the identification of antifreeze protein using latent semantic analysis,” Intelligent Medicine, 2025.

[11] S. C. Nossam, R. A. Katakam, G. Pulastya, and M. Venugopalan, “Enhanced Crop Yield Prediction using Machine Learning Techniques,” in 2024 ICCCNT, IEEE, pp. 1–6, 2024.

[12] W. Chen, X. Wan, J. Ding, and T. Wang, “Enhancing clay content estimation through hybrid CatBoost-GP with model class selection,” Transportation Geotechnics, vol. 45, p. 101232, 2024.

[13] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “CatBoost: Unbiased boosting with categorical features,” in NeurIPS, vol. 31, 2018.

[14] S. Krishnan et al., “Identification of dry bean varieties based on multiple attributes using CatBoost machine learning algorithm,” Scientific Programming, vol. 2023, p. 2556066, 2023.

[15] J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi, and M. Asadpour, “Boosting methods for multi-class imbalanced data classification: An experimental review,” Journal of Big Data, vol. 7, pp. 1–47, 2020.

[16] S. Galelli and A. Castelletti, “Assessing the predictive capability of randomized tree-based ensembles in streamflow modelling,” Hydrology and Earth System Sciences, vol. 17, no. 7, pp. 2669–2684, 2013. Kevin Alifviansyah 439 468 A Study on Multi-Class Topic Prediction ...

[17] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Machine Learning, 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 vol. 63, pp. 3–42, 2006.

[18] I. Botunac, M. Brkić Bakarić, and M. Matetić, “Comparing fine-tuning and prompt engi neering for multi-class classification in hospitality review analysis,” Applied Sciences, vol. 14, no. 14, p. 6254, 2024.

[19] J. Sharma, C. Giri, O. C. Granmo, and M. Goodwin, “Multi-layer intrusion detection system with ExtraTrees feature selection, extreme learning machine ensemble, and softmax aggregation,” EURASIP Journal on Information Security, vol. 2019, no. 1, pp. 1–16, 2019.

[20] A. J. Rawat, S. Ghildiyal, and A. K. Dixit, “Topic modelling of legal documents using NLP and BERT,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 28, no. 3, pp. 1749–1755, 2022.

[21] M. Kaya and Y. Çetin-Kaya, “A novel deep learning architecture optimization for multiclass classification of Alzheimer’s disease level,” IEEE Access, vol. 12, pp. 46562–46581, 2024.

[22] A. S. Tamzila, A. Sulistya, and B. R. Lidiawaty, “A Comparative Evaluation of LDA and BERTopic for Topic Modeling of Traffic Complaints in Social Media Texts,” in 2025 IES, IEEE, pp. 567–572, 2025.

[23] A. Abuzayed and H. Al-Khalifa, “BERT for Arabic topic modeling: An experimental study on BERTopic technique,” Procedia Computer Science, vol. 189, pp. 191–194, 2021.

[24] A. Dixit and A. Mani, “Sampling technique for noisy and borderline examples problem in imbalanced classification,” Applied Soft Computing, vol. 142, p. 110361, 2023.

[25] A. Zulius et al., “Handling Imbalanced Datasets in Researcher Scientific Classification Using Oversampling and Undersampling Approaches,” in 2024 EECSI, IEEE, pp. 538–545, 2024.

[26] L. Hakim, B. Sartono, and A. Saefuddin, “Bagging based ensemble classification method on imbalance datasets,” IPB University Repository, pp. 670–676, 2017.

[27] A. M. Messele, “Ensemble machine learning for predicting academic performance in STEM education,” Discover Education, vol. 4, no. 1, p. 291, 2025.

[28] J. Ramos, “Using TF-IDF to determine word relevance in document queries,” in Proc. 1st Int. Conf. Machine Learning, vol. 242, no. 1, pp. 29–48, 2003.

[29] S. Farhadpour, T. A. Warner, and A. E. Maxwell, “Selecting and interpreting multiclass loss and accuracy metrics for classifications with class imbalance,” Remote Sensing, vol. 16, no. 3, p. 533, 2024.




DOI: https://doi.org/10.18860/cauchy.v11i1.37941

Refbacks

  • There are currently no refbacks.


Copyright (c) 2026 Kevin Alifviansyah

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Editorial Office
Mathematics Department,
Universitas Islam Negeri Maulana Malik Ibrahim Malang
Gajayana Street 50 Malang, East Java, Indonesia 65144
Faximile (+62) 341 558933
e-mail: cauchy@uin-malang.ac.id

Creative Commons License
CAUCHY: Jurnal Matematika Murni dan Aplikasi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.