BERTopic-Based Multi-Class Topic Classification on Indonesian Shopee E-commerce Reviews Using Ensemble Learning

Kevin Alifviansyah, Asep Saefuddin, Septian Rahardiantoro

Abstract


The rapid growth of e-commerce platforms has resulted in a large volume of unstructured user reviews, creating challenges for scalable analysis. This study proposes a multi-class topic classification framework for Indonesian Shopee application reviews by integrating BERTopic-based embedding-driven topic modeling with ensemble learning. A total of 23,956 reviews are analyzed, with BERTopic applied exclusively to 19,167 training reviews to derive eight dominant topic labels, which serve as pseudo-labels for supervised classification using CatBoost and Extra Trees. Model performance is evaluated on a held-out test set under baseline and hybrid resampling settings to address severe class imbalance. The results show that hybrid resampling substantially improves balanced accuracy, particularly for CatBoost, while ROC–AUC remains consistently high, indicating robust class discrimination. Analysis of an unlabeled 2025 dataset, used solely in a deployment-style setting, reveals semantically consistent topic distributions on unseen data. Overall, the findings demonstrate that embedding-based topic modeling combined with ensemble learning provides an effective and scalable solution for multi-class topic classification in highly imbalanced e-commerce review data, with clear separation between training, evaluation, and post-deployment analysis.

Keywords


BERTopic; CatBoost; e-Commerce Reviews; Ensemble Learning; Imbalanced Data; Multi-Class Classification; Topic Modeling.

Full Text:

PDF

References


[1] I. P. Nuralam, N. Yudiono, M. R. A. Fahmi, E. S. Yuliaji, and T. Hidayat. “Perceived ease of use, perceived usefulness, and customer satisfaction as driving factors on repurchase intention: The perspective of the e-commerce market in Indonesia”. In: Cogent Business & Management 11.1 (2024). doi: 10.1080/23311975.2024.2413376.

[2] M. Mishra. “A holistic review of customer experience research: Topic modelling using BERTopic”. In: Marketing Intelligence & Planning (2024). doi: 10.1108/MIP-09-2023-0457.

[3] S. Das, S. S. Mullick, and I. Zelinka. “On supervised class-imbalanced learning: An updated perspective and some key challenges”. In: IEEE Transactions on Artificial Intelligence 3.6 (2022), pp. 973–993. doi: 10.1109/TAI.2022.3160658.

[4] Lukmanul Hakim, Bagus Sartono, and Asep Saefuddin. “Bagging Based Ensemble Classification Method on Imbalance Datasets”. In: 2017. https://api.semanticscholar.org/CorpusID:212484809.

[5] J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi, and M. Asadpour. “Boosting methods for multi-class imbalanced data classification: An experimental review”. In: Journal of Big Data 7 (2020), pp. 1–47. doi: 10.1186/s40537-020-00349-y.

[6] A. N. A. Aldania, A. M. Soleh, and K. A. Notodiputro. “A comparative study of CatBoost and double random forest for multi-class classification”. In: Jurnal RESTI 7.1 (2023), pp. 129–137. doi: 10.30598/barekengvol19iss1pp227-236.

[7] A. Sharaff and H. Gupta. “Extra-tree classifier with metaheuristics approach for email classification”. In: Advances in Computer Communication and Computational Sciences. Springer, 2019, pp. 189–197. doi: 10.1007/978-981-13-6861-5_17.

[8] Slamet Riyanto, Sukaesih Sitanggang Imas, Taufik Djatna, and Tika Dewi Atikah. “Comparative analysis using various performance metrics in imbalanced data for multi-class text classification”. In: International Journal of Advanced Computer Science and Applications 14.6 (2023). doi: 10.14569/IJACSA.2023.01406116.

[9] Bambang Nazief and Mirna Adriani. “Confix Stripping Approach in Indonesian Stemming Algorithm”. In: Proceedings of the Workshop on Computational Linguistics (1996), pp. 1–13. https://dl.acm.org/doi/10.1145/1316457.1316459.

[10] Indra, Edi Winarko, and Reza Pulungan. “Trending topics detection of Indonesian tweets using BN-grams and Doc-p”. In: Journal of King Saud University – Computer and Information Sciences 31.2 (Apr. 2019), pp. 266–274. doi: 10.1016/j.jksuci.2018.01.005.

[11] Maarten Grootendorst. “BERTopic: Neural Topic Modeling with a Class-Based TF–IDF Procedure”. In: arXiv preprint arXiv:2203.05794 (2022). doi: 10.48550/arXiv.2203.05794.

[12] Bryan Wilie, Kevin Vincentio, Genta Indra Winata, Samuel Cahyawijaya, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, and Pascale Fung. “IndoBenchmark: Benchmarking Natural Language Processing Tasks for Indonesian”. In: Proceedings of the 28th International Conference on Computational Linguistics (2020), pp. 843–857. doi: 10.48550/arXiv.2009.05387.

[13] Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (2019). doi: 10.18653/v1/D19-1410.

[14] Leland McInnes, John Healy, and James Melville. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction”. In: arXiv preprint arXiv:1802.03426 (2018). doi: 10.48550/arXiv.1802.03426.

[15] Leland McInnes, John Healy, and Steve Astels. “hdbscan: Hierarchical Density Based Clustering”. In: Journal of Open Source Software 2.11 (2017), p. 205. doi: 10.21105/joss.00205.

[16] Liudmila Ostroumova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. “CatBoost: unbiased boosting with categorical features”. In: Neural Information Processing Systems. 2017. https://api.semanticscholar.org/CorpusID:5044218.

[17] John T. Hancock and Taghi M. Khoshgoftaar. “CatBoost for big data: an interdisciplinary review”. In: Journal of Big Data 7 (2020). doi: 10.1186/s40537-020-00369-8.

[18] Pierre Geurts, Damien Ernst, and Louis Wehenkel. “Extremely randomized trees”. In: Machine Learning 63 (2006), pp. 3–42. doi: 10.1007/s10994-006-6226-1.

[19] Budi Padmaja, Vicky Prasa, and K. V. N. Sunitha. “A Novel Random Split Point Procedure Using Extremely Randomized (Extra) Trees Ensemble Method for Human Activity Recognition”. In: EAI Endorsed Transactions on Pervasive Health and Technology 6 (2020), e5. https://api.semanticscholar.org/CorpusID:219545647.

[20] Chalvina Izumi and Nidya Sari Rahmawati. “Handling Multiclass Imbalance in Diabetes, Cancer, and Pneumonia Classification Using NR-Clustering SMOTE”. In: IJACI: International Journal of Advanced Computing and Informatics (2025). https://api.semanticscholar.org/CorpusID:282367647.

[21] Juan Enrique Ramos. “Using TF-IDF to Determine Word Relevance in Document Queries”. In: 2003. https://api.semanticscholar.org/CorpusID:14638345.




DOI: https://doi.org/10.18860/cauchy.v11i1.37941

Refbacks

  • There are currently no refbacks.


Copyright (c) 2026 Kevin Alifviansyah

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Editorial Office
Mathematics Department,
Universitas Islam Negeri Maulana Malik Ibrahim Malang
Gajayana Street 50 Malang, East Java, Indonesia 65144
Faximile (+62) 341 558933
e-mail: cauchy@uin-malang.ac.id

Creative Commons License
CAUCHY: Jurnal Matematika Murni dan Aplikasi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.