Reliable and Efficient Sentiment Analysis on IMDb with Logistic Regression

Diah Mariatul Ulya, Juhari Juhari, Rossima Eva Yuliana, Mohammad Jamhuri

Abstract


Understanding public opinion at scale is essential for modern media analytics. We present a reproducible, leakage-safe evaluation of logistic regression (LR) for binary sentiment classification on the IMDb Large Movie Review dataset and compare it with five widely used baselines: multinomial Naive Bayes, linear support vector machine (SVM), decision tree, k-nearest neighbors, and random forest. Using a standardized text pipeline (HTML stripping, stopword removal, WordNet lemmatization) with TF–IDF unigrams–bigrams and nested, stratified cross-validation, we assess threshold-dependent and threshold-independent performance, probability calibration, and computational efficiency. LR attains the best overall balance of quality and speed, achieving 88.98% accuracy and 89.13% F1, with strong ranking performance (OOF ROC–AUC ≈ 0.9568; PR–AUC ≈ 0.9554) and well-behaved calibration (Brier ≈ 0.0858). Training completes in seconds per fold and CPU inference reaches about 2.46×10^6 samples per second. While a calibrated linear SVM yields slightly higher precision, LR delivers higher F1 at markedly lower compute. These results establish LR as a robust, transparent baseline that remains competitive with more complex neural and ensemble approaches, offering a favorable performance–efficiency trade-off for practical deployment and reproducible research on IMDb sentiment classification.

Keywords


classification; IMDb; logistic regression; sentiment analysis; text mining.

Full Text:

PDF

References


[1] S. Banerjee and A. Y. Chua, “Tracing the growth of IMDb reviewers in terms of rating, readability, and usefulness,” 2018 4th International Conference on Information Management (ICIM), IEEE, pp. 57–61, 2018. doi: 10.1109/INFOMAN.2018.8392809.

[2] S. Ounacer, D. Mhamdi, S. Ardchir, A. Daif, and M. Azzouazi, “Customer sentiment analysis in hotel reviews through natural language processing techniques,” International Journal of Advanced Computer Science and Applications, vol. 14, no. 1, pp. 1–11, 2023. doi: 10.14569/IJACSA.2023.0140162.

[3] P. Atandoh, F. Zhang, M. A. Al-Antari, D. Addo, and Y. H. Gu, “Scalable deep learning framework for sentiment analysis prediction for online movie reviews,” Heliyon, vol. 10, no. 10, 2024. doi: 10.1016/j.heliyon.2024.e30756.

[4] M. Muhathir, “Comparison of bagging method effectiveness in classifying spices using histogram of oriented gradient feature extraction,” Jurnal Teknik Informatika CIT Medicom, vol. 15, no. 1, pp. 48–57, 2023. doi: 10.35335/cit.Vol15.2023.386.pp48-57.

[5] L. Holla and K. Kavitha, “An improved fake news detection model using hybrid TF–IDF for feature extraction and AdaBoost ensemble classifier,” Journal of Advances in Information Technology, vol. 15, no. 2, pp. 202–211, 2024. doi: 10.12720/jait.15.2.202-211.

[6] D. Kalla, N. Smith, and F. Samaah, “Deep learning-based sentiment analysis: Enhancing IMDb review classification with LSTM models,” SSRN Preprint, 2025. doi: 10.2139/ssrn.5103558.

[7] K. Kaushik and M. Parmar, “IMDb movie data classification using voting classifier for sentiment analysis,” International Journal of Computer Science and Engineering, vol. 10, no. 1, pp. 18–23, 2022. doi: 10.26438/ijcse/v10i1.1823.

[8] A. Belz, S. Agarwal, A. Shimorina, and E. Reiter, “A systematic review of reproducibility research in natural language processing,” arXiv preprint arXiv:2103.07929, 2021. doi: 10.48550/arXiv.2103.07929.

[9] Y. Xue, X. Cao, X. Yang, Y. Wang, R. Wang, and J. Li, “We need to talk about reproducibility in NLP model comparison,” Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9424–9434, 2023. doi: 10.18653/v1/2023.emnlp-main.586.

[10] A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150, 2011. Available online.

[11] J. Sun and Y. Xia, “Pretreating and normalizing metabolomics data for statistical analysis,” Genes & Diseases, vol. 11, no. 3, p. 100979, 2024. doi: 10.1016/j.gendis.2023.04.018.

[12] K. Maharana, S. Mondal, and B. Nemade, “A review: Data pre-processing and data augmentation techniques,” Global Transitions Proceedings, vol. 3, no. 1, pp. 91–99, 2022. doi: 10.1016/j.gltp.2022.04.020.

[13] K. Passi and S. Kalakala, “Rule-based sentiment analysis of WhatsApp reviews in Telugu language,” in IoT with Smart Systems, Springer, pp. 167–180, 2023. doi: 10.1007/978-981-19-3575-6_19.

[14] Z. Abidin, A. Junaidi, et al., “Text stemming and lemmatization of regional languages in Indonesia: A systematic literature review,” Journal of Information Systems Engineering and Business Intelligence, vol. 10, no. 2, pp. 217–231, 2024. doi: 10.20473/jisebi.10.2.217-231.

[15] T. Verdonck, B. Baesens, M. Óskarsdóttir, and S. vanden Broucke, “Special issue on feature engineering: Editorial,” Machine Learning, vol. 113, no. 7, pp. 3917–3928, 2024. doi: 10.1007/s10994-021-06042-2.

[16] D. Gibert, J. Planes, C. Mateu, and Q. Le, “Fusing feature engineering and deep learning: A case study for malware classification,” Expert Systems with Applications, vol. 207, p. 117957, 2022. doi: 10.1016/j.eswa.2022.117957.

[17] R. Pan, M. Bagherzadeh, T. A. Ghaleb, and L. Briand, “Test case selection and prioritization using machine learning: A systematic literature review,” Empirical Software Engineering, vol. 27, no. 2, p. 29, 2022. doi: 10.1007/s10664-021-10066-6.

[18] M. Jamhuri, I. Mukhlash, and M. I. Irawan, “Performance improvement of logistic regression for binary classification by Gauss–Newton method,” Proceedings of the 2022 5th International Conference on Mathematics and Statistics, pp. 12–16, 2022. doi: 10.1145/3545839.3545842.

[19] J. Ribeiro, R. Lima, T. Eckhardt, and S. Paiva, “Robotic process automation and artificial intelligence in Industry 4.0: A literature review,” Procedia Computer Science, vol. 181, pp. 51–58, 2021. doi: 10.1016/j.procs.2021.01.104.

[20] N. Cesario, D. Lewis, C. Rosales, F. Antolini, R. Stojanovic, and L. Vandenberg, “Ransomware detection using opcode sequences and machine learning: A novel approach with t-SNE and support vector machines,” Authorea Preprints, 2024. doi: 10.36227/techrxiv.172963142.20817264/v1.

[21] J.-Y. Ong, L.-Y. Ong, and M.-C. Leow, “Addressing overfitting in comparative studies for deep learning-based classification,” TELKOMNIKA (Telecommunication Computing Electronics and Control), vol. 23, no. 3, pp. 673–681, 2025. doi: 10.12928/telkomnika.v23i3.26451.

[22] D. A. Neu, J. Lahann, and P. Fettke, “A systematic literature review on state-of-the-art deep learning methods for process prediction,” Artificial Intelligence Review, vol. 55, no. 2, pp. 801–827, 2022. doi: 10.1007/s10462-021-09960-8.

[23] S. N. Nobel, S. M. R. Swapno, M. R. Islam, M. Safran, S. Alfarhood, and M. Mridha, “A machine learning approach for vocal fold segmentation and disorder classification based on ensemble method,” Scientific Reports, vol. 14, no. 1, p. 14435, 2024. doi: 10.1038/s41598-024-64987-5.

[24] N. M. Ali, M. M. Abd El Hamid, and A. Youssif, “Sentiment analysis for movie reviews dataset using deep learning models,” International Journal of Data Mining & Knowledge Management Process, vol. 9, 2019. doi: 10.5121/ijdkp.2019.9302.

[25] N. G. Ramadhan and T. I. Ramadhan, “Aspect-based sentiment analysis of IMDb movie reviews using SVM,” Sinkron: Jurnal dan Penelitian Teknik Informatika, vol. 6, no. 1, pp. 39–45, 2021. doi: 10.33395/sinkron.v7i1.11204.




DOI: https://doi.org/10.18860/cauchy.v10i2.33809

Refbacks

  • There are currently no refbacks.


Copyright (c) 2025 Diah Mariatul Ulya, Juhari Juhari, Rossima Eva Yuliana, Mohammad Jamhuri

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Editorial Office
Mathematics Department,
Universitas Islam Negeri Maulana Malik Ibrahim Malang
Gajayana Street 50 Malang, East Java, Indonesia 65144
Faximile (+62) 341 558933
e-mail: cauchy@uin-malang.ac.id

Creative Commons License
CAUCHY: Jurnal Matematika Murni dan Aplikasi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.