Analyzing Household Expenditures with Generalized Random Forests

Eriski Isnanda, Khairil Anwar Notodiputro, Kusman Sadik

Abstract


This study investigates the performance of Generalized Random Forest (GRF), which has been known to be useful in understanding heterogeneous treatment effects (HTE) and non-linear relationships in high-dimensional data. In this paper the performance of GRF was compared with Random Forest (RF), Generalized Linear Mixed Model (GLMM) as continuation of previous study conducted by Athey (2019). The data utilized in this study is from the National Socioeconomic Survey (SUSENAS) to predict household per capita expenditure in West Java, Indonesia. The models are evaluated based on their ability to handle outliers using Winsorization. The results show that RF performed the best, yielding the smallest MSE values, followed by GRF with reasonably good performance, and GLMM with the highest MSE, indicating its limitations in handling non-linear data patterns. These findings indicate that RF is the most accurate method for modeling per capita expenditure in West Java, with recommendations for further research to develop hybrid methods or use more specific random effects in GLMM

Keywords


generalized linear mixed model; generalized random forest; household per capita expenditure; random forest; winsorization

Full Text:

PDF

References


[1]

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Springer.

[2]

Chaudhary, K., Alam, M., Al-Rakhami, M. S., & Gumaei, A. (2021). Machine learning-based mathematical modeling for prediction of social media consumer behavior using big data analytics. Journal of Big Data, 8(1), 1-20.

[3]

Conover, W. J. (1999). Practical Nonparametric Statistics (3rd ed.). Wiley.

[4]

Freedman, D. A. (2009). Statistical Models: Theory and Practice. Cambridge University Press.

[5]

Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized Random Forests. The Annals of Statistics, 47(2), 1148–1178.

[6]

Wager, S., & Athey, S. (2018). Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests. Journal of the American Statistical Association, 113(523), 1228–1242.

[7]

Goldman, N. (2022). Nonparametric Estimation of Conditional Densities by Generalized Random Forests. Journal of Statistical Computation, 48(3), 121–145.

[8]

Zhang, Y., Li, H., & Ren, G. (2022). Estimating heterogeneous treatment effects in road safety analysis using generalized random forests. Accident Analysis & Prevention, 165, 106507.

[9]

Wang, M., & Yang, Q. (2022). The heterogeneous treatment effect of low-carbon city pilot policy on stock return: A generalized random forests approach. Finance Research Letters, 47(Part A), 102808.

[10]

Shiraishi, T. (2024). Time Series Quantile Regression Using Random Forests. Machine Learning Journal, 113(4), 789–805.

[11]

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.

[12]

Setiawan, D., Wijayanto, H., & Abdul Rahman, L. O. (2022). Bagging and random forest classification methods for unbalanced data school dropout cases in Lampung province. AIP Conference Proceedings, 2662(1), 020026.

[13]

Amaliah, S., Nusrang, M., & Aswi. (2022). Penerapan metode Random Forest untuk klasifikasi varian minuman kopi di kedai kopi Konijiwa Bantaeng. VARIANSI: Journal of Statistics and Its Application on Teaching and Research, 4(2), 121–127.

[14]

Ilma, H., Notodiputro, K. A., & Sartono, B. (2023). Association rules in random forest for the most interpretable model. Barekeng: Journal of Mathematics and Its Applications, 17(1), 185–196.

[15]

Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1-48.

[16]

Rusyana, A., Notodiputro, K. A., & Sartono, B. (2021). A generalized linear mixed model for understanding determinant factors of student's interest in pursuing bachelor's degree at Universitas Syiah Kuala. Jurnal Natural, 21(2), 193–205.

[17]

Sunandi, E., Notodiputro, K. A., & Sartono, B. (2022). A study of generalized linear mixed model for count data using hierarchical Bayes method. Media Statistika, 14(2), 194–205.

[18]

Belinda, N. S., Notodiputro, K. A., & Soleh, A. M. (2024). BHF and Copula Models in Small Area Estimation for Household Per Capita Expenditure in Bogor District. Jurnal Natural, 24(2). DOI: 10.24815/jn.v24i2.37278.

[19]

Chatterjee, S., & Hadi, A. S. (2015). Regression analysis by example (5th ed.). Wiley.

[20]

Dash, C. S. K., Behera, A. K., Dehuri, S., & Ghosh, A. (2023). An outliers detection and elimination framework in classification task of data mining. Decision Analytics Journal, 6, 100164. https://doi.org/10.1016/j.dajour.2023.100164

[21]

Ghosh, D., & Vogt, A. (2012). Outliers: An evaluation of methodologies. Joint Statistical Meetings, 12(1), 3455–3460.

[22]

Zubedi, F., Sartono, B., & Notodiputro, K. A. (2022). Implementation of winsorizing and random oversampling on data containing outliers and unbalanced data with the random forest classification method. Jurnal Natural, 22(2), 108–116. https://doi.org/10.24815/jn.v22i2.25499.

[23]

Fox, J. (2016). Applied Regression Analysis and Generalized Linear Models (3rd ed.). Sage Publications

[24]

Wilcox, R. R. (2012). Introduction to Robust Estimation and Hypothesis Testing (3rd ed.). Academic Press.

[25]

Field, A. (2013). Discovering Statistics Using SPSS (4th ed.). Sage Publications





DOI: https://doi.org/10.18860/cauchy.v10i1.30104

Refbacks

  • There are currently no refbacks.


Copyright (c) 2025 Eriski Isnanda, Khairil Anwar Notodiputro, Kusman Sadik

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Editorial Office
Mathematics Department,
Universitas Islam Negeri Maulana Malik Ibrahim Malang
Gajayana Street 50 Malang, East Java, Indonesia 65144
Faximile (+62) 341 558933
e-mail: cauchy@uin-malang.ac.id

Creative Commons License
CAUCHY: Jurnal Matematika Murni dan Aplikasi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.