Comparing Outlier Detection Methods: An Application on Indonesian Air Quality Data

Anwar Fitrianto, Amalia Kholifatunnisa, Anang Kurnia

Abstract


There are many methods for detecting outliers, but only a few methods consider data distribution. This research compares outlier detection method on univariate data with a skewed distribution. Outlier detection methods used in this research are Tukey's boxplot, adjusted boxplot, sequential fences, and adjusted sequential fences. It identifies areas of concern due to poor air quality during the Implementation of Micro-Community Activity Restrictions. The study used Indonesian air quality index data.

The adjusted boxplot method performs best based on the number of outliers detected, error rate, accuracy, precision, specificity, sensitivity, and robustness. Adjusted boxplot and adjusted sequential fences can detect tails that contain outliers accurately because the skewness coefficient makes them more robust. Meanwhile, Tukey's boxplot and sequential fences are poor methods since they couldn’t detect correctly true outliers. Based on the results, adjusted boxplot is the best method. Then, areas that need attention due to poor air quality include South Sumatera, South Sulawesi, West Java, Riau, North Sumatera, Jambi, Jakarta, and East Java.


Keywords


adjusted boxplot, adjusted sequential fences, outlier, sequential fences, Tukey's boxplot.

Full Text:

PDF

References


[1] G. B. Begashaw and Y. B. Yohannes, “Review of outlier detection and identifying using robust regression model,” International Journal of Systems Science and applied mathematics, vol. 5, no. 1, pp. 4–11, 2020.

[2] B. B. Alkan, C. Atakan, and N. Alkan, “A comparison of different procedures for principal component analysis in the presence of outliers,” J Appl Stat, vol. 42, no. 8, pp. 1716–1722, 2015.

[3] K. Singh and S. Upadhyaya, “Outlier detection: applications and techniques,” International Journal of Computer Science Issues (IJCSI), vol. 9, no. 1, p. 307, 2012.

[4] S. K. Kwak and J. H. Kim, “Statistical data preparation: management of missing values and outliers,” Korean J Anesthesiol, vol. 70, no. 4, pp. 407–411, 2017.

[5] C. Leys, M. Delacre, Y. L. Mora, D. Lakens, and C. Ley, “How to classify, detect, and manage univariate and multivariate outliers, with emphasis on pre-registration,” International Review of Social Psychology, vol. 32, no. 1, 2019.

[6] N. L. Chaudhary and W. J. Lee, “Detecting and removing outliers in production data to enhance production forecasting,” in SPE Hydrocarbon Economics and Evaluation Symposium, SPE, 2016, p. D021S005R004.

[7] V. F. Rofatto, M. T. Matsuoka, I. Klein, M. Roberto Veronez, and L. G. da Silveira Jr, “A Monte Carlo-based outlier diagnosis method for sensitivity analysis,” Remote Sens (Basel), vol. 12, no. 5, p. 860, 2020.

[8] T. V Pollet and L. van der Meij, “To remove or not to remove: the impact of outlier handling on significance testing in testosterone data,” Adaptive Human Behavior and Physiology, vol. 3, pp. 43–60, 2017.

[9] Y. H. Dovoedo and S. Chakraborti, “Boxplot‐based phase I control charts for time between events,” Qual Reliab Eng Int, vol. 28, no. 1, pp. 123–130, 2012.

[10] N. J. Carter, N. C. Schwertman, and T. L. Kiser, “A comparison of two boxplot methods for detecting univariate outliers which adjust for sample size and asymmetry,” Stat Methodol, vol. 6, no. 6, pp. 604–621, 2009.

[11] M. Hubert and E. Vandervieren, “An adjusted boxplot for skewed distributions,” Comput Stat Data Anal, vol. 52, no. 12, pp. 5186–5201, 2008.

[12] H. S. Wong and A. Fitrianto, “Adjusted sequential fences for detecting uniovariate outliers in skewed distribution,” ASM Science Journal, vol. 12, no. 5, pp. 107–115, 2019.

[13] D. A. P. Putri and E. Sudarmilah, “Comparative study for outlier detection in air quality data set,” International Journal of Emerging Trends in Engineering Research, vol. 7, no. 11, pp. 584–592, 2019.

[14] IQAir, “2019 World Air Quality Report: Region & City PM2. 5 Ranking,” IQAir, vol. 1, pp. 1–35, 2020.

[15] A. Masito, “Analisis risiko kualitas udara ambien (NO2 dan SO2) dan gangguan pernapasan pada masyarakat di Wilayah Kalianak Surabaya,” Jurnal Kesehatan Lingkungan, vol. 10, no. 4, pp. 394–401, 2018.

[16] G. Brys, M. Hubert, and A. Struyf, “A robust measure of skewness,” Journal of Computational and Graphical Statistics, vol. 13, no. 4, pp. 996–1017, 2004.

[17] D. Chen, “A Comparison of Alternative Bias-Corrections in the Bias-Corrected Bootstrap Test of Mediation,” 2018.

[18] N. C. Schwertman, M. A. Owens, and R. Adnan, “A simple more general boxplot method for identifying outliers,” Comput Stat Data Anal, vol. 47, no. 1, pp. 165–174, 2004.

[19] N. C. Schwertman and R. de Silva, “Identifying outliers with sequential fences,” Comput Stat Data Anal, vol. 51, no. 8, pp. 3800–3810, 2007.




DOI: https://doi.org/10.18860/ca.v9i2.29434

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 Anwar Fitrianto

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Editorial Office
Mathematics Department,
Universitas Islam Negeri Maulana Malik Ibrahim Malang
Gajayana Street 50 Malang, East Java, Indonesia 65144
Faximile (+62) 341 558933
e-mail: cauchy@uin-malang.ac.id

Creative Commons License
CAUCHY: Jurnal Matematika Murni dan Aplikasi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.