Comparison between Statistical Approaches and Data Mining Algorithms for Outlier Detection

Annisa Putri Utami, Anwar Fitrianto, Khairil Anwar Notodiputro

Abstract


Outliers are observation values that are very different from most observations. The presence of outliers in data can have a negative impact on research but can contain important information for other research. So, identifying outliers before conducting data analysis is a crucial thing to do. Outlier detection methods/techniques were first pioneered by researchers in statistics. However, due to rapid technological advances which have an impact on the ease of collecting extensive data, the development of outlier detection techniques is now handled mainly by researchers in the field of computer science (data mining) using computing facilities. This research aims to examine the results of simulation studies by comparing methods for identifying several outliers using statistical approaches and data mining algorithm approaches in various predetermined data scenarios. Based on the scenario carried out, the outlier detection method using a statistical approach is generally better than the outlier detection method using a data mining-based approach. Suggestions for further research are to improve the data mining method by focusing more on statistical analysis apart from focusing on data processing computing time so that the expected results of outlier detection are faster and more precise.


Keywords


distance-based methods; masking; outlier; outlier detection method; swamping

Full Text:

PDF

References


[1] D. M. Hawkins, Identification of Outliers. London: Chapman and Hall, 1980.

[2] V. Kotu and B. Deshpande, Data Science, Second. Morgan Kaufmann, 2019.

[3] J. W. Osborne and A. Overbay, “The power of outliers (and why researchers should ALWAYS check for them),” Practical Assessment, Research, and Evaluation, vol. 9, 2004, doi: 10.7275/QF69-7K43.

[4] K. Wada, “Outliers in official statistics,” Jpn J Stat Data Sci, vol. 3, no. 2, pp. 669–691, Dec. 2020, doi: 10.1007/s42081-020-00091-y.

[5] M. Bakker and J. M. Wicherts, “Outlier Removal and the Relation with Reporting Errors and Quality of Psychological Research,” PLoS ONE, vol. 9, no. 7, p. e103360, Jul. 2014, doi: 10.1371/journal.pone.0103360.

[6] E. Panjei, L. Gruenwald, E. Leal, C. Nguyen, and S. Silvia, “A survey on outlier explanations,” The VLDB Journal, vol. 31, no. 5, pp. 977–1008, Sep. 2022, doi: 10.1007/s00778-021-00721-1.

[7] Ch. S. K. Dash, A. K. Behera, S. Dehuri, and A. Ghosh, “An outliers detection and elimination framework in classification task of data mining,” Decision Analytics Journal, vol. 6, p. 100164, Mar. 2023, doi: 10.1016/j.dajour.2023.100164.

[8] Z. Niu, S. Shi, J. Sun, and X. He, “A Survey of Outlier Detection Methodologies and Their Applications,” in Artificial Intelligence and Computational Intelligence, vol. 7002, H. Deng, D. Miao, J. Lei, and F. L. Wang, Eds., in Lecture Notes in Computer Science, vol. 7002. , Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 380–387. doi: 10.1007/978-3-642-23881-9_50.

[9] A. Smiti, “A critical overview of outlier detection methods,” Computer Science Review, vol. 38, p. 100306, Nov. 2020, doi: 10.1016/j.cosrev.2020.100306.

[10] J. Majewska, “Identification of Multivariate Outliers – Problems and Challenges Of Visualization Methods,” Informatyka i Ekonometria, vol. 4, pp. 69–83, 2015.

[11] S. A. Shaikh and H. Kitagawa, “Efficient distance-based outlier detection on uncertain datasets of Gaussian distribution,” World Wide Web, vol. 17, no. 4, pp. 511–538, Jul. 2014, doi: 10.1007/s11280-013-0211-y.

[12] X. Xu, H. Liu, L. Li, and M. Yao, “A Comparison of Outlier Detection Techniques for High-Dimensional Data:,” IJCIS, vol. 11, no. 1, p. 652, 2018, doi: 10.2991/ijcis.11.1.50.

[13] E. M. Knorr, R. T. Ng, and V. Tucakov, “Distance-based outliers: algorithms and applications,” The VLDB Journal The International Journal on Very Large Data Bases, vol. 8, no. 3–4, pp. 237–253, Feb. 2000, doi: 10.1007/s007780050006.

[14] S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient Algorithms for Mining Outliers from Large Data Sets,” 2000.

[15] F. Angiulli, S. Basta, and C. Pizzuti, “Distance-based detection and prediction of outliers,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 2, pp. 145–160, Feb. 2006, doi: 10.1109/TKDE.2006.29.

[16] K. Zhang, M. Hutter, and H. Jin, “A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data.” arXiv, Mar. 18, 2009. [Online]. Available: http://arxiv.org/abs/0903.3257

[17] A. Zimek and P. Filzmoser, “There and back again: Outlier detection between statistical reasoning and data mining algorithms,” WIREs Data Min & Knowl, vol. 8, no. 6, p. e1280, Nov. 2018, doi: 10.1002/widm.1280.

[18] A. M. Baba, H. Midi, M. B. Adam, and N. H. A. Abd Rahman, “Detection of Influential Observations in Spatial Regression Model Based on Outliers and Bad Leverage Classification,” Symmetry, vol. 13, no. 11, p. 2030, Oct. 2021, doi: 10.3390/sym13112030.

[19] P. J. Rousseeuw and K. V. Driessen, “A Fast Algorithm for the Minimum Covariance Determinant Estimator,” Technometrics, vol. 41, no. 3, pp. 212–223, Aug. 1999, doi: 10.1080/00401706.1999.10485670.

[20] E. A. Mahmood, H. Midi, S. Rana, and A. G. Hussin, “Robust Circular Distance and its Application in the Identification of Outliers in the Simple Circular Regression Model,” Asian J. of Applied Sciences, vol. 10, no. 3, pp. 126–133, Jun. 2017, doi: 10.3923/ajaps.2017.126.133.




DOI: https://doi.org/10.18860/ca.v9i1.25450

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 Annisa Putri Utami, Anwar Fitrianto, Khairil Anwar Notodiputro

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Editorial Office
Mathematics Department,
Universitas Islam Negeri Maulana Malik Ibrahim Malang
Gajayana Street 50 Malang, East Java, Indonesia 65144
Faximile (+62) 341 558933
e-mail: cauchy@uin-malang.ac.id

Creative Commons License
CAUCHY: Jurnal Matematika Murni dan Aplikasi is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.