Klasifikasi Data Tak Seimbang menggunakan Algoritma Random Forest dengan SMOTE dan SMOTE-ENN (Studi Kasus pada Data Stunting)

Anju Fauziah; Julan Hernadi

doi:10.30787/restia.v3i2.1906

Authors

Anju Fauziah Universitas Ahmad Dahlan
Julan Hernadi

DOI:

https://doi.org/10.30787/restia.v3i2.1906

Keywords:

Informatics Engineering, Information Systems, Distributed Computer Systems, Artificial Intelligence, artificial intelligence system

Abstract

The random forest algorithm is one of the widely used machine learning classification methods because it has the advantage of reducing the risk of overfitting while improving general prediction performance. However, for data with unbalanced classes, this algorithm lacks to achieve its best performance, particularly in predicting data in the minority class. As a result, this article proposes two resampling approaches to balance the data: the Synthetic Minority Oversampling Technique (SMOTE) and the Synthetic Minority Oversampling Technique with Edited Nearest Neighbors (SMOTE-ENN). For the data classification technique, the random forest algorithm is applied to the original data, then to the resampling results using both SMOTE as well as SMOTE-ENN. The case study was applied to stunting data consisting of 421 cases in the majority class and 79 in the minority class. An accuracy of 89% was obtained on the original data, 90% on the resampled data with SMOTE-ENN, and 91% on the resampled data with SMOTE. The best accuracy was obtained using resampling technique with SMOTE, however it was not particularly significant.

References

R. Hitman et al., “Stunting Prevention Counseling for Children (Expanding Stunting Prevention for Children),” Community Development Journal, vol. 2, no. 3 August 2021.

E. Lestari, Z. Shaluhiyah, and M. Sakundarno Adi, “MPPKI Media Publikasi Promosi Kesehatan Indonesia,” vol. 6, no. 2 August 2023, doi: 10.31934/mppki.v2i3.

UNICEF, WHO, and World Bank, Child Malnutrition Levels and Trends 2023. 2023.

Rokom, “Stunting Prevalence in Indonesia Drops to 21.6% from 24.4%.” Accessed: May 17, 2024. [Online]. Available: https://sehatnegeriku.kemkes.go.id/baca/rilis-media/20230125/3142280/prevalensi-stunting-di-indonesia-turun-ke-216-dari-244/

[ “Guidelines for Implementing Integrated Stunting Reduction Interventions in Districts and Cities.”

“Random Forest Algorithm in Machine Learning.” Accessed: December 9, 2024. [Online]. Available: https://www.geeksforgeeks.org/random-forest-algorithm-in-machine-learning

L. Breiman, “Random Forest,” 2001.

V. Kumar et al., “Addressing Binary Classification on Class-Imbalanced Clinical Datasets Using Intelligent Computing Techniques,” Healthcare (Switzerland), vol. 10, no. 7, July 2022, doi: 10.3390/healthcare10071293.

T. Bouabana-Tebibel and S. H. Rubin, “Advances in Intelligent Systems and Computing 446.” [Online]. Available: http://www.springer.com/series/11156

R. Ghorbani and R. Ghousi, “Comparing Different Resampling Methods in Predicting Student Performance Using Machine Learning Techniques,” IEEE Access, vol. 8, pp. 67899–67911, 2020, doi: 10.1109/ACCESS.2020.2986809.

X. Wang et al., “Early warning of diabetes mellitus and factor analysis using Bayesian ensemble networks with SMOTE-ENN and Boruta,” Sci Rep, vol. 13, no. 1, December 2023, doi: 10.1038/s41598-023-40036-5.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: A Synthetic Minority Oversampling Technique,” 2002.

M. Muntasir Nishat et al., “A Comprehensive Investigation of the Performance of Various Machine Learning Classifiers with SMOTE-ENN Oversampling Technique and Hyperparameter Optimization for an Imbalanced Heart Failure Dataset,” Sci Program, vol. 2022, 2022, doi: 10.1155/2022/3649406.

D. Varma, A. Nehansh, and P. Swathy, “Data Preprocessing Toolkit: An Approach to Automate Data Preprocessing,” International Journal of Scientific Research in Engineering and Management, vol. 07, no. 03, March 2023, doi: 10.55041/ijsrem18270.

S. Das, M. S. Imtiaz, N. H. Neom, N. Siddique, and H. Wang, “A hybrid approach for Bangla sign language recognition using deep transfer learning model with random forest classifier,” Expert Syst Appl, vol. 213, March 2023, doi: 10.1016/j.eswa.2022.118914.

G. Devisetty and N. S. Kumar, “Bradycardia Prediction Using Decision Tree Algorithm and Comparing Its Accuracy with Support Vector Machines,” in E3S Web of Conferences, EDP Sciences, July 2023. doi: 10.1051/e3sconf/202339909004.

A. Primajaya and B. N. Sari, “Random Forest Algorithm for Precipitation Prediction,” 2018.

C. Zhang, Y. Liu, and N. Tie, “Forest Land Resource Information Acquisition with Sentinel-2 Imagery Using Support Vector Machines, K-Nearest Neighbor, Random Forest, Decision Tree, and Multi-Layer Perceptron,” Forests, vol. 14, no. 2, February 2023, doi: 10.3390/f14020254.

T. Setiyorini et al., “Application of Gini Index and K-Nearest Neighbor for Classifying Cognitive Level of Questions in Bloom's Taxonomy,” Jurnal Pilar Nusa Mandiri, vol. 13, no. 2, 2017, [Online]. Available: http://www.nusamandiri.ac.id1;http://www.swadharma.ac.id/2

L. C, P. S, A. H. Kashyap, A. Rahaman, S. Niranjan, and V. Niranjan, “Prediction of New Biomarkers for Lung Cancer Using Random Forest Classifier,” Cancer Inform, vol. 22 January 2023, doi: 10.1177/11769351231167992.

P. Soltanzadeh and M. Hashemzadeh, “RCSMOTE: A range-controlled synthetic minority oversampling technique for addressing class imbalance problems,” Inf Sci (NY), vol. 542, pp. 92–111, January 2021, doi: 10.1016/j.ins.2020.07.014.

K. Abhishek and M. Abdelaziz, Machine learning for imbalanced data: addressing imbalanced datasets using machine learning and deep learning techniques.

N. P. Y. T. Wijayanti, E. N. Kencana, and I. W. Sumarjaya, “Smote: Its Potential and Drawbacks in Surveys,” E-Journal of Mathematics, vol. 10, no. 4, p. 235, Nov. 2021, doi: 10.24843/mtk.2021.v10.i04.p348.

A. Salvadorrgarcíaa, M. R. Pratii, and B. Franciscooherrera, “Learning from Imbalanced Datasets.”

B. Santoso, H. Wijayanto, K. A. Notodiputro, and B. Sartono, “Synthetic Oversampling Methods for Dealing with Imbalanced Class Problems: A Review,” in IOP Conference Series: Earth and Environmental Sciences, Institute of Physics Publishing, Apr. 2017. doi: 10.1088/1755-1315/58/1/012031.

G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A Study on the Behavior of Some Methods for Balancing Machine Learning Training Data.”