HOW MACHINE LEARNING METHOD PERFORMANCE FOR IMBALANCED DATA Case Study: Classification of Working Status of Banten Province Section Articles

##plugins.themes.academic_pro.article.main##

Pardomuan Robinson Sihombing

Abstract

This study will examine the application of several classification methods to machine learning models by taking into account the case of imbalanced data. The research was conducted on a case study of classification modeling for working status in Banten Province in 2020. The data used comes from the National Labor Force Survey, Statistics Indonesia. The machine learning methods used are Classification and Regression Tree (CART), Naïve Bayes, Random Forest, Rotation Forest, Support Vector Machine (SVM), Neural Network Analysis, One Rule (OneR), and Boosting. Classification modeling using resample techniques in cases of imbalanced data and large data sets is proven to improve classification accuracy, especially for minority classes, which can be seen from the sensitivity and specificity values that are more balanced than the original data (without treatment). Furthermore, the eight classification models tested shows that the Boost model provides the best performance based on the highest sensitivity, specificity, G-mean, and kappa coefficient values. The most important/most influential variables in the classification of working status are marital status, education, and age.

##plugins.themes.academic_pro.article.details##

How to Cite
Sihombing, P. R. . (2021). HOW MACHINE LEARNING METHOD PERFORMANCE FOR IMBALANCED DATA : Case Study: Classification of Working Status of Banten Province. TEKNOKOM, 4(2), 48–52. https://doi.org/10.31943/teknokom.v4i2.64

References

  1. W. Venables and B. Ripley, Modern Applied Statistics with S, Fourth ed., New York: Springer, 2021.
  2. A. Liaw and M. Wiener, "Classification and Regression by randomForest," R News, vol. 2, no. 3, pp. 18-22, 2018.
  3. H. v. Jouanne-Diedrich, "OneR: One Rule Machine Learning Classification Algorithm with Enhancements," 2017.
  4. M. Ballings and D. V. d. Poel, "RotationForest: Fit and Deploy Rotation Forest Models," 2017.
  5. T. Therneau and B. Atkinson, "rpart: Recursive Partitioning and Regression Trees," 2019.
  6. D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel and F. Leisch, "e1071: Misc Functions of the Department of Statistics, Probability Theory Group," 2021.
  7. T. Chen, T. He, M. Benesty, V. Khotilovich, Y. Tang, H. Cho, K. Chen, R. Mitchell, I. Cano, T. Zhou, M. Li, J. Xie, M. Lin, Y. Geng and Y. Li, "xgboost: Extreme Gradient Boosting," 2021.
  8. M. Maalouf and Siddiqi, "Wieghted Logistic Regression for Large-Scale Imbalanced and Rare Events Data," Journal of Knowledge-Based Systems, vol. 59, pp. 142-148, 2014.
  9. G. King and L. Zeng, "Logistic Regression in Rare Events Data," Journal of Political Analysis, vol. 9, no. 2, pp. 137-163, 2001.
  10. Badan Pusat Statistik, "Labor Market Indicators Indonesia August 2020," Badan Pusat Statistik, Jakarta, 2021.
  11. J. Han, M. Kamber and J. Pei, Data Mining Concepts and Techhiques, Third Edition ed., Waltham: Elsevier Inc, 2012.
  12. J. Landis and G. Koch, The Measurment of Observer Agreement for Categorical Data, 2013.
  13. Yuliatin, T. Huseno and Febriani, "Pengaruh Karakteristik Kependudukan Terhadap Pengangguran di Sumatera Barat," Jurnal Manajemen dan Kewirausahaan, vol. 2, no. 2, 2011.
  14. S. Mutiadanu, M. R. Adry and D. Z. Putri, "Analisis Sosial Ekonomi Terhadap Pengangguran Muda.," Ecosains, vol. 7, no. 2, pp. 89-98, 2018.
  15. Dhanani, "Unemployment and Underemployment in Indonesia," International Labour Office, Switzeland:, 2004.