AWBI-LSTM classifier with hybrid ADASYN-GAN oversampling and optimized FCM undersampling for imbalanced data

Sangeetha Palanisamy, Chitra Duraisamy

Article ID: 8283
Vol 39, Issue 4, 2025
DOI: https://doi.org/10.54517/jbrha8283

Download PDF

Abstract

When faced with imbalanced data, classification techniques in the area of artificial intelligence have a tendency to Favor the majority class samples, which lowers the recognition rates of minority class samples. This problem is solved by undersampling, which reduces the quantity of majority class samples while trying to restore the original data distribution when the dataset is acquired. The initial imbalanced dataset and its classification accuracy as a whole are strongly impacted by the constraints of the clustering-based undersampling techniques utilized today. To solve these issues, in this research work, initially the highly imbalanced dataset is pre-processed using Non-Negative Matrix Factorization (NMF) Algorithm. Next, Hybrid Extremely Randomized Trees (HERT), an efficient ensemble learning-based method, is employed to quickly choose the features. Afterwards, to solve class imbalance issue, Generative Adversarial Network (GAN)-based oversampling is suggested. This method has shown exceptional capacity to solve class imbalance as it may detect the genuine data distribution of minority class samples and produce new samples. By selecting useful instances from each cluster and avoiding information loss, the Fuzzy C means (FCM) clustering system is suggested for the undersampling method. Here Combined form of Fuzzy C means clustering for majority class and Adasyn-GAN centred over sampling for minority class are together to produce better results. Finally, the sampled dataset has undergone classification using Adaptive Weight Bi-Directional Long Short-Term Memory (AWBi-LSTM) classifier. Three huge, unbalanced data sets are applied to assess the suggested algorithm. The suggested systems efficiency was compared to those of cutting-edge machine learning (ML) techniques like XG boost and random forest. The suggested methods effectiveness is demonstrated by the performance assessment with regard to accuracy, recall, precision, and F1-score. Furthermore, the suggested plan requires less training time than cutting-edge methods.


Keywords

big data platform; feature selection; imbalanced data classification; neural network; clustering


References

[1]          Mohamed AE. Comparative study of four supervised machine learning techniques for classification. International Journal of Applied Science and Technology, 2017; 7 (2).

[2]          Trifonov R, Gotseva D, Angelov V. Binary classification algorithms. International Journal of Development Research, 2017; 7 (11): 16873-16879.

[3]          Aly M. Survey on multiclass classification methods. Neural Netw, 2005; 19(1-9): 2.

[4]          de Carvalho AC, Freitas AA. A tutorial on multi-label classification techniques. Foundations of computational intelligence volume, 2009; 5: 177-195.

[5]          Chawla NV, Japkowicz N, Kotcz A. Special issue on learning from imbalanced data sets. Association for Computing Machinery Special Interest Group on Knowledge Discovery and Data explorations newsletter, 2004; 6(1): 1-6.

[6]          Sohony I, Pratap R, Nambiar U. Ensemble learning for credit card fraud detection. In Proceedings of the ACM India Joint International Conference on Data Science and Management of Data;2018. pp. 289-294.

[7]          Manek AS, Samhitha MR, Shruthy S, Bhat VH, Shenoy PD et al. RePID-OK: spam detection using repetitive preprocessing. In IEEE 2013 International Conference on Cloud & Ubiquitous Computing & Emerging Technologies; 2013. pp. 144-149.

[8]          Gupta S, Gupta MK. A comprehensive data‐level investigation of cancer diagnosis on imbalanced data. Computational Intelligence, 2022; 38 (1): 156-186.

[9]          Padurariu C, Breaban ME. Dealing with data imbalance in text classification. Procedia Computer Science, 2019; 159: 736-745.

[10]      Estabrooks A, Jo T, Japkowicz N. A multiple resampling method for learning from imbalanced datasets. Computational intelligence, 2004; 20 (1): 18-36.

[11]      He H, Garcia EA. Learning from Imbalanced Data IEEE Transactions on Knowledge and Data Engineering, 2009; 21 (9).

[12]      Weiss GM. Mining with Rarity: a Unifying Framework, Association for Computing Machinery’s (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining Explorations Newsletter, 2004; 6 (1): 7-19.

[13]      Visa S, Ralescu A. Issues in mining imbalanced data sets-a review paper. In Proceedings of the sixteen midwest artificial intelligence and cognitive science conference; 2005. pp. 67-73.

[14]      Lópe V, Fernández A, García S, Palade V, Herrera F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information sciences 2013; 250: 113-141.

[15]      Chawla NV. Data mining for imbalanced datasets: An overview. Data mining and knowledge discovery handbook, 2009: 875-886.

[16]      Tsai, C. F., Lin, W. C., Hu, Y. H., & Yao, G. T. (2019). Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Information Sciences477, 47-54.

[17]      Zhu, M.; Xia, J.; Jin, X.; Yan, M.; Cai, G.; Yan, J.; Ning, G.: Class weights random forest algorithm for processing class imbalanced medical data. IEEE Access 6, 4641–4652 (2018)

[18]      Li, J.; Fong, S.; Yuan, M.; Wong, R.K.: Adaptive multi-objective swarm crossover optimization for imbalanced data classifcation. In: International Conference on Advanced Data Mining and Applications (pp. 374–390). Springer, Cham (2016)

[19]      Li, M.; Xiong, A.; Wang, L.; Deng, S.; Ye, J.: Aco resampling: enhancing the performance of oversampling methods for class imbalance classifcation. Knowl.-Based Syst. 196, 105818 (2020).

[20]      Febriantono, M.A.; Pramono, S.H.; Rahmadwati, R.; Naghdy, G.: Classifcation of multiclass imbalanced data using cost-sensitive decision tree C50. IAES Int. J. Artif. Intell. 9(1), 65 (2020)

[21]      Babu, M.C.; Pushpa, S.: Genetic algorithm-based PCA classifcation for imbalanced dataset. In: Intelligent Computing in Engineering (pp. 541–552). Springer, Singapore (2020).

[22]      Ji, S., Zhang, Z., Ying, S., Wang, L., Zhao, X., & Gao, Y. (2020). Kullback–Leibler divergence metric learning. IEEE transactions on cybernetics52(4), 2047-2058.

[23]      Kalantar, B., Ueda, N., Idrees, M. O., Janizadeh, S., Ahmadi, K., & Shabani, F. (2020). Forest fire susceptibility prediction based on machine learning models with resampling algorithms on remote sensing data. Remote Sensing12(22), 3682.

[24]      Bezdek, J.C., Hathaway, R.J., 1994. Optimization of fuzzy clustering criteria using genetic algorithms. In: Evolutionary Computation, 1994. IEEE World Congress on Computational Intelligence., Proceedings of the First IEEE Conference on. IEEE, pp. 589–594

[25]      He, H., Bai, Y., Garcia, E. A., & Li, S. (2008, June). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) (pp. 1322-1328). IEEE.

Supporting Agencies

This Study has not received any financial support or funding from external sources.



Copyright (c) 2025 Sangeetha Palanisamy*, Chitra Duraisamy

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.


This site is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).