Lexter T. Aquino, Roi John D. Belmonte, Crisel F. Salva. 4 0
An Enhancement of smote algorithm applied in an intelligent system for medical diagnosis. 6 6 - - - - - - - - . - . - 0 . - . - 0 .
Undergraduate Thesis : (Bachelor of Science in Computer Science) - Pamantasan ng Lungsod ng Maynila, 2024.
5
ABSTRACT: Data classification is a field in data science that ecompasses the prediction and classification of data instances through training and testing data. Data imbalance in datasets or the existence of a major difference in amount between the minority and majority class instances usually occur in data like medical datasets. This imbalance causes bias in the process of synthesis of new instances which is inclined to focus on the majority instances due to its density. Synthetic Minority Oversampling Technique (SMOTE) is an oversampling algorithm that uses linear interpolation in synthesizing new instances to even out the amount of instances per class. However, it has certain limitations: (a) its current method is prone to overfitting as it uses random linear interpolation in synthesizing; (b) it is unable to determine the distribution density of the dataset; (c) its inability to mitigate noise from being synthesized in the minority instances which can lead to further synthesis of noise instances which causes over-generalization. The proposed algorithm tackles the stated problems by implementing three (3) techniques to resolve the shortcomings of the existing SMOTE algorithm, namely the (a) triangle centroid method for synthesis of the minority samples using triangular structures; (b) Heron's formula to calculate the sparcity of the minority dataset for a sense of distribution determination; (c) triangular noise mitigation that takes advantage of the properties of a triangle to decrease the creation of noisy samples. The results of simulation of the proposed algorithm on publicly available medical datasets that Heron-Centroid SMOTE improved the existing SMOTE across all performance metrics and won against both 'Imbalanced' tests and SMOTE on most datasets, the proposed algorithm gained a mean accuracy of 0.8144, an average sensitivity of 0.9067, an average precision of 0.7160, a mean F-score of 0.5792, and an average G-mean of 0.6398.
5
2 = =
2
2 --0------
6 --0-- 2 --------
0 2 --
--20------
--------20--
--------20--
----2
/ 2
/ 2
/
/
An Enhancement of smote algorithm applied in an intelligent system for medical diagnosis. 6 6 - - - - - - - - . - . - 0 . - . - 0 .
Undergraduate Thesis : (Bachelor of Science in Computer Science) - Pamantasan ng Lungsod ng Maynila, 2024.
5
ABSTRACT: Data classification is a field in data science that ecompasses the prediction and classification of data instances through training and testing data. Data imbalance in datasets or the existence of a major difference in amount between the minority and majority class instances usually occur in data like medical datasets. This imbalance causes bias in the process of synthesis of new instances which is inclined to focus on the majority instances due to its density. Synthetic Minority Oversampling Technique (SMOTE) is an oversampling algorithm that uses linear interpolation in synthesizing new instances to even out the amount of instances per class. However, it has certain limitations: (a) its current method is prone to overfitting as it uses random linear interpolation in synthesizing; (b) it is unable to determine the distribution density of the dataset; (c) its inability to mitigate noise from being synthesized in the minority instances which can lead to further synthesis of noise instances which causes over-generalization. The proposed algorithm tackles the stated problems by implementing three (3) techniques to resolve the shortcomings of the existing SMOTE algorithm, namely the (a) triangle centroid method for synthesis of the minority samples using triangular structures; (b) Heron's formula to calculate the sparcity of the minority dataset for a sense of distribution determination; (c) triangular noise mitigation that takes advantage of the properties of a triangle to decrease the creation of noisy samples. The results of simulation of the proposed algorithm on publicly available medical datasets that Heron-Centroid SMOTE improved the existing SMOTE across all performance metrics and won against both 'Imbalanced' tests and SMOTE on most datasets, the proposed algorithm gained a mean accuracy of 0.8144, an average sensitivity of 0.9067, an average precision of 0.7160, a mean F-score of 0.5792, and an average G-mean of 0.6398.
5
2 = =
2
2 --0------
6 --0-- 2 --------
0 2 --
--20------
--------20--
--------20--
----2
/ 2
/ 2
/
/