A Comparative Study and Impact Analysis of Different Oversampling Techniques for CIP
Main Article Content
Article Sidebar
Abstract
In the area of data science, machine learning based classification techniques are the first choice for an accurate analysis of a huge amount of data. Most of the machine learning and data mining algorithms make an assumptions that the data is equally distributed. Most of the dataset or applications are imbalanced and such data are more biased, towards majority class. Class Imblance can be defined as the observation/units in training data belonging to one class substantially outrange the observations in the other classes, e.g: insurance claims, forest cover types, fraud detection, rare medical disease diagnosis or rare variety clasification[1]. When we analyzed real world datasets like text and video mining, detection of oil spills in satellite radar images, activity recognition detection of fraud telephonic call, so on.[2] [3]. Basically, Imbalanced Datasets deal with rare classes or it can be called as skewed data and skewed Dataset can reduce the performance of the classification algorithms. Class Imblance influence the performance achieved by existing learning systems and the learning systems may have difficulities to learn the concept related to the majority class. The Machine-Learning community appears to concur on the idea that the major hypothesis in inducing classifiers in imbalanced domains is class imbalance. Imbalanced problems can be considers of two types that is between and within class imbalances. In the between class imbalances, the existence of the imbalance is between the two sample and in the within class imbalances, the majority samples are higher than the minority samples. The first requirement while developing such classification techniques is the robustness that it will imply for an accurate and efficient classification. However, it is seen that in many times these algorithms suffer from “class-imbalance problem”, shortly CIP. Besides class imbalance, the degree of data overlapping among the classes is another factor that lead to the decrease in the performance of learning algorithms.Due to CIP, many difficulties arise during the learning process, which as a whole results a poor classification process. Resampling the data set is one common technique for dealing with CIP where in general oversampling the size of the rare class is made. We have different oversampling techniques available in the literature like SMOTE, ADYSN, and Random Oversample are the noted ones. In this paper, an effort is made to compare these different techniques as well as their impact on classification performance.
For our comparative study, we have used 7 datasets form UCI repository, like yeast, pima diabetes, Ionosphere, abalone, yeastme1, yeastme1, yeastexc. The imbalanced ratio of the datasets, with the number of observation, number of positive and negative shown on the table 1.
Table 1: Characteristics and Imbalanced Ratio of the datasets
Dataset
Number of Observation
Number of Positive
Number of Negative
Imbalance Ratio
Yeast
514
51
463
9.078431373
PIMA
768
268
500
1.865671642
Ionosphere
351
126
225
1.785714286
Abalone
4177
62
4115
66.37096774
YeastME1
1484
44
1440
32.72727273
YeastME2
1494
51
1443
28.29411765
YeastEXC
1484
35
1449
41.4
How to Cite
Article Details
Data Science, Goals, SDG, Computer Science, Information Technology, AI, Machine Learning
[2] Ying Liu, Han Tong Loh and Aixin Sun, “ Imbalanced text classification: A term weighting approach”, Expert systems with Applications, 2007.
[3] Miroslav Kubat , Robert C. Holte and Stan Matwin, “Machine Learning for the detection of Oils Spills in Satellite Radar Image”, Kluwer Academic Publishers, February 1998, https://doi.org/10.1023/A:1007452223027.