Main Article Content
In the area of data science, machine learning based classification techniques are the first choice for an accurate analysis of a huge amount of data. Most of the machine learning and data mining algorithms make an assumptions that the data is equally distributed. Most of the dataset or applications are imbalanced and such data are more biased, towards majority class. Class Imblance can be defined as the observation/units in training data belonging to one class substantially outrange the observations in the other classes, e.g: insurance claims, forest cover types, fraud detection, rare medical disease diagnosis or rare variety clasification. When we analyzed real world datasets like text and video mining, detection of oil spills in satellite radar images, activity recognition detection of fraud telephonic call, so on. . Basically, Imbalanced Datasets deal with rare classes or it can be called as skewed data and skewed Dataset can reduce the performance of the classification algorithms. Class Imblance influence the performance achieved by existing learning systems and the learning systems may have difficulities to learn the concept related to the majority class. The Machine-Learning community appears to concur on the idea that the major hypothesis in inducing classifiers in imbalanced domains is class imbalance. Imbalanced problems can be considers of two types that is between and within class imbalances. In the between class imbalances, the existence of the imbalance is between the two sample and in the within class imbalances, the majority samples are higher than the minority samples. The first requirement while developing such classification techniques is the robustness that it will imply for an accurate and efficient classification. However, it is seen that in many times these algorithms suffer from “class-imbalance problem”, shortly CIP. Besides class imbalance, the degree of data overlapping among the classes is another factor that lead to the decrease in the performance of learning algorithms.Due to CIP, many difficulties arise during the learning process, which as a whole results a poor classification process. Resampling the data set is one common technique for dealing with CIP where in general oversampling the size of the rare class is made. We have different oversampling techniques available in the literature like SMOTE, ADYSN, and Random Oversample are the noted ones. In this paper, an effort is made to compare these different techniques as well as their impact on classification performance.
For our comparative study, we have used 7 datasets form UCI repository, like yeast, pima diabetes, Ionosphere, abalone, yeastme1, yeastme1, yeastexc. The imbalanced ratio of the datasets, with the number of observation, number of positive and negative shown on the table 1.
Table 1: Characteristics and Imbalanced Ratio of the datasets
Number of Observation
Number of Positive
Number of Negative
How to Cite
Data Science, Goals, SDG, Computer Science, Information Technology, AI, Machine Learning
 Ying Liu, Han Tong Loh and Aixin Sun, “ Imbalanced text classification: A term weighting approach”, Expert systems with Applications, 2007.
 Miroslav Kubat , Robert C. Holte and Stan Matwin, “Machine Learning for the detection of Oils Spills in Satellite Radar Image”, Kluwer Academic Publishers, February 1998, https://doi.org/10.1023/A:1007452223027.