A Comparative Study and Impact Analysis of Different Oversampling Techniques for CIP

Main Article Content

Article Sidebar

Published Sep 14, 2021
Dibyajyoti Bora

Abstract

In the area of data science, machine learning based classification techniques are the first choice for an accurate analysis of a huge amount of data. Most of the machine learning and data mining algorithms make an assumptions that the data is equally distributed. Most of the dataset or applications are imbalanced and such data are more biased, towards majority class. Class Imblance can be defined as the observation/units in training data belonging to one class substantially outrange the observations in the other classes, e.g: insurance claims, forest cover types, fraud detection, rare medical disease diagnosis or rare variety clasification[1]. When we analyzed real world datasets like text and video mining, detection of oil spills in satellite radar images, activity recognition detection of fraud telephonic call, so on.[2] [3]. Basically, Imbalanced Datasets deal with rare classes or it can be called as skewed data and skewed Dataset can reduce the performance of the classification algorithms. Class Imblance influence the performance achieved by existing learning systems and the learning systems may have difficulities to learn the concept related to the majority class. The Machine-Learning community appears to concur on the idea that the major hypothesis in inducing classifiers in imbalanced domains is class imbalance. Imbalanced problems can be considers of two types that is between and within class imbalances. In the between class imbalances, the existence of the imbalance is between the two sample and in the within class imbalances, the majority samples are higher than the minority samples. The first requirement while developing such classification techniques is the robustness that it will imply for an accurate and efficient classification. However, it is seen that in many times these algorithms suffer from “class-imbalance problem”, shortly CIP. Besides class imbalance, the degree of data overlapping among the classes is another factor that lead to the decrease in the performance of learning algorithms.Due to CIP, many difficulties arise during the learning process, which as a whole results a poor classification process. Resampling the data set is one common technique for dealing with CIP where in general oversampling the size of the rare class is made. We have different oversampling techniques available in the literature like SMOTE, ADYSN, and Random Oversample are the noted ones. In this paper, an effort is made to compare these different techniques as well as their impact on classification performance.

For our comparative study, we have used 7 datasets form UCI repository, like yeast, pima diabetes, Ionosphere, abalone, yeastme1, yeastme1, yeastexc. The imbalanced ratio of the datasets, with the number of observation, number of positive and negative shown on the table 1.

Table 1: Characteristics and Imbalanced Ratio of the datasets

Dataset

Number  of Observation

Number  of Positive

Number of Negative

Imbalance Ratio

Yeast

514

51

463

9.078431373

PIMA

768

268

500

1.865671642

Ionosphere

351

126

225

1.785714286

Abalone

4177

62

4115

66.37096774

YeastME1

1484

44

1440

32.72727273

YeastME2

1494

51

1443

28.29411765

YeastEXC

1484

35

1449

41.4

 

How to Cite

Bora, D. (2021). A Comparative Study and Impact Analysis of Different Oversampling Techniques for CIP. SPAST Abstracts, 1(01). Retrieved from https://spast.org/techrep/article/view/458
Abstract 56 |

Article Details

Keywords

Data Science, Goals, SDG, Computer Science, Information Technology, AI, Machine Learning

References
[1] Leevy, J.L., Khoshgoftaar, T.M., Bauder, R.A. et al. A survey on addressing high-class imbalance in big data. J Big Data 5, 42 (2018). https://doi.org/10.1186/s40537-018-0151-6
[2] Ying Liu, Han Tong Loh and Aixin Sun, “ Imbalanced text classification: A term weighting approach”, Expert systems with Applications, 2007.
[3] Miroslav Kubat , Robert C. Holte and Stan Matwin, “Machine Learning for the detection of Oils Spills in Satellite Radar Image”, Kluwer Academic Publishers, February 1998, https://doi.org/10.1023/A:1007452223027.
Section
GE3- Computers & Information Technology