# SMOTE, Undersampling, and Oversampling for Imbalanced Dataset

## All that people ask for in a machine learning model is the accuracy of the model, this accuracy is sometimes nothing but a hoax. There are a lot of factors that determine the accuracy of the model, the major one among them is the quality of the dataset. The preparation of data is the most important and fundamental step in machine learning models.

There is a reason why understanding the data you are working with is important in forming a good machine learning model. Sometimes, the dataset is not perfectly balanced and causes a false sense of accuracy in a model. This type of dataset is known as an imbalanced dataset.  Let’s learn how this problem occurs and the ways to solve it.

What Is an Imbalanced Dataset In Machine Learning?

Consider you are working as a machine learning engineer in a factory and your task is to develop a model that could tell the defective product. Now, we all are well aware that the number of faulty products is generally way less than the perfect product in most factories. So let's say, 99.3% of products are non-defective and 0.7% are defective products on average.

In this type of classification problem, we have a lot of data points referring to the non-defectiveness of the product and a significantly less number of data points referring to the defectiveness of the product. This creates an imbalance in the dataset, but how is it a problem? The problem is that even if you just create a one-line code that says every piece of product is non-defective, you will get an accuracy of 99.3% on average.

The above example is an exact reason why such datasets could create a hoax or false accuracy. There are a few methods to solve the problem of imbalance in a dataset, these methods are-:

1. Undersampling or the blind copy of the dataset
2. Oversampling of dataset
3. SMOTE

Let us learn them one by one, starting with undersampling to re-sample the data.

## Re-Sampling Using Undersampling Method

Let us understand the problem first, we have a dataset that contains a minority class and we need to make a balance in a dataset. Minority class means that the number of frauds is very very less in comparison to the non-defective scenarios. This leads to false or unfair predictions in our machine learning model.

Undersampling Method

Consider there are 10,000 data points showing the total number of products and out of which 9,900 are not defective and 100 are defective. So there are 99 percent non-defective data points and 1 percent of defective data points.

In undersampling of the dataset, we tend to focus on reducing the number of majority classes, so let's say we reduce the non-defective data points from 9,900 to 900. Now, the total number of points becomes 1000 and there are 90 percent of the non-defective data and 10 percent of defective data. In this way, we can try to create a balance in a dataset, also known as the Resampling of data.

The problem with this approach is crystal clear, we reduce a significant amount of majority data points in order to make some balance. In this process, we are bound to lose a lot of information, this information could be valuable in the training part. This method has a high chance of introducing bias to our model, which obviously we don’t want. Therefore, let's move to our next method and see if it’s more helpful in the resampling of data.

## Random Oversampling In Machine Learning Dataset

If we look closely at the core problem, we can analyze that there are only two ways of solving the problem. The first method is to decrease the majority class, which we did in undersampling of data and the second method is to increase the minority class. In this method, we will increase the minority class in the dataset to introduce balance in our dataset.

Let us consider the same example as we took above, the total observations are 10,000, non-defective observations are 9,900, and 100 are defective observations. In oversampling, we will replicate the minority dataset and thus no information loss will happen.

Increase the number of defective data points to 1100, therefore the total observation will increase to 11000, where 90 percent will be non-defective data points and 10 percent will be the defective data points. In this process, we are not losing any bit of information but that does not mean this method doesn’t come up with a problem.

The replication of the minority data points might lead to the overfitting of data, therefore a lot of noise will be introduced in the training part. We definitely need a more stable method that is not prone to such bias and overfitting, this is where we are going to use SMOTE.

## SMOTE

SMOTE short for The synthetic minority oversampling technique is a method that helps our model’s decision-making capability. Smote basically adds some minority features in between the examples that are too close and closes the imbalance in the dataset.

Let us take two features-:

In the above figure, we can see that the minority class is very less as compared to the majority class, therefore, we will apply SMOTE method. For the minority class features, we will find out the nearest neighbor using KNN which will draw a map of the nearest neighbors. Let’s magnify the nearest neighbor in the minority class-:

Each minority class has three nearest neighbors, full map would be like this-:

Now, SMOTE will form duplicate features in between this line to maintain the balance without any significant information loss, after adding the duplicates in the minority class, the dataset will look like this:

After using SMOTE, there seems to be a balance between minority and majority classes, this is how SMOTE is implemented.

## Conclusion

In this article, we have learned about the significance of solving the Imbalanced dataset with the help of a few methods. We learned how undersampling is different from oversampling but reaches the same result. At last, we tried to understand the process of SMOTE in order to remove the imbalance in the dataset.

### tanesh

Founder Of Aipoint, A very creative machine learning researcher that loves playing with the data.