What is overfitting

Ruben Cañadas 06/03/2024 Science

Overfitting, or better known by its English translation, overfitting, is a property of the statistical model that tells us that it will not be able to adequately generalize to other data with which it has not been trained.

Overfitting happens when when building a machine learning model, the method used gives too much flexibility to the parameters and ends up generating a model that fits perfectly with the data that has been trained but is not capable of performing the basic function of a model. statistician: being able to generalize to new information.

Overfitting is one of the main problems of autonomous learning, and of artificial intelligence in general. If we are not able to detect this problem, our model will be of very poor quality even if it has good prediction results in the training set.

There are different ways to detect and avoid having overfitting in the prediction model. Most of these techniques consist of reducing the complexity of the model so that it fits less to the training set and is able to generalize to new observations.

How to detect overfitting or overfitting

As we have seen, diagnosing our model should be a mandatory step before putting it into production. If not, the predictions made may be inaccurate and our project may not work as it should. Below we show some checks that we can do to detect the level of overfitting or underfitting that our statistical model suffers.

Bias-variance trade-off

It is important to know the concept of balance between bias and variance within the world of machine learning. In machine learning, our goal is to be able to construct a function f' that is as close as possible to the original function f that models the behavior of our data.

When we train a model, basically what we are doing is building the function f' from the data we use as input.

The variance represents how much the function f' varies when changing the training set. If it changes a lot, we say that the statistical model has a high variance, so it surely suffers from overfitting since it is capable of perfectly modeling the training data, but when generalizing to data that it has never seen, it fails.

We can define the bias as the opposite. If when using different training sets, the function f' remains practically the same, then the model has low variance and high bias. This indicates that we have underfitting, so the model is too simple and does not fit well either the training or validation data.

What we should try when building our models is a balance between variance and bias.

In the next section we will teach you a technique with which you can visually detect these problems.

Learning curves

Learning curves or learning curves in English are one of the best methods to diagnose our model for possible overfitting (high variance) or underfitting (high bias) problems.

In a typical learning curve graph we have an error metric on the ordinate axis, for example, the MSE (Mean-Squared Error) and the coordinate axis have different sizes of the training set.

The learning functions will tell us how the model error varies with the size of the learning data set.

In the case of overfitting or high variance, the graph will show how there is a large gap between the validation data and the training data. This is because the model fits the training data very well, so the error in the training set will be very low. However, it is not able to generalize. For this reason, the error in the validation set will be much larger. In the following graph we can see the typical learning curves of an overtrained model.

When we have a high bias or underfitting the gap between the two functions is very small. Furthermore, the error of the validation set and the training set is high. This indicates that it is a very simple model that does not fit the data well. In this case it would be necessary to add more training data or increase the model training time.

How to solve overfitting

Overfitting is a very common problem and data scientists must constantly deal with it. Next we will show some of the techniques most used by data scientists around the world to eliminate overfitting and improve model generalization.

1. Simplification of the model

The first step is to reduce the complexity of the model. The way to carry out this will depend on the machine learning method used.

In neural networks we can reduce the number of layers or neurons. We can also use regularization techniques such as dropout or early stopping.

In the case that we are using decision trees we can use a technique known as pruning.

In other cases, such as Support Vector Machines (SVM) or regression techniques, the regularization of the models is achieved through their hyperparameters, which add restrictions, limiting flexibility.

2. Data augmentation techniques

These techniques consist of generating new data from existing data. For example, in a set of images, some transformations that we can apply that would generate new samples are translations, rotations, scaling, filters or changes in lighting.

3. Eliminate noise from the training set

In some cases, overfitting may be due to poor data cleaning. When we receive the raw data, we must perform what is known as data cleaning to eliminate outliers, standardize the data and suppress information that could add noise to our modeling.

By carrying out data cleaning processes we can reduce the variance, improving the final results.

4. Get more observations

Getting more data could help solve the problem. However, it is also possible that this is not the case and we will have to use one of the other methodologies in this section.

5. Transfer learning techniques

In some cases, the overfitting problem may be due to the little data we have. It is also possible that more data cannot be acquired by increasing the dataset.

At this point we can resort to other types of solutions such as transfer learning, or better known as transfer learning. This solution consists of adopting an already trained and functional model that performs a similar function and retraining it with our small data set.

Implementation in Python

In this article on the abdatum blog we have seen what overfitting or overfitting is, how to detect it and how to solve it. We have also explained that learning curves are one of the best methods to diagnose the machine learning model. How do we create learning curves?

We can do it manually. However, the Python package sklearn includes the learning_curve inside model_selection. In this function we have to pass the estimator that we will use to build the model and the training dataset.

From the information it returns we can graph the curves using the matplotlib library.