Bias in statistics and machine learning

Ruben Cañadas 06/03/2024 Technology

In statistics and more specifically in the world of machine learning, it is important to know the limitations of the models.

During training, different problems can cause the algorithm to not learn adequately, generating errors that will result in poor prediction of the results.

One of the most frequent errors is what is known as bias. A biased model will generate a result that is inaccurate from reality.

For this reason, it is important to make a diagnosis and evaluate the machine learning models once trained. If we detect an error we can act and resolve it before putting it into production that could affect the business.

In this article we will see what bias or a biased model is and we will learn how to detect it and deal with it to improve its accuracy.

What is bias in machine learning?

Bias can be thought of as a model that has not taken into account all the information available in the dataset, and, therefore, is too poor to make accurate predictions.

This is known as underfitting and occurs when the model is too simple for the problem to be solved.

There are different ways to detect bias. One of them is to generate learning curves. (In the article of what is overfitting it is better explained what they are).

If the validation curve and the training curve have a small gap and a large error, it means that the model is too poor and therefore may suffer from underfitting, indicating that there is a bias problem.

How to improve a biased model

Once we have detected the problem we must act. To improve a biased model we can expand the size of the training set. In this way we will force the model to learn more complex patterns and reduce underfitting.

However, many times the data is limited and obtaining more of it is not possible.

Another option is to try other machine learning or Deep learning techniques that allow greater flexibility and complexity of the model.

Many artificial intelligence algorithms have tunable hyperparameters that can be played with to decrease bias and increase model complexity.

Data augmentation and synthetic data generation is another option. As we have mentioned, many times we cannot obtain more information than we already have since the datasets are limited.

However, we can use different techniques to generate synthetic data. One of them is data augmentation. This is especially used in images where we can rotate them, cut them, zoom in or use filters to generate new images from existing ones.

For other types of data we can use interpolation algorithms such as SMOTE or ADASYN, known as oversampling techniques.

Differences between bias and variance

When we try to solve the problem of bias we have to be careful not to get the opposite. If we readjust the parameters a lot and give the model a lot of flexibility, we could go from having a model that is too simple (underfitting) to a model that is too complex (overfitting). The latter would be a model with a high variance and a low bias.

A high variance means that the built model is too complex and very specific for our training data, so it would have a hard time generalizing to information that it has not seen during training.

For this reason, it is important to diagnose the models and try to achieve a balance between bias (underfitting) and variance (overfitting).