menu EXPLORE
history NEW

What is a dataset

On the abdatum blog we have mentioned the term dataset in several articles . We data scientists have borrowed this word from English and use it constantly to talk about our data projects. machine learning, business intelligence either big data.

In this article I will tell you what exactly a dataset is, its importance in data science and where you can find datasets as an example so you can observe what they are like and experiment with them.

What are datasets and their importance

The word dataset literally means data set. This data is normally tabulated in rows and columns to facilitate the analysis of the information.

Every artificial intelligence project, or in general, that uses data, get a good dataset or data set It is the first step of the entire methodology. All data analysis algorithms are highly dependent on the quality of the information. If the data is wrong, the conclusions we draw will also be wrong.

For this reason, Getting a reliable source of data is the most difficult thing in data science. Many times data cleaning and transformation processes are necessary to improve its quality and make the statistical models we generate more reliable.

Types of data sets or data sets

We can differentiate different types of data sets depending on how they are structured and stored.

Files

There are several file formats that allow you to save data. Some of the most used formats for datasets are .csv and .tab. Most data analysis tools accept these files as data sources.

Excel formats such as .xlsx are also files that can act as a dataset for a big data or data analysis project.

Websites

Websites can be used to store data. The information is saved on the server where the website is hosted and we can access the page and extract the information we need to analyze.

Databases

Databases are the most optimized way to store our datasets. Normally, to have a tabular structure, the so-called relational databases which use the relational model to establish relationships between the different tables of information stored in the database.

The 4 most popular datasets in data science

  1. Iris dataset: This is a data set widely used in machine learning for testing. It contains information on 3 different types of flowers: sepal length, sepal width, petal length and petal width.
  2. Coco dataset: Coco is a large-scale captioning, segmentation, and object detection dataset published by Microsoft. The objective of this dataset is to provide tools for image recognition. It is used by computer vision teams to train and test their models.
  3. Mnist dataset: is a large data set that includes images of handwritten digits. It has been commonly used to test different multiclassing techniques. Some machine learning models that have been tested are: linear classifiers, support vector machines, deep neural networks, convolutional neural networks or random forests. It has also been used to test generative models such as adversarial neural networks or autoencoders .
  4. Boston housing dataset: This dataset has been widely used to benchmark different artificial intelligence models. It contains information on some houses in the Boston area. Some of the data to predict are house prices.

Where to find free public datasets

If you've come this far, you're probably wondering where you can find real datasets so you can start looking at what they are like and running tests with them. Next we tell you 4 web pages where you will find all types of public and free data sets.

Google dataset search

On this website there is a search engine where we can put the name of the information we want it to contain. Google will return the results of where you have found the information we have requested. It's a good place to start looking for data sets to play with and experiment with. View datasets.

Kaggle

Kaggle is a platform where machine learning competitions are held to see who is able to generate a better model to solve a given problem. Most problems have their own dataset that you can download for free. View datasets.

Github

Github is a repository specialized in saving code. However, many users of the platform also use it to upload relevant information. There are some repositories that contain a list of different public and free datasets that we can download. One of them is Awesome Public Datasets. View datasets.

fivethirtyeight

This is a website where they use information to give knowledge to people. So that everyone can verify that what they say is correct, they post all the datasets they use to analyze current events in the country. You can access this information and use it as a data set to perform data analysis or machine learning tests. View datasets.

Difference between dataframe and dataset

I will use the last section of this article about datasets to clarify a question that I have been asked several times. What is the difference between dataframe and dataset?

We have explained that datasets are simply sets of data that are normally stored in a tabular structure, either in a file, on a website or in a database.

Dataframes are programming objects that are used in languages ​​such as R or Python. Normally, when we import data into a data analysis package, it transforms the dataset into an abstract internal representation that in many programming packages is called a dataframe.

Simply put, a dataframe is simply a representation of a dataset.