Exploratory data analysis
Data exploration methodologies are normally the first step before proceeding with advanced statistical techniques such as inferential statistics or autonomous learning.
Exploratory data analysis or also known as EDA for its acronym in English “Exploratory data analysis” is a part of statistical mathematics that uses tools to qualitatively describe the main characteristics of the data.
Graphs and metrics are used to summarize the data of interest to draw initial conclusions about the relationships between variables and possible correlations.
Graphic techniques for data exploration
The first step when we start analyzing a new set of data is to graph the different variables to begin to understand what information we can extract from them.
Some of the basic information exploration and analysis techniques are the following:
Box plots or boxplots
Box plots or in English, boxplots, are a type of graph that allows you to see the distribution of data in the form of a box.
They represent the different quartiles of the distribution along with the mean, standard deviation and outliers. This type of graph gives us a first view of what shape the data has and how it is distributed within our dataset.
Histograms
Histograms are graphs that describe a variable using bars where their area is directly proportional to the frequency of the values in our data.
There are different types of histogram graphs, each with a specific objective to understand the data.
It is highly recommended to use this type of visualizations to understand our variables when we carry out the initial phases of data exploration and analysis.
Heat maps or heatmaps
Heat maps are a type of graph used in many sectors to analyze magnitudes of a variable according to its color. Normally, the range of colors used ranges from blue to red, with blue being the lowest values and red being the highest.
This type of data exploration is used in many fields such as molecular biology to detect the level of expression of genes or digital marketing to know which parts of the website where users interact the most.
Scatter plots
This type of graph allows you to study the relationship between pairs of variables (x,y) through a diagram formed by a cloud of points. Thanks to this analysis we can see variables related through a direct or inverse correlation (directly proportional or inversely proportional).
When to use data exploration
The answer is always. This type of initial analysis allows us to begin drawing conclusions from our data and can guide us how to define the data analysis strategy.
Furthermore, in this step we can detect the quality of the received data set and design a good methodology to clean the data, improving its quality and improving the analysis results.
Tools for data exploration
There are many advanced tools for data analysis. They are designed to carry out business intelligence or machine learning methodologies.
However, to do an initial exploratory analysis we do not need any paid tool. We can directly use a spreadsheet such as Excel or Google Sheets.
These programs allow us to open the data and create different graphs to begin to have an idea of what the information we have received is like.
My favorite tool is the Python or R programming languages. These have different libraries aimed at data analysis. If we master either of these two languages we can create different graphics quickly and effectively.