What is a data lake
Data extraction and storage is crucial today in all sectors. This stored information can later be used to improve an application, business or company.
Introduction to data lakes or Data Lakes
A data lake is a repository designed to store all types of data without any predetermined schema, that is, we can save it raw without preprocessing it.
These technologies make use of the ELT (Extract, Load and Transform) procedure, which refers to the extraction of data from an original source and loading it into the final source, which is the data lake. Subsequently, those that are of interest can be filtered, grouped or selected.
One of the advantages of using data lakes over other architectures is the speed of data ingestion since they do not have to be cleaned before saving. In addition, we do not lose information either since they are all stored.
Data Lake Characteristics
Distributed environment
Many data lakes support distributed data storage, increasing data ingestion capacity and making them highly scalable.
Obtaining data in real time
By not having a predefined scheme, data collection is very fast, allowing data to be recovered and processed in real time.
Supports all types of formats
Data lakes allow for structured, semi-structured and unstructured formats. This feature allows all types of information to be saved in it regardless of its format.
Layers in a data lake
We have seen that the information stored in data lakes is raw data. However, there may be higher layers where this data is processed to give the client or user the information they need.
Below we show some of the typical layers present in a data lake architecture.
- Data ingestion : This step is optional and involves doing some checks on the information before storing it in the data lake. For example, you can add filters or carry out encryption processes for greater security.
- saved : In this part structured, semi-structured or unstructured data is stored without any prior transformation.
- Prosecution : Once saved, there may be a need to create a layer where the data is processed and transformed to show it to certain users. At this point, processes of data quality to ensure the integrity, reliability and relevance of the ingested data.
Differences between Data Lake and Data Warehouse
Data warehouses only allow data to be stored with a previous structure. On the other hand, data lakes accept all types of formats: structured, semi-structured and unstructured.
In data lakes we can frequently find images, videos or texts which we do not find in data warehouses.
Information processing
Data warehouses follow a process called ETL (Extract, Transform and Load). Data transformation and cleansing is executed before storage on the target system. This makes saving slower.
Instead, data lakes follow ELT (Extract, Load and Transform) where the data is cleaned and processed after saving to the target system.
Ingestion speed
The ingestion speed is higher in ELT processes, that is, in data lakes since no time is wasted in processing the information before storage.
In data warehouses there is a transformation of the information before saving to ensure its reliability and that it complies with the scheme with which the data warehouse has been designed.
Data protection
Data warehouses have a better data protection system since they have been operating on the market for longer.
Adaptation to changes
Data lakes adapt more easily to changes since a data warehouse, having a predefined structure, makes the process of adapting to customer requirements difficult. Data lakes, by not having a predefined structure, allow for greater versatility and agility.
Information reliability
Data Warehouses allow us to obtain more detailed and more reliable information since the data has been filtered and cleaned prior to saving.
On the other hand, in a data lake, the data is raw. If someone accesses the data lake with little experience, they may receive low-quality and unreliable information.
Platforms that use Data Lakes
There are some platforms that allow you to use this type of architecture to store all types of data. The most famous are AWS (Amazon Web Server), Azure, Google Cloud and Cloudera.
All of them have a lot of experience with the implementation of Big Data and Machine Learning technologies, which will help you at all times in the implementation of data lakes.
Learn how to use Data Lakes cloud services
There are courses on learning platforms like Udemy to learn how to manage data lakes and data warehouses and become a big data engineer:
- Data Lake in AWS
- Azure Data Factory for Data Engineers
- Azure Data Lake Storage Gen2