What is HDFS (Hadoop Distributed Filesystem)

Ruben Cañadas 06/03/2024 Technology

what is hdfs (hadoop distributed filesystem)

When we work with our computer, we save files, images or videos on the built-in hard drive. However, when we work with a massive amount of data we cannot store it on a single computer since it would take up more memory than it has.

Introduction to HDFS

Here we face different problems. What would happen if one of the computers broke? We would lose part of our data. For this reason, distributed file systems have to be fault tolerant to avoid data loss.

Hadoop integrates its own file system called Hadoop Distributed Filesystem or better known as HDFS. This system complies with what was mentioned above. It is capable of storing millions and millions of data in a distributed network of computers and is also fault tolerant, preventing data from being lost during an error in one of the nodes.

In this blog article we will introduce HDFS to understand its importance, its characteristics and the flow of data between big data frameworks.

Hadoop File System Features

Very large files

Hadoop computer clusters can store files that take up a lot of memory. Today, there are Hadoop systems that run applications with many petabytes of data.

Saved in blocks

Files in HDFS are divided into blocks of about 128 MB of memory. These blocks are duplicated and stored in different nodes. Having multiple copies of the same block allows data to be recovered if one of the nodes breaks.

Suitable for all types of hardware

Hadoop can run on clusters built with computers from many different manufacturers. It is designed to be easy to install and be able to use in a simple way, so that it seems that we are working with a single computer.

High scalability

HDFS scales horizontally, adding more nodes (computers) in the cluster system. HDFS supports thousands of nodes and can grow distributed applications quickly and securely.

High latency access

Applications that require low latency do not work well with HDFS since this system is designed to deliver a large amount of data. There are other data storage systems that access is faster, such as HBase, a column-oriented non-relational distributed database.

Fault Tolerant

HDFS is a redundant system. This means that for each block 3 copies are generated and stored on different servers. This allows that if any of the computers lose the information, it can be recovered quickly.

HDFS system components

NameNode

The NameNode is the master node in charge of organizing and managing the entire cluster. Contains the metadata that specifies in which DataNode each block of information is stored.

In addition, the NameNode also manages access to the clients' DataNodes.

If the NameNode crashes then the block information cannot be recovered and the HDFS information is lost. To avoid this, an alternative NameNode can be enabled that will take over when the main spare one fails.

DataNodes

The datanodes are the slaves of the system. They store and retrieve data blocks when ordered by the Namenode or the client.

Data flow between client and HDFS

To understand how an application that uses a distributed file system like HDFS works, it is important to understand how the data flow takes place between HDFS, the client, the namenode and the datanodes.

When the client requests permission to read files, a call is made to the namenode and it proceeds to determine the location of the information blocks on the datanodes.

Once found, the datanodes are sorted through their topology by proximity to the client.

The information from the first block is sent to the client and once finished, the connection to the datanode is closed and the connection to the datanode of the next block is opened.

This happens transparently to the client where they see a continuous flow of the data.

If during the process an error arises with the datanode of a specific block, then the next datanode with that block of information that is closest to the client will be searched. This is what allows HDFS to be fault tolerant.

Big data applications supported by Hadoop Distributed Filesystem

Many big data frameworks run on Hadoop's distributed file system underneath. Some of these programs are Spark, Hive, Pig or Mahout.

All these frameworks play the role of clients. They call HDFS to retrieve the necessary data and perform different operations on it before showing it to the user.

For example, Mahout uses data stored in HDFS to create machine learning models or Hive to extract data with SQL statements similar to human natural language.

Best courses to learn Hadoop and HDFS

Now that you know what HDFS is about, do you want to learn how to implement it in a practical way for your own projects?

Below we show you some of the online courses where you can learn this big data, HDFS and Hadoop technology in depth.

The ultimate hands-on Hadoop: Tame your big data!
Introduction to Big Data with Hadoop from Scratch