This post is about how HDFS handles data in batches and in real-time. This post will address all your queries on how HDFS stores data coming from different sources and in different forms, and also includes the basics of HDFS, starting from what Hadoop is, the different versions of Hadoop, changes in these versions, and how HDFS is modulated in respect to these changes, particularly in batch and real-time processing.
The objective of this post is to give you a clear idea of how HDFS stores data coming from different sources.
As of now we have two versions in Hadoop;Hadoop1.x and Hadoop 2.x.
This is the very first version of Hadoop built to handle BigData. HDFS and MapReduce are the two steps involved in processing the data in Hadoop1.x architecture. It process data in Batches, hence the name ‘Batch processing’.
Hadoop2.x is the enhanced version of Hadoop1.x. This version has introduced to overcome the problems and shortcomings of Hadoop1.x.A new feature called YARN(Yet Another Resource negotiator) has been introduced in Hadoop2.x. Here HDFS is used along with YARN to process the data. With YARN, Hadoop is able to process different forms of data. Along with Hadoop, we can use many other tools to process the data. They are as follows:
- HDFS+MapReduce- Batch processing
- HDFS+Storm – Real-time processing
- HDFS+Spark – Near real-time processing
Similar to this, we can process many forms of data with the help YARN andHadoop2.x
Batch processing is the processing of previously collected data in the form of batches i.e., processing large amounts of accumulated data at one time.Hadoop handles the large amounts of data by storing them in a cluster of nodes connected to each other.
Real time processing is the data processing that takes place, instantaneously upon data entry or receipt of a command. Real-time processing must execute within strict constraints on response time.
Near Real-time Processing
Here there is a slight time delay between the collection of data and processing of data.
If you observe closely, there is something common between all the three types of data processing.
It is none other than HDFS. Let’s see how HDFS handles the different forms of data.
HDFS is popularly known as Hadoop Distributed File System, which is the core component of Hadoop. HDFS is a java-based file system and is the place where all the data in the Hadoop cluster resides. In typical terms, Hadoop has the Master-Slave architecture. This is named in perspective to the HDFS.
It is called as Master-Slave architecture because there is a Master which takes control of all the Slaves. Here the Master is named as NameNode and the Slaves are named as DataNodes.
HDFS has been restructured in the second version of Hadoop to support multiple types of data processing units.
In Hadoop1.x the components of HDFS are as follows:
- Job Tracker
- Task Tracker
- Secondary NameNode
Here NameNode, DataNode, Secondary NameNode are the daemons of HDFS and Job tracker, Task tracker are the daemons of Map reduce
In Hadoop2.x the components are:
- Resource Manager
- Node Manager
- Secondary NameNode
Here NameNode, DataNode, Secondary NameNode are the daemons of HDFS and Resource Manager, Node manager are the daemons of YARN(popularly known as the Map reduce 2.0)
These are called daemons in Hadoop.
The difference between these two versions is that, in the Master, Resource Manager is introduced instead of the Job Tracker, and in the Slave, Node Manager is introduced instead of the Task Tracker.
Job Tracker and Task Tracker are the components of MapReduce, whereas Resource Manager and Node Manager are the components of YARN. Here in Hadoop 2.x HDFS is federated in hadoop 2.x to handle multiple jobs at the same time.
Let us learn about the daemons of HDFS:
NameNode holds the meta data for the HDFS like Namespace information, block information etc. When in use, all this information is stored in main memory. But these information also stored in disk for persistence storage.
There are two different files associated with the NameNode
- fsimage– It is the snapshot of the filesystem
- Edit logs – After the start of NameNode it maintains the every transaction that happened with NameNode.
The NameNode maintains the namespace tree and the mapping of blocks to DataNodes.
The Client communicates with the NameNode and provides data to the HDFS through it. The HDFS then stores data as blocks inside DataNodes.
By default, the block size in a hadoop cluster is 64MB. NameNode maintains the meta data information of all the blocks present inside the hadoop cluster like permissions, modification and access times, namespace and disk space quotas . This is the reason NameNode is also called as the Master node.
The above figure gives us a clear idea of how Secondary NameNode works
- It asks the NameNode for its edit logs in regular intervals and copies them into the fsimage.
- After updating the fsimage, NameNode copy back that fsimage
- NameNode uses this fsimage when it starts, this eventually will reduce the startup time.
The main theme of secondary NameNode is to maintain a checkpoint in HDFS. When a failure occurs SNN won’t become NameNode it just helps NameNode in bringing back its data. SNN is also called as checkpoint node in hadoop’s architecture.
This is the place where actual data is stored in the hadoop cluster in a distributed manner.
Every DataNode will have a block scanner and it directly reports to the NameNode about the blocks which it is handling.
The DataNodes communicate with the NameNode by sending Heartbeats for every 3seconds and if the NameNode does not receive any Heartbeat for 10 minutes, then it treats the DataNode as a dead node and re-replicates the blocks.
During startup each DataNode connects to the NameNode and performs a handshake. The purpose of the handshake is to verify the namespace ID and the software version of the DataNode. If either does not match that of the NameNode, the DataNode automatically shuts down.
The namespace ID is assigned to the file system instance when it is formatted. The namespace ID is persistently stored on all nodes of the cluster.
The lost blocks are re-replicated as per the replication factor as there is a threshold limit for the number of blocks that are to be maintained.Depending on the heartbeats, we can obtain the replication management. This process is called as block report.If a block is over replicated, then it should be removed and if a block is under replicated, NameNode creates a priority queue and data with the lower number of blocks are put in the higher priority of the queue. The blocks with high priority are replicated first.
This replication factor can be changed by configuring it in the HDFS-site.xml file.
We can specify our own value in between the value tags so that we can have as many copies as we want.
Bandwidth and Latency are very useful for achieving high performance with the cluster. HDFS achieves this by storing the same kind of data in the nearest DataNodes and in the nearest Racks. When storing in the nearest nodes, it becomes easier to exchange the data between the nodes.
Hdfs is very intelligent, it caches the blocks that are frequently used. A DataNode will read the block from disk, but for frequently accessed files, the blocks are cached in the DataNode’s memory. A person can also explicitly specify to a NameNode to cache particular file that is what we call as distributed cache.
Space Reclamation refers to the deletion and un-deletion of a file. If we delete a file from HDFS, it won’t delete the file permanently, it will send the deleted file into trash memory and the file lies in the trash memory for a certain amount of time which can be configured, by default it is 6 hours.
If we delete a file the DataNode space will be freed, but the file data will be in the NameNode’s HDFS namespace until it is in the trash. We can also undelete the file from trash memory within that configured time. After the expiry of file from trash memory, NameNode will automatically delete the file from HDFS namespace.
There are certain protocols using which client will communicate with the NameNode and NameNode will communicate with the DataNodes. All the HDFS communication protocols are based on Tep/IP protocol.
Client talks with the NameNode using the client protocol which is established on a Tep port on the NameNode. DataNodes talk to the NameNode using the DataNode protocol. Between these Client protocol and DataNode protocol there is a Remote procedure call(RPC) abstraction which wraps both of the above specified protocols.
Types of commands
The HDFS can be interacted through the following commands:
FS Shell commands:
The FS shell is popularly known as FileSystem Shell. It allows users to interact with the HDFS through commands. The FS Shell has a profound dictionary of commands by which users can easily interact with HDFS.
Refer the below blog to learn the basic commands in HDFS
The dfsadmin command supports Hadoop administration-related options like handling the nodes in a cluster.
These are the different types of administration commands that are available with HDFS.
We use both the Fs shell and Administration commands to work on HDFS. But the problem comes with storing the data into HDFS.
Let us discuss about how HDFS interacts with different types of data and how to store them in HDFS.
HDFS in Batch Processing
Batch processing means, processing the previously collected data.In Batch processing, the client gives the previously collected data to the Hadoop cluster and the NameNode distributes the data among the DataNodes and maintains the records of all these data.
Hadoop’s primary feature is to handle large amount of data in batches that has been achieved here. Using the HDFS fs shell commands we can able to store the pre collected data.
HDFS in Real-Time Processing
Almost 43% of the Big Data is received from social media. The data from social media are real-time and needs to be processed to obtain more insights from them.
We have seen as far that hadoop can only handle Batch processing i.e., data which is present in batches. In batch processing client is the one who provides the data to hadoop cluster but while coming to Real Time no one is there to give data to the cluster.
The solution for this problem is pipelining the Real time data.
A pipeline is a data processing element where the output of one element is the input to another. So, a middle ware should be introduced between HDFS and the real-time data. Apache Flume is one such middle ware which collects the data in the real-time and give it as an input to the HDFS.
We can obtain this pipelining in two ways i.e., with Flume and Kafka.
How Flume works?
Flume is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of data from many different sources to a centralized data store, such as HDFS or HBase. It is more tightly integrated with Hadoop ecosystem.
Source, Channel and Sink are the primary components of Flume.
Source is the one through which data enters into Flume.
Source collects events. An event can be defined as a typical single log entry. Source actively wait for the events to be collected.
A sink is the one which outputs the data. Which means it gives the input of the flume as the output to the next element.
HDFS sink is used to write the flume input data into HDFS.
Channels are the mechanism by which events get transferred from Source to the sink. Channel maintain the event inside it until the sink writes or gives its output to any other unit it may be HDFS. This is very helpful when some failure happens and the data in the sink is lost.
It collects sources, channels and sinks and it can be any physical java virtual machine running flume.
Refer the below blog to understand the streaming twitter data using Flume.
There is another tool called Apache Kafka, which handles the data in real-time. It is very similar to Flume. It was developed by LinkedIn to handle their real-time data.