In this article we will talk about two new names YARN and MR2 introduced in Hadoop 2.0
- What is YARN?
- Why there was a need of YARN (Yet Another Resource Negotiator), a new framework in Hadoop 2.0?
- What are the benefits of YARN framework over earlier MapReduce framework of Hadoop 1.0?
- What is the difference between MR1 in Hadoop 1.0 and MR2 in Hadoop2.0?
You can understand this article in a better manner if you have basic knowledge of Hadoop and MapReduce. If you are not aware of Hadoop and MapReduce, below article may help you.
YARN (Yet Another Resource Negotiator) is a new component added in Hadoop 2.0
Let’s have a look on how Hadoop architecture has changed from Hadoop 1.0 to Hadoop 2.0
As shown, in Hadoop 2.0 a new layer has been introduced between HDFS and MapReduce.
This is YARN framework which is responsible for doing Cluster Resource Management.
Cluster Resource Management:
Cluster resource management means managing the resources of the Hadoop Clusters. And by resources we mean Memory, CPU etc.
YARN took over this task of cluster management from MapReduce and MapReduce is streamlined to perform Data Processing only in which it is best.
Before we understand the need of YARN, we should understand how cluster resource management was done in Hadoop 1.0 and what the problem in that approach was.
In Hadoop 1.0, there is tight coupling between Cluster Resource Management and MapReduce programming model.
Job Tracker, which does resource management, is part of, MapReduce Framework.
In MapReduce framework, MapReduce job (MapReduce application) is divided between number of tasks called mappers and reducers. Each task runs on one of the machine (DataNode) of the cluster, and each machine has a limited number of predefined slots (map slot, reduce slot) for running tasks concurrently.
Here, JobTracker is responsible for both managing the cluster’s resources and driving the execution of the MapReduce job. It reserves and schedules slots for all tasks, configures, runs and monitors each task, and if a task fails, it allocates a new slot and reattempts the task. After a task finishes, the job tracker cleans up temporary resources and releases the task’s slot to make it available for other jobs.
- It limits scalability: JobTracker runs on single machine doing several task like
- Resource management
- Job and task scheduling and
Although there are so many machines (DataNode) available; they are not getting used. This limits scalability.
- Availability Issue: In Hadoop 1.0, JobTracker is single Point of availability. This means if JobTracker fails, all jobs must restart.
- Problem with Resource Utilization: In Hadoop 1.0, there is concept of predefined number of map slots and reduce slots for each TaskTrackers. Resource Utilization issues occur because maps slots might be ‘full’ while reduce slots is empty (and vice-versa). Here the compute resources (DataNode) could sit idle which are reserved for Reduce slots even when there is immediate need for those resources to be used as Mapper slots.
- Limitation in running non-MapReduce Application: In Hadoop 1.0, Job tracker was tightly integrated with MapReduce and only supporting application that obeys MapReduce programming framework can run on Hadoop.
Let’s try to understand point 4 in more detail.Hadoop distributed file system (HDFS) makes it cheap to store large amounts of data, and its scalable MapReduce analysis engine makes it possible to extract insights from that data. MapReduce works on batch-driven data analysis, where the input data is partitioned into smaller batches that can be processed in parallel across many machines in the Hadoop cluster. But MapReduce, while powerful enough to express many data analysis algorithms, is not always the optimal choice of programming paradigm. It‘s often desirable to run other computation paradigms in the Hadoop cluster – here are some examples.
- Problem in performing real-time analysis: MapReduce is batch driven. What if I want to do perform real time analysis instead of batch-processing (where results is available after several hours).There are many applications which need results in real time like fraud detection algorithm. There are real time engines like Apache Storm which can work better in this case. But in Hadoop 1.0, due to tight coupling these engines cannot run independently.
- Problem in running Message-Passing approach: It is a stateful process that runs on each node of a distributed network. The processes communicate with each other by sending messages, and alter their state based on the messages they receive. This is not possible in MapReduce.
- Problem in running Ad-hoc query: Many users like to query their big data using SQL. Apache Hive can execute a SQL query as a series of MapReduce jobs, but it has shortcomings in terms of performance.
Recently, some new approaches such as Apache Tajo , Facebook’s Presto and Cloudera’s Impala drastically improve the performance, but they require to run services in other form than MapReduce form.
It is not possible to run all such non Map Reduce jobs on Hadoop Cluster. Such jobs have to “disguise” themselves as mappers and reducers in order to be able to run on Hadoop 1.0.
YARN took over the task of cluster management from MapReduce and MapReduce is streamlined to perform Data Processing only in which it is best.
YARN has central resource manager component which manages resources and allocates the resources to the application. Multiple applications can run on Hadoop via YARN and all application could share common resource management.
- Yarn does efficient utilization of the resource.
There are no more fixed map-reduce slots. YARN provides central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource.
- Yarn can even run application that do not follow MapReduce model.
YARN decouples MapReduce’s resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications. For example, Hadoop clusters can now run interactive querying and streaming data applications simultaneously with MapReduce batch jobs. This also streamlines MapReduce to do what is does best – process data.
- YARN is backward compatible.
This means that existing MapReduce job can run on Hadoop 2.0 without any change.
- No more JobTracker and TaskTracker needed in Hadoop 2.0
JobTracker and TaskTracker has totally disappeared. YARN splits the two major functionalities of the JobTracker i.e. resource management and job scheduling/monitoring into 2 separate daemons (components).
- Resource Manager
- Node Manager(node specific)
Central Resource Manager and node specific Node Manager together constitutes YARN.
Earlier version of map- reduce framework in Hadoop 1.0 is called as MR1. The new version of MapReduce is known as MR2.
No more JobTracker and TaskTracker needed in Hadoop 2. With the introduction of YARN in Hadoop2, the term JobTracker and TaskTracker disappeared. MapReduce is now streamlined to perform processing data.
The new model is more isolated and scalable as compared to the earlier MR1 system. MR2 is one kind of distributed application that run MapReduce framework on top of YARN. MapReduce perform data processing via YARN. Other tools can also perform data processing via YARN. Hence Yarn execution model is more generic than earlier MapReduce model.