Hadoop – the solution for deciphering the avalanche of Big Data – has come a long way from the time Google published its paper on Google File System in 2003 and MapReduce in 2004. It created waves with its scale-out and not scale-up strategy. Inroads from Doug Cutting and team at Yahoo and Apache Hadoop project resulted in popularizing MapReduce programming – which is intensive in I/O and is constrained in interactive analysis and graphics support. This paved the way for further evolving of Hadoop 1 to Hadoop 2. The following table describes the major differences between them:

Sl No

Hadoop 1

Hadoop 2

1 Supports MapReduce (MR) processing model only. Does not support non MR tools Allows to work in MR as well as other distributed computing models like Spark, Hama, Giraph, Message Passing Interface) MPI & HBase coprocessors.
2 MR does both processing and cluster-resource management. YARN (Yet Another Resource Negotiator) does cluster resource management and processing is done using different processing models.
3 Has limited scaling of nodes. Limited to 4000 nodes per cluster Has better scalability. Scalable up to 10000 nodes per cluster
4 Works on concepts of slots – slots can run either a Map task or a Reduce task only. Works on concepts of containers. Using containers can run generic tasks.
5 A single Namenode to manage the entire namespace. Multiple Namenode servers manage multiple namespace.
6 Has Single-Point-of-Failure (SPOF) – because of single Namenode- and in case of Namenode failure, needs manual intervention to overcome. Has feature to overcome SPOF with a standby Namenode and in case of Namenode failure, it is configured for automatic recovery.
7 MR API is compatible with Hadoop 1x. A program written in Hadoop1 executes in Hadoop1x without any additional files. MR API requires additional files for a program written in Hadoop1x to execute in Hadoop2x.
8 Has a limitation to serve as a platform for event processing, streaming and real time operations. Can serve as a platform for a wide variety of data analytics-possible to run event processing, streaming and real time operations.
9 A Namenode failure affects the stack. The Hadoop stack – Hive, Pig, HBase etc. are all equipped to handle Namenode failure.
10 Does not support Microsoft Windows Added support for Microsoft windows


Hadoop 2.0 has come up with few great things. Let’s check these cool features and compare it with 1.0.

Hadoop 1.0 Vs Hadoop 2.0

1. Name node in High Availability mode(HA)
Name node in Hadoop Cluster is most   important because it stores all the metadata, if it is down due to some unplanned event such as a machine crash, the whole Hadoop Cluster will be down as well. How to handle this situation?
Hadoop 2.0 comes with the solution for this problem.
·         HDFS comes with High Availability feature now, which solves this problem by  providing the option of running two redundant Name Nodes in the same cluster in an Active/Passive way (one primary Name Node and other a hot standby Name Node)
·         They both share an edits log. All namespace edits are logged to a shared NFS storage and there is only a single writer to this shared storage at any point of time. The passive Name Node reads from this storage and keeps updated metadata information for cluster. In case of Active Name Node failure, the passive Name Node becomes the Active Name Node and starts writing to the shared storage. There is only one write to the shared storage at any point of time.

  1. Ability to  run Non MapReduce Application on Hadoop 2.0

In Hadoop 1.0, you can only run MapReduce framework jobs to process the data stored in HDFS. There were no other models (other than MapReduce) of data processing. For other processing way like Real-time or graph analysis on the same data stored in HDFS, you need to take out that data to some alternate storage like HBase because Hadoop 1.0 was only supporting MapReduce Processing manner.

Hadoop 2.0 came up with new framework YARN (Yet another Resource Navigator), which provides ability to run Non-MapReduce application.
Hadoop 2.0 provides YARN API‘s to write other frameworks to run on top of HDFS. This enables running Non-MapReduce Big Data Applications on Hadoop. Spark, MPI, Giraph, and HAMA are few of the applications written or ported to run within YARN.

  1. Improved Resource Utilization

In Hadoop 1.0 JobTracker is responsible for both managing the cluster’s resources and driving the execution of the MapReduce job.
YARN splits up the two major functionalities of overburdened JobTracker (resource management and job scheduling/monitoring) into two separate daemons:

  • a global Resource Manager and
  •  Per-application Application Master.

A Resource Manager (RM) focuses on managing the cluster resources and
An Application Master (AM), one-per-running-application, manages each running application (such as a MapReduce job).
There are no more fixed map-reduce slots. YARN provides central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource.

  1. Native Windows Support

Hadoop was originally developed to support the UNIX family of operating systems. With Hadoop 2, the Windows operating system is natively supported. This extends the reach of Hadoop significantly to a sizable Windows Server market

  1. Beyond Batch Oriented application: Hadoop goes beyond Batch oriented nature in its version 2.0 and now can run interactive, streaming application also.
  1. HDFS Federation

Hadoop cluster storage subsystem has been generalized to support other frameworks besides HDFS. Similar to YARN, the new storage architecture generalizes the block storage layer so that it can be used not only by HDFS but also other storage services. The first use of this feature is HDFS federation, which allows multiple instances of HDFS namespaces to share the underlying storage. In future versions of Hadoop, other storage services (such as key-value storage) will use the same storage layer.

  1. HDFS- Multiple Storage

One more fundamental change is the support for heterogeneous storage.
Hadoop 1.0 treated all storage devices (be it spinning disks or SSDs) on a DataNode as a single uniform pool; although one could store data on an SSD, one could not control which data. Heterogeneous storage is part of  Hadoop 2.0 onwards, where the system will distinguish between storage types and also make the storage type information available to frameworks and applications so that they can take advantage of storage properties. Indeed, the approach is general enough to allow us to treat even memory as a storage tier for cached and temporary data.

  1. Faster  access to data—Data Node caching

Users and applications (such as Hive, Pig or HBase) can identify now a set of files that need to be cached. For example, dimension tables in Hive can be configured for caching in the DataNode RAM, enabling quick reads for Hive queries to these frequently looked up tables.

  1. HDFS Snapshots

Hadoop 2 adds support for file system snapshots. A snapshot is a point-in-time image of the entire file system or a sub tree of a file system. A snapshot has many uses:

  • Protection against user errors: An admin can set up a process to take snapshots periodically. If a user accidentally deletes files, these can be restored from the snapshot that contains the files.
  • Backup: If an admin wants to back up the entire file system or a subtree in the file system, the admin takes a snapshot and uses it as the starting point of a full backup. Incremental backups are then taken by copying the difference between two snapshots.
  • Disaster recovery: Snapshots can be used for copying consistent point-in-time images over to a remote site for disaster recovery.

The snapshots feature supports read-only snapshots; it is implemented only in the NameNode, and no copy of data is made when the snapshot is taken. Snapshot creation is instantaneous. All the changes made to the snapshotted directory are tracked using modified persistent data structures to ensure efficient storage on the NameNode.

Categories: Tags:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s