Hadoop – the solution for deciphering the avalanche of Big Data – has come a long way from the time Google published its paper on Google File System in 2003 and MapReduce in 2004. It created waves with its scale-out and not scale-up strategy. Inroads from Doug Cutting and team at Yahoo and Apache Hadoop project resulted in popularizing MapReduce programming – which is intensive in I/O and is constrained in interactive analysis and graphics support. This paved the way for further evolving of Hadoop 1 to Hadoop 2. The following table describes the major differences between them:
|1||Supports MapReduce (MR) processing model only. Does not support non MR tools||Allows to work in MR as well as other distributed computing models like Spark, Hama, Giraph, Message Passing Interface) MPI & HBase coprocessors.|
|2||MR does both processing and cluster-resource management.||YARN (Yet Another Resource Negotiator) does cluster resource management and processing is done using different processing models.|
|3||Has limited scaling of nodes. Limited to 4000 nodes per cluster||Has better scalability. Scalable up to 10000 nodes per cluster|
|4||Works on concepts of slots – slots can run either a Map task or a Reduce task only.||Works on concepts of containers. Using containers can run generic tasks.|
|5||A single Namenode to manage the entire namespace.||Multiple Namenode servers manage multiple namespace.|
|6||Has Single-Point-of-Failure (SPOF) – because of single Namenode- and in case of Namenode failure, needs manual intervention to overcome.||Has feature to overcome SPOF with a standby Namenode and in case of Namenode failure, it is configured for automatic recovery.|
|7||MR API is compatible with Hadoop 1x. A program written in Hadoop1 executes in Hadoop1x without any additional files.||MR API requires additional files for a program written in Hadoop1x to execute in Hadoop2x.|
|8||Has a limitation to serve as a platform for event processing, streaming and real time operations.||Can serve as a platform for a wide variety of data analytics-possible to run event processing, streaming and real time operations.|
|9||A Namenode failure affects the stack.||The Hadoop stack – Hive, Pig, HBase etc. are all equipped to handle Namenode failure.|
|10||Does not support Microsoft Windows||Added support for Microsoft windows|
DETAILS DESCRIPTION :-
1. Name node in High Availability mode(HA)
Name node in Hadoop Cluster is most important because it stores all the metadata, if it is down due to some unplanned event such as a machine crash, the whole Hadoop Cluster will be down as well. How to handle this situation?
Hadoop 2.0 comes with the solution for this problem.
· HDFS comes with High Availability feature now, which solves this problem by providing the option of running two redundant Name Nodes in the same cluster in an Active/Passive way (one primary Name Node and other a hot standby Name Node)
· They both share an edits log. All namespace edits are logged to a shared NFS storage and there is only a single writer to this shared storage at any point of time. The passive Name Node reads from this storage and keeps updated metadata information for cluster. In case of Active Name Node failure, the passive Name Node becomes the Active Name Node and starts writing to the shared storage. There is only one write to the shared storage at any point of time.
- Ability to run Non MapReduce Application on Hadoop 2.0
In Hadoop 1.0, you can only run MapReduce framework jobs to process the data stored in HDFS. There were no other models (other than MapReduce) of data processing. For other processing way like Real-time or graph analysis on the same data stored in HDFS, you need to take out that data to some alternate storage like HBase because Hadoop 1.0 was only supporting MapReduce Processing manner.
Hadoop 2.0 came up with new framework YARN (Yet another Resource Navigator), which provides ability to run Non-MapReduce application.
Hadoop 2.0 provides YARN API‘s to write other frameworks to run on top of HDFS. This enables running Non-MapReduce Big Data Applications on Hadoop. Spark, MPI, Giraph, and HAMA are few of the applications written or ported to run within YARN.
- Improved Resource Utilization
In Hadoop 1.0 JobTracker is responsible for both managing the cluster’s resources and driving the execution of the MapReduce job.
YARN splits up the two major functionalities of overburdened JobTracker (resource management and job scheduling/monitoring) into two separate daemons:
- a global Resource Manager and
- Per-application Application Master.
A Resource Manager (RM) focuses on managing the cluster resources and
An Application Master (AM), one-per-running-application, manages each running application (such as a MapReduce job).
There are no more fixed map-reduce slots. YARN provides central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource.
- Native Windows Support
Hadoop was originally developed to support the UNIX family of operating systems. With Hadoop 2, the Windows operating system is natively supported. This extends the reach of Hadoop significantly to a sizable Windows Server market
- Beyond Batch Oriented application: Hadoop goes beyond Batch oriented nature in its version 2.0 and now can run interactive, streaming application also.
- HDFS Federation
Hadoop cluster storage subsystem has been generalized to support other frameworks besides HDFS. Similar to YARN, the new storage architecture generalizes the block storage layer so that it can be used not only by HDFS but also other storage services. The first use of this feature is HDFS federation, which allows multiple instances of HDFS namespaces to share the underlying storage. In future versions of Hadoop, other storage services (such as key-value storage) will use the same storage layer.
- HDFS- Multiple Storage
One more fundamental change is the support for heterogeneous storage.
Hadoop 1.0 treated all storage devices (be it spinning disks or SSDs) on a DataNode as a single uniform pool; although one could store data on an SSD, one could not control which data. Heterogeneous storage is part of Hadoop 2.0 onwards, where the system will distinguish between storage types and also make the storage type information available to frameworks and applications so that they can take advantage of storage properties. Indeed, the approach is general enough to allow us to treat even memory as a storage tier for cached and temporary data.
- Faster access to data—Data Node caching
Users and applications (such as Hive, Pig or HBase) can identify now a set of files that need to be cached. For example, dimension tables in Hive can be configured for caching in the DataNode RAM, enabling quick reads for Hive queries to these frequently looked up tables.
- HDFS Snapshots
Hadoop 2 adds support for file system snapshots. A snapshot is a point-in-time image of the entire file system or a sub tree of a file system. A snapshot has many uses:
- Protection against user errors: An admin can set up a process to take snapshots periodically. If a user accidentally deletes files, these can be restored from the snapshot that contains the files.
- Backup: If an admin wants to back up the entire file system or a subtree in the file system, the admin takes a snapshot and uses it as the starting point of a full backup. Incremental backups are then taken by copying the difference between two snapshots.
- Disaster recovery: Snapshots can be used for copying consistent point-in-time images over to a remote site for disaster recovery.
The snapshots feature supports read-only snapshots; it is implemented only in the NameNode, and no copy of data is made when the snapshot is taken. Snapshot creation is instantaneous. All the changes made to the snapshotted directory are tracked using modified persistent data structures to ensure efficient storage on the NameNode.