Anybody working in the area of Big Data will know what MapReduce does and what its shortcomings are. It is not completely fair to say that there are shortcomings because MapReduce along with HDFS was a phenomenon when it released. But right now Spark has taken the world by storm. Now is a good time to understand the differences between Spark and MapReduce.
What is MapReduce? It is a part of the Hadoop framework that is responsible for processing large data sets with a parallel and distributed algorithm on a cluster. As the name suggests, the MapReduce algorithm contains two important tasks: Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). On the other hand, Reduce takes the output from a map as an input and combines the data tuples into smaller set of tuples. In MapReduce, the data is distributed over the cluster and processed.
The difference in Spark is that it performs in-memory processing of data. This in-memory processing is a faster process as there is no time spent in moving the data/processes in and out of the disk, whereas MapReduce requires a lot of time to perform these input/output operations thereby increasing latency.
Real-Time Big Data Analysis:
Real-time data analysis means processing data generated by the real-time event streams coming in at the rate of millions of events per second, Twitter data for instance. The strength of Spark lies in its abilities to support streaming of data along with distributed processing. This is a useful combination that delivers near real-time processing of data. MapReduce is handicapped of such an advantage as it was designed to perform batch cum distributed processing on large amounts of data. Real-time data can still be processed on MapReduce but its speed is nowhere close to that of Spark.
Spark claims to process data 100x faster than MapReduce, while 10x faster with the disks.
Most graph processing algorithms like page rank perform multiple iterations over the same data and this requires a message passing mechanism. We need to program MapReduce explicitly to handle such multiple iterations over the same data. Roughly, it works like this: Read data from the disk and after a particular iteration, write results to the HDFS and then read data from the HDFS for next the iteration. This is very inefficient since it involves reading and writing data to the disk which involves heavy I/O operations and data replication across the cluster for fault tolerance. Also, each MapReduce iteration has very high latency, and the next iteration can begin only after the previous job has completely finished.
Also, message passing requires scores of neighboring nodes in order to evaluate the score of a particular node. These computations need messages from its neighbors (or data across multiple stages of the job), a mechanism that MapReduce lacks. Different graph processing tools such as Pregel and GraphLab were designed in order to address the need for an efficient platform for graph processing algorithms. These tools are fast and scalable, but are not efficient for creation and post-processing of these complex multi-stage algorithms.
Introduction of Apache Spark solved these problems to a great extent. Spark contains a graph computation library called GraphX which simplifies our life. In-memory computation along with in-built graph support improves the performance of the algorithm by a magnitude of one or two degrees over traditional MapReduce programs. Spark uses a combination of Netty and Akka for distributing messages throughout the executors. Let’s look at some statistics that depict the performance of the PageRank algorithm using Hadoop and Spark.
Iterative Machine Learning Algorithms:
Almost all machine learning algorithms work iteratively. As we have seen earlier, iterative algorithms involve I/O bottlenecks in the MapReduce implementations. MapReduce uses coarse-grained tasks (task-level parallelism) that are too heavy for iterative algorithms. Spark with the help of Mesos – a distributed system kernel, caches the intermediate dataset after each iteration and runs multiple iterations on this cached dataset which reduces the I/O and helps to run the algorithm faster in a fault tolerant manner.
Spark has a built-in scalable machine learning library called MLlib which contains high-quality algorithms that leverages iterations and yields better results than one pass approximations sometimes used on MapReduce.
Spark RDD: RDD stands for Resilient Distributed Dataset which can trackback and complete a task instead of having to start from scratch in case of failure during distributed/batch processing.
Spark MLlib: Spark provides a built-in library containing machine learning algorithms which execute the programs faster as they are executed in-memory at a very rapid rate when compared to MapReduce programs because the processes are moved in/out of the disks.
SparkSQL: SparkSQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. Spark SQL allows Spark programmers to leverage the benefits of relational processing (such as declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark.
SparkR: This is one of Spark’s latest additions which provides an integration to ‘R’ allowing programmers to make their analysis on big data. We can perform the analysis on ‘R’ cluster using the RStudio.
Here are results from a survey taken on Spark by Typesafe to better understand the trends and growing demand for Spark.
Having said that, Hadoop MapReduce and Apache Spark are not competing with one another. In fact, they complement each other quite well. Spark cannot completely replace Hadoop and the good news is that the demand for Spark is currently at an all-time high.
Got a question for us? Please mention it in the comments section and we will get back to you.