Beginner’s Guide for Spark

In this Blog we will be discussing the basics of Spark’s functionality and its installation.

Apache spark is a cluster computing framework which runs on top of the Hadoop eco-system and handles different types of data. It is a one stop solution to many problems. Spark has rich resources for handling the data and most importantly, it is 10-100x faster than Hadoop’s MapReduce. It attains this speed of computation by its in-memory primitives. The data is cached and is present in the memory (RAM) and performs all the computations in-memory.

Spark’s rich resources has almost all the components of Hadoop. For example we can perform batch processing in Spark and real time data processing using its own streaming engine called spark streaming.

We can perform various functions with Spark:

  • SQL operations: It has its own SQL engine called Spark SQL. It covers the features of both SQL and Hive.

  • Machine Learning: It has Machine Learning Library , MLib. It can perform Machine Learning without the help of MAHOUT.

  • Graph processing: It performs Graph processing by using GraphX component.

All the above features are in-built in Spark.

It can be run on different types of cluster managers such as Hadoop, YARN framework and Apache Mesos framework. It has its own standalone scheduler to get started, if other frameworks are not available.Spark provides the access and ease of storing the data,it can be run on many file systems. For example, HDFS, Hbase, MongoDB, Cassandra and can store the data in its local files system.

Resilient Distributed Datasets

Resilient Distributed Datasets (RDD) is a simple and immutable distributed collection of objects. Each RDD is split into multiple partitions which may be computed on different nodes of the cluster. In spark all function are performed on RDDs only.

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.

Let’s see now the features of Resilient Distributed Datasets in the below explanation:

  • In Hadoop we store the data as blocks and store them in different data nodes. In Spark, instead of following the above approach, we make partitions of the RDDs and store in worker nodes (datanodes) which are computed in parallel across all the nodes.

  • In Hadoop we need to replicate the data for fault recovery, but in case of Spark, replication is not required as this is performed by RDDs.

  • RDDs load the data for us and are resilient which means they can be recomputed.

  • RDDs perform two types of operations: transformations which creates a new dataset from the previous RDD and actions which return a value to the driver program after performing the computation on the dataset.

  • RDDs keeps a track of transformations and checks them periodically. If a node fails, it can rebuild the lost RDD partition on the other nodes, in parallel.

RDDs can be created in two different ways:

  • Referencing an external dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

  • By parallelizing a collection of objects(a list or a set) in the driver program.

Lazy evaluation in RDD

If you create any RDD from an existing RDD that is called as transformation and unless you call an action your RDD will not be materialized the reason is spark will delay the result until you really want the result because there could be some situations you have typed something and it went wrong and again you have to correct it in an interactive way it will increase the time and it will create un-necessary delays. Also spark optimizes the required calculations and takes intelligent decisions which is not possible with line by line code execution. Spark recovers from failures and slow workers.

Architecture of Apache Spark

Apache spark application contains two programs a Driver program and Workers program. A cluster manager will be there in-between to interact with the workers on the cluster nodes. Spark context will keep in touch with the worker nodes with the cluster manager.

Spark context is like master and Spark workers are like slaves. Workers contains the executors to run the job . If any dependencies or arguments have to be passed then spark context will take care of that. RDD’s will reside on the spark executors. You can also run spark applications locally using a local thread, and if you want to take advantage of distibuted environments you can take the help of S3, HDFS or any other storage system.

Life cycle of a Spark program:

  1. Some input RDDs are created from external data or by parallelizing the collection of objects in the driver program.

  2. These RDDs are lazily transformed into new RDDs using transformations like filter() or map().

  3. Spark caches any intermediate RDDs that will  be needed to reused.

  4. Actions such as count() and collect are launched to kick off a parallel computation which is then optimized and then executed by Spark.

Let’s now discuss the steps to install spark in your cluster:

Step by step process to install spark


Before installing spark Scala needs to be installed in the system. We need to follow the below steps to install scala.

1.Open the Terminal in your CentOS

To download Scala type the below command:


2.Extract the downloaded tar file by using the command

After extracting specify the path of scala in .bashrc file

After setting the path we need to save the file and type the below command:

The above command will sum up the scala installation. we need to then install spark after that.

To install spark in centos we need to follow the below steps to download and install Single Node cluster of Spark in CentOS.

1.Open the browser and go the link

Download spark-1.5.1-bin-hadoop2.6.tgz

File will be downloaded into Downloads folder

Go to the Downloads folder and untar the Downloaded file using the below command:

After untaring the file we need to move the file to the Home Folder using the below command:

Now the file is moved on to the home folder

We need to update the path for spark in the .bashrc in the same way as we did for scala.

Refer the below screen shot for updating the path for .bashrc.

After adding the path for SPARK type the command source .bashrc, refer the the screen shot for the same.

Make a folder by Name ‘work’ in HOME using the below command:

Inside the work folder we need to make another folder by name ‘sparkdata’ using the command

We need to give the permissions to the sparkdata folder as 777 using the command

Now move into the conf directory of spark folder using the below command:

Type the command ls to see the files inside conf folder:

There will be a file by name , we need to copy that file by name using the below command:

Edit the file using the below command

and make the configuration as follows

Note: Make sure that you are giving the paths of Java and Scala correctly. After editing save the file and close the file.

Lets follow the below steps to start the spark single node cluster.Move to the sbin directory of spark folder using the below command:

Inside sbin type the below command to start the Master and Slave daemons.

Now the spark Single Node cluster will start with One Master and Two Workers.

You can check that the cluster is running or not by using the below command


If the Master and Worker Nodes are running then you have successfully started the spark single node cluster.

We hope this blog helped you in getting the basic understanding of spark and the ways to install it.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s