Configuring Replication Factor and Block Size in HDFS

In this post, we will be discussing how to configure replication factor, block size for the entire cluster, along with directory, and file in HDFS.

Hadoop Distributed File System (HDFS) stores files such as blocks, and distributes them across the entire cluster. As HDFS was designed to be fault-tolerant and to run on commodity hardware, bocks are replicated several times to ensure high data availability.

Before going ahead, it is important to know basic information like, what is Replication factor, blocks and block size. So, let’s get a clear picture of them first.

Blocks and Block Size:

HDFS is designed to store and process huge amounts of data and data sets. A typical block size used by HDFS is about 64MB. We can also change the block size in Hadoop Cluster.  All blocks in a file, except the last block are of the same size. When you store a file in HDFS, the system breaks it down into a set of individual blocks and stores these blocks in various slave nodes in the Hadoop cluster.

Block Size Configuration for Entire Cluster:

If you want to set some specific block size for the entire cluster, you need to add a property into hdfs-site.xml as shown below.

fffffffffffffff

Here, we have set the dfs.block.size as 128MB. This will be applicable for the entire cluster.

Changing the dfs.block.size property in hdfs-site.xml will change the default block size for all the files placed into HDFS. Here, changing the block size will not affect the block size of any files already in HDFS. It will only be applicable for those files which will be placed after this setting takes effect.

 

Replication Factor:

The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at the time of creation of the file and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.

The replication factor is a property that can be set in the HDFS configuration file. It also allows you to adjust the global replication factor for the entire cluster. For each block stored in HDFS, there will be n – 1 duplicated blocks distributed across the cluster.

Example:

If you want to set 4 as the replication factor for the entire cluster, then you need to specify the replication factor into the hdfs-site.xml.

hdfs-core-sitexml

We can also change the replication factor on a file.

 

Regards

Anand Pandey

 

Advertisements
Categories:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s