In this post, we will be discussing how to configure replication factor, block size for the entire cluster, along with directory, and file in HDFS.
Hadoop Distributed File System (HDFS) stores files such as blocks, and distributes them across the entire cluster. As HDFS was designed to be fault-tolerant and to run on commodity hardware, bocks are replicated several times to ensure high data availability.
Before going ahead, it is important to know basic information like, what is Replication factor, blocks and block size. So, let’s get a clear picture of them first.
Blocks and Block Size:
HDFS is designed to store and process huge amounts of data and data sets. A typical block size used by HDFS is about 64MB. We can also change the block size in Hadoop Cluster. All blocks in a file, except the last block are of the same size. When you store a file in HDFS, the system breaks it down into a set of individual blocks and stores these blocks in various slave nodes in the Hadoop cluster.
Block Size Configuration for Entire Cluster:
If you want to set some specific block size for the entire cluster, you need to add a property into hdfs-site.xml as shown below.
Here, we have set the dfs.block.size as 128MB. This will be applicable for the entire cluster.
Changing the dfs.block.size property in hdfs-site.xml will change the default block size for all the files placed into HDFS. Here, changing the block size will not affect the block size of any files already in HDFS. It will only be applicable for those files which will be placed after this setting takes effect.
The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at the time of creation of the file and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.
The replication factor is a property that can be set in the HDFS configuration file. It also allows you to adjust the global replication factor for the entire cluster. For each block stored in HDFS, there will be n – 1 duplicated blocks distributed across the cluster.
If you want to set 4 as the replication factor for the entire cluster, then you need to specify the replication factor into the hdfs-site.xml.
<value>4</value> <!— Here you need to set replication factor for entire cluster. —>
We can also change the replication factor on a file.