In this blog, we will discuss about merging files in HDFS and creating a single file. Before proceeding further, we recommend you to refer to our blogs on HDFS. The links are provided below:
Merging multiple files is useful when you want to retrieve the output of a MapReduce computation with multiple reducers, where each reducer produces a part of the output.
The HDFS getmerge command can copy the files present in a given path in HDFS to a single concatenated file in the local filesystem.
hadoop fs -getmerge /user/hadoop/demo_files merged.txt
The getmerge command has the following syntax:
Hadoop fs -getmerge -nl <source file path> <local system destination path>
The getmerge command has three parameters:
- <src files> is the HDFS path to the directory that contains the files to be concatenated
- <dist file> is the local filename of the merged file
- [-nl] is an optional parameter that adds a new line in the result file.
Steps to merge the files
We need to place more than 1 file inside the HDFS directory.
In the figure below, you can see that there are three files named acadgild, hadoop and FlumeData, on which we will perform merging operation.
The content of the files is shown in the below screenshot.
We now have to type the command as shown in the screenshot, to merge the files.
We have used -nl as an optional parameter to add extra line after the content of each file.
A file will be created in a specific location of your local machine with merged content. In this case, a new file with the name merged_file will be created, having the content from acadgild, hadoop and FlumeData.
You can directly open the file to see the merged content. Refer the figure below.
From the above figure, you can see that a single file is created after merging the content of three individual files