Loading Files To Local File System Using Flume

Apache Flume is one of the most preferred options to provide a distributed, reliable, and available service for efficient collection, aggregation, and movement of the large volume of data. Moving data in the large volume is a very complex task and to minimize the latency in transfer Flume is configured.
First, we’ll see how to setup flume.
Users can follow Installation of Flume and fetching data from twitter using Flume to understand flume installation steps and the steps to use it.
 We also recommend users to go through our blog using flume to copy a file from local file system to HDFS using Spool directory.

Before we proceed let us understand the architecture of the Flume.

  • Event – It is a singular unit of data that is transported by Flume (usually a single log entry).
  • Source – It is an entity through which data enters into the Flume channel. Sources either actively sample the data or passively waits for data to be delivered to flume channel.
  • Sink – It is a unit that delivers the data to the destination by streaming it to a range of destinations. Example: HDFS sink writes events to the HDFS.
  • Channel – It is the connecting medium between the Source and the Sink. Event is ingested into the Channel from the source and  then from the channel, it is drained into the Sink.(location of sink is specified in flume configuration directory)
  • Agent – It is a physical Java virtual machine running Flume. It is a collection of Sources, Sinks, and Channels.
  • Client – It produces and transmits the Event to the Source operating within the Agent

In this post, we are creating a spool directory to transfer file locally. This file transfer is also called “rolling of files.”

  • First, create a configuration file inside the Twitter folder conf directory

In this case, we have named the configuration file as “AcadgildLocal.conf.” Continue reading for more details about the configuration file.

*Note: Create two directories named source_sink_dir and destination_sink_dir, and update the same in conf.

You can also download the configuration file HERE.

Refer to the screenshot below that displays the configuration file for a memory channel defined as “agnet1.”

Explanation for the configuration file

Property Name Default Description
Channel Memory
Type The component type name needs to be file_roll
sink.directory The directory where files will be stored sink.
apool.directory The directory where files will be spooled from.
Optional
rollInterval 30 Roll the file every 30 seconds. Specifying 0 will disable rolling and cause all events to be written to a single file.
sink.serializer TEXT Other possible options include avro_event or the FQCN of an implementation of EventSerializer.Builder interface.
batchSize 100

*Note: Make sure Hadoop daemons are up and running.

  • Now fire up your Flume that includes the configured file with the complete path.

If everything is lined up correctly, a message stating “spool started” will show up. (Please refer to the screenshot below.)

  • Once started, you can find a file inside destination_sink. This Indicates that the event of Flume is running efficiently and reporting at the correct destination folder.

  • Next, we will apply file roles to a few test files.

  • Let us drag and drop the files inside the sink. (Please refer to the screenshot below.)

 

  • Once the files have been dropped into the sink_directory, we find that the name for the files change that now comes with a newly extension, “COMPLETED.”

In destinatio_directory it is observed that we find newly generated files. (Please refer to the following screenshot for better understanding.)

You will find several files in the destination directory, as the default timer for rolling is 30 seconds. This happens because the data flows in continuously to the sink as long as the Flume agent runs. (Refer to the screenshot below for the result of test file rolled, that we kept in the destination_sink_ directory.)

 

  • To stop the Flume agent, press ctrl + z inside the terminal running the agent. Every time you need this agent to work, we need to start it manually using the configuration file.
Advertisements
Categories:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s