Before we proceed let us understand the architecture of the Flume.
- Event – It is a singular unit of data that is transported by Flume (usually a single log entry).
- Source – It is an entity through which data enters into the Flume channel. Sources either actively sample the data or passively waits for data to be delivered to flume channel.
- Sink – It is a unit that delivers the data to the destination by streaming it to a range of destinations. Example: HDFS sink writes events to the HDFS.
- Channel – It is the connecting medium between the Source and the Sink. Event is ingested into the Channel from the source and then from the channel, it is drained into the Sink.(location of sink is specified in flume configuration directory)
- Agent – It is a physical Java virtual machine running Flume. It is a collection of Sources, Sinks, and Channels.
- Client – It produces and transmits the Event to the Source operating within the Agent
In this post, we are creating a spool directory to transfer file locally. This file transfer is also called “rolling of files.”
- First, create a configuration file inside the Twitter folder conf directory
In this case, we have named the configuration file as “AcadgildLocal.conf.” Continue reading for more details about the configuration file.
*Note: Create two directories named source_sink_dir and destination_sink_dir, and update the same in conf.
You can also download the configuration file HERE.
Refer to the screenshot below that displays the configuration file for a memory channel defined as “agnet1.”
Explanation for the configuration file
|Type||–||The component type name needs to be file_roll|
|sink.directory||–||The directory where files will be stored sink.|
|apool.directory||–||The directory where files will be spooled from.|
|rollInterval||30||Roll the file every 30 seconds. Specifying 0 will disable rolling and cause all events to be written to a single file.|
|sink.serializer||TEXT||Other possible options include avro_event or the FQCN of an implementation of EventSerializer.Builder interface.|
*Note: Make sure Hadoop daemons are up and running.
- Now fire up your Flume that includes the configured file with the complete path.
If everything is lined up correctly, a message stating “spool started” will show up. (Please refer to the screenshot below.)
- Once started, you can find a file inside destination_sink. This Indicates that the event of Flume is running efficiently and reporting at the correct destination folder.
- Next, we will apply file roles to a few test files.
- Let us drag and drop the files inside the sink. (Please refer to the screenshot below.)
- Once the files have been dropped into the sink_directory, we find that the name for the files change that now comes with a newly extension, “COMPLETED.”
In destinatio_directory it is observed that we find newly generated files. (Please refer to the following screenshot for better understanding.)
You will find several files in the destination directory, as the default timer for rolling is 30 seconds. This happens because the data flows in continuously to the sink as long as the Flume agent runs. (Refer to the screenshot below for the result of test file rolled, that we kept in the destination_sink_ directory.)
- To stop the Flume agent, press ctrl + z inside the terminal running the agent. Every time you need this agent to work, we need to start it manually using the configuration file.