Apache Flume is a distributed tool to collect and move a large amount of data from different sources to a centralized data store. Apache Flume introduces 2 basic concepts I’d like to introduce in this tutorial. The first one is the Flume Source, which consumes the data from external data sources. Apache Flume currently supports many types of data sources such as: JMS source, Twitter source, Syslog source, Avro source, Thrift source, etc. The second concept is the Flume Sink, which writes the data to the destinations. Currently, there are many Flume sinks such as: HDFS sink, Hive sink, Kafka sink, Avro sink, HBase sink, etc. In this post, I’d like to show an example about Apache Flume HDFS Sink which moves data from log file to HDFS by using the Tail Source and HDFS Sink.

1. Preparation.

1.1. About Example

Apache Flume HDFS Sink

Apache Flume HDFS Sink

Assume that we have a log file: /var/log/messages on a web server and we want to use Apache Flume to monitor and move the log content to our big data system. In this case, we will try to set up an Flume agent, uses Exec Source to wrapper a tail command on the file: /var/log/messages, write the data temporarily to the Memory Channel, then use the HDFS sink to write the data to the HDFS.

1.2. Environment

  • Apache Flume 1.6.0 (installed on Linux CentOS is prefered)
  • Hadoop 2.6.0

You can use Cloudera Quickstart VMs. The current version is CDH 5.7. It already has Apache Hadoop ecosystems including Apache Flume installed.

1.3. Create a configuration for Flume Agent

Create a file flume-hdfs-sink.conf at the folder: /usr/lib/flume-ng/conf with following content:

We named the agent: agent1

We will use the exec source type which will  executes the tail command on the file /var/log/messages.

We use the hdfs sink, specify the path of file which the log file will be written on the hdfs:hdfs://quickstart.cloudera:8020/tmp/system.log

And finally we connect the tail source and the hdfs sink by the memory-channel.

2. Run the Apache Flume HDFS Sink example

2.1. Start the Agent

Go to the folder where Apache Flume is installed. In my Cloudera, it is: /usr/lib/flume-ng

Start the agent by issuing following command:

We have specified the configuration file is the file we have just created above and the agent1 as agent name.

2.2. Verify the result.

We can verify the result by check the directory: hdfs://quickstart.cloudera:8020/tmp/system.log on the HDFS. You can do that by issuing below command:

The output on my console:

Or simply open Hue or any tool that can view HDFS content.

3. Conclusion.

We have seen a basic example about using Apache Flume HDFS Sink to move log data into HDFS. This use case is pretty popular today where we need to collect the log from different data sources and centralize in a single storage location like Hadoop or NoSQL for further analysis and processing. In next post, I’d like to show more examples about using different sinks and sources like: Kafka sink, Hive sink, JDBC sources, etc.

 

 

 

0 0 vote
Article Rating