Apache Kafka Connect Example

1. Introduction to Apache Kafka Connect

Apache Kafka, which is a kind of Publish/Subscribe Messaging system, gains a lot of attraction today. We can see many use cases where Apache Kafka stands with Apache Spark, Apache Storm in Big Data architecture which need real-time processing, analytic capabilities.
To integrate with other applications, systems, we need to write producers to feed  data into Kafka and write the consumer to consume the data.  However, Apache Kafka Connect which is  one of new features has been introduced in Apache Kafka 0.9, simplifies the integration between Apache Kafka and other systems. Apache Kafka Connect supports us to quickly define connectors that move large collections of data from other systems into Kafka and from Kafka to other systems. Let’s take a look at the overview of the Apache Kafka Connect:

Apache Kafka Connect - Overview

Apache Kafka Connect – Overview

The Sources in Kafka Connect are responsible for ingesting the data from other system into Kafka while the Sinks are responsible for writing the data to other systems. Note that another new feature has been also introduced in Apache Kafka 0.9 is Kafka Streams. It is a client library for processing and analyzing data stored in Kafka. We can filter, transform, aggregate, the data streams. By combining the Kafka Connect with Kafka Streams, we can build prefect data pipelines.

2. Some Apache Kafka Connectors

Currently, there are some opensource Sources and Sinks Connectors from community as below:

Connectors References
Apache Ignite Source, Sink
Elastic Search Sink1, Sink2, Sink3
Cassandra Source1, Source 2, Sink1, Sink2
MongoDB Source
HBase Sink
Syslog Source
MQTT (Source) Source
Twitter (Source) Source, Sink
S3 Sink1, Sink2

Or you can find some certified Connectors from Confluent.io via this link

3. Example

3. 1. File Connectors

I’d like to take an example from Apache Kafka 0.10.0.0 distribution and elaborate it. The example is used to demo how to use Kafka Connect to stream data from source which is  file test.txt to destination which is also a file, test.sink.txt. Note that the example will run on the standalone mode.

Apache Kafka Connect Example - FileStream

Apache Kafka Connect Example – FileStream

 

We will need to use 2 connectors:

  • FileStreamSource reads the data from the test.txt file and publish to Kafka topic: connect-test
  • FileStreamSink which will consume data from connect-test topic and write to the test.sink.txt file.

Let’s see configuration file for the Source at kafka_2.11-0.10.0.0\config\connect-file-source.properties

We need to define the connector.class, the maximum of tasks will we created, the file name that will be read by connector and the topic where data will be published.

Here is the configuration file for the Sink at kafka_2.11-0.10.0.0\config\connect-file-sink.properties

In similar to the Source, we need to define the connector.class, the number of tasks, the destination file where the data will be written and the topic which data will be consumed.

One important configuration file located at: kafka_2.11-0.10.0.0\config\connect-standalone.properties we need to define the address of the Kafka broker, the keys, values converters.

3. 2. Run the example

Make sure you have Apache Kafka 0.9.x or 0.10.x deployed and ready. Assume that we have an Kafka distribution at /opt/kafka_2.11-0.10.0.0. Note that we will use some basic Kafka command line below. If you’re not familiar with those, you can reference another post : Apache Kafka Command Line Interface

3.2.1. Start Kafka broker

We first should cd(change directory) to the Kafka distribution folder.

Start ZooKeeper

Start Kafka Server

3.2.1. Start the Source and Sink connectors

Note that after this command, the connector is ready for reading the content from test.txt file which should be located in the execution folder: /opt/kafka_2.11-0.10.0.0

3.2.2. Start the Source connector

Write some content to the test.txt file

3.2.3. Check whether the Source connector feed the test.txt content into the topic connect-test or not

The output on the console is:

3.2.4. Check whether the Sink Connector write content to the test.sink.txt or not

The output on my console:

4. Conclusions

We have seen the overview of Apache Kafka Connect and an simple example that using FileStream connectors. We can leverage Kafka Connectors to quickly ingest data from a lot of sources, do some processing and write to other destinations. Basically, everything can be done by Apache Kafka, we don’t need to use either other libraries, frameworks like Apache Flume or custom producers. In next posts, I will introduce more about using other types of Kafka Connectors like HDFS sink, JDBC sources, etc and how to implement a Kafka Connector.

Below are the articles related to Apache Kafka topic. If you’re interested in them, you can refer to the following links:

Apache Kafka Tutorial

Getting started with Apache Kafka 0.9

Using Apache Kafka Docker

Apache Kafka 0.9 Java Client API Example

Apache Kafka Command Line Interface

How To Write A Custom Serializer in Apache Kafka

Write An Apache Kafka Custom Partitioner

Apache Kafka Command Line Interface

Apache Flume Kafka Source And HDFS Sink Tutorial

Apache Kafka Connect MQTT Source Tutorial

Spring Kafka Tutorial