Scaling out with Kafka Consumer Groups

This blog post will cover Apache Kafka consumers and demonstrate how they are designed for distributed, scale-out architectures. It assumes you have some idea about Kafka in general. We will mainly look at

  • Kafka Consumers groups, and
  • dive into the scale out section
  • a sample application

Source code available on Github

Let’s start with a quick overview of …

Kafka Consumer groups

They ensure two important things

  • Consumer parallelism, and
  • Load balancing
consumer-groups

(from the Kafka documentation) Co-relation b/w Consumer Groups and Topic partitions in Kafka

  • Each Kafka consumer belongs to a consumer group i.e. it can be thought of as a logical container/namespace for a bunch of consumers
  • A consumer group can choose to receive messages from one or more topics
  • Instances in a consumer group can receive messages from zero, one or more partitions within each topic (depending on the number of partitions and consumer instances)
  • Kafka makes sure that there is no overlap as far as message consumption is concerned i.e. a consumer (in a group) receives messages from exactly one partition of a specific topic
  • The partition to consumer instance assignment is internally taken care of by Kafka and this process is dynamic in nature i.e. the work is re-distributed as and when
    • new instances are added to an existing consumer group, or
    • instances are removed from the consumer group

Scaling out

Problem

From the point of view of a Kafka consumer, scalability is the ability to consume messages which are both high volume and velocity. In such a scenario, it makes sense to scale out the kafka broker layer by adding more instances. This will

  • ensure that topic partitions storage are offloaded to the new brokers – maintain system throughput
  • i.e. the new broker(s) instance(s) act as the leader for some partitions and follower (store the replica for) other partitions – maintain high availability

We also need to take care of the consumer layer – the applications should be able to keep up with the rate of production….

Possible solution

Consumer Groups: Kafka transparently load balances traffic from all partitions amongst a bunch of consumers in a group which means that a consuming application can respond to higher performance and throughput requirements by

  • simply spawning additional consumer instances within the same group, and
  • expect the load to be divided amongst them

Things to note

  • Consumer group membership  – assign groupid (in consumer logic)
    • kafkaProps.put(ConsumerConfig.GROUP_ID_CONFIG, CONSUMER_GROUP);
  • Consumer to Partition ratio
    • equal to 1: each consumer will receive messages from exactly one partition i.e. one-to-one co-relation
    • less than 1: some consumers might receive from more than 1 partition
    • more than 1: some consumers will remain idle
  • A Kafka Consumer is not meant for concurrent access by multiple threads

Sample

Scenario

  • single node cluster (keeping things simple)
  • 4 partitions
  • start with 1 consumer and bump up to 4 consumers (increment by 1)

First and foremost

  • download Kafka and unzip it
  • configure partitions
    • edit /config/server.properties
    • find num.partitions and change it to 4
  • start things up
    • cd /bin
    • zookeeper-server-start.sh ../config/zookeeper.properties (zookeeper first)
    • kafka-server-start.sh ../config/server.properties (kafka broker node)
  • setup the Github project
    • fork the code
    • cd
    • mvn clean install

Actions & observations

  • start the producer (we’ll keep one producer for simplicity) –/target/java -jar kafka-scale.jar producer
    • the partition ID for each message is logged
  • start consumer instance #1 /target/java -jar kafka-scale.jar consumer
    • track the logs to confirm that contents of ALL of the partitions (4 in this example) of the topic are consumed by the one and only consumer instance in the group
  • start 2nd consumer instance (same command as above)- the partition load will now get (randomly) distributed across two consumer instances e.g. in the logs, you will notice that consumer 1 is receiving messages from partition 0,1 and consumer 2 will act as the sink for partition 2,3
  • start consumer #3 – Again, the load distribution will be at random, but we can be sure of is that
    • one of the consumers will receive data from any two partitions
    • the remaining (two) consumers will receive data from one partition each
    • e.g. consumer 1 will take partition 0,1 and consumers 2,3 get assigned to partitions 2,3 respectively
  • start consumer #4 – this one is easy. The logs will confirm that the load will get evenly distributed among ALL (the 4) consumers i.e. one consumer per partition

Further reading

Cheers!

Advertisements

About Abhishek

Java EE & distributed systems junkie who frequently blogs at abhirockzz.wordpress.com as well as simplydistributed.wordpress.com. Oh, I have also authored a few (mini) books, articles, Refcards etc. :-)
This entry was posted in Distributed systems, Kafka and tagged , , , , , , . Bookmark the permalink.

One Response to Scaling out with Kafka Consumer Groups

  1. Pingback: Started a new blog: Simply Distributed | Thinking in Java EE (at least trying to!)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s