Kafka Partitioning…

Partitions are the key to scalability attributes of Kafka. Developers can also implement custom partitioning algorithm to override the default partition assignment behavior. This post will briefly cover

  • Partitions in general
  • Data distribution, default partitioning, and
  • Example of custom partitioning logic

Partitions in Kafka

In Kafka, partitions serve as another layer of abstraction – a Partition

Here is a quickie

  • Topic is divided into one (default, can be increased) or more partitions
  • A partition is like a log
  • Publishers append data (end of log) and each entry is identified by a unique number called the offset. The data in the partition is immutable
  • Records in the partition are immutable and stored for a configurable amount of time (after which they are removed from disk)
  • Consumers can read (they pull data) from any position (offset) in a partition, move forward or back
  • Each partition is replicated (as per replication factor configuration) which means that it can have (at most) one primary copy (on the leader node) and and 0 or more copies(follower nodes)
  • Kafka ensures strict ordering within a partition i.e. consumers will receive it in the order which a producer published the data to begin with

Distributing partitions across nodes

In Kafka, spreading/distributing the data over multiple machines deals with partitions (not individual records). Scenario depicted below

kafka-partitions-1

Partition distribution over a Kafka cluster

  • Primary partition placement: 2 partitions per node e.g. N1 will have P1,P3 & N2 will have P2, P4
  • Replica placement: one replica for each partition since the factor is 2 i.e. one replica in addition to a primary copy (total 2)

Data and Partitions: the relationship

When a producer sends data, it goes to a topic – but that’s 50,000 foot view. You must understand that

  • data is actually a key-value pair
  • its storage happens at a partition level

A key-value pair in a messaging system like Kafka might sound odd, but the key is used for intelligent and efficient data distribution within a cluster. Depending on the key, Kafka sends the data to a specific partition and ensures that its replicated as well (as per config). Thus, each record

Default behavior

The data for same key goes to same partition since Kafka uses a consistent hashing algorithm to map key to partitions. In case of a null key (yes, that’s possible), the data is randomly placed on any of the partition. If you only have one partition: ALL data goes to that single partition

kafka-partitions-2

Data in partitions

Custom partitioning scheme

You can plugin a custom algorithm to partition the data

  • by implementing the Partitioner interface
  • configure Kafka producer to use it

Here is an example

Generate random partitions which are within the valid partition range

Further reading

Cheers!

Advertisements

About Abhishek

Loves Java EE, distributed KV stores and messaging systems. Frequently blogs at abhirockzz.wordpress.com as well as simplydistributed.wordpress.com. Oh, I have also authored a few (mini) books, articles, Refcards etc. :-)
This entry was posted in Distributed systems, Kafka and tagged , , , , . Bookmark the permalink.

One Response to Kafka Partitioning…

  1. Pingback: Kafka producer and partitions | Simply Distributed

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s