Kafka Partitioning…

Partitions are the key to scalability attributes of Kafka. Developers can also implement custom partitioning algorithm to override the default partition assignment behavior. This post will briefly cover

  • Partitions in general
  • Data distribution, default partitioning, and
  • Example of custom partitioning logic

Partitions in Kafka

In Kafka, partitions serve as another layer of abstraction – a Partition

Here is a quickie

  • Topic is divided into one (default, can be increased) or more partitions
  • A partition is like a log
  • Publishers append data (end of log) and each entry is identified by a unique number called the offset. The data in the partition is immutable
  • Records in the partition are immutable and stored for a configurable amount of time (after which they are removed from disk)
  • Consumers can read (they pull data) from any position (offset) in a partition, move forward or back
  • Each partition is replicated (as per replication factor configuration) which means that it can have (at most) one primary copy (on the leader node) and and 0 or more copies(follower nodes)
  • Kafka ensures strict ordering within a partition i.e. consumers will receive it in the order which a producer published the data to begin with

Distributing partitions across nodes

In Kafka, spreading/distributing the data over multiple machines deals with partitions (not individual records). Scenario depicted below

kafka-partitions-1

Partition distribution over a Kafka cluster

  • Primary partition placement: 2 partitions per node e.g. N1 will have P1,P3 & N2 will have P2, P4
  • Replica placement: one replica for each partition since the factor is 2 i.e. one replica in addition to a primary copy (total 2)

Data and Partitions: the relationship

When a producer sends data, it goes to a topic – but that’s 50,000 foot view. You must understand that

  • data is actually a key-value pair
  • its storage happens at a partition level

A key-value pair in a messaging system like Kafka might sound odd, but the key is used for intelligent and efficient data distribution within a cluster. Depending on the key, Kafka sends the data to a specific partition and ensures that its replicated as well (as per config). Thus, each record

Default behavior

The data for same key goes to same partition since Kafka uses a consistent hashing algorithm to map key to partitions. In case of a null key (yes, that’s possible), the data is randomly placed on any of the partition. If you only have one partition: ALL data goes to that single partition

kafka-partitions-2

Data in partitions

Custom partitioning scheme

You can plugin a custom algorithm to partition the data

  • by implementing the Partitioner interface
  • configure Kafka producer to use it

Here is an example

Generate random partitions which are within the valid partition range

Further reading

Cheers!

Advertisements

About Abhishek

Java EE & distributed systems junkie who frequently blogs at abhirockzz.wordpress.com as well as simplydistributed.wordpress.com. Oh, I have also authored a few (mini) books, articles, Refcards etc. :-)
This entry was posted in Distributed systems, Kafka and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s