December 4, 2014

Day 4 - Introduction to Kafka

Written by: Brandon Burton (@solarce)
Edited by: Scott Murphy (@ovsage)

There are many new distributed systems being released these days and it can be hard to keep up. Apache Kafka is one that seems to have recently seen a huge uptick in adoption as a message/event bus in many organizations, including my own work place, Since we began to use it and I’ve been learning how to run it in production over the last few months, I wanted to provide an introduction to the operational side of Kafka. I won’t really get into using Kafka as part of your application but will provide some jumping off points to where to learn more about accomplishing that near the end of this post.

Kafka Mission Statement

So what is Apache Kafka?

According to the Project Site, Apache Kafka is …

Written in Scala and runs on the JVM.

publish-subscribe messaging rethought as a distributed commit log.


A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.


Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers.


Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.

Distributed by Design

Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.

Breaking it down

That all sounds pretty good right? But what does all that mean?

First let’s review some basic messaging terminology:

  • Kafka maintains feeds of messages in categories called topics.
  • Topics are divided into partitions for scaling and redundancy.
  • We’ll call processes that publish messages to a Kafka topic producers.
  • We’ll call processes that subscribe to topics and process the feed of published messages consumers.
  • Kafka is run as a cluster comprised of one or more servers each of which is called a broker.

So, at a high level, producers send messages over the network to the Kafka cluster, the messages are stored in a topic, and the cluster serves them up to consumers.

Let’s dive into a little more detail on each of these facets of Kafka.


A broker is a node in a Kafka cluster. Each broker acts as a leader for one or more topic partition, depending on the topic partition count settings. Each broker will also act as an in-sync replica (ISR) for additional topic partitions, depending on the replication factor of the topic. A broker recieves messages from one or more producters, on a per topic basis, and delivers messages to consumer of a topic’s partitions.

kafka cluster example


A topic is the abstraction that acts as a bucket for messages being published to a Kafka cluster. A topic is divided into one or more partitions, which are ordered log files on disk.


Each partition is an ordered, immutable sequence of messages that is continually appended to รข€” a commit log. The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition. Each partition within a topic can have one or more in-sync replicas, to provide failover, based on a per topic replication factor setting. All writes and reads go throught the elected leader of a partition and a message is considered “committed” when all in sync replicas for that partition have applied it to their log. Only committed messages (in that all ISRs) are ever given out to the consumer(s).

partition example


Producers publish data to the topics of their choice. The producer is responsible for choosing which message to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the message).


Consumers label themselves with a consumer group name, and each message published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.

If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.


One things that catches many people by suprise is that Kafka has a hard dependency on the Apache Zookeeper project, especially when coming from using another system like RabbitMQ or Redis as your message/event bus.

What is it?

ZooKeeper is a distributed system which provides a strongly consistent interface to a hierarchical file system like structure. It provides an event-driven model in which clients can watch for changes to specific nodes, called znodes. An example of such a change could be a new child being added to an existing znode. ZooKeeper achieves high availability by running multiple ZooKeeper servers, called an ensemble, with each server holding an in-memory copy of the distributed file system to service client read requests. Each server also holds a persistent copy on disk. A zookeeper ensemble has strict rules about how it achieves a quorum in order to achieve its strong consistency guarantees.

What does Kafka use it for?

Zookeeper is used in a Kafka cluster to coordinate which brokers are members of a cluster and to coordinate leader election for in-sync replication of topic partitions. Since Zookeeper was already built as a strongly consistent distributed system for storing this kind of state, it made sense for the Kafka project to utilize Zookeeper instead of attempting to rebuild these features inside the Kafka codebase. Zookeeper’s design though is not meant for write-heavy workloads. This means that while Kafka relies on Zookeeper for broker discovery and ISR leader election, it also does some things entirely within the Kafka cluster, such as ensuring that a write is fully committed to all ISRs before acknowledging the write to a producer.

Deploying Kafka and Zookeeper

So now that we’ve learned a little about Kafka and Zookeeper, let’s review some key points on deploying it.

  • Zookeeper ensembles should be deployed in odd number sizes in order for its quorum to be maintained. The minimum recommended Zookeeper ensemble size is 3 nodes for staging environments and 5 nodes for production. This provides fault tolerance and the ability to perform maintainence while still maintaining N+1. Due to the nature of Zookeeper’s design, going over 5 nodes can start to cause increases in latency. You can read more on that on
  • The Kafka cluster size will vary depending on your workload, your desired partition count and replication factor settings. We’re using a default replication factor of 3, which means we use 3 node clusters for staging and 5 node clusters for production.
  • The recommended approach is to deploy a local Kafka cluster in each datacenter and have machines in each location interact only with their local cluster. For applications that need a global view of all data you should use the mirror maker tool to provide clusters which have aggregate data mirrored from all datacenters. These aggregator clusters are used for reads by applications that require the global view.
  • Monitoring of Kafka is done primarily through JMX and there is a great overview of the key MBeans to monitor and what they mean.

Beyond that, you should go straight to the Kafka Operations documentation as it’s very well written and you’ll want to read it all in-depth before tackling a Kafka deployment.

Further Reading

So where to go from here?

To learn more about Kafka in general, a few things I’d recommend, in no specific order are:

Operating Kafka and Zookeeper

Using Kafka in Your Application

No comments :