Inside Kafka Topics and Partitions

Christian Hubinger

Christian Hubinger

kafka

api

In our latest blog post — Apache Kafka — what is it? — we talked about the bigger picture and the basics of Apache Kafka. Now I'd like to dig a little deeper into Apache Kafka topics & partitions, what they really are and how you can go about implementing them.

An ordered log

As discussed last time, we can see a topic as a queue, where you post messages on one side and pick them up on the other side. But now we're interested in how Kafka actually stores your data in a topic and passes it on to the consumers.

Apache Kafka stores the data inside a topic in one or more partitions. Those partitions contain the actual data. We'll discuss the reason for those partitions a bit later. To keep it simple, let's assume we use a single partition per topic.

When you add a message to a topic, Apache Kafka appends the data to the end of the partition.

Kafka Topics & Partitions

As the order of the messages stays unchanged, each message in Kafka can be identified by the topic, the partition, and the offset index.

This way of storing the data creates continuously growing files that can be read sequentially from the disk, which is the most efficient way to work with hard disks and is one of the key design decisions made to provide superior performance.

So now that we know what's happening when we add a message let's look at the other side.

A consumer, who subscribes to a topic, pulls messages from the partition in the same order as they are stored. This means it simply needs to keep the last processed offset index to track which messages have been processed.

Kafka: A topic's partition

As Kafka, in contrast to most traditional queuing systems, does not delete the messages once processed automatically, a second consumer (or even the same consumer) can decide to process the same data again. It simply needs to reset its offset index and restart to receive the data again.

A reliable infrastructure component

When you build your application with Apache Kafka as a central component, reliability is definitely one of the most important aspects. When you build applications in this way, Kafka is responsible for all communication between different applications — it becomes a single point of failure. Just like the Database in a traditional monolithic application.

To mitigate this risk, Apache Kafka runs as a cluster and allows you to replicate your data to several nodes, so that, with proper configuration, the infrastructure stays available until the last server in the cluster has crashed.

Of course, this replication causes the storage requirements to multiply. Therefore, Apache Kafka enables us to configure each topic in the cluster. So, we can configure our mission-critical topics with double redundant setup, and other not so essential — but maybe high-volume — topics without any redundancy to save disk space.

Kafka cluster distributing data to its nodes

This image shows four distinct topics inside a Kafka cluster with three nodes. With replication enabled — which we call the active copy — all new data is written "leader", and the others "followers".

In this example, Topic 1 and Topic 2 are configured with a replica count of three, Topic 3 with two replicas, and Topic 4 with a single one.

When the producer writes new data to the topics, it's written to the cluster node containing the topic's leader and then replicated to all followers.

Be fast!

The second big question for such a sensible central piece of software is: can it handle out load? If you've ever built a more extensive system with ten or more distinct services, you surely know how fast data volume and throughput can increase. Especially for a message broker like Apache Kafka, performance is key.

Here, partitions come into play. As we already talked about, a topic holds its data in one or more partitions. By configuring more than a single partition, Apache Kafka will start to distribute the messages evenly across them. To visualize, let's zoom into our image above and show the distribution of the partitions inside a single topic on the server nodes.

Kafka topic partition distribution

There are leaders in green and followers in blue in the graphic. By placing the leader on different physical cluster nodes, Apache Kafka allows you to scale, read, and write performance horizontally for each topic.

With a setup like this, maximum write performance will be available when using the same number of partitions as nodes in the cluster. Each cluster node manages one partition, and the underlining hardware is as utilized as possible.

Apache Kafka: What have we learned so far?

We now have an understanding of how Apache Kafka stores and manages the data it handles. With this clear view of the architecture, we'll discuss the most important consequences and how we configure the system for various use-cases.

Get in touch with us

Did you find this post helpful, or maybe even have some constructive feedback? Get in touch and let's continue the discussion.

Your opinion is very important to us!

On a score of 1 to 5, what's your overall experience of our blog?
1...Very unsatisfied - 5...Very Satisfied

More insights

Apache Kafka — what is it?

Kafka primarily acts as a message broker. Get an introduction, find out its use cases, and why we're sure it's a technology that's here to stay.

kafka, api

Read full story

An Intro to Enterprise Application Integration

Read about why Enterprise Application Integration is important, what advantages and challenges come with it, and just how you can go about it.

integration, api

Read full story

How SSO (Single-Sign-On) can significantly reduce employee IT support

Want to reduce internal IT support so the development team can focus on other things? Here's how you can do just that with an SSO solution.

sso, keycloak, api

Read full story

How to keep your OpenShift clusters running like clockwork

Struggling to keep your OpenShift clusters running? Here are the best tips, tricks, & tools to help you deploy your applications with minimal downtime.

cloud native, openshift

Read full story

Why investing in an OpenShift container platform is a smart business choice

Looking to scale up your business while keeping your DevOps teams happy? Investing in OpenShift might just be your smartest business decision yet.

cloud native, openshift

Read full story

How cloud native software development works & why you need it

A cloud native tech stack brings greater flexibility and scalability to your business than traditional software methods. Here's how.

cloud native, techstack, software development

Read full story

Sehen Sie selbst, wie maßgeschneiderte Unternehmenssoftware unseren Kunden zum Erfolg verholfen hat – ohne Verkaufsgespräche. Nur reale Beispiele aus der Praxis. Garantiert.