Inside Kafka Topics and Partitions

Christian Hubinger

Christian Hubinger

kafka

api

In our latest blog post — Apache Kafka — what is it? — we talked about the bigger picture and the basics of Apache Kafka. Now I'd like to dig a little deeper into Apache Kafka topics & partitions, what they really are and how you can go about implementing them.

An ordered log

As discussed last time, we can see a topic as a queue, where you post messages on one side and pick them up on the other side. But now we're interested in how Kafka actually stores your data in a topic and passes it on to the consumers.

Apache Kafka stores the data inside a topic in one or more partitions. Those partitions contain the actual data. We'll discuss the reason for those partitions a bit later. To keep it simple, let's assume we use a single partition per topic.

When you add a message to a topic, Apache Kafka appends the data to the end of the partition.

Kafka Topics & Partitions

As the order of the messages stays unchanged, each message in Kafka can be identified by the topic, the partition, and the offset index.

This way of storing the data creates continuously growing files that can be read sequentially from the disk, which is the most efficient way to work with hard disks and is one of the key design decisions made to provide superior performance.

So now that we know what's happening when we add a message let's look at the other side.

A consumer, who subscribes to a topic, pulls messages from the partition in the same order as they are stored. This means it simply needs to keep the last processed offset index to track which messages have been processed.

Kafka: A topic's partition

As Kafka, in contrast to most traditional queuing systems, does not delete the messages once processed automatically, a second consumer (or even the same consumer) can decide to process the same data again. It simply needs to reset its offset index and restart to receive the data again.

A reliable infrastructure component

When you build your application with Apache Kafka as a central component, reliability is definitely one of the most important aspects. When you build applications in this way, Kafka is responsible for all communication between different applications — it becomes a single point of failure. Just like the Database in a traditional monolithic application.

To mitigate this risk, Apache Kafka runs as a cluster and allows you to replicate your data to several nodes, so that, with proper configuration, the infrastructure stays available until the last server in the cluster has crashed.

Of course, this replication causes the storage requirements to multiply. Therefore, Apache Kafka enables us to configure each topic in the cluster. So, we can configure our mission-critical topics with double redundant setup, and other not so essential — but maybe high-volume — topics without any redundancy to save disk space.

Kafka cluster distributing data to its nodes

This image shows four distinct topics inside a Kafka cluster with three nodes. With replication enabled — which we call the active copy — all new data is written "leader", and the others "followers".

In this example, Topic 1 and Topic 2 are configured with a replica count of three, Topic 3 with two replicas, and Topic 4 with a single one.

When the producer writes new data to the topics, it's written to the cluster node containing the topic's leader and then replicated to all followers.

Be fast!

The second big question for such a sensible central piece of software is: can it handle out load? If you've ever built a more extensive system with ten or more distinct services, you surely know how fast data volume and throughput can increase. Especially for a message broker like Apache Kafka, performance is key.

Here, partitions come into play. As we already talked about, a topic holds its data in one or more partitions. By configuring more than a single partition, Apache Kafka will start to distribute the messages evenly across them. To visualize, let's zoom into our image above and show the distribution of the partitions inside a single topic on the server nodes.

Kafka topic partition distribution

There are leaders in green and followers in blue in the graphic. By placing the leader on different physical cluster nodes, Apache Kafka allows you to scale, read, and write performance horizontally for each topic.

With a setup like this, maximum write performance will be available when using the same number of partitions as nodes in the cluster. Each cluster node manages one partition, and the underlining hardware is as utilized as possible.

Apache Kafka: What have we learned so far?

We now have an understanding of how Apache Kafka stores and manages the data it handles. With this clear view of the architecture, we'll discuss the most important consequences and how we configure the system for various use-cases.

Get in touch with us

Did you find this post helpful, or maybe even have some constructive feedback? Get in touch and let's continue the discussion.

Your opinion is very important to us!

On a score of 1 to 5, what's your overall experience of our blog?
1...Very unsatisfied - 5...Very Satisfied

More insights

Why No-Code/Low-Code will change custom software development forever — and how you can benefit from it

We are in the midst of a No-Code/Low-Code boom. It's everywhere, and everyone in tech is talking about it. Read this insight to explore what it is and how you can benefit from it.

no-code, low-code, api

Read full story

The importance of data accuracy

Data is one of the most important topics people are talking about in business right now — but why is it important? And why is it not just about gathering any type of data? Find out here.

integration, api

Read full story

Apache Kafka — what is it?

Kafka primarily acts as a message broker. Get an introduction, find out its use cases, and why we're sure it's a technology that's here to stay.

kafka, api

Read full story

An Intro to Enterprise Application Integration

Read about why Enterprise Application Integration is important, what advantages and challenges come with it, and just how you can go about it.

integration, api

Read full story

How to automatically create podcast artwork

How I used Transistor, Placid and n8n to create podcast artwork for each episode. Yes, I am a lazy podcaster.

automation, api

Read full story

How SSO (Single-Sign-On) can significantly reduce employee IT support

Want to reduce internal IT support so the development team can focus on other things? Here's how you can do just that with an SSO solution.

sso, keycloak, api

Read full story

See how custom business software has helped our clients succeed, no sales pitch involved. Just real-world examples. Guaranteed.

Schedule a demo