In our latest blog post — Apache Kafka — what is it? — we talked about the bigger picture and the basics of Apache Kafka. Now I'd like to dig a little deeper into Apache Kafka topics & partitions, what they really are and how you can go about implementing them.
As discussed last time, we can see a topic as a queue, where you post messages on one side and pick them up on the other side. But now we're interested in how Kafka actually stores your data in a topic and passes it on to the consumers.
Apache Kafka stores the data inside a topic in one or more partitions. Those partitions contain the actual data. We'll discuss the reason for those partitions a bit later. To keep it simple, let's assume we use a single partition per topic.
When you add a message to a topic, Apache Kafka appends the data to the end of the partition.
As the order of the messages stays unchanged, each message in Kafka can be identified by the topic, the partition, and the offset index.
This way of storing the data creates continuously growing files that can be read sequentially from the disk, which is the most efficient way to work with hard disks and is one of the key design decisions made to provide superior performance.
So now that we know what's happening when we add a message let's look at the other side.
A consumer, who subscribes to a topic, pulls messages from the partition in the same order as they are stored. This means it simply needs to keep the last processed offset index to track which messages have been processed.
As Kafka, in contrast to most traditional queuing systems, does not delete the messages once processed automatically, a second consumer (or even the same consumer) can decide to process the same data again. It simply needs to reset its offset index and restart to receive the data again.
When you build your application with Apache Kafka as a central component, reliability is definitely one of the most important aspects. When you build applications in this way, Kafka is responsible for all communication between different applications — it becomes a single point of failure. Just like the Database in a traditional monolithic application.
To mitigate this risk, Apache Kafka runs as a cluster and allows you to replicate your data to several nodes, so that, with proper configuration, the infrastructure stays available until the last server in the cluster has crashed.
Of course, this replication causes the storage requirements to multiply. Therefore, Apache Kafka enables us to configure each topic in the cluster. So, we can configure our mission-critical topics with double redundant setup, and other not so essential — but maybe high-volume — topics without any redundancy to save disk space.
This image shows four distinct topics inside a Kafka cluster with three nodes. With replication enabled — which we call the active copy — all new data is written "leader", and the others "followers".
In this example, Topic 1 and Topic 2 are configured with a replica count of three, Topic 3 with two replicas, and Topic 4 with a single one.
When the producer writes new data to the topics, it's written to the cluster node containing the topic's leader and then replicated to all followers.
The second big question for such a sensible central piece of software is: can it handle out load? If you've ever built a more extensive system with ten or more distinct services, you surely know how fast data volume and throughput can increase. Especially for a message broker like Apache Kafka, performance is key.
Here, partitions come into play. As we already talked about, a topic holds its data in one or more partitions. By configuring more than a single partition, Apache Kafka will start to distribute the messages evenly across them. To visualize, let's zoom into our image above and show the distribution of the partitions inside a single topic on the server nodes.
There are leaders in green and followers in blue in the graphic. By placing the leader on different physical cluster nodes, Apache Kafka allows you to scale, read, and write performance horizontally for each topic.
With a setup like this, maximum write performance will be available when using the same number of partitions as nodes in the cluster. Each cluster node manages one partition, and the underlining hardware is as utilized as possible.
We now have an understanding of how Apache Kafka stores and manages the data it handles. With this clear view of the architecture, we'll discuss the most important consequences and how we configure the system for various use-cases.
Did you find this post helpful, or maybe even have some constructive feedback? Get in touch and let's continue the discussion.
We are in the midst of a No-Code/Low-Code boom. It's everywhere, and everyone in tech is talking about it. Read this insight to explore what it is and how you can benefit from it.
no-code, low-code, api
Data is one of the most important topics people are talking about in business right now — but why is it important? And why is it not just about gathering any type of data? Find out here.
Kafka primarily acts as a message broker. Get an introduction, find out its use cases, and why we're sure it's a technology that's here to stay.
Read about why Enterprise Application Integration is important, what advantages and challenges come with it, and just how you can go about it.
How I used Transistor, Placid and n8n to create podcast artwork for each episode. Yes, I am a lazy podcaster.
Want to reduce internal IT support so the development team can focus on other things? Here's how you can do just that with an SSO solution.
sso, keycloak, api