menu EXPLORE
history NEW

Apache Kafka

Every modern business revolves around data. Precisely for this reason, several technologies, platforms and frameworks have emerged over the years to support advanced data management. One such solution is Apache Kafka, a distributed streaming platform designed for high-speed, real-time data processing.

To date, Kafka has already seen widespread adoption in thousands of companies around the world. We analyze it in depth to understand what is special about it and how different companies can use it.

What is Apache Kafka?

Kafka is an open source, distributed streaming platform that allows you to store, read and analyze data in real time. It is a powerful tool capable of handling billions of events a day and still running quickly, especially thanks to its distributed nature.

Before it moved to the community and the Apache Foundation in 2011, Kafka was originally created at LinkedIn, to track the behavior of its users and create connections between them. At the time, it aimed to solve a real problem that LinkedIn developers struggled with: low-latency ingestion of big event data. There was a clear need for real-time processing, but there were no solutions that could truly accommodate it.

This is how Kafka was born. Since then, it has come a long way and evolved into a complete publish-subscribe distributed messaging system, providing the backbone for building robust applications.

How does Kafka work?

Basically, a streaming platform like Kafka has three basic capabilities: publishing and subscribing to log streams, storing these log streams in a fault-tolerant manner, and processing them as they occur.

With Kafka specifically, applications (in this case producers) publish messages (logs) that arrive at a Kafka node (broker). Each record (consisting of a key, a value and a timestamp) is processed by so-called consumers, and stored in categories called topics that make up Kafka clusters.

However, as topics can grow in size, they are divided into smaller partitions to improve performance and scalability. All messages within each partition are ordered in the immutable sequence in which they arrived and are continually added to a structured commit log.

Additionally, partition records are assigned a sequential identification number (offset) that uniquely identifies each record within the partition. Data from partitions is also replicated across multiple brokers to preserve all information in case one of them dies.

What's also interesting is that Kafka doesn't care what records are being consumed. The Kafka cluster durably persists all published logs using a configurable retention period. It is the consumers themselves who poll Kafka for new messages and say which logs they want to read. However, after the set period, the record is discarded to free up space.

As for APIs, there are five key components in Kafka:

  • The Producer API that allows the application to publish a stream of records to topics (one or many, for example),
  • The consumer API that allows the application to subscribe to topics and process the corresponding log stream,
  • The Flows API that facilitates the effective transformation of input flows into output flows,
  • The connector API that allows you to build and run reusable producers or consumers that connect Kafka topics to existing applications or data systems,
  • The administration API that allows you to manage and inspect topics, brokers, and other Kafka objects.

The Producer API that allows the application to publish a stream of records to topics (one or many, for example),

The consumer API that allows the application to subscribe to topics and process the corresponding log stream,

The Flows API that facilitates the effective transformation of input flows into output flows,

The connector API that allows you to build and run reusable producers or consumers that connect Kafka topics to existing applications or data systems,

The administration API that allows you to manage and inspect topics, brokers, and other Kafka objects.

As a result, Kafka is capable of quickly ingesting and moving large amounts of data, facilitating communication between various, even loosely connected, elements of computer systems.

Advantages and disadvantages of Apache Kafka?

There are a few reasons why Kafka seems to be growing in popularity. First, given the enormous amount of data that is produced and consumed across different services, applications, and devices, many businesses can benefit from event-driven architecture today. A distributed streaming platform with durable storage like Kafka is said to be the cleanest way to achieve such an architecture.

Kafka also turns out to be reliable and fast, especially thanks to low-latency message delivery, sequential I/O, the zero-copy principle, as well as efficient data clustering and compression. All of this makes Kafka a suitable alternative to traditional messaging brokers.

However, on the other hand, Kafka is simply complicated. To start, you have to plan and calculate an adequate number of brokers, topics and partitions. On the other hand, rebalancing the Kafka cluster can also impact the performance of both producers and consumers (and therefore pause data processing). Speaking of data, it's easy for old logs to be deleted too early (on high throughput and low power, for example) to save disk space. It can easily be overkill if you don't really need Kafka's features.

What is Apache Kafka for?

According to the Kafka development team, there are a few key use cases it is designed for, including:

  • for messages
  • website activity tracking
  • log aggregation
  • operational metrics
  • stream processing

Whenever there is a need to build real-time streaming applications that need to process or react to “chunks” of data, or reliably transfer data between systems or applications, Kafka comes to the rescue.

It's one of the reasons why Kafka works well with banking and finance applications, where transactions need to be processed in a specific order. The same goes for transportation and logistics, as well as retail, especially when IoT sensors are involved. In these sectors, constant monitoring, asynchronous and real-time applications (e.g., inventory controls), advanced analytics, and systems integration, to name a few, are often needed.

In fact, any company that wants to take advantage of data analytics and the integration of complex tools (for example, between CRM, POS and e-commerce applications) can benefit from Kafka. It is precisely where it fits well into the equation.

Who uses Apache Kafka?

It should come as no surprise that Kafka remains a core part of LinkedIn's infrastructure. It is primarily used for activity tracking, message sharing, and metrics collection, but the list of use cases doesn't end here. Most of the data communication between the different services in the LinkedIn environment uses Kafka to some extent.

At the moment, LinkedIn admits to maintaining more than 100 Kafka clusters with more than 4,000 brokers, serving 100,000 topics and millions of partitions. The total number of messages handled by LinkedIn's Kafka deployments, on the other hand, has already exceeded 7 billion per day.

Although no other service uses Kafka at the scale of LinkedIn, many other applications, companies, and projects take advantage of it. At Uber, for example, many processes are modeled with Kafka Streams – including customer and driver matching and ETA calculations. Netflix also adopted the multi-cluster Kafka architecture for stream processing and now seamlessly handles billions of messages each day.

At Future Mind, we also had the opportunity to implement Kafka. For Fleet Connect, a vehicle tracking and fleet management system, we monitor the location of each vehicle (along with some other parameters) thanks to dedicated devices inside it. The data collected by these devices reaches the IoT Gateway, which decodes the messages and sends them to Kafka. From there, they end up in IoT Collector, where data processing and analysis takes place.

Curiously, the trackers inside the vehicles send the messages one by one and in order of appearance, but only after the previous one has been accepted and stored correctly, so that we do not lose any relevant data. For this reason, however, we have to "take over" all these messages quickly, so as not to overload the trackers and process the data in real time.

In this case, horizontal scaling with the use of classic brokers, messaging systems and "traditional" queues would not be enough due to the specificity of the project and the need to analyze data in real time. However, with Kafka we can divide the input stream into partitions based on the vehicle ID and use multiple brokers that guarantee almost unlimited horizontal scalability, data backup in case of failure, high operational speed, continuity and a real-time processing of data in a specific order.

Should I use Kafka for my projects?

1. Consider whether you really need Kafka if some of the following are true:

  • No need to process thousands of messages per second
  • No need to maintain the specific order in which data should be processed
  • A possible loss of some logs in the case of failover would not be very problematic

2. Think about the different components of the system.

  • How many brokers do you need (i.e. how distributed should the system be), how many ZooKeeper instances, how many servers and in what locations?
  • What themes and partitions you plan to have - these can be changed later, but the change itself is quite problematic
  • How many different producers and consumers will use the system, and in how many instances (within a group of consumers) will each of them be parallelized?

3. Think about the entire setup, especially:

  • What is the best choice for a partition key, and what is the partition strategy behind it?
  • How to choose the best strategy for grouping messages (for both producers and consumers) to maintain a balance between processing/delivery speed and delays?
  • What should be the retry policy and timeouts for messages delivered by producers to Kafka brokers?

4. Test your configuration and check if:

  • If one of the brokers is disabled, would the remaining brokers take over the partitions managed by it (in other words, would partition rebalancing be activated)?
  • If one of the consumer instances is deactivated, would the remaining instances within the same consumer group begin processing messages from the inactive consumer partitions?
  • Can you handle the retry policy and commit shifts, and haven't you lost any messages or processed any duplicates along the way?

5. Monitor Kafka using the available tools and verify that:

  • Does not have any partition that is not used
  • None of the consumers have too high a watermark
  • All runners are healthy
  • All brokers and consumers are well balanced