BEGINNER’S GUIDE
How Apache Kafka works?
Understand different components and architecture of Apache Kafka, an event streaming platform capable of handling trillion requests a day.
In last decade, industry has moved from monolith architecture to microservices architecture for agile development. But microservices architecture has its own problems and to overcome them many new technologies have emerged. For example Messaging Queues are becoming more and more popular as a communication channel for decoupled microservices and in various modern system architecture patterns. Apache Kafka is a leading open source event streaming platform used as messaging queue and for pub-sub pattern. Apache Kafka is popular for its unique features and excellent performance than other available messaging queue technologies. In addition, Kafka provides feature rich streaming library for data streaming and Kafka Connect library to work with external resources like database, filesystem and other services available on network.
More than 80% of all Fortune 100 companies trust, and use Kafka.
- Apache Kafka official Site
What is Apache Kafka?
As mentioned in its official documentations, Apache Kafka is a Distributed Streaming Platform. Technically, Streaming is capturing and processing events from different systems in real time. Kafka implements publish subscribe mechanism for working with streams, stores data streams durably for as long as you want and provides capability to process the stream of data as they occur. Kafka is a distributed system with servers and clients communicating with high performance, fault-tolerant TCP protocol. Kafka is run as cluster on bare-metal, virtual machines or on containers. Kafka implements the messaging queue functionality in highly scalable, elastic, fault-tolerant and secure manner.
Key Concepts
Like any other messaging queue technology, Kafka has brokers, producers and consumers as its core components. Producers are the clients which publishes events into the Kafka and Consumers polls for those events and process them whenever they are received. Kafka provides clients libraries in almost all popular programming languages to make it easy to implement both producers and consumers. In Kafka both Producers and Consumers are completely independent of each other; this is an important design decision for providing high scalability.
Records
Records are the events published by producers into the Kafka cluster. Each record in Kafka is made of key, value, timestamp and optional headers. Kafka allows null key in records but empty values are not allowed in the Kafka. Also Kafka allows you to configure which timestamp to use for the record: the one when record was pushed from the producer or the timestamp when the record was added to the Kafka brokers.
Example of Record in Kafka:Key: “1234421”
Value: “{“product”: “xyz”, “cost”: 12.7}”
Timestamp: “Oct. 19, 2020 at 8:20 p.m.”
Topic and Partitions
In Kafka, Topics are used to categorise records produced in the system. Topics are like directories and we push events in topic same as we add files in directories. Topics in same Kafka cluster should have unique names, for example: orders, payments, etc. Producer defines the topic to push each record and similarly consumers defines a list of topics to poll for new events. Multiple producers can publish events to the same topic and similarly zero, one or more consumers can subscribe to the topic. All events in the topics are stored durably in Kafka and you can configure the retention period for those messages. This design decision gives consumers a benefit that they can consume same message as many times as they wants, unlike the case with other traditional queues. We should not worry about performance with increase in stored data as Kafka’s performance is constant irrespective of size of the data stored (You can read more on this in Kafka’s documentation).
Further, topics are divided into multiple partitions distributed among available brokers in the cluster. That means each topic in Kafka is divided into one or more buckets. Partitions are the ordered commit logs with incremental offsets for all new events added to them, which means the order is maintained for the events being pushed to the partition with each record having unique incremental offset. These partitions are read only, so once the message is appended to the partitioned, it can’t be reversed.
Also you can configure the replication factor for each topic and Kafka will make that many copies of your events for fault tolerance. Kafka distributes the partitions and replicas into multiple brokers available in the cluster, which allows it to provide better fault tolerance and elasticity. For example if we are running three brokers and divided our topic into two partitions with replication factor three, then Kafka will create both partitions on different brokers with their replicas on alternate brokers not hosting the primary partition.
Kafka Brokers
Kafka is run as a cluster of more than one broker nodes. Broker nodes facilitates the storage of all events and coordination with Producers and Consumers. For fault tolerance, Kafka supports replication of the data (partitions) and the replicated partitions are distributed among available brokers to achieve the possibly best fault tolerance. One broker is elected as a leader for the partition and all the requests for that partition is served by this leader broker, while other broker hosting replicated partition, passively replicated data from leader broker. All the partitions of a topic are distributed among all available brokers in the cluster for better efficiency and load balancing. Kafka elects one broker as the master node, whose responsibility is to administer the cluster and electing the leader broker for the partitions, in addition of serving requests from producers and consumers for topic and partitions stored on it.
Producers
Producer publishes events in the Kafka ecosystem. For each record, producers need to specify the target topic and partition to store the event. Producers get the details of the leader broker for partition of the topic and directly send the records to the leader broker for the target topic and partition without any intermediate routing overhead. Producers can implement round robin partition assignment for load balancing. In addition to that Kafka supports semantic partitioning. With semantic partitioning enabled, we can configure producer to use specific key in record for semantic partitioning. For example, in eCommerce application producers can use orderid as the key to partition all invoice data. With this configuration all invoices of the same orderid will be assigned to the same partition of the topic. Semantic partitioning allows consumers to make better decisions on data locality.
To improve the performance, producers batch events on the client side before actually sending it to the broker to avoid latency of round trip to the broker for each event. This batching of data also allows Kafka to store data continuously on filesystem to take advantage of better OS page caching and read ahead features for better performance. As the communication between producers and broker is done through network, bandwidth can bottleneck the end to end performance of the Kafka. Kafka supports various compression formats to send large amount of data on the same bandwidth by compressing the entire batch of data on producer side. We can compress individual data on our own but anyone with little knowledge on compression technique will know that higher duplication results in better compression and with batch of data we can have higher duplication in the headers, keys and some values of data.
Consumers
Kafka consumers periodically send fetch requests to broker to pull required data. In Kafka, all consumers should be assigned to a consumer group. The concept of consumer group allows Kafka to provide parallel message consumption as well as maintaining order of data processing at some level. One partition of the topic can be consumed by only one consumer from each consumer group at a time. As order of events pushed in partition is maintained in Kafka, it is sure the consumer will consume events in the same order as they are published in that partition (be aware that ordering is maintained for each partition and not for the entire topic having multiple partitions). To achieve the parallelism, we need to divide the topic into multiple partitions and create multiple consumers under same consumer group; broker will distribute all partitions among consumers of the same consumer group. If any consumer group has less number of consumers than number of partitions, then multiple partitions are assigned to the same consumer. But if there are more number of consumers under same consumer group than number of partitions, then extra consumers more than number partitions will remain idle and they will be automatically assigned the partition in case existing consumer crashes. If multiple consumers wants to process the data differently, we need to create each type of consumer with different consumer group. As the consumer group is different for the consumers, they can simultaneously consume and process messages from the same partition.
In traditional messaging queue system, brokers manages meta data for each record in queue to track which of them are consumed by any consumer. This adds performance overhead on brokers. To overcome this issue, Kafka adopted a design where only one consumer from each consumer group can consume data from one partition. Because of this decision, Kafka just needs to tracks one number, the offset of next record to consume, for each consumer group for each partition which allows it to implement lightweight acknowledgment and improves the performance by significant amount. Apart from this, this design allows consumers to go back and reprocess the records. As Kafka just stores one number, consumers can send the offset of any record to pull the data and go forward or backward. This pattern is against the traditional working of messaging queue, but this helps in some specific cases where reprocessing of the data is required. Similar to producers, consumers can fetch multiple records in response of single request to avoid multiple round trip to brokers and better compression.
Producers and consumers are agnostic of each other, which means producers are not required to know anything about consumers and same is true for the producers. As all the components of the Kafka ecosystem communicates through language agnostic TCP protocol, developers can implement both producers and consumers in different programming languages based on the requirements.
Pros and Cons of Apache Kafka
Pros:
- Exactly once message processing guarantee
- Robust platform for Stream processing
- Exceptional performance and Stability
- Reliable data durability
- Preserved ordering at shard level (ordering is preserved for data of same partition)
Cons:
- No support of multi-tenancy
- Does not support robust multi DC replication (there are open source projects which helps to achieve this at some level)
- Lack of monitoring and management tools
- Too many configurations options which are overwhelming for new comers as well as professionals
- Does not support retry and failover topics similar to RabbitMQ
Apache Kafka is perfect fit for Stream processing. Apart from this, Kafka is used industry wide to track website activity, metrics collection and monitoring, log aggregation, real-time analytics, complex event processing, ingesting data into Spark or Hadoop, CQRS, replay messages, error recovery, and guaranteed distributed commit log for in-memory computing.
Summary
In this article, I aimed to describe basic and important concepts of the Apache Kafka. Adoption of Apache Kafka as messaging queue technology is increasing in this age of microservices. Kafka provides messaging queue features and patterns in highly available, fault tolerant, scalable and distributed way. Kafka’s ability to store the data durably make it stand out from other similar messaging queue technologies. Kafka is designed for high throughput and low latency requirements. Each messaging queue technology has their own use cases and pros and cons. Before making any decisions on choosing any messaging queue technology, I would suggest to understand pros and cons of each technology and choose the one which fulfils your requirements considering other factors like maintenance cost, hardware requirements, learning curve and many more.