Overview of Kafka and ZooKeeper Architecture

Kafka and Zookeeper Architecture

Subscription

Table of content

Introduction to Kafka

The initial goal was to solve the low-latency ingestion of large amounts of event data from the LinkedIn website and infrastructure into a lambda architecture that leveraged Hadoop and real-time event processing systems. The "real-time" processing proved crucial. There were no real-time solutions for this type of ingress at the time.

Although there were solid methods for ingesting data into offline batch systems, they exposed implementation details to downstream users and used a push model that might rapidly overload a user. They were also not created with the real-time use case in mind. It was extremely difficult to get data from source systems and consistently move it around, and previous batch-based solutions and corporate messaging solutions failed to tackle the problem.

Kafka was designed to provide the ingestion backbone for this type of use case. Kafka was ingesting over 1 billion events per day in 2011. LinkedIn recently recorded daily consumption rates of 1 trillion messages.


Read more about KAFKA vs. PULSAR: Pick the Right one for your Business


What is the Architecture of Kafka?

A Kafka topic is a collection of messages that belong to a given category or feed name. Data is saved in Kafka in the form of topics. Data is written to subjects by producers, and consumers read data from these topics.elixir-data-kafka-architecture

What is the role of broker in Kafka?

A Kafka cluster is made up of one or more brokers or servers. A broker in Kafka acts as a container for numerous topics with various partitions. A unique integer ID identifies the Kafka cluster's brokers. A connection to any of the cluster's Kafka brokers implies connecting to the entire cluster. If a cluster has more than one broker, the brokers do not have all of the data connected with a single topic.

Consumers and consumer groups

Consumers read data from the Kafka cluster. When the consumer is prepared to receive the message, the data to be read by the consumer must be fetched from the broker. In Kafka, a consumer group collects consumers who all pull data from the same topic or range of topics.

Producer

Kafka producers send messages to one or more topics. Data is sent to the Kafka cluster. When a Kafka producer sends a message to Kafka, the broker receives it and appends it to one partitions. Producers have the option of publishing messages to a certain division.

Significance of Partitioning and keys

In Kafka, topics are separated into a customizable number of sections known as partitions. Partitions enable several consumers to read data from the same subject simultaneously. Partitions are split in chronological sequence. The number of partitions is set when configuring a topic, although this number can be adjusted afterward. A topic's partitions are spread among servers in the Kafka cluster. Each server in the cluster manages the data and requests partitions independently. Messages, coupled with a key, are forwarded to the broker. The key can be used to identify which partition a message will be routed to. All messages with the same key are routed to the same partition. The partition will be determined via a round-robin method if no key is supplied.

Offsets and Replicas

In Kafka, messages or records are allocated to a partition. Each record is given an offset to describe the location of the records inside the partition. The offset value associated with a record can be used to identify it inside its partition uniquely. A partition offset has no relevance outside of that partition. Because records are appended to the ends of partitions, older records will have smaller offset values.

In Kafka, replicas are similar to backups for partitions. A topic's partitions are distributed across numerous servers in a Kafka cluster. Replicas are copies of the partition. They are used to guarantee that no data is lost in the case of a breakdown or a scheduled shutdown.

Leaders and followers

Each partition in Kafka will always have one server acting as the partition's leader. The leader is in charge of all read and writes operations for the partition. There might be zero or more followers for each division. The role of the follower is to repeat the leader's data. If the leader for a particular partition fails, one of the follower nodes can take over as the leader.


Know here to Adopt or not to Adopt Data Mesh?


The right number of Topics/Partitions in a Kafka cluster

In Kafka, a topic partition is the unit of parallelism. Writes to distinct partitions can be done in full parallel on both the producer and broker sides. As a result, costlier procedures like compression can take advantage of more hardware resources. On the consumer side, Kafka always sends all data from a single partition to a single consumer thread. As a result, the consumer's degree of parallelism (within a consumer group) is limited by the number of partitions consumed. As a result, the higher the throughput achieved by a Kafka cluster with more partitions, the better, but having a huge number of partitions may also have several negative impacts, such as

  1. A more significant number of partitions necessitates a more significant number of open file handles.
  2. There's a chance that adding more partitions would increase unavailability.
  3. End-to-end latency may increase when more partitions are added.
  4. More partitions may necessitate additional memory on the client's part.

Throughput is used as a crude calculation for determining the number of partitions. You calculate the total amount of production (for example, x) and consumption (for example, y) that can be achieved on a single partition (for example, y). Let's suppose your throughput goal is t. Then you'll need a minimum of max(t/x, t/y) partitions. The amount of per-partition throughput that may be achieved on the producer is determined by batching size, compression codec, acknowledgment type, replication factor, and so on.

In general, a Kafka cluster with more partitions has greater throughput. However, the impact of having too many partitions in total or per broker on factors like availability and latency must be considered.

What is the Architecture of ZooKeeper? 

Apache ZooKeeper is a free and open-source distributed coordination service that aids in the management of a large number of servers. It isn't easy to manage and coordinate in a distributed setting. ZooKeeper automates this process, allowing developers to develop software features rather than worrying about how it is distributed.

Client-Server architecture is used by Apache ZooKeeper, in which clients are “machine nodes” and servers are nodes.

The relationship between the servers and their clients is depicted below. We can see that each client uses the client library and communicates with any of the ZooKeeper nodes in this way.elixir-data-zookeeper-architecture

Server

When a client connects, server sends an acknowledgment. If the connected server does not respond, client will immediately transfer the message to another server.

Client

The client is a node in a distributed application cluster. It enables you to obtain information from a server. Every client sends a message to the server at regular intervals, which allows the server to determine whether or not the client is alive.

Leader

A server is designated as the Leader. It provides all of the information to the clients and an acknowledgment that the server is still operational. If any of the connected nodes fails, it will undertake automated recovery.

Follower

A follower is a server node that obeys the instructions of the leader. Client read requests are handled by the ZooKeeper server connected to the client. The ZooKeeper leader is in charge of handling client write requests.

Cluster/Ensemble

Ensemble or Cluster is a collection of ZooKeeper servers. When using Apache, you can use ZooKeeper infrastructure in cluster mode to keep the system working at its best.


Read more about Data Quality and its Challenges


Role of ZooKeeper in Kafka Topic and Backup recovery

ZooKeeper accomplishes its high availability and consistency by distributing data over multiple collections of nodes. ZooKeeper can execute instant failover migration if a node fails; for example, a leader node fails, a new one is selected in real-time via polling within an ensemble. If the initial node fails to answer, a client connected to the server might query a different node.
ZooKeeper is used in a minor but crucial coordinating function in newer versions of Kafka (>= 0.10.x). When new Kafka Brokers join the cluster, ZooKeeper is used to discover and connect to the other brokers. The cluster also uses ZooKeeper to pick the controller and track the controller epoch. Finally, and arguably most significantly, ZooKeeper saves the Kafka Broker topic partition mappings, which keep track of the data held on each broker. The data will still exist in the absence of these mappings, but it will not be accessible or copied.

How the heartbeat is managed for Kafka Brokers?

The mechanism by which Kafka can know that the business application with the consumer is still up and running or is not running is known as heartbeat. This helps Kafka expel a non active consumer from the group to distribute partitions to the remaining consumers. Still, this way of handling heartbeat had a problem, i.e., If your processing takes a lengthy time, you'll need to set a long timeout, which will result in a system that takes minutes to notice and recover from failure.

The Kafka broker controls this process by delivering network status updates stating that we're still alive and evicting each other based on timeouts.

In older versions of Kafka, we had to call the poll method to trigger the heartbeat mechanism.

Kafka's newer versions use KIP-62 (a background thread) to handle this mechanism. So, instead of waiting for us to call a poll, this thread informs Kafka that everything is good. So, suppose your entire process dies along with the heartbeat thread. In that case, Kafka will soon notice. It will no longer be influenced by the amount of time you spend processing data because the heartbeat is transmitted asynchronously and more frequently.

  1. Discover more about DevOps Best Practices for Data Engineers
  2. Explore more about Big Data Discovery using Blockchain

Fresh news directly to your mailbox