Introduction to Kafka Tutorial

Introduction to Apache Kafka Tutorial

Welcome to the third chapter of the Apache Kafka tutorial (part of the Apache Kafka Course.) This lesson provides an introduction to Kafka.

In the next section of this Apache kafka tutorial, we will discuss objectives of apache kafka.

Objectives

After completing this lesson, you will be able to:

  • Define Kafka

  • Describe some use cases for Kafka

  • Describe the Kafka data model

  • Describe Kafka architecture

  • List the types of messaging systems

  • Explain the importance of brokers

In the next section of this Apache kafka tutorial, we will discuss an Introduction to Apache kafka.

Apache Kafka - Introduction

Kafka is a high-performance, real-time messaging system. It is an open source tool and is a part of Apache projects.

The characteristics of Kafka are:

  • Kafka is a distributed and partitioned messaging system that is highly fault-tolerant and scalable.

  • It has been tested to process and send millions of messages per second to several receivers.

In the next section of this Apache kafka tutorial, we will discuss the history of Apache Kafka.

Kafka History

Apache Kafka was originally developed by LinkedIn to handle their log files and later handed over to the open source community in early 2011. It became the main Apache project in October 2012.

A stable Apache Kafka version 0.8.2.0 was released in Feb 2015 and a stable Apache Kafka version 0.8.2.1 was released in May 2015, which is the latest version.

In the next section of this Apache kafka tutorial, we will discuss use cases of Apache Kafka.

Wish to have in-depth knowledge about the Apache Kafka platform? Click here to know more!

Kafka Use Cases

Kafka can be used for various purposes in an organization, such as:

Messaging service: Millions of messages can be sent and received in real-time.

Real-time stream processing: Kafka can be used to process a continuous stream of information in real-time and pass data to stream processing systems such as Storm.

Log aggregation: Kafka can be used to collect physical log files from multiple systems and store it in a central location such as HDFS.

Commit log service: Kafka can be used as an external commit log for distributed systems.

Event sourcing: A time ordered sequence of events can be maintained through Kafka.

Aggregating User Activity Using Kafka - Example

Kafka can be used to aggregate user activity data such as clicks, navigation, and searches from different websites of an organization; such user activities can be sent to a real-time monitoring system and Hadoop system for offline processing.

An example is illustrated in the image below.

The information from customer-facing portals is sent in real-time to the Kafka cluster.

The Kafka cluster consists of one or more servers that process the messages in parallel. The information is sent to a real-time monitoring system to monitor the user clicks, navigation, and searches. The information is also saved in a Hadoop system for offline processing.

In the next section of this Apache kafka tutorial, we will discuss Apache Kafka Data Model.

Kafka Data Model

The Kafka data model consists of messages and topics. Messages represent information such as lines in a log file, a row of stock market data, or an error message from a system.

Messages are grouped into categories called topics.

For example, LogMessage and StockMessage.

The processes that publish messages into a topic in Kafka are known as producers. The processes that receive the messages from a topic in Kafka are known as consumers. The processes or servers within Kafka that process the messages are known as brokers.

A Kafka cluster consists of a set of brokers that process the messages. The image illustrates the Kafka data model. It shows a Kafka cluster that consists of three brokers. There are two producers sending messages to the Kafka cluster, and two consumers receiving the messages from the cluster.

Producer 1 creates messages for topic 1, whereas, producer 2 sends messages for topic 2. These messages are processed by the three brokers in parallel and sent to the consumers.

Consumer 1 is interested in topic 2; so, it receives the messages for topic 2. Similarly, consumer 2 is interested in topic 1; so, it receives the messages for topic 1.

The brokers in the Kafka cluster handle the process of receiving, storing, and forwarding the messages to the interested consumers.

In the next few sections of this Apache Kafka tutorial, we’ll discuss topics, partitions, partition distribution, producers and consumers in Apache Kafka.

Topics in Apache Kafka

A topic is a category of messages in Kafka. The producers publish the messages into topics and the consumers read the messages from topics. A topic is divided into one or more partitions.

A partition is also known as a commit log. Each partition contains an ordered set of messages. Each message is identified by its offset in the partition. Messages are added at one end of the partition and consumed at the other.

The image below illustrates a topic ‘simple’ that is divided into two partitions.

The writes are completed at one end and the reads are completed at the other. It shows six messages in partition 0 and five messages in partition 1.

The offset of message one in partition 0 is zero as it is the first message. The offset of message six in partition 0 is five.

The messages are written in the order 1, 2, 3, 4, 5 and 6, whereas, they are read in the same order as 1, 2, 3, 4, 5 and 6. The next message in partition 0 will be message 7 which will be written at offset 6. The next message for partition 1 will be message 6 which will be written at offset 5.

Partitions in Apache Kafka

Topics are divided into partitions, which are the unit of parallelism in Kafka. Partitions allow messages in a topic to be distributed to multiple servers or brokers so that the messages in a topic can be processed in parallel.

A topic can have any number of partitions. Each partition should fit in a single Kafka server. The number of partitions in a topic decide the parallelism of the topic.

The image below illustrates two partitions of a topic ‘simple.’

Partition 0 consists of six messages, whereas, partition 1 consists of five messages.

Partition Distribution in Apache Kafka

Partitions can be distributed across the Kafka cluster. Each Kafka server or broker may handle one or more partitions.

A partition can be replicated across several servers for fault-tolerance.

One server is marked as a leader for the partition and the others are marked as followers. The leader controls the read and writes for the partition, whereas, the followers replicate the data. If a leader fails, one of the followers automatically become the leader.

ZooKeeper is used for the leader selection as explained in the previous lesson.

The image below illustrates the partitions of a topic ‘simple’.

Here, the partition 0 is assigned to server 1 and partition 1 is assigned to server 2. These servers process the messages in parallel to increase throughput.

Producers in Apache Kafka

The producer is the creator of the message in Kafka. Producers place the message on a particular topic and decide which partition to place the message into.

For example, a producer may place a message into partition 0 of topic simple.

Another producer may place a message into partition 1 of topic simple. Topics should already exist before a message is placed by the producer. Messages are added at one end of the partition by Kafka.

The image below illustrates a producer that creates three messages and sends them to different topics and partitions in Kafka.

Message 1 is sent to partition 0 of topic test 1, message 2 is sent to partition 1 of topic test 1, and message 3 is sent to partition 0 of topic test 2.

Consumers in Apache Kafka

The consumer is the receiver of the message in Kafka. Each consumer belongs to a consumer group. A consumer group may have one or more consumers. The consumers specify what topics they want to listen to.

A message is sent to all the consumers in a consumer group. The consumer groups are used to control the messaging system.

The image below illustrates the three consumer groups.

They are:

  • Consumer group 1

  • Consumer group 2

  • Consumer group 3

The consumer group 1 consists of three consumers called consumer 1, consumer 2, and consumer 3.

The consumer group 2 consists of two consumers called consumer 4 and consumer 5.

The consumer group 3 consists of a single consumer called consumer 6.

In the next section of this Apache kafka tutorial, we will discuss Kafka Architecture.

Kafka Architecture

Kafka architecture consists of brokers that take messages from the producers and add to a partition of a topic.

Brokers provide the messages to the consumers from the partitions. The producers create the messages and send them to a particular topic and a partition of a Kafka cluster.

A topic is divided into multiple partitions.

The messages are added to the partitions at one end and consumed in the same order. Each partition acts as a message queue.

Consumers are divided into consumer groups. Each message is delivered to one consumer in each consumer group. ZooKeeper is used for coordination among the Kafka brokers.

The image below illustrates the Kafka Architecture that consists of two partitions called partition 1 and partition 2.

The two producers are sending messages to the two brokers in the cluster.

The brokers add the messages to the partitions and the messages are taken from the partitions in the same order as insertion. The messages are sent to two consumer groups.

The image also illustrates that the Kafka cluster or brokers interact with ZooKeeper for distributed coordination.

Types of Messaging Systems in Apache Kafka

Kafka architecture supports two types of messaging systems known as publish-subscribe and queue system. The publish-subscribe system is also called pub-sub.

In this system, one system broadcasts the messages and the consumers subscribe to receive the messages.

Each message is received by all the subscribers. So, if there are 100 messages published, each subscriber receives all the 100 messages in the same order that they are produced.

In the queue system, each message has to be consumed by only one consumer. If there are multiple consumers, each message is consumed by any one of the available consumers, in the same order that they are received.

In the next section of this Apache kafka tutorial, we will discuss a queue system with an example.

Queue System - Example

The image below illustrates the implementation of a queue system.

Consumer 1, consumer 2, and consumer 3 belong to the same consumer group. So out of the six messages, two messages are received by consumer 1, two messages by consumer 2, and two messages by consumer 3.

Note that the messages are received in the same order that they are produced.

So, consumer 1 receives message 1, consumer 2 receives message 2, and consumer 3 receives message 3. After this, consumer 1 receives message 4, consumer 2 receives message 5, and consumer 3 receives message 6.

In the next section of this Apache kafka tutorial, we will discuss a publish-subscribe system with an example.

Publish-Subscribe System - Example

The image below illustrates the implementation of a publish-subscribe system.

Consumer 1, Consumer 2, and Consumer 3 belong to three separate consumer groups. So, all the six messages are sent to all the three consumer groups called consumer group 1, consumer group 2, and consumer group 3.

Since there is only one consumer in consumer group 1, it receives all the six messages in the order 1, 2, 3, 4, 5, and 6. Similarly, consumer 2 and consumer 3 also receive all the six messages in the same order.

Brokers in Apache Kafka

Brokers are the Kafka processes that process the messages in Kafka.

Each machine in the cluster can run one broker. The brokers coordinate with each other using ZooKeeper. One broker acts as a leader for a partition and handles the delivery and persistence, whereas, the others act as followers. Brokers receive the message from the producer and send it to consumer groups.

Kafka Guarantees

Kafka guarantees the following:

Guarantee 1: Messages sent by a producer to a topic and a partition are appended in the same order. This ensures that the messages produced earlier do not get ahead of the messages produced later. The time order is maintained very strictly.

Guarantee 2: A consumer instance gets the messages in the same order as they are produced, which means that the messages are never out of order.

If the messages are produced in the order 1, 2, 3, 4, 5, 6, they will be received in the order 1, 2, 3, 4, 5, and 6. This is important in messaging systems, as the dependency is on the time order of messages.

Guarantee 3: A topic with replication factor N, tolerates up to N-1 server failures.

For example, when the replication factor is specified as 3, there will be no loss of messages even if two machines fail.

Kafka at LinkedIn

LinkedIn or the website www.linkedin.com is the largest network of professionals and is the originator of Kafka. Kafka is used by LinkedIn to manage streams of information.

Some of the uses of Kafka at LinkedIn are as follows:

Monitoring: Kafka is used to collecting metrics from various systems and to create monitoring dashboards.

Messaging: Kafka is used as message queues for content feeds and as a publish-subscribe system for searches.

Analytics: Kafka is used to collecting page views and clicks from customer-facing websites, and to store the information into a central Hadoop-based analytics system.

A building block for distributed applications: Kafka is used as a building block for distributed applications and for building distributed databases and distributed log systems.

In the next section of this Apache kafka tutorial, we will discuss replication in Apache kafka.

Replication in Kafka

Kafka uses the primary-backup method of replication. In the primary-backup method, one machine or one replica is called a leader and is chosen as the primary. The remaining machines or replicas are chosen as the followers and act as backups.

The leader propagates the writes to the followers and waits until the writes are completed on all the replicas. If a replica is down, it is skipped for the write.

However, Kafka will write a copy to the machine once it is back. If the leader fails, one of the followers will be chosen as the new leader. This mechanism can tolerate n-1 failures if the replication factor is ‘n’, which can be specified at the topic level.

In the next section of this Apache kafka tutorial, we will discuss persistence in Apache kafka.

Persistence in Kafka

Persistence means a message can be delivered even if the machine that handles the message fails.

Kafka uses the Linux file system for the persistence of messages. Persistence ensures no messages are lost. Kafka relies on the file system page cache for fast reads and writes.

All the data is immediately written to a file in a file system so that they can be recovered even if the machine fails. Messages are grouped as message sets for more efficient writes.

Message sets can be compressed to reduce network bandwidth. A standardized binary message format is used among producers, brokers, and consumers to minimize data modification.

Wish to know more about Apache Kafka platform? Click here to watch our course preview.

Summary

Here, is a quick recap of what we have learned this lesson:

  • Kafka is a high-performance, real-time messaging system.

  • Kafka can be used as an external commit log for distributed systems.

  • Kafka data model consists of messages and topics.

  • Kafka architecture consists of brokers that take messages from the producers and add to a partition of a topic.

  • Kafka architecture supports two types of the messaging system called publish-subscribe and queue system.

  • Brokers are the Kafka processes that process the messages in Kafka.

Conclusion

This concludes ‘Introduction to Kafka.’ The next lesson is ‘Installation and Configuration.’ which has already been covered as a part of Apache Storm tutorial.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

We use cookies on this site for functional and analytical purposes. By using the site, you agree to be cookied and to our Terms of Use. Find out more

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)

By proceeding, you agree to our Terms of Use and Privacy Policy

We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*

By proceeding, you agree to our Terms of Use and Privacy Policy