Apache Storm Tutorial - Introduction

Welcome to the first chapter of the Apache Storm tutorial (part of the Apache Storm Course.) This is the introductory lesson of the Apache Storm tutorial, which is part of the Apache Storm Certification Training. This Chapter will provide you an introduction to Storm, its data model, architecture, and components.

Let us explore the objectives of Apache Storm in the next section.

Objectives

After completing this lesson, you will be able to:

  • Describe the concept of Storm

  • Explain streaming

  • Describe the features and use cases for Storm

  • Discuss the Storm data model,

  • Describe Storm architecture and its components

  • Explain the different types of topologies.

We will begin with Apache Storm in the next section.

Apache Storm

Following are the features of Apache Storm.

  • Apache Storm is a real-time stream processing system.

  • It is an open source and a part of Apache projects.

  • It helps to process big data.

  • It is a fast and reliable processing system.

  • It can ingest high volume and high-velocity data.

  • It is highly parallelizable, scalable, and fault-tolerant.

In the next section of apache storm tutorial, we will discuss Uses of Storm.

Wish to have in-depth knowledge of Apache Storm? Check out our Course Preview!

Uses of Storm

Storm provides the computation system that can be used for real-time analytics, machine learning, and unbounded stream processing. It can take continuously produced messages and can output to multiple systems.  

In the next section of apache storm tutorial, let us understand what a stream is.

What is a Stream

In computing, a stream represents a continuous sequence of bytes of data. It is produced by one program and consumed by another. It is consumed in the ‘First In First Out’ (or FIFO) sequence. That means if 12345 is produced by one program, another program consumes it in the order 12345 only. It can be bounded or unbounded.

Bounded means that the data is limited. Unbounded means that there is no limit and the producer will keep producing the data as long as it runs and the consumer will keep consuming the data. A Linux pipe is an example of a stream.

The Linux command is cat logfile | wc -l (cat space logfile vertical bar wc hyphen l)

In this command, cat logfile command produces a stream that is consumed by wc -l command to display the number of lines in the file.

Next, we will explain industry use cases for Storm.

Industry use cases for STORM

Many industries can use Storm for real-time big data processing such as:

  1. Credit card companies can use it for fraud detection on swipe.

  2. Investment banks can use it for trade pattern analysis in real time.

  3. Retail stores can use it for dynamic pricing.

  4. Transportation providers can use it for route suggestion based on traffic data.

  5. Healthcare providers can use it for the monitoring of ICU sensors.

  6. Telecom organizations can use it for processing switch data.

In the next section of apache storm tutorial, let us look at the Storm data model.

STORM Data Model

Storm data model consists of tuples and streams.

Tuple

A tuple is an ordered list of named values similar to a database row. Each field in the tuple has a data type that can be dynamic. The field can be of any data type such as a string, integer, float, double, boolean or byte array. User-defined data types are also allowed in tuples.

For example, for the stock market data, if the schema is in the ticker, year, value, and status format, then some tuples can be ABC, 2011, 20, GOOD ABC, 2012, 30, GOOD ABC, 2012, 32, BAD XYZ, 2011, 25, GOOD.

Stream

A stream of Storm is an unbounded sequence of tuples.

For example, if the above tuples are stored in a file stocks.txt format, then the command cat stocks.txt produces a stream. If the process is continuously putting data into stocks.txt format, then it becomes an unbounded stream.

In the next section of apache storm tutorial, let us look at the Storm architecture.

Storm Architecture

Storm has a master-slave architecture. There is a master server called Nimbus running on a single node called master node. There are slave services called supervisor that are running on each worker node. Supervisors start one or more worker processes called workers that run in parallel to process the input.

Worker processes store the output to a file system or database. Storm uses Zookeeper for distributed process coordination.

The diagram shows the Storm architecture with one master node and five worker nodes. The Nimbus process is running on the master node. There is one supervisor process running on each worker node. There are multiple worker processes running on each worker node. The workers get the input from the file system or database and store the output also to a file system or database.

Storm Architecture

In the next section of apache storm tutorial, we will discuss storm processes.

Storm Processes

A Zookeeper cluster is used for coordinating the master, supervisor and worker processes. Moving on, let us explore the Nimbus process.

Nimbus Process

Nimbus is the master daemon of Storm cluster.

  • Runs on only one node called the master node

  • Assigns and distributes the tasks to the worker nodes

  • Monitors the tasks

  • Reassigns tasks on node failure.

Supervisor Process

The supervisor is the worker daemon of Storm cluster.

  • Runs on each worker node of the cluster

  • Runs each task as a separate process called worker process

  • Communicates with Nimbus daemon using zookeeper

  • Number of worker processes for each task can be configured

Worker Process

Worker process does the actual work of Storm.

  • Runs on any worker node of the cluster

  • Started and monitored by the supervisory process

  • Runs either spout or bolt tasks

  • Number of worker processes for each task can be configured

Sample Program

A Log processing program takes each line from the log file and filters the messages based on the log type and outputs the log type.

Input: A log file containing error, warning, and informational messages. This is a growing file getting continuous lines of log messages.

Output: Output type of message (ERROR or WARNING or INFO) Let us continue with the sample program.

This program given below contains a single spout and a single bolt.

The spout does the following:

Opens the file, reads each line and outputs the entire line as a tuple.

The bolt does the following: Reads each tuple from the spout and checks if the tuple contains the string ERROR or WARNING or INFO. Outputs only ERROR or WARNING or INFO.

LineSpout {

foreach line = readLine(logfile) {

emit(line)

}

LogTypeBolt(tuple) {

if(tuple contains “ERROR”) emit(“ERROR”);

if(tuple contains(“WARNING”) emit (“WARNING”);

if(tuple contains “INFO”) emit(“INFO”);

}

The spout is named LineSpout. It has a loop to read each line of input and outputs the entire line. The emit function is used to output the line as a stream of tuples. The bolt is named LogTypeBolt. It takes the tuple as input. If the line contains the string ERROR, then it outputs the string ERROR. If the line contains the string WARNING, then it outputs the string WARNING.

Similarly, if the line contains the string INFO, then it outputs the string INFO.

Next, let us explore the Storm Components.

Storm Components

Storm provides two types of components that process the input stream, spouts, and bolts. Spouts process external data to produce streams of tuples. Spouts produce tuples and send them to bolts. Bolts process the tuples from input streams and produce some output tuples. Input streams to bolt may come from spouts or from another bolt.

The diagram shows a Storm cluster consisting of one spout and two bolts. The spout gets the data from an external data source and produces a stream of tuples. The first bolt takes the output tuples from spout and processes them to produce another set of tuples. The second bolt takes the output tuples from bolt 1 and stores them into an output stream.

Apache Storm Components

Now, we will understand the functioning of Storm spout.

Storm Spout

Spout is a component of Storm that ingests the data and creates the stream of tuples for processing by the bolts. A spout can create a stream of tuples from the input, and it automatically serializes the output data. It can get data from other queuing systems like Kafka, Twitter, RabitMQ, etc. Spout implementations are available for popular message producers such as Kafka and Twitter. A single spout can produce multiple streams of output. Each stream output can be consumed by one or more bolts.

The diagram shows a twitter spout that gets twitter posts from a twitter feed and converts them into a stream of tuples. It also shows a Kafka spout that gets messages from Kafka server and produces a tuple of messages.

Apache Storm Spout

Next, let us look at how Storm bolt functions.

Storm Bolt

Storm Bolt processes the tuples from spouts and outputs to external systems or other bolts. The processing logic in a bolt can include filters, joins, and aggregation.

Filter data examples include processing only records with STATUS = GOOD, processing only records with volume > 100, etc.

Aggregation examples include calculating the sum of sale amount, calculating the max stock value, etc. A bolt can process any number of input streams. Input data is deserialized, and the output data is serialized. Streams are treated as tuples; bolts run in parallel and can distribute across machines in the Storm cluster.

The diagram shows four bolts running in parallel. Bolt 1 produces the output that is received by both bolt 3 and bolt 4. Bolt 2 produces the output that is received by both bolt 3 and bolt 4. Bolt 3 stores the output to a Cassandra database whereas bolt 4 stores the output to Hadoop storage.

Apache Storm Bolt

Next, we will explore the functioning of Storm topology.

Storm Topology

A group of spouts and bolts running in a Storm cluster form the Strom Topology. Spouts and bolts run in parallel. There can be multiple spouts and bolts.

Topology determines how the output of a spout is connected to the input of bolts and how the output of a bolt is connected to the input of other bolts.

The diagram shows a Storm Topology with one spout and five bolts. The output of spout 1 is processed by three bolts: bolt 1, bolt 2 and bolt 3. Bolt 4 gets the output from bolt 2 and bolt 3. Bolt 5 gets the input from Bolt 5. This represents the storm topology. The input to spout 1 is coming from an external data source. The output from bolt 4 goes to Output 1, and the output from bolt 5 goes to output 2.

Storm topology

Let us understand this in a better way with the help of an example in the next section of apache storm tutorial.

Storm Example

Let us illustrate storm with an example.

Problem: The stock market data which is continuously sent by an external system should be processed, so that data with GOOD status is inserted into a database whereas data that are with BAD status is written to an error log.

STORM Solution: This will have one spout and two bolts in the topology. Spout will get the data from the external system and convert into a stream of tuples. These tuples will be processed by two bolts. Those with Status GOOD will be processed by bolt1. Those with status BAD will be processed by bolt2. Bolt1 will save the tuples to Cassandra database. Bolt2 will save the tuples to an error log file.

The diagram shows the Storm topology for the above solution. There is one spout that gets the input from an external data source. There is bolt 1 that processes the tuples from the spout and stores the tuples with GOOD status to Cassandra. There is bolt 2 that processes the tuples from the spout and stores the tuples with BAD status to an error log.

Apache Storm Example

Next, let us understand serialization and deserialization.

Serialization-Deserialization

Serialization is the process to convert data structures or objects into a platform-independent stream of bytes. It is used to store data on disk or memory and to transmit data over the network. The purpose of serialization is to make data readable by other programs.

For example, The object ( name: char(10), id : integer) may have data {‘John’, 101}. This can be serialized as “John\00x65”. This represents that the string John is followed by null character and then the hexadecimal representation of 101.

To read the data, programs have to reverse the serialization process. This is called deserialization. Serialization-Deserialization is also known as SerDe (abbreviation of Serialization-Deserialization).

Now we will go through the steps involved in submitting a job to Storm.

Submitting a Job to Storm

A job is submitted to Nimbus process.

To submit a job, you need to:

  1. Create spout and bolt functions

  2. Build topology using spout and bolts

  3. Configure parallelism

  4. Submit topology to Nimbus

builder = new TopologyBuilder();

builder.setSpout("spout", LineSpout());

builder.setBolt(“bolt", LogTypeBolt()).shuffleGrouping("spout");

conf.setNumWorkers(2); // use two worker processes

StormSubmitter.submitTopology(builder.createTopology, conf);

The diagram shows a Java program fragment for submitting a job to Storm. It first gets a Storm topology builder object. Next, it sets the spout and bolts for the topology with the setSpout and setBolt methods. It also sets the connection from the output of spout to bolt using the shuffleGrouping method.

shuffleGrouping is a type of grouping of input that we will discuss in a subsequent lesson. After setting the spout and bolt, the program sets the number of workers for the job to 2; which represents the number of workers that will run in parallel. Finally, using the submitTopology method, the topology is submitted to Nimbus process.

Now, finally, we will look at the Types of topologies.

Types of Topologies

Storm supports different types of topologies:

Simple topology: It consists of simple spouts and bolts. The output of spouts is sent to bolts.

Transactional topology: It guarantees processing of tuples only once. This is an abstraction of the above simple topology. The Storm libraries ensure that a tuple is processed exactly once even if the system goes down in the middle. This topology has become deprecated and replaced by trident topology in the current version of Storm.

Distributed RPC: Distributed RPC is also called DRPC. This topology provides libraries to parallelize any computation using Storm.

Trident topology: It is a higher level abstraction over Storm. That means it has libraries that run on top of Storm libraries to support additional functionality. Trident provides spouts and bolts that provide much higher functionality such as transaction processing and batch processing.

Summary

Let us summarize the topics covered in this lesson.

  • Storm is used for processing streams of data.

  • Storm data model consists of tuples and streams.

  • Storm consists of spouts and bolts.

  • Spouts create tuples that are processed by bolts.

  • Spouts and bolts together form the Storm topology.

  • Storm follows a master-slave architecture.

  • The master process is called Nimbus, and the slave processes are called supervisors.

  • Data processing is done by workers that are monitored by supervisors.

Benefits of Apache Storm to Professionals

Following are the benefits of Apache Storm to professionals.

  • It equips you with the skill-set needed to become an Apache specialist

  • It equips you with an experience in stream processing big data technology of Apache Storm.

  • It equips you with experience for real-time event processing with Big Data.

  • It enables you to take up the responsibilities of a  Big Data Hadoop developer.

Apache Storm Tutorial Prerequisites

The prerequisites for the Apache Storm tutorial are:

  • Working knowledge of Java

  • Basic knowledge of Linux or Unix based systems

  • Basic knowledge of data processing

 It is recommended to do a Big Data Hadoop Developer Certification Training as a prerequisite as it provides an excellent foundation for Apache Storm certification.

Let us explore the target audience of Apache Storm Tutorial in the next section.

Target Audience of Apache Storm

This Apache Storm tutorial is aimed at:

  • Analytics professionals

  • Research professionals

  • IT developers and testers

  • Project Managers

  • Data Scientists

  • Professionals and students aspiring for a Real-time analytics career in Big Data Hadoop

In the next section, we will discuss the objectives of the Apache Storm tutorial.

Looking for more information on Apache Storm? Watch our Course Preview!

Objectives

After completing this tutorial, you will be able to:

  • Enhance your knowledge of Apache Storm architecture and the basic concepts of Apache Storm.

  • Install and configure Apache Storm.

  • Have a command on concepts like Ingesting and processing of real-time events with Storm

  • Understand the basic principles of Trident extension in Apache Storm.

  • Gain knowledge in the areas of Grouping & Data Insertion in Apache Storm.     

  • Understand basic principles of Storm Interfaces with Kafka, Cassandra, Java.

Let us explore the lessons covered in Apache Storm Tutorial in the next section.

Lessons Covered in this Apache Storm Tutorial

There are seven lessons covered in this tutorial. Take a look at the lesson names that are listed below.

Lesson No

Chapter Name

What You’ll Learn

Lesson 1

Big Data Overview Tutorial

In this chapter, you’ll be able to:

  • Describe the concept of big data and its 3 Vs

  • List the different sizes of data

  • Describe some use cases for big data

  • Explain the concept of Apache Hadoop and real-time big data processing

  • Describe some tools for real-time processing of big data

Lesson 2

Introduction to Storm Tutorial

In this chapter, you’ll be able to:

  • Describe the concept of Storm

  • Explain streaming

  • Describe the features and use cases for Storm

  • Discuss the Storm data model

  • Describe Storm architecture and its components

  • Explain the different types of topologies.

Lesson 3

Apache Storm Installation and Configuration Tutorial

In this chapter, you’ll be able to:

  • Choose proper hardware for Storm installation

  • Install Storm on an Ubuntu system

  • Configure Storm

  • Run Storm on an Ubuntu system

  • Describe the steps to set up a multi-node Storm cluster.

Lesson 4

Apache Storm Advanced Concepts Tutorial

In this chapter, you’ll be able to:

  • Describe the types of spouts and the structure of spouts and bolts.

  • Explain eight different types of grouping in Storm.

  • Explain reliable processing in Storm.

  • Explain Storm topology life cycle.

  • Discuss the real-time event processing in Storm.

Lesson 5

Apache Storm Interfaces Tutorial

In this chapter, you’ll be able to:

  • Identify the different interfaces to Storm

  • Explain Java interface to Storm

  • Describe creating spouts, bolts and their connections

  • Explore Real-time data processing platform

  • Describe Kafka interface to Storm

  • Describe writing a Kafka spout

  • Explore Storm interface to Cassandra

Lesson 6

Apache Storm Trident Tutorial

In this chapter, you’ll be able to:

  • Explain what is Trident

  • Explain the Trident data model

  • Identify various operations in Trident

  • Explain stateful processing using Trident

  • Explain pipelining in Trident

  • Explain Trident advantages

Conclusion

With this, we come to an end about what this Apache Storm tutorial includes. In the next chapter, we will discuss the Big Data Overview Tutorial

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

We use cookies on this site for functional and analytical purposes. By using the site, you agree to be cookied and to our Terms of Use. Find out more

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)

By proceeding, you agree to our Terms of Use and Privacy Policy

We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*

By proceeding, you agree to our Terms of Use and Privacy Policy