Reader small image

You're reading from  Mastering Apache Storm

Product typeBook
Published inAug 2017
Reading LevelExpert
Publisher
ISBN-139781787125636
Edition1st Edition
Languages
Right arrow
Author (1)
Ankit Jain
Ankit Jain
author image
Ankit Jain

Ankit Jain holds a bachelor's degree in computer science and engineering. He has 6 years, experience in designing and architecting solutions for the big data domain and has been involved with several complex engagements. His technical strengths include Hadoop, Storm, S4, HBase, Hive, Sqoop, Flume, Elasticsearch, machine learning, Kafka, Spring, Java, and J2EE. He also shares his thoughts on his personal blog. You can follow him on Twitter at @mynameisanky. He spends most of his time reading books and playing with different technologies. When not at work, he spends time with his family and friends watching movies and playing games.
Read more about Ankit Jain

Right arrow

Chapter 4. Trident Introduction

In the previous chapters, we covered the architecture of Storm, its topology, bolts, spouts, tuples, and so on. In this chapter, we are covering Trident, which is a high-level abstraction over Storm.

We are covering the following points in this chapter:

  • Introducing Trident
  • Understanding Trident's data model
  • Writing Trident functions, filters, and projections
  • Trident repartitioning operations
  • Trident aggregators
  • When to use Trident

Trident introduction


Trident is a high-level abstraction built on top of Storm. Trident supports stateful stream processing, while pure Storm is a stateless processing framework. The main advantage of using Trident is that it guarantees that every message entered into the topology is processed only once, which would be difficult to achieve with vanilla Storm. The concept of Trident is similar to high-level batch processing tools, such as Cascading and Pig, developed over Hadoop. To achieve exactly-once processing, Trident processes the input stream in small batches. We will cover this in more detail in the Chapter 5, Trident Topology and UsesTrident state section.

In the first three chapters, we learned that, in Storm's topology, the spout is the source of tuples. A tuple is a unit of data that can be processed by a Storm application, and a bolt is the processing powerhouse where we write the transformation logic. But in the Trident topology, the bolt is replaced with the higher level semantics...

Understanding Trident's data model


The Trident tuple is the data model of a Trident topology. The Trident tuple is the basic unit of data that can be processed by a Trident topology. Each tuple consists of a predefined list of fields. The value of each field can be a byte, char, integer, long, float, double, Boolean, or byte array. During the construction of a topology, operations are performed on a tuple, which will either add new fields to the tuple or replace the tuple with a new set of fields.

Each of the fields in a tuple can be accessed by name, (getValueByField(String)), or its positional index, (getValue(int)), in the tuple. The Trident tuple also provides convenience methods, such as getIntegerByField(String), which saves you from typecasting the objects.

Writing Trident functions, filters, and projections


This section covers the definition of Trident functions, filters, and projections. Trident functions, filters, and projections are used to modify/filter the input tuples based on certain criteria. This section also covers how we can write Trident functions, filters, and projections.

Trident function

Trident functions contain logic to modify the original tuple. A Trident function gets a set of fields of the tuple as input and emits one or more tuples as output. The fields of the output tuples are merged with the fields of the input tuple to form the complete tuple, which will pass to the next action in the topology. If the Trident function emits no tuples corresponding to the input tuple, then that tuple is removed from the stream.

We can write a custom Trident function by extending the storm.trident.operation.BaseFunction class and implementing the execute(TridentTuple tuple, TridentCollector collector) method.

Let's write the sample Trident...

Trident repartitioning operations


By performing repartitioning operations, a user can partition tuples across multiple tasks. The repartitioning operation doesn't make any changes to the content of the tuples. Also, the tuples will only pass over the network for the repartitioning operation. Here are the different types of repartitioning operation.

Utilizing shuffle operation

This repartitioning operation partitions the tuples in a uniform, random way across multiple tasks. This repartitioning operation is generally used when we want to distribute the processing load uniformly across the tasks. The following diagram shows how the input tuples are repartitioned using the shuffle operation:

Here is a piece of code that shows how we can use the shuffle operation:

mystream.shuffle().each(new Fields("a","b"), new myFilter()).parallelismHint(2) 

Utilizing partitionBy operation

This repartitioning operation enables you to partition the stream on the basis of the fields in the tuples. For example, if...

Trident aggregator


The Trident aggregator is used to perform the aggregation operation on the input batch, partition, or input stream. For example, if a user wants to count the number of tuples present in each batch, then we can use the count aggregator to count the number of tuples in each batch. The output of the aggregator completely replaces the value of the input tuple. There are three types of aggregator available in Trident:

  • partitionAggregate
  • aggregate
  • persistenceAggregate

Let's understand each type of aggregator in detail.

partitionAggregate

As the name suggests, the partitionAggregate works on each partition instead of the whole batch. The output of partitionAggregate completely replaces the input tuple. Also, the output of partitionAggregate contains a single-field tuple. Here is a piece of code that shows how we can use partitionAggregate :

mystream.partitionAggregate(new Fields("x"), new Count() ,new new Fields("count")) 

For example, we get an input stream containing the fields x and...

Utilizing the groupBy operation


The groupBy operation doesn't involve any repartitioning. The groupBy operation converts the input stream into a grouped stream. The main function of the groupBy operation is to modify the behavior of the subsequent aggregate function. The following diagram shows how the groupBy operation groups the tuples of a single partition:

The behavior of groupBy is dependent on a position where it is used. The following behavior is possible:

  • If the groupBy operation is used before a partitionAggregate, then the partitionAggregate will run the aggregate on each group created within the partition.
  • If the groupBy operation is used before an aggregate, the tuples of the same batch are first repartitioned into a single partition, then groupBy is applied to each single partition, and at the end it will perform the aggregate operation on each group.

When to use Trident


It is very easy to achieve exactly-once processing using the Trident topology, and Trident was designed for this purpose. It would be difficult to achieve exactly-once processing with vanilla Storm, so Trident will be useful when we need exactly-once processing.

Trident is not fit for all use cases, especially for high-performance use cases, because Trident adds complexity to Storm and manages the state.

Summary


In this chapter, we mainly concentrated on Trident high-level abstraction over Storm and learned about the Trident filter, function, aggregator, and repartitioning operations.

In the next chapter, we will cover non-transactional topology, Trident topology, and Trident topology using a distributed RPC.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Apache Storm
Published in: Aug 2017Publisher: ISBN-13: 9781787125636
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Ankit Jain

Ankit Jain holds a bachelor's degree in computer science and engineering. He has 6 years, experience in designing and architecting solutions for the big data domain and has been involved with several complex engagements. His technical strengths include Hadoop, Storm, S4, HBase, Hive, Sqoop, Flume, Elasticsearch, machine learning, Kafka, Spring, Java, and J2EE. He also shares his thoughts on his personal blog. You can follow him on Twitter at @mynameisanky. He spends most of his time reading books and playing with different technologies. When not at work, he spends time with his family and friends watching movies and playing games.
Read more about Ankit Jain