Reader small image

You're reading from  Mastering Apache Storm

Product typeBook
Published inAug 2017
Reading LevelExpert
Publisher
ISBN-139781787125636
Edition1st Edition
Languages
Right arrow
Author (1)
Ankit Jain
Ankit Jain
author image
Ankit Jain

Ankit Jain holds a bachelor's degree in computer science and engineering. He has 6 years, experience in designing and architecting solutions for the big data domain and has been involved with several complex engagements. His technical strengths include Hadoop, Storm, S4, HBase, Hive, Sqoop, Flume, Elasticsearch, machine learning, Kafka, Spring, Java, and J2EE. He also shares his thoughts on his personal blog. You can follow him on Twitter at @mynameisanky. He spends most of his time reading books and playing with different technologies. When not at work, he spends time with his family and friends watching movies and playing games.
Read more about Ankit Jain

Right arrow

Chapter 5. Trident Topology and Uses

In the previous chapter, we covered an overview of Trident. In this chapter, we are going to cover the development of a Trident topology. Here are the important points we are going to cover in this chapter:

  • The Trident groupBy operation
  • Non-transactional topology
  • Trident hello world topology
  • Trident state
  • Distributed RPC
  • When to use Trident

Trident groupBy operation


The groupBy operation doesn't involve any repartitioning. The groupBy operation converts the input stream into a grouped stream. The main function of the groupBy operation is to modify the behavior of subsequent aggregate functions.

groupBy before partitionAggregate

If the groupBy operation is used before a partitionAggregate, then the partitionAggregate will run the aggregate on each group created within the partition.

groupBy before aggregate

If the groupBy operation is used before an aggregate, then input tuples is first repartition and then perform the aggregate operation on each group.

Non-transactional topology


In non-transactional topology, a spout emits a batch of tuples and doesn't guarantee what's in each batch. With a processing mechanism, we can divide the pipeline into two categories:

  • At-most-once-processing: In this type of topology, failed tuples are not retried. Hence, the spout does not wait for an acknowledgment.
  • At-least-once-processing: Failed tuples are retried in the processing pipeline. Hence, this type of topology guarantees that every tuple that enters the processing pipeline must be processed at least once.

We can write a non-transactional spout by implementing the org.apache.storm.trident.spout.IBatchSpout interface.

This example shows how we can write a Trident spout:

public class FakeTweetSpout implements IBatchSpout{ 
 
    
   private static final long serialVersionUID = 10L; 
   private intbatchSize; 
   private HashMap<Long, List<List<Object>>>batchesMap = new HashMap<Long, List<List<Object>>>(); 
   public FakeTweetSpout...

Trident hello world topology


This section explains how we can write a Trident hello world topology. Perform the following steps to create Trident hello world topology:

  1. Create a Maven project by using com.stormadvance as the groupId and storm_trident as the artifactId.
  2. Add the following dependencies and repositories to the pom.xml file:
         <dependencies> 
         <dependency> 
               <groupId>junit</groupId> 
               <artifactId>junit</artifactId> 
               <version>3.8.1</version> 
               <scope>test</scope> 
         </dependency> 
         <dependency> 
               <groupId>org.apache.storm</groupId> 
               <artifactId>storm-core</artifactId> 
               <version>1.0.2</version> 
               <scope>provided</scope> 
         </dependency> 
   </dependencies> 
  1. Create a TridentUtility class in a com.stormadvance...

Trident state


Trident provides an abstraction for reading from and writing results to stateful sources. We can maintain the state either internally to the topology (memory), or we can store it in external sources (Memcached or Cassandra).

Let's consider that we are maintaining the output of the preceding hello world Trident topology in a database. Every time you process the tuple, the count of country present in a tuple is increased in the database. We can't achieve exactly-once processing by only maintaining a count in the database. The reason is that if any tuple failed during processing, then the failed tuple is retried. This gives us a problem while updating the state, because we are not sure whether the state of this tuple was updated previously or not. If the tuple has failed before updating the state, then retrying the tuple will increase the count in the database and make the state consistent. But if the tuple has failed after updating the state, then retrying the same tuple will...

Distributed RPC


Distributed RPC is used to query on and retrieve results from Trident topology on the fly. Storm has an in-built distributed RPC server. The distributed RPC server receives the RPC request from the client and passes it to the Storm topology. The topology processes the request and sends the result to the distributed RPC server, which is redirected by the distributed RPC server to the client.

We can configure the distributed RPC server by using the following properties in the storm.yaml file:

drpc.servers: 
     - "nimbus-node" 

Here, nimbus-node is the IP of the distributed RPC server.

Now, run this command on the nimbus-node machine to start the distributed RPC server:

> bin/storm drpc

Let's assume we are storing the count aggregation of hello world Trident topology in a database and want to retrieve the count for a given country on the fly. We would need to use the distributed RPC feature to achieve this. This example shows how we can incorporate the distributed RPC in the...

When to use Trident


It is very easy to achieve exactly-once processing using Trident topology and Trident meant for the same. On the other hand, it would be difficult to achieve the exactly-once processing in the case of vanilla Storm. Hence, Trident will be useful for that use case where we have require exactly-once processing.

Trident is not fit for all use cases, especially for high-performance use cases, because Trident adds complexity on Storm and manages the state.

Summary


In this chapter, we mainly concentrated on the Trident sample topology, the Trident groupBy operation, and the non-transactional topology. We also covered how we can query on the fly on a Trident topology using distributed RPC.

In the next chapter, we will cover the different types of Storm scheduler.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Apache Storm
Published in: Aug 2017Publisher: ISBN-13: 9781787125636
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Ankit Jain

Ankit Jain holds a bachelor's degree in computer science and engineering. He has 6 years, experience in designing and architecting solutions for the big data domain and has been involved with several complex engagements. His technical strengths include Hadoop, Storm, S4, HBase, Hive, Sqoop, Flume, Elasticsearch, machine learning, Kafka, Spring, Java, and J2EE. He also shares his thoughts on his personal blog. You can follow him on Twitter at @mynameisanky. He spends most of his time reading books and playing with different technologies. When not at work, he spends time with his family and friends watching movies and playing games.
Read more about Ankit Jain