Reader small image

You're reading from  Learning Apache Apex

Product typeBook
Published inNov 2017
Reading LevelIntermediate
Publisher
ISBN-139781788296403
Edition1st Edition
Languages
Right arrow
Authors (5):
Thomas Weise
Thomas Weise
author image
Thomas Weise

Thomas Weise is the Apache Apex PMC Chair and cofounder at Atrato. Earlier, he worked at a number of other technology companies in the San Francisco Bay Area, including DataTorrent, where he was a cofounder of the Apex project. Thomas is also a committer to Apache Beam and has contributed to several more of the ecosystem projects. He has been working on distributed systems for 20 years and has been a speaker at international big data conferences. Thomas received the degree of Diplom-Informatiker (MSc in computer science) from TU Dresden, Germany. He can be reached on Twitter at: @thweise.
Read more about Thomas Weise

Ananth Gundabattula
Ananth Gundabattula
author image
Ananth Gundabattula

Ananth is a senior application architect in the Decisioning and Advanced Analytics architecture team for Commonwealth Bank of Australia. Ananth holds a Ph.D degree in the domain of computer science security and is interested in all things data including low latency distributed processing systems, machine learning and data engineering domains. He holds 3 patents granted by USPTO and has one application pending. Prior to joining to CBA, he was an architect at Threatmetrix and the member of the core team that scaled Threatmetrix architecture to 100 million transactions per day that runs at very low latencies using Cassandra, Zookeeper and Kafka. He also migrated Threatmetrix data warehouse into the next generation architecture based on Hadoop and Impala. Prior to Threatmetrix, he worked for the IBM software labs and IBM CIO labs enabling some of the first IBM CIO projects onboarding HBase, Hadoop and Mahout stack. Ananth is a committer for Apache Apex and is currently working for the next generation architectures for CBA fraud platform and Advanced Analytics Omnia platform at CBA.
Read more about Ananth Gundabattula

Munagala V. Ramanath
Munagala V. Ramanath
author image
Munagala V. Ramanath

Dr. Munagala V. Ramanath got his PhD in Computer Science from the University of Wisconsin, USA and an MSc in Mathematics from Carleton University, Ottawa, Canada. After that, he taught Computer Science courses as Assistant/Associate Professor at the University of Western Ontario in Canada for a few years, before transitioning to the corporate sphere. Since then, he has worked as a senior software engineer at a number of technology companies in California including SeeBeyond, EMC, Sun Microsystems, DataTorrent, and Cloudera. He has published papers in peer reviewed journals in several areas including code optimization, graph theory, and image processing.
Read more about Munagala V. Ramanath

David Yan
David Yan
author image
David Yan

David Yan is based in the Silicon Valley, California. He is a senior software engineer at Google. Prior to Google, he worked at DataTorrent, Yahoo!, and the Jet Propulsion Laboratory. David holds a master of science in Computer Science from Stanford University and a bachelor of science in Electrical Engineering and Computer Science from the University of California at Berkeley
Read more about David Yan

Kenneth Knowles
Kenneth Knowles
author image
Kenneth Knowles

Kenneth Knowles is a founding PMC member of Apache Beam. Kenn has been working on Google Cloud Dataflow—Google's Beam backend—since 2014. Prior to that, he built backends for startups such as Cityspan, Inkling, and Dimagi. Kenn holds a PhD in
Read more about Kenneth Knowles

View More author details
Right arrow

Chapter 4. Scalability, Low Latency, and Performance

In traditional (non-distributed) applications, performance optimization is a substantial and ongoing effort—there are often individuals or even small teams dedicated to this effort and a vast array of techniques are employed to achieve the desired effect. These techniques include use of better algorithms, data structures with better performance characteristics, threads to parallelize computation, implementing caches, hoisting loop-invariant computations out of loops, and use of appropriate compiler options.

For distributed applications in general and Apex applications in particular, in addition to all of these techniques, a whole slew of new methods are applicable and will be covered in this chapter. Specifically, we'll cover the following topics:

  • Partitioning and how it works
  • Elasticity, operator state, dynamic scaling, and resource allocation
  • Partitioning toolkit
  • Performance optimizations (chaining of operators, stream locality, operator...

Partitioning and how it works


As the volume of incoming data increases, it can overwhelm the processing capabilities of the application resulting in increasing latencies and reduced throughput. It is rarely the case that the resources of the entire application are inadequate; instead, a careful analysis often reveals one or more bottlenecks. Addressing these bottlenecks will often resolve the problem. If the input data rate continues to increase, it may again cross the processability threshold, at which point the analysis must be repeated to find and resolve the new bottlenecks.

The modus operandi for addressing a bottleneck can take several forms, depending on the nature of the application, its configuration, the cluster environment, and other factors, for example:

  • Use a faster algorithm if available and compute resources are the constraint
  • Use more space-efficient algorithms and increase the memory allocation if excessive garbage collection (GC) calls are observed
  • Use additional cluster nodes...

Elasticity


As described in the preceding section, the number of desired partitions of each operator that is likely to be a bottleneck can be specified as part of the application configuration and the platform will ensure that the desired partitions are created at application start time. However, this is not possible when the volume of data flows can fluctuate unpredictably since we cannot forecast the number of required partitions.

The platform has the required elasticity to support such scenarios via dynamic scaling: the application writer can implement the Partitioner interface along with the related StatsListener interface, either directly in the operator or in a separate object that is set on the operator as an attribute. These interfaces allow the operator to periodically examine current metrics such as throughput, latency, or even custom metrics, and, based on those values, create new partitions or remove existing partitions, or both. All the resource allocation and deallocation is...

Partitioning toolkit


Partitioning is appropriate, as described in the previous section, when an operator is likely to become a bottleneck. An operator is a bottleneck if it is unable to process the input stream at the required speed, causing tuples to back up in upstream buffers. Often, this also means lowered throughput and increased latencies between the time an input tuple enters the input port of the operator and the corresponding computed output tuple(s) leave the output port(s). If such an increase in latency or reduction in throughput is transient—lasts no more than a few seconds—then partitioning may not be needed (and may even be detrimental since it causes interruption of processing while existing operators are brought down and new operators are started) since OS and platform buffering will allow the operator to catch up once the spike in input has passed.

However, if the input data rates are likely to remain high for an extended period, partitioning may be needed. In any case,...

Custom dynamic partitioning


Now that we've discussed most of the concepts related to partitioning (distribution of tuples to the partitions, unifying the output of partitions, and use of the built-in stateless partitioners for both static and dynamic partitioning) let us consider an advanced topic: custom dynamic partitioning for potentially stateful operators.

As noted earlier, construction of the new set of partitions is done in the definePartitions() method of the Partitioner interface:

Collection<Partition<T>> definePartitions(Collection<Partition<T>> partitions, PartitioningContext context); 

It is important to remember that this function is invoked in the Application Master and not in any of the containers running the partitions.

The first argument is the list of currently existing partitions. We've already seen an example implementation of this method earlier in the context of defining partition masks and keys. That example creates a list of new partitions using...

Performance optimizations


In addition to partitioning, there are other aspects of how an Apex application is deployed on the cluster that are under the control of the application writer which can be used to further improve performance.

For example, consider consecutive operators opA and opB in a DAG. If the former generates a large volume of data on its output port and the latter performs some sort of filtering or aggregation operation so that the volume of data leaving its output port is considerably diminished, it may make sense to co-locate them in the same node to conserve network bandwidth; this is called NODE_LOCAL locality.

Additionally, tuple serialization and deserialization overhead (which can be considerable in some cases) can be eliminated if they could be co-located within the same container; this is called CONTAINER_LOCAL locality. The following figure shows different options of co-location of two operators:

These co-locations can be achieved by setting the locality of the appropriate...

Low-latency versus throughput


As mentioned in Chapter 1, Introduction to Apex, Apex is capable of very low-latency processing while also delivering high throughput and fault-tolerance. Processing latency evaluation ultimately depends on the specific use case and end-to-end pipeline functionality. It is nevertheless important to know the platform limits and what is theoretically possible. Resource consumption translates to cost. It is essential that a platform can scale to meet current and future needs at reasonable cost. Complex streaming analytics applications can run on hundreds of processes and consume TBs of memory. At such a scale, it is essential to understand how future growth of business and data volume will translate into cost for the solution.

The Apex user does not have to trade latency for throughput as would be the case in batch systems or a micro-batch architecture. Performance benchmark results and production deployments have shown that Apex can process millions of events per...

Sample application for dynamic partitioning


In this section, we will take a detailed look at an example application that illustrates the use of dynamic partitioning of an operator. It uses an input operator that generates random numbers and outputs them to a DevNull library operator (which, as the name suggests, simply discards them). The input operator starts out with two partitions; after some tuples have been processed, a dynamic repartition is triggered via the StatsListener interface discussed above to increase the number of partitions to four. The source code is available atthe following link: https://github.com/apache/apex-malhar/tree/master/examples/dynamic-partition.

The populateDAG() method is, as expected, very simple:

@Override 
public void populateDAG(DAG dag, Configuration conf) 
{ 
  Gen gen         = dag.addOperator("gen",     Gen.class); 
  DevNull devNull = dag.addOperator("devNull", DevNull.class); 
  dag.addStream("data", gen.out, devNull.data); 
} 

The interesting code...

Performance – other aspects for custom operators


When tuning an application with custom operators, some common Java coding practices can act as hidden performance drains so developers should avoid them as far as possible. We've already mentioned one earlier, in passing—inadvertently including a large number of fields (or fields whose values are large) of an operator in the state by not adding the transient modifier. As the size of the state increases, serializing it for every checkpoint can become a hidden bottleneck. Generally speaking, if a field is cleared for every streaming or application window, it does not need to be part of the state.

A second practice is the per-tuple use of reflection (or the use of Maps and other Java collections) which is an expensive operation; this is often done when the type of the tuple is not known at compile time, so just Object is used. For such cases, Apex provides a utility class called PojoUtils which can be used to create custom getter and setter methods...

Summary


In this chapter, we discussed the partitioning of operators and how it is the key platform feature that provides both static scalability and adaptive dynamic scalability to Apex applications with minimal required effort on the part of the application writer. We discussed related concepts such as shuffled partitioning, parallel partitioning, unifiers, and stream codecs, how these features can be configured either in code or in properties files and how the interplay of these features results in unparalleled flexibility and outstanding run-time performance. Finally, we concluded by discussing a sample application that illustrates some of these features.

In the next chapter, we will cover another key strength of Apex: fault tolerance and reliability with check-pointing and processing guarantees.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning Apache Apex
Published in: Nov 2017Publisher: ISBN-13: 9781788296403
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (5)

author image
Thomas Weise

Thomas Weise is the Apache Apex PMC Chair and cofounder at Atrato. Earlier, he worked at a number of other technology companies in the San Francisco Bay Area, including DataTorrent, where he was a cofounder of the Apex project. Thomas is also a committer to Apache Beam and has contributed to several more of the ecosystem projects. He has been working on distributed systems for 20 years and has been a speaker at international big data conferences. Thomas received the degree of Diplom-Informatiker (MSc in computer science) from TU Dresden, Germany. He can be reached on Twitter at: @thweise.
Read more about Thomas Weise

author image
Ananth Gundabattula

Ananth is a senior application architect in the Decisioning and Advanced Analytics architecture team for Commonwealth Bank of Australia. Ananth holds a Ph.D degree in the domain of computer science security and is interested in all things data including low latency distributed processing systems, machine learning and data engineering domains. He holds 3 patents granted by USPTO and has one application pending. Prior to joining to CBA, he was an architect at Threatmetrix and the member of the core team that scaled Threatmetrix architecture to 100 million transactions per day that runs at very low latencies using Cassandra, Zookeeper and Kafka. He also migrated Threatmetrix data warehouse into the next generation architecture based on Hadoop and Impala. Prior to Threatmetrix, he worked for the IBM software labs and IBM CIO labs enabling some of the first IBM CIO projects onboarding HBase, Hadoop and Mahout stack. Ananth is a committer for Apache Apex and is currently working for the next generation architectures for CBA fraud platform and Advanced Analytics Omnia platform at CBA.
Read more about Ananth Gundabattula

author image
Munagala V. Ramanath

Dr. Munagala V. Ramanath got his PhD in Computer Science from the University of Wisconsin, USA and an MSc in Mathematics from Carleton University, Ottawa, Canada. After that, he taught Computer Science courses as Assistant/Associate Professor at the University of Western Ontario in Canada for a few years, before transitioning to the corporate sphere. Since then, he has worked as a senior software engineer at a number of technology companies in California including SeeBeyond, EMC, Sun Microsystems, DataTorrent, and Cloudera. He has published papers in peer reviewed journals in several areas including code optimization, graph theory, and image processing.
Read more about Munagala V. Ramanath

author image
David Yan

David Yan is based in the Silicon Valley, California. He is a senior software engineer at Google. Prior to Google, he worked at DataTorrent, Yahoo!, and the Jet Propulsion Laboratory. David holds a master of science in Computer Science from Stanford University and a bachelor of science in Electrical Engineering and Computer Science from the University of California at Berkeley
Read more about David Yan

author image
Kenneth Knowles

Kenneth Knowles is a founding PMC member of Apache Beam. Kenn has been working on Google Cloud Dataflow—Google's Beam backend—since 2014. Prior to that, he built backends for startups such as Cityspan, Inkling, and Dimagi. Kenn holds a PhD in
Read more about Kenneth Knowles