Packt+ | Advance your knowledge in tech

You're reading from Learning Apache Apex

Product typeBook

Published inNov 2017

Reading LevelIntermediate

Publisher

ISBN-139781788296403

Edition1st Edition

Languages

Java

Tools

Apache Apex

Concepts

Data Processing

Authors (5):

Thomas Weise

Ananth Gundabattula

Munagala V. Ramanath

David Yan

Kenneth Knowles

View More author details

Value proposition of Apex

The cases studies presented earlier showcase how Apex is used in critical production deployments that solves important business problems. This section will highlight key capabilities of Apex and how they relate to the value proposition. To understand the challenges in finding the right technology and building successful solutions, it is helpful to look at the evolution of the big data technology space over the last few years, which essentially started with Apache Hadoop.

Hadoop was originally built as a Java-based platform for search indexing in Yahoo, inspired by Google's MapReduce paper. Its promise was to perform processing of big data on commodity hardware, reducing the infrastructure cost of such systems significantly. Hadoop became an Apache Software Foundation (ASF) top-level project in 2008, consisting of HDFS for storage and MapReduce for processing. This marked the beginning of an entire ecosystem of other Apache projects beyond MapReduce, including HBase, Hive, Oozie, and so on. Recently, we have started to see the shift away from MapReduce towards projects such as Apache Spark and Apache Kafka, leading to a transformation within the ecosystem that reflects the need for a different architecture and processing paradigm.

A further indication is that even leading Hadoop vendors have started to rebrand products and conferences to expand beyond the original Hadoop roots. Over the last 10 years, there has been a lot of hype around Hadoop, but the success rate of projects has not kept up. Challenges include:

A very large number of tools and vendors with often confusing positioning, making it difficult to evaluate and identify the right options
Complexity in development and integration, a steep learning curve, and long time to production
Scarcity of skill set: experts in the technology are difficult to hire
Production-readiness: often the primary focus is on features and functionality while operational aspects are sidelined, which is a problem for business critical systems.

Matt Turck of FirstMark Capital summed it up with the following declaration:

Big Data success is not about implementing one piece of technology (like Hadoop or anything else), but instead requires putting together an assembly line of technologies, people and processes.

So, how does Apex help to succeed with stream data processing use cases?

Since its inception, the Apex project was focused on enterprise-readiness as a key architectural requirement, including aspects such as:

The fault tolerance and high availability of all components, automatic recovery from failures, and the ability to resume applications from previous state.
Stateful processing architecture with strong processing guarantees (end-to-end exactly-once) to enable mission critical use cases that depend on correctness.
Scalability and superior performance with high throughput and low latency and the ability to process millions of events per second without compromising fault tolerance, correctness and latency.
Security, multi-tenancy and operability, including a REST API with metrics for monitoring, and so on
A comprehensive library of connectors for integration with the external systems typically found in enterprise architecture. The library is an integral part of the project, maintained by the community and guaranteed to be compatible with the engine.
Ability for code reuse in the JVM environment, and Java as the primary development language, which has a very rich ecosystem and large developer base that is accessible to the kinds of customers who require big data solutions

With several large-scale, mission-critical deployments in production, some of which we discussed earlier, Apex has proven that it can deliver.

Apex requires a cluster to run on and, as of now, this means a Hadoop cluster with YARN and HDFS. Apex will likely support other cluster managers such as Mesos, Kubernetes, or Docker Enterprise in the future, as they gain adoption in the target enterprise space. Running on top of a cluster allows Apex to provide features such as dynamic scaling and resource allocation, automatic recovery and support for multi-tenancy.

For users who already have Hadoop clusters as well as the operational skills and processes to run the infrastructure, it is easy to deploy an Apex application, as it does not require installation of any additional components on cluster nodes. If no existing Hadoop cluster is available, there are several options to get started with varying degrees of upfront investment, including cloud deployment such as Amazon EMR, installation of any of the Hadoop distributions (Cloudera, Hortonworks, MapR) or just a Docker image on a local laptop for experimentation.

Big data applications in general are not trivial, especially not the pipelines that solve complex use cases and have to run in production 24/7 without downtime. When working with Apex, the development process, APIs, library, and examples are tailored to enable a Java developer to become productive and obtain results quickly. By using readily available connectors for sources and sinks, it is possible to quickly build an initial proof of concept (PoC) application that consumes real data, does some of the required processing, and stores results. The more involved custom development for using case-specific business logic can then occur in iterations. The process of building an Apex application will be covered in detail in the next chapter.

Apex separates the application functionality (or business logic) and the behavior of the engine. Aspects such as parallelism, operator chaining/locality, checkpointing and resource allocations for individual operators can all be controlled through configuration and modified without affecting the application code or triggering a full build/test cycle. This allows benchmarking and tuning to take place independently. For example, it is possible to run the same packaged application with different configurations to test trade-offs such as lower parallelism/longer time to completion (batch use case), and so on.

Low latency and stateful processing

Apex is a native streaming architecture. As previously discussed, this allows processing of events as soon as they arrive without artificial delay, which enables real-time use cases with very low latency. Another important capability is stateful processing. Windowing may require a potentially very large amount of computational state. However, state also needs to be tracked in connectors for correct interaction with external systems. For example, the Apex Kafka connector will keep track of partition offsets as part of its checkpointed state so that it can correctly resume consumption after recovery from failure. Similarly, state is required for reading from files and other sources. For sources that don't allow for replay, it is even necessary to retain all consumed data in the connector until it has been fully processed in the DAG.

Stateful stream processors have what is also referred to as continuous operator model. Operators are initialized once, at launch time. Subsequently, as events are processed one by one, state can be accumulated and held in-memory as long as it is needed for the computation. Access to the memory is fast, which allows for very low latency.

So, what about fault tolerance? The platform is responsible for checkpointing the state. It can do so efficiently and provides everything needed to guarantee that state can be restored and is consistent in the event of failure. Unlike the early days of Apache Storm with per tuple acknowledgement overhead and user responsibility for state handling, the next generation streaming architectures provide fault tolerance mechanisms that do not compromise performance and latency. How Apex solves this, will be covered in detail in Chapter 5, Fault Tolerance and Reliability.

Native streaming versus micro-batch

Let's examine how the stateful stream processing (as found in Apex and Flink) compares to the micro-batch based approach in Apache Spark Streaming.

Let's look at the following diagram:

On top, we see an example of processing in Spark Streaming and below we see an example in Apex in the preceding diagram. Based on its underlying "stateless" batch architecture, Spark Streaming processes a stream by dividing it into small batches (micro-batches) that typically last from 500 ms to a few seconds. A new task is scheduled for every micro-batch. Once scheduled, the new task needs to be initialized. Such initialization could include opening connections to external resources, loading data that is needed for processing and so on. Overall this implies a per task overhead that limits the micro-batch frequency and leads to a latency trade-off.

In classical batch processing, tasks may last for the entire bounded input data set. Any computational state remains internal to the task and there is typically no special consideration for fault tolerance required, since whenever there is a failure, the task can restart from the beginning.

However, with unbounded data and streaming, a stateful operation like counting would need to maintain the current count and it would need to be transferred across task boundaries. As long as the state is small, this may be manageable. However, when transformations are applied to large key cardinality, the state can easily grow to a size that makes it impractical to swap in and out (cost of serialization, I/O, and so on). The correct state management is not easy to solve without underlying platform support, especially not when accuracy, consistency and fault tolerance are important.

Performance

Even with big data scale out architectures on commodity hardware, efficiency matters. Better efficiency of the platform lowers cost. If the architecture can handle a given workload with a fraction of the hardware, it will result in reduced Total Cost of Ownership (TCO). Apex provides several advanced mechanisms to optimize efficiency, such as stream locality and parallel partitioning, which will be covered in Chapter 4, Scalability, Low Latency, and Performance.

Apex is capable of very low latency processing (< 10 ms), and is well suited for use cases such as the real-time threat detection as discussed earlier. Apex can be used to deliver latency processing Service Level Agreement (SLA) in conjunction with speculative execution (processing the same event multiple times in parallel to prevent delay) due to a unique feature: the ability to recover a path or subset of operators without resetting the entire DAG.

Only a fraction of real-time use cases may have such low latency and SLA requirements. However, it is generally desirable to avoid unnecessary trade-offs. If a platform can deliver high throughput (millions of events per second) with low latency and everything else is equal, why not choose such a platform over one that forces a throughput/latency trade-off? Various benchmarking studies have shown Apex to be highly performant in providing high throughput while maintaining very low latency.

Where Apex excels

Overall, Apex has characteristics that positively impact time to production, quality, and cost. It is a particularly good fit for use cases that require:

High performance and low latency, possibly with SLA
Large scale, fault tolerant state management and end-to-end exactly-once processing guarantees
Computationally complex production pipelines where accuracy, functional stability, security and certification are critical and ad hoc changes not desirable

The following figure provides a high-level overview of the business value Apex is capable of delivering:

Where Apex is not suitable

On the other hand, there are a few related areas of interest that Apex does not target or is less suited for (as of this writing):

Data exploration in ad hoc, experimental environments such as Spark's interactive shell.
Machine learning. Apex currently does not have its own library of machine learning algorithms, although it does have the capability for iterative processing and can be used as execution engine as seen in Apache SAMOA.
Interactive SQL. Apex has basic support for streaming SQL transformations, but is not comparable to Hive or similar tools.
At the time of writing, Apex does not have support for Python, although it is being discussed within the community and likely to happen in the future. (The Apex library has a Jython operator, but users typically want to run native Python code and also specify the pipeline in Python.)

You have been reading a chapter from

Learning Apache Apex

Published in: Nov 2017Publisher: ISBN-13: 9781788296403

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (5)

Thomas Weise

Thomas Weise is the Apache Apex PMC Chair and cofounder at Atrato. Earlier, he worked at a number of other technology companies in the San Francisco Bay Area, including DataTorrent, where he was a cofounder of the Apex project. Thomas is also a committer to Apache Beam and has contributed to several more of the ecosystem projects. He has been working on distributed systems for 20 years and has been a speaker at international big data conferences. Thomas received the degree of Diplom-Informatiker (MSc in computer science) from TU Dresden, Germany. He can be reached on Twitter at: @thweise.
Read more about Thomas Weise

Ananth Gundabattula

Ananth is a senior application architect in the Decisioning and Advanced Analytics architecture team for Commonwealth Bank of Australia. Ananth holds a Ph.D degree in the domain of computer science security and is interested in all things data including low latency distributed processing systems, machine learning and data engineering domains. He holds 3 patents granted by USPTO and has one application pending. Prior to joining to CBA, he was an architect at Threatmetrix and the member of the core team that scaled Threatmetrix architecture to 100 million transactions per day that runs at very low latencies using Cassandra, Zookeeper and Kafka. He also migrated Threatmetrix data warehouse into the next generation architecture based on Hadoop and Impala. Prior to Threatmetrix, he worked for the IBM software labs and IBM CIO labs enabling some of the first IBM CIO projects onboarding HBase, Hadoop and Mahout stack. Ananth is a committer for Apache Apex and is currently working for the next generation architectures for CBA fraud platform and Advanced Analytics Omnia platform at CBA.
Read more about Ananth Gundabattula

Munagala V. Ramanath

Dr. Munagala V. Ramanath got his PhD in Computer Science from the University of Wisconsin, USA and an MSc in Mathematics from Carleton University, Ottawa, Canada. After that, he taught Computer Science courses as Assistant/Associate Professor at the University of Western Ontario in Canada for a few years, before transitioning to the corporate sphere. Since then, he has worked as a senior software engineer at a number of technology companies in California including SeeBeyond, EMC, Sun Microsystems, DataTorrent, and Cloudera. He has published papers in peer reviewed journals in several areas including code optimization, graph theory, and image processing.
Read more about Munagala V. Ramanath

David Yan

David Yan is based in the Silicon Valley, California. He is a senior software engineer at Google. Prior to Google, he worked at DataTorrent, Yahoo!, and the Jet Propulsion Laboratory. David holds a master of science in Computer Science from Stanford University and a bachelor of science in Electrical Engineering and Computer Science from the University of California at Berkeley
Read more about David Yan

Kenneth Knowles

Kenneth Knowles is a founding PMC member of Apache Beam. Kenn has been working on Google Cloud Dataflow—Google's Beam backend—since 2014. Prior to that, he built backends for startups such as Cityspan, Inkling, and Dimagi. Kenn holds a PhD in
Read more about Kenneth Knowles

Other recommended products

Related to this chapter

Mastering Apache Storm

With real-world examples and clear explanations, this book will ensure you will have a thorough mastery Apache Storm.You’ll get an understanding of deploying Storm on clusters. Introduce yourself to topics such as trident topology, monitoring, Storm Parallelism, scheduler and log processing. Learn how to integrate Storm with other well-known Big Data technologies such as HBase, Redis, Kafka, and Hadoop to realize the full potential of Storm.You will be able to use the knowledge to develop efficient, distributed real-time applications to cater to your business needs.

BookAug 2017284 pages

Practical Real-time Data Processing and Analytics

Real-time data processing involves continuous input, processing and output of data, with the condition that the time required for processing is as short as possible. This book covers the majority of the existing and evolving open source technology stack for real-time processing and analytics. You will get to know about all the real-time solution aspects, from the source to the presentation to persistence. Through this practical book, you’ll be equipped with a clear understanding of how to solve challenges on your own.

BookSep 2017360 pages

Learning Apache Flink

BookFeb 2017280 pages

Big Data Analytics with Hadoop 3

Apache Hadoop is the most popular platform for big data processing to build powerful analytics solutions. This book shows you how to do just that, with the help of practical examples. You will be well-versed with the analytical capabilities of Hadoop ecosystem with Apache Spark and Apache Flink to perform big data analytics by the end of this book.

BookMay 2018482 pages

Apache Hadoop 3 Quick Start Guide

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics such as MapReduce, YARN and HDFS.

BookOct 2018220 pages

Building Data Streaming Applications with Apache Kafka

Apache Kafka is a popular distributed streaming platform which acts as a messaging queue or an enterprise messaging system. This book is a comprehensive guide on designing and architecting enterprise-grade streaming applications using Apache Kafka and other Big Data tools. Once you grasp the basics, we will take you through the more advance concepts in Apache Kafka such as capacity planning and security.

BookAug 2017278 pages

Stream Analytics with Microsoft Azure

This book is your guide to understanding the basics of how Azure Stream Analytics works, and build your own analytics solution using its capabilities. By the end of this book, you will be well-versed in using Azure Stream Analytics to develop an efficient analytics solution which can work with any type of data.

BookDec 2017322 pages

Apache Spark 2.x for Java Developers

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone.

BookJul 2017350 pages

Mastering Apache Spark 2.x

Apache Spark is an in-memory cluster-based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and more. This book will familiarize you with the newest features in Apache Spark 2.x, and take you through an exciting journey of complex Big Data processing, analytics, streaming analytics as well as advanced machine learning with Apache Spark. During the course of the book, you will leverage different functionalities and modules of Apache Spark such as Spark SQL, Spark MLlib, Spark Streaming, SparkML and more, to build efficient data processing solutions. By the end of this book, you will have all the necessary knowledge to use Apache Spark effectively in your day to day tasks.

BookJul 2017354 pages

Mastering Hadoop 3

This is a comprehensive guide to understand advanced concepts of Hadoop ecosystem. You will learn how Hadoop works internally, and build solutions to some of real world use cases. Finally, you will have a solid understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable Big Data pipeline

BookFeb 2019544 pages

Modern Big Data Processing with Hadoop

This book presents unique techniques to conquer different Big Data processing and analytics challenges using Hadoop. Practical examples are provided to boost your understanding of Big Data concepts and their implementation. By the end of the book, you will have all the knowledge and skills you need to become a true Big Data expert.

BookMar 2018394 pages

Architecting Data-Intensive Applications

Are you a software architect or developer looking at your own applications gingerly while browsing through Facebook and applauding its data-intensive yet fluent and efficient behavior? This book is your gateway to build smart Data Intensive Systems by imbibing Core Data Intensive Architectural Principles, patterns, and techniques directly into your application architecture.

BookJul 2018340 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages