You're reading from Building Big Data Pipelines with Apache Beam

Product typeBook

Published inJan 2022

Reading LevelBeginner

PublisherPackt

ISBN-139781800564930

Edition1st Edition

Languages

Python

Tools

Apache Beam

Concepts

Data Processing

Author (1)

Jan Lukavský

Chapter 8: Understanding How Runners Execute Pipelines

So far in this book, we have focused on Apache Beam from the user's perspective. We have seen how to code pipelines in the Java Software Development Kit (SDK), how to use Domain-Specific Languages (DSLs) such as SQL, and how to use portability with the Python SDK. In this chapter, we will focus on how the runner executes the pipeline. This will help us if we want to develop a runner for a new technology, debug our code, or improve performance issues.

We will not try to implement our own runner in this chapter. Instead, we will focus on the theoretical concepts that underpin runners. We will explore the building blocks of a typical runner, and this will help us understand how a runner executes our user code.

After describing how runners implement the Beam model, we will conclude this chapter with an in-depth description of window semantics and using metrics for observability. Improving observability is key when attempting...

Describing the anatomy of an Apache Beam runner

Let's first take a look at the typical life cycle of a pipeline, from the construction time to the pipeline teardown. The complete life cycle is illustrated in the following figure:

Figure 8.1 – The execution of a pipeline by a runner

The pipeline construction is already well known – we spent most of this book showing how to construct and test pipelines. The next step is submitting the pipeline to a runner. This is the point where the pipeline crosses the SDK-runner boundary, typically by a call to Pipeline.run().

After the pipeline is submitted to the runner, the runner proceeds as follows:

Once a runner receives a pipeline, it first performs pipeline validation. This consists of various runner-independent validations – for instance, validating that an appropriate window function and/or trigger is being set and depending on the boundedness of the inputs of the pipeline. These...

Explaining the differences between classic and portable runners

The description in the previous section – Describing the anatomy of an Apache Beam runner – applies to both classic and portable runners. However, there are some important differences between the two.

A classic runner is a runner that is implemented using the same programming language as the Beam SDK. The runner is made in a way that enables it to run the specific SDK only. An example of a classic runner is a classic FlinkRunner instance, which uses Apache Flink, has a native API implemented in Java, and is able to execute Beam pipelines written in the Java SDK. We used this runner throughout the first five chapters of this book.

A portable runner is implemented using the portability layer and as a result, it can be used to execute pipelines written in any SDK that is supported by Beam. However, this flexibility comes at a price – a portable runner implemented in Java and running a pipeline implemented...

Understanding how a runner handles state

As we already know, any complex computation will need to group multiple data elements in order to do computation. Because the streaming processing cannot rely on sources being able to replay data (as opposed to pure batch processing, where this property is essential), any updates to the local state during the computation have to be fault-tolerant, and it is the responsibility of a runner to ensure this. The Beam state API is designed precisely to enable this. Any state access is handled by a runner-provided implementation of StateInternals (and TimerInternals for timers – in this discussion, we will treat timers as special cases of state, so we will not describe them independently). The StateInternals instances are responsible for creating the accessors for the state – for example, ValueState, BagState, MapState, and so on. The runner must create and manage these instances to ensure both fault tolerance and consistency...

Exploring the Apache Beam capability matrix

Beam makes sure that runners adhere to the Beam model to ensure that pipelines are portable between different runners. There are tests specifically designed to validate the compatibility between runners – these tests are called ValidatesRunner suites and are annotated using the @ValidatesRunner annotation. Any runner author can verify the compatibility of their runner against these tests. However, because runners are developed at different rates, it is possible that a specific runner at a certain moment in time does not support all of the features of the Beam model. Therefore, the complete set of ValidatesRunner tests can be narrowed down by toggling specific features of the model. The tests in the complete suite are then annotated using features such as UsesSetState or UsesSchema. A runner is then allowed to specify features that should be excluded from the tests. These excluded features should then match what is documented in the...

Understanding windowing semantics in depth

In Chapter 1, Introducing Data Processing with Apache Beam, we introduced the basic types of window functions. To recap, we defined the following:

Fixed windows
Sliding windows
Global window
Session windows

We also defined two basic types of windows: key-aligned and key-unaligned. The first three types (fixed, sliding, and global) are key-aligned, and session windows are key-unaligned (as in session windows, each window can start and end at different times for different keys). However, what we skipped in Chapter 1, Introduction to Data Processing with Apache Beam, was the fact that we can define completely custom windowing logic.

The Window.into transform accepts a generic WindowFn instance, which defines the following main methods:

The assignWindows method, which assigns elements into a set of window labels.
The isNonMerging method, which tells the runner whether the WindowFn instance defines merging...

Debugging pipelines and using Apache Beam metrics for observability

Observability is a key part of spotting potential issues with a running pipeline. It can be used to measure various performance characteristics, including the number of elements processed, the number of RPC calls to backend services, and the distribution of the event-time lags of elements flowing through the pipeline.

Although it should be possible to create a side output for each metric and handle the resulting stream like any data in the pipeline, the requirement for quick and simple feedback from running pipelines led Beam to create a simple API dedicated to metrics. Currently, Beam supports the following metrics:

Counters
Gauges
Distributions

A Counter instance is a metric that is represented by a single long value that can only be incremented or decremented (this can be by 1, or by another number).

A Gauge instance is a metric that also holds a single long value; however, this value...

Summary

In this chapter, we walked through how runners execute pipelines in both classic mode and using the portability layer. We have seen that classic runners are suitable only for cases where a particular underlying technology – for instance, Apache Flink – has an API in the same language as the pipeline SDK. The most practical cases for this include using the Java SDK for both the runner and the pipeline.

In cases where the language of the runner and the pipeline SDK differ, we have to use portability (Fn API), which brings some overhead. We have seen how pipeline fusion is used to reduce this overhead as much as possible. We have also discussed situations where we want to prevent fusion and how to do this by inserting a shuffle boundary.

Next, we discussed the responsibilities of a runner with regard to state management. We saw how the runner ensures fault tolerance and correctness upon failures. We outlined two basic types of fault-tolerant states: local state...

The rest of the chapter is locked

You have been reading a chapter from

Building Big Data Pipelines with Apache Beam

Published in: Jan 2022Publisher: PacktISBN-13: 9781800564930

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Building Big Data Pipelines with Apache Beam

Chapter 8: Understanding How Runners Execute Pipelines

Describing the anatomy of an Apache Beam runner

Explaining the differences between classic and portable runners

Understanding how a runner handles state

Exploring the Apache Beam capability matrix

Understanding windowing semantics in depth

Debugging pipelines and using Apache Beam metrics for observability

Summary

Why subscribe?

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook