Reader small image

You're reading from  Building Big Data Pipelines with Apache Beam

Product typeBook
Published inJan 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800564930
Edition1st Edition
Languages
Right arrow
Author (1)
Jan Lukavský
Jan Lukavský
author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský

Right arrow

Explaining PTransform expansion

A PTransform is a short name for parallel transform – an Apache Beam primitive for transforming PInput into POutput. PInput is a labeling interface that marks objects as suitable as input to PTransform, while POutput marks objects as suitable as outputs. We already know these objects quite well – a typical one that's used for both input and output is PCollection. But there are others as well – most notably PCollectionTuple and PCollectionList. There are also two special objects – PBegin and PDone. As we already know, an Apache Beam program – a pipeline – is a DAG whose edges represent PCollections and whose nodes represent PTransforms. PTransforms in the DAG that take PBegin as input are roots, while PTransforms that produce PDone are the leaves of the DAG.

This can be seen in the following diagram:

Figure 4.1 – DAG of PTransforms and PCollections

A PTransform is a recursive...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Building Big Data Pipelines with Apache Beam
Published in: Jan 2022Publisher: PacktISBN-13: 9781800564930

Author (1)

author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský