Reader small image

You're reading from  Building Big Data Pipelines with Apache Beam

Product typeBook
Published inJan 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800564930
Edition1st Edition
Languages
Right arrow
Author (1)
Jan Lukavský
Jan Lukavský
author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský

Right arrow

Introducing the primitive PTransform object – Partition

The GroupByKey transform creates a set of sub-streams based on a dynamic property of the data – the set of keys of a particular window can be modified during the pipeline execution time. New keys can be created and processed at any time. This creates the complexity mentioned in the previous section – we need to store our data in keyed states and flush them on triggers. A question we might have is – would the task be easier if we knew the exact set of keys upfront, during pipeline construction time?

The answer is yes, and that is why we have a PTransform object called Partition.

Important note

A pipeline is generally divided into three phases during its life cycle: pipeline compile time, pipeline construction time, and pipeline execution time. Compile time refers (as usual) to the time we compile the source to bytecode. Construction time is the time when the pipeline's DAG of transformations...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Building Big Data Pipelines with Apache Beam
Published in: Jan 2022Publisher: PacktISBN-13: 9781800564930

Author (1)

author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský