Reader small image

You're reading from  Building Big Data Pipelines with Apache Beam

Product typeBook
Published inJan 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800564930
Edition1st Edition
Languages
Right arrow
Author (1)
Jan Lukavský
Jan Lukavský
author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský

Right arrow

Introducing composite transform – CoGroupByKey

In Task 11, we solved the tracker motivation problem by using side inputs. The actual operation that's involved can be described as a join. We want to join two streams – that is, a 5-minute average with a 1-minute average – to compare them and then output a notification. Using side inputs is handy and efficient, provided they fit in memory. If we have enough users, we will likely run into trouble with this approach. What other options do we have to solve our problem? Fortunately, Apache Beam has a composite transform called CoGroupByKey for this purpose. The transform is composite because it wraps around GroupByKey and PCollectionTuple, where each element of two or more input PCollections is tagged using TupleTag and then processed using GroupByKey to produce a CoGbkResult – a wrapper object that holds all the values from each of the input PCollections with the same key and same window. This can be seen...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Building Big Data Pipelines with Apache Beam
Published in: Jan 2022Publisher: PacktISBN-13: 9781800564930

Author (1)

author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský