Reader small image

You're reading from  Building Big Data Pipelines with Apache Beam

Product typeBook
Published inJan 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800564930
Edition1st Edition
Languages
Right arrow
Author (1)
Jan Lukavský
Jan Lukavský
author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský

Right arrow

Task 9 – Separating droppable data from the rest of the data processing

Under normal circumstances, data flowing in a pipeline does not change its status regarding being late, droppable, or on time. However, the exceptions to this are as follows:

  • Data could change its status if we change our WindowFn object and re-window our stream, thereby producing different points in time that define the window GC time.
  • Data could change its status if we apply logic with a more sensitive definition of droppable data – this specifically applies to @RequiresTimeSortedInput, where droppable data becomes every data element that is – at any point in time – more behind the watermark than the defined allowed lateness.

We can rephrase these conditions so that as long as we do not change the window function and do not apply logic with specific requirements, the droppable status of an element should not change between transforms. We will use this property to...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Building Big Data Pipelines with Apache Beam
Published in: Jan 2022Publisher: PacktISBN-13: 9781800564930

Author (1)

author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský