Reader small image

You're reading from  Building Big Data Pipelines with Apache Beam

Product typeBook
Published inJan 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800564930
Edition1st Edition
Languages
Right arrow
Author (1)
Jan Lukavský
Jan Lukavský
author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský

Right arrow

Defining droppable data in Beam

This section will be a short return to the material we covered in Chapter 2, Implementing, Testing, and Deploying Basic Pipelines, where we already defined what late data means. To recap – late data is every data element that has a timestamp that is behind the watermark. That is to say, the watermark tells us that we should not receive a data element with a timestamp lower than the watermark, but nevertheless, we do receive such an element. This is perfectly fine, and as already described in Chapter 1, Introduction to Data Processing with Apache Beam, a perfect watermark would introduce unnecessary – or even impractical – latency. However, what we left unanswered is the following question – what happens to data elements that arrive too late? We know that we can define allowed lateness, but what if any data arrives even later? And as always, the answer is – it depends. Luckily, some of the concepts relating to streaming...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Building Big Data Pipelines with Apache Beam
Published in: Jan 2022Publisher: PacktISBN-13: 9781800564930

Author (1)

author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský