Reader small image

You're reading from  Building Big Data Pipelines with Apache Beam

Product typeBook
Published inJan 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800564930
Edition1st Edition
Languages
Right arrow
Author (1)
Jan Lukavský
Jan Lukavský
author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský

Right arrow

Using side outputs

As the name suggests, side inputs are something that is added to the main input from the side, while side outputs are something that is output from the DoFn object outside of the main PCollection output. Let's start with the side outputs, as they are more straightforward.

As an example, let's imagine we are processing data coming in as JSON values. We need to parse these messages into an internal object. But what should we do with the values that cannot be parsed because they contain a syntax error? If we do not do any validation before we store them in the stream (topic), then it is certainly possible that we will encounter such a situation. We can silently drop those records, but that is obviously not a great idea, as that could cause hard-to-debug problems. A much better option would be to store these values on the side to be able to investigate and fix them. Therefore, we should aim to do the following:

Figure 3.8 – Main...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Building Big Data Pipelines with Apache Beam
Published in: Jan 2022Publisher: PacktISBN-13: 9781800564930

Author (1)

author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský