Reader small image

You're reading from  Building Big Data Pipelines with Apache Beam

Product typeBook
Published inJan 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800564930
Edition1st Edition
Languages
Right arrow
Author (1)
Jan Lukavský
Jan Lukavský
author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský

Right arrow

Summary

In this chapter, we have walked through the last fundamental transform of Apache Beam – the splittable DoFn transform. The transform works as a unifying bridge between batch and streaming sources on one side and allows us to build reusable bounded and unbounded transforms that can be composed to deliver new functionality. As an example, we implemented a StreamingFileRead transform that composes two splittable DoFn transforms – one that watches a directory for new files and another that reads the contents of the files and produces PCollection objects of text lines from them. Note that we might reuse these transforms in different ways. The FileRead transform can be used to read filenames from Apache Kafka, thereby converting a stream in Kafka containing new filenames to a stream of text lines contained in these files. The DirectoryWatch transform could be used as an input to a transform that ensures the synchronizing of files between two distinct locations. It is...

lock icon
The rest of the page is locked
Previous PageNext Chapter
You have been reading a chapter from
Building Big Data Pipelines with Apache Beam
Published in: Jan 2022Publisher: PacktISBN-13: 9781800564930

Author (1)

author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský