Reader small image

You're reading from  Building Big Data Pipelines with Apache Beam

Product typeBook
Published inJan 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800564930
Edition1st Edition
Languages
Right arrow
Author (1)
Jan Lukavský
Jan Lukavský
author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský

Right arrow

Task 2 – Calculating the maximal length of a word in a stream

This is a similar example. In the previous task, we wanted to calculate the K most frequent words in a stream for a fixed time window. How would our solution change if our task was to calculate this from the beginning of the stream? Let's define the problem.

Defining the problem

Given an input data stream of lines of text, calculate the longest word ever seen in this stream. Start with an empty word value; once a longer word is seen, immediately output the new longest word.

Discussing the problem decomposition

Although the logic seems to be similar to the previous task, it can be simplified as follows:

Figure 2.3 – The problem decomposition

Note, there are two main differences from the previous task:

  • We must compute the word with the longest length; although this could be viewed as a Top transform, with K equal to one, Beam has a specific transform for that...
lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Building Big Data Pipelines with Apache Beam
Published in: Jan 2022Publisher: PacktISBN-13: 9781800564930

Author (1)

author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský