Reader small image

You're reading from  Building Big Data Pipelines with Apache Beam

Product typeBook
Published inJan 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800564930
Edition1st Edition
Languages
Right arrow
Author (1)
Jan Lukavský
Jan Lukavský
author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský

Right arrow

Task 7 – Batching queries to an external RPC service

Let's imagine that the RPC service we used in Task 6 supports the batching of RPC queries. Batching is a technique for reducing network overhead by grouping multiple queries into a single one, thus increasing throughput. So, instead of querying our RPC service with each element, we would like to send multiple input elements in a single query.

Defining the problem

Given an RPC service that supports the batching of requests for increasing throughput, use this service to augment the input data of a PCollection object. Be sure to preserve the timestamp of both the timestamp and window assigned to the input element.

Discussing the problem decomposition

The first thing to notice is that unlike in Task 6, where we queried our RPC service with each element separately (and therefore, simply kept the timestamp and the window of the element untouched), in this case, we can have multiple elements with multiple timestamps...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Building Big Data Pipelines with Apache Beam
Published in: Jan 2022Publisher: PacktISBN-13: 9781800564930

Author (1)

author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský