Reader small image

You're reading from  Building Big Data Pipelines with Apache Beam

Product typeBook
Published inJan 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800564930
Edition1st Edition
Languages
Right arrow
Author (1)
Jan Lukavský
Jan Lukavský
author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský

Right arrow

Writing a custom data sink

As opposed to a data source, a data sink has much less work to do. Actually – in trivial cases – a data sink can be implemented using a plain ParDo object. In fact, we have already implemented one of these, which was PrintElements, located in the util module. The PrintElements transform can be considered a sink to stderr, as we can see from this implementation:

@Override
public PDone expand(PCollection<T> input) {
  input.apply(ParDo.of(new LogResultsFn<>()));
  return PDone.in(input.getPipeline());
}
private static class LogResultsFn<T> extends DoFn<T, Void> {
  @ProcessElement
  public void process(@Element T elem) {
    System.err.println(elem);
  }
}

This sink is very simplistic – a real-life solution would need some of the tools we already know. For example, batching RPCs using bundle life cycles via @StartBundle and @FinishBundle...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Building Big Data Pipelines with Apache Beam
Published in: Jan 2022Publisher: PacktISBN-13: 9781800564930

Author (1)

author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský