You're reading from Building Big Data Pipelines with Apache Beam

Product typeBook

Published inJan 2022

Reading LevelBeginner

PublisherPackt

ISBN-139781800564930

Edition1st Edition

Languages

Python

Tools

Apache Beam

Concepts

Data Processing

Author (1)

Jan Lukavský

Chapter 5: Using SQL for Pipeline Implementation

In the previous chapter, we explored how to view a stream as a changing table and vice versa. We also recalled that a table has a fancy name – a relation – and that a table that changes over time is called a Time-Varying Relation (TVR). In this chapter, we will use this knowledge to make our lives easier when implementing real-life problems. Instead of writing a full-blown pipeline in the Java SDK – which can sometimes be a little lengthy – we will use a well-known language to express our data transforms. As the name of this chapter suggests, this language will be Structured Query Language (SQL). The language itself needs some extensions to be able to manipulate the TVRs since the original version was not time-sensitive – the data was presumed to be static at the time of querying it.

Because SQL is a strongly typed language (a Domain Specific Language (DSL), actually), we will need to have strong...

Technical requirements

As always, please make sure that you have the current main branch cloned from this book's GitHub and that you have set up minikube based on the instructions provided in Chapter 1, Introducing Data Processing with Apache Beam. You will also need working knowledge of SQL as we will use it extensively in this chapter. Once all of this is ready, we can dive directly into schemas!

Understanding schemas

A schema is a way of describing a nested structure of a Java object. A typical Java object contains fields that have string names and data types.

Let's see the following Java class:

public class Position {
  double latitude; 
  double longitude;
}

This class has two fields called latitude and longitude, respectively, both of which are of the double type. The matching Schema property of this class would be as follows:

Position: Row
  latitude: double
  longitude: double

This notation declared a Position type with a schema of the Row type containing two fields, latitude and longitude, both of the double type. A Row is one of Apache Beam's built-in schema types with a nested structure – the others are Array, Iterable, and Map with their usual definitions in computer science. The difference between Array and Iterable is that Iterable does not have a known size until it's iterated over. This...

Implementing our first streaming pipeline using SQL

We will follow the same path that we walked when we started playing with the Java SDK of Apache Beam. The very first pipeline we implemented, which was in Chapter 1, Introducing Data Processing with Apache Beam, was a pipeline that read input from a resource file named lorem.txt. Our goal was to process this file and output the number of occurrences of words within that file. So, let's see how our solution would differ if we used SQL to solve it!

We have implemented the equivalent of com.packtpub.beam.chapter1.FirstPipeline in com.packtpub.beam.chapter5.FirstSQLPipeline. The main differences are summarized here:

First, we need to create a Schema that will represent our input. The input is raw lines of text as String objects, so a possible Schema representing it is a single-field Schema defined as follows:
```
Schema lineSchema = Schema.of(
    Field.of("s", FieldType.STRING));
```
We then attach...

Task 14 – Implementing SQLMaxWordLength

In Chapter 2, Implementing, Testing, and Deploying Basic Pipelines, we implemented a pipeline called MaxWordLength. In this task, we will reimplement it by using SQL and schemas. Note that although we already know how to structure code better and use PTransforms rather than using static methods to transform one PCollection into another, we will keep the approach from the original chapter so that we can easily spot the differences and compare both versions more easily.

For clarity, let's restate the problem.

Problem definition

Given a stream of text lines in Apache Kafka, create a stream consisting of the longest word seen in the stream from the beginning to the present. Use triggering to output the result as frequently as possible. Use Apache Beam SQL to implement the task whenever possible.

Problem decomposition discussion

The interesting parts will be centered around several problems that we need to solve:

...

Task 15 – Implementing SchemaSportTracker

In this section, we will reimplement a task from Chapter 2, Implementing, Testing, and Deploying Basic Pipelines. We have included this to learn how to overcome some limitations of SQL when using schemas – notably, the (current) inability to perform aggregation (UDAF) using multiple fields. In our computation, we need to aggregate a composite (a Row) that has three fields – latitude, longitude, and timestamp.

Again, for clarity, let's recap the definition of our problem.

Problem definition

Given a stream of GPS locations and timestamps for a workout of a specific user (a workout has an ID that is guaranteed to be unique among all users), compute the performance metrics for each workout. These metrics should contain the total duration and distance elapsed from the start of the workout to the present.

Problem decomposition discussion

The actual business logic of computing the distance from GPS location...

Task 16 – Implementing SQLSportTrackerMotivation

In this task, we will explore the benefits that SQL DSL brings us when it comes to more complex pipelines that are composed of several aggregations, joins, and so on. Again, as a recap, let's restate the problem definition.

Problem definition

Given a GPS location stream per workout (the same as in the previous task), create another stream that would contain information if the runner increased or decreased pace in the past minute by more than 10% compared to the average pace over the last 5 minutes. Again, use SQL DSL as much as possible.

The test and deployment are the same as in the corresponding SportTracker task, so we will skip this here. Instead, we will demonstrate how SQL (and schemas) can help us when we are dealing with joins – which is what we did when we were implementing our SportTrackerMovation example. So, let's reimplement that as well!

Problem decomposition discussion

In the original...

Further development of Apache Beam SQL

In this section, we will sum up the possible further development of Apache Beam SQL and what parts are currently expected to be missing or somewhat incomplete.

At the end of the previous chapter, we described the retract and upsert streams and defined time-varying relations on top of these streams. Although Apache Beam does contain generic retractions as part of its model, they are not implemented at the moment. The same is true for SQL. Among other things, it implies that Apache Beam SQL currently does not support full stream-to-stream joins, only windowed joins.

A windowed join, by itself, does not guarantee that retractions will not be needed, but when using a default trigger without allowed lateness – or a trigger that fires past the end of the window, plus allowed lateness only –no retractions are needed. The reason for this is that all the data is projected onto the timestamp at the end of the window, and the window ends...

Summary

In this chapter, we learned how unbounded streams of data can be viewed as time-varying relations and, as such, are suitable to be queried using SQL. We saw how standard SQL needs to be adjusted to fit streaming needs – we introduced three special functions called TUMBLE, HOP, and SESSION to be used in the GROUP BY clauses of SQL to apply a windowing strategy within SQL statements.

We explored that the prerequisite of applying Apache Beam SQL to PCollection is to create a PCollection<Row>, where Row represents the relational view of a stream, broken down to a structure with a given Schema, which represents the individual (possibly nested) fields of data elements inside PCollection. We also learned how to either automatically infer a schema from the given type using the @DefaultSchema annotation with a SchemaProvider such as JavaFieldSchema or JavaBeanSchema. When we cannot (or do not want to) use a @DefaultSchema, we can set the schema to a PCollection manually...

The rest of the chapter is locked

You have been reading a chapter from

Building Big Data Pipelines with Apache Beam

Published in: Jan 2022Publisher: PacktISBN-13: 9781800564930

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages