Reader small image

You're reading from  Building Big Data Pipelines with Apache Beam

Product typeBook
Published inJan 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800564930
Edition1st Edition
Languages
Right arrow
Author (1)
Jan Lukavský
Jan Lukavský
author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský

Right arrow

Chapter 5: Using SQL for Pipeline Implementation

In the previous chapter, we explored how to view a stream as a changing table and vice versa. We also recalled that a table has a fancy name – a relation – and that a table that changes over time is called a Time-Varying Relation (TVR). In this chapter, we will use this knowledge to make our lives easier when implementing real-life problems. Instead of writing a full-blown pipeline in the Java SDK – which can sometimes be a little lengthy – we will use a well-known language to express our data transforms. As the name of this chapter suggests, this language will be Structured Query Language (SQL). The language itself needs some extensions to be able to manipulate the TVRs since the original version was not time-sensitive – the data was presumed to be static at the time of querying it.

Because SQL is a strongly typed language (a Domain Specific Language (DSL), actually), we will need to have strong...

Technical requirements

As always, please make sure that you have the current main branch cloned from this book's GitHub and that you have set up minikube based on the instructions provided in Chapter 1, Introducing Data Processing with Apache Beam. You will also need working knowledge of SQL as we will use it extensively in this chapter. Once all of this is ready, we can dive directly into schemas!

Understanding schemas

A schema is a way of describing a nested structure of a Java object. A typical Java object contains fields that have string names and data types.

Let's see the following Java class:

public class Position {
  double latitude; 
  double longitude;
}

This class has two fields called latitude and longitude, respectively, both of which are of the double type. The matching Schema property of this class would be as follows:

Position: Row
  latitude: double
  longitude: double

This notation declared a Position type with a schema of the Row type containing two fields, latitude and longitude, both of the double type. A Row is one of Apache Beam's built-in schema types with a nested structure – the others are Array, Iterable, and Map with their usual definitions in computer science. The difference between Array and Iterable is that Iterable does not have a known size until it's iterated over. This...

Implementing our first streaming pipeline using SQL

We will follow the same path that we walked when we started playing with the Java SDK of Apache Beam. The very first pipeline we implemented, which was in Chapter 1, Introducing Data Processing with Apache Beam, was a pipeline that read input from a resource file named lorem.txt. Our goal was to process this file and output the number of occurrences of words within that file. So, let's see how our solution would differ if we used SQL to solve it!

We have implemented the equivalent of com.packtpub.beam.chapter1.FirstPipeline in com.packtpub.beam.chapter5.FirstSQLPipeline. The main differences are summarized here:

  1. First, we need to create a Schema that will represent our input. The input is raw lines of text as String objects, so a possible Schema representing it is a single-field Schema defined as follows:
    Schema lineSchema = Schema.of(
        Field.of("s", FieldType.STRING));
  2. We then attach...

Task 14 – Implementing SQLMaxWordLength

In Chapter 2, Implementing, Testing, and Deploying Basic Pipelines, we implemented a pipeline called MaxWordLength. In this task, we will reimplement it by using SQL and schemas. Note that although we already know how to structure code better and use PTransforms rather than using static methods to transform one PCollection into another, we will keep the approach from the original chapter so that we can easily spot the differences and compare both versions more easily.

For clarity, let's restate the problem.

Problem definition

Given a stream of text lines in Apache Kafka, create a stream consisting of the longest word seen in the stream from the beginning to the present. Use triggering to output the result as frequently as possible. Use Apache Beam SQL to implement the task whenever possible.

Problem decomposition discussion

The interesting parts will be centered around several problems that we need to solve:

    ...

Task 15 – Implementing SchemaSportTracker

In this section, we will reimplement a task from Chapter 2, Implementing, Testing, and Deploying Basic Pipelines. We have included this to learn how to overcome some limitations of SQL when using schemas – notably, the (current) inability to perform aggregation (UDAF) using multiple fields. In our computation, we need to aggregate a composite (a Row) that has three fields – latitude, longitude, and timestamp.

Again, for clarity, let's recap the definition of our problem.

Problem definition

Given a stream of GPS locations and timestamps for a workout of a specific user (a workout has an ID that is guaranteed to be unique among all users), compute the performance metrics for each workout. These metrics should contain the total duration and distance elapsed from the start of the workout to the present.

Problem decomposition discussion

The actual business logic of computing the distance from GPS location...

Task 16 – Implementing SQLSportTrackerMotivation

In this task, we will explore the benefits that SQL DSL brings us when it comes to more complex pipelines that are composed of several aggregations, joins, and so on. Again, as a recap, let's restate the problem definition.

Problem definition

Given a GPS location stream per workout (the same as in the previous task), create another stream that would contain information if the runner increased or decreased pace in the past minute by more than 10% compared to the average pace over the last 5 minutes. Again, use SQL DSL as much as possible.

The test and deployment are the same as in the corresponding SportTracker task, so we will skip this here. Instead, we will demonstrate how SQL (and schemas) can help us when we are dealing with joins – which is what we did when we were implementing our SportTrackerMovation example. So, let's reimplement that as well!

Problem decomposition discussion

In the original...

Further development of Apache Beam SQL

In this section, we will sum up the possible further development of Apache Beam SQL and what parts are currently expected to be missing or somewhat incomplete.

At the end of the previous chapter, we described the retract and upsert streams and defined time-varying relations on top of these streams. Although Apache Beam does contain generic retractions as part of its model, they are not implemented at the moment. The same is true for SQL. Among other things, it implies that Apache Beam SQL currently does not support full stream-to-stream joins, only windowed joins.

A windowed join, by itself, does not guarantee that retractions will not be needed, but when using a default trigger without allowed lateness – or a trigger that fires past the end of the window, plus allowed lateness only –no retractions are needed. The reason for this is that all the data is projected onto the timestamp at the end of the window, and the window ends...

Summary

In this chapter, we learned how unbounded streams of data can be viewed as time-varying relations and, as such, are suitable to be queried using SQL. We saw how standard SQL needs to be adjusted to fit streaming needs – we introduced three special functions called TUMBLE, HOP, and SESSION to be used in the GROUP BY clauses of SQL to apply a windowing strategy within SQL statements.

We explored that the prerequisite of applying Apache Beam SQL to PCollection is to create a PCollection<Row>, where Row represents the relational view of a stream, broken down to a structure with a given Schema, which represents the individual (possibly nested) fields of data elements inside PCollection. We also learned how to either automatically infer a schema from the given type using the @DefaultSchema annotation with a SchemaProvider such as JavaFieldSchema or JavaBeanSchema. When we cannot (or do not want to) use a @DefaultSchema, we can set the schema to a PCollection manually...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Building Big Data Pipelines with Apache Beam
Published in: Jan 2022Publisher: PacktISBN-13: 9781800564930
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský