Reader small image

You're reading from  Building Big Data Pipelines with Apache Beam

Product typeBook
Published inJan 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800564930
Edition1st Edition
Languages
Right arrow
Author (1)
Jan Lukavský
Jan Lukavský
author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský

Right arrow

Task 15 – Implementing SchemaSportTracker

In this section, we will reimplement a task from Chapter 2, Implementing, Testing, and Deploying Basic Pipelines. We have included this to learn how to overcome some limitations of SQL when using schemas – notably, the (current) inability to perform aggregation (UDAF) using multiple fields. In our computation, we need to aggregate a composite (a Row) that has three fields – latitude, longitude, and timestamp.

Again, for clarity, let's recap the definition of our problem.

Problem definition

Given a stream of GPS locations and timestamps for a workout of a specific user (a workout has an ID that is guaranteed to be unique among all users), compute the performance metrics for each workout. These metrics should contain the total duration and distance elapsed from the start of the workout to the present.

Problem decomposition discussion

The actual business logic of computing the distance from GPS location...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Building Big Data Pipelines with Apache Beam
Published in: Jan 2022Publisher: PacktISBN-13: 9781800564930

Author (1)

author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský