Reader small image

You're reading from  Building Big Data Pipelines with Apache Beam

Product typeBook
Published inJan 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800564930
Edition1st Edition
Languages
Right arrow
Author (1)
Jan Lukavský
Jan Lukavský
author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský

Right arrow

Task 18 – Implementing SportTracker in the Python SDK

This task will be a reimplementation of Task 5 from Chapter 2, Implementing, Testing, and Deploying Basic Pipelines. Again, for clarity, let's restate the problem definition.

Problem definition

Given an input data stream of quadruples (workoutId, gpsLatitude, gpsLongitude, and timestamp), calculate the current speed and total tracked distance. The data comes from a GPS tracker that sends data only when the user starts a sports activity. We can assume that workoutId is unique and contains userId in it.

The caveats of the implementation are the same as what we discussed in the original Task 5, so we'll skip to its Python SDK implementation right away.

Solution implementation

The complete implementation can be found in the source code for of this chapter, in chapter6/src/main/python/sport_tracker.py. The logic is concentrated in two functions – SportTrackerCalc and computeMetrics:

  1. The...
lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Building Big Data Pipelines with Apache Beam
Published in: Jan 2022Publisher: PacktISBN-13: 9781800564930

Author (1)

author image
Jan Lukavský

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.
Read more about Jan Lukavský