You're reading from Data Engineering with Google Cloud Platform - Second Edition

Product typeBook

Published inApr 2024

PublisherPackt

ISBN-139781835080115

Edition2nd Edition

Concepts

Data Engineering

Author (1)

Adi Wijaya

Processing Streaming Data with Pub/Sub and Dataflow

Processing streaming data is becoming increasingly popular since this enables businesses to get real-time metrics on business operations. In this chapter, we will understand which paradigm should be used – and when – for streaming data. We will also learn how to apply transformations to streaming data using Cloud Dataflow, as well as how to store processed records in BigQuery for analysis.

Learning about streaming data is easier when we do it, so we will complete some exercises where we will create a streaming data pipeline on Google Cloud Platform (GCP). We will use two GCP services, Pub/Sub and Dataflow. Both services are essential in creating a streaming data pipeline. At the end of this chapter, we will compare how similar and different streaming is to the batch approach that we learned about in Chapter 5, Building a Data Lake Using Dataproc.

Here are the topics that we will discuss in this chapter:

...

Technical requirements

Before we begin this chapter, you must have a few prerequisites ready.

In this chapter’s exercises, we will use Dataflow, Pub/Sub, Google Cloud Storage (GCS), and BigQuery. If you’ve never opened any of these services in your GCP console, you can open them now and enable their application programming interfaces (APIs).

Also, make sure you have your GCP console, Cloud Shell, and Cloud Shell Editor ready.

You can download the example code and the dataset from here: https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform-Second-Edition/tree/main/chapter-6.

Be aware of the cost that might arise from Dataflow streaming. Make sure you delete all the environments after executing the exercises in this chapter to prevent unexpected charges.

This chapter will use the same data from Chapter 5, Building a Data Lake Using Dataproc. You can choose to use the same data or prepare new data from the preceding GitHub repository...

Processing streaming data

In the big data era, people like to correlate big data with real-time data. Some people say that if the data is not real time, then it’s not big data. This statement is partially true. In reality, the majority of data pipelines in the world use the batch approach, and that’s why it’s still very important for data engineers to understand the batch data pipeline. From Chapter 3, Building a Data Warehouse on BigQuery, to Chapter 5, Building a Data Lake Using Dataproc, we focused on handling batch data pipelines.

However, real-time capabilities in the big data era are something that many data engineers need to start to rethink in terms of data architecture. To understand more about architecture, we need to have a clear definition of what real-time data is.

From the end user perspective, real-time data can mean anything – from faster access to data, more frequent data refresh, and detecting events as soon as they happen. From a...

Introduction to Pub/Sub

Pub/Sub is a messaging system. What messaging systems do is receive messages from multiple systems and distribute them to multiple systems. The key here is multiple systems. A messaging system needs to be able to act as a bridge or middleware to many different systems.

The following diagram provides a high-level picture of Pub/Sub:

Figure 6.4 – Pub/Sub terminologies and flows

To understand how to use Pub/Sub, we need to understand the four main terminologies inside Pub/Sub, as follows:

Publisher
The entry point of Pub/Sub is the publisher. Pub/Sub uses the publisher to control incoming messages. Users can write code to publish messages from their applications using programming languages such as Java, Python, Go, C++, C#, Hypertext Preprocessor (PHP), and Ruby. Pub/Sub will store the messages in topics.
Topic
The central point of Pub/Sub is the topic. Pub/Sub stores messages in its internal storage. The sets of...

Introduction to Dataflow

Dataflow is a data processing engine that can handle both batch and streaming data pipelines. If we want to compare with technologies that we already learned about in this book, Dataflow is comparable with Spark – in terms of positioning, both technologies can process big data. Both technologies process data in parallel and can handle almost any kind of data or file.

But in terms of underlying technologies, they are different. From the user perspective, the main difference is the serverless nature of Dataflow. Using Dataflow, we don’t need to set up any cluster. We just submit jobs to Dataflow, and the data pipeline will run automatically on the cloud. How we write the data pipeline is by using Apache Beam.

If you have finished reading Chapter 5, Building a Data Lake Using Dataproc, you will know that Dataproc is also available with Spark Serverless. At the time of writing, this feature is relatively new compared to Dataflow. There are still...

Exercise – publishing event streams to Pub/Sub

In this exercise, we will try to stream data from Pub/Sub publishers. The goal is to create a data pipeline that can stream the data to a BigQuery table, but instead of using a scheduler (as we did in Chapter 4, Building Workflows for Batch Data Loading Using Cloud Composer), we will submit a Dataflow job that will run as an application to flow data from Pub/Sub to a BigQuery table. In the exercise, we will use the bike-sharing dataset we used in Chapter 3, Building a Data Warehouse in BigQuery. Here are the overall steps we will cover:

Creating a Pub/Sub topic.
Creating and running a Pub/Sub publisher using Python.
Creating a Pub/Sub subscription.

We’ll start by creating a Pub/Sub topic.

Creating a Pub/Sub topic

We can create Pub/Sub topics using many approaches – for example, using the GCP console, the gcloud command, or through code. As a starter, let’s use the GCP console. Proceed...

Exercise – using Dataflow to stream data from Pub/Sub to GCS

In this exercise, we will learn how to develop Beam code in Python to create data pipelines. Learning Beam will be challenging at first as you will need to get used to the specific coding pattern. So, in this exercise, we will start with some HelloWorld code. But the benefit of using Beam is it’s a general framework. Generally, you can create a batch or streaming pipeline with similar code. You can also run using different runners. In this exercise, we will use Direct Runner and Dataflow. As a summary, here are the steps:

Creating a HelloWorld application using Apache Beam.
Creating a Dataflow streaming job without aggregation.
Creating a Dataflow streaming job with aggregation.

To get started, check out the code for this exercise:

https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform-Second-Edition/blob/main/chapter-6/code/beam_helloworld.py.

Creating a...

Introduction to CDC and Datastream

Now that we’ve learned about Pub/Sub Dataflow streaming, let’s get a better idea of how the data starts being pushed from the source system to BigQuery. Unfortunately, in the real world, there are many cases in which you can’t change the source system code at all. This means that you can’t add a Pub/Sub publisher to publish the records for streaming.

This may happen for many reasons – for example, in an organization such as banking. The core application is usually a monolith product that is developed by third-party vendors. Even if it’s developed internally, the complexity of the banking core system makes it difficult to change the code to add a Pub/Sub publisher in every data point. How can we solve this?

Back to our learning batch pipeline, we must extract data from tables in databases. We export the database’s table into files and load it to BigQuery. Can we do the same for streaming?

Unfortunately...

Exercise – Datastream ETL streaming to BigQuery

In this exercise, we will walk through the process of setting up a streaming process using Datastream. The pipeline will involve creating and configuring various GCP components to move and transform data from a CloudSQL MySQL table to a BigQuery dataset.

This exercise will be heavy on configurations and will use many different GCP components. I suggest that you open six browser tabs, one for each of the GCP components. To guide you, please use the following diagram as your checklist to make sure all the components are configured correctly:

Figure 6.24 – Datastream end-to-end steps

Most of the steps other than Datastream and Dataflow were covered in this chapter or previous ones. I will not go through every step in too much detail. This is a good chance for you to review your understanding of what we’ve learned so far. Let’s start with the first step.

Step 1 – create...

Summary

In this chapter, we learned about streaming data and how to handle incoming data as soon as it is created. Data is created using the Pub/Sub publisher client. In practice, you can use this approach by requesting the application developer to send messages to Pub/Sub as the data source, though a second option is to use a CDC tool. In GCP, you can use the Google-provided tool for CDC called Datastream. CDC tools can be attached to the backend database such as CloudSQL to publish data changes such as insert, update, and delete operations.

The second part of streaming data is how to process the data. In this chapter, we learned how to use Dataflow to handle continuously incoming data from Pub/Sub to aggregate it on the fly and store it in BigQuery tables. Keep in mind that you can also handle data from Pub/Sub using Dataflow in a batch manner.

With experience in creating streaming data pipelines on GCP, you will realize how easy it is to start creating one from an infrastructure...

The rest of the chapter is locked

You have been reading a chapter from

Data Engineering with Google Cloud Platform - Second Edition

Published in: Apr 2024Publisher: PacktISBN-13: 9781835080115

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages