Reader small image

You're reading from  Data Engineering with Google Cloud Platform - Second Edition

Product typeBook
Published inApr 2024
PublisherPackt
ISBN-139781835080115
Edition2nd Edition
Right arrow
Author (1)
Adi Wijaya
Adi Wijaya
author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya

Right arrow

Processing Streaming Data with Pub/Sub and Dataflow

Processing streaming data is becoming increasingly popular since this enables businesses to get real-time metrics on business operations. In this chapter, we will understand which paradigm should be used – and when – for streaming data. We will also learn how to apply transformations to streaming data using Cloud Dataflow, as well as how to store processed records in BigQuery for analysis.

Learning about streaming data is easier when we do it, so we will complete some exercises where we will create a streaming data pipeline on Google Cloud Platform (GCP). We will use two GCP services, Pub/Sub and Dataflow. Both services are essential in creating a streaming data pipeline. At the end of this chapter, we will compare how similar and different streaming is to the batch approach that we learned about in Chapter 5, Building a Data Lake Using Dataproc.

Here are the topics that we will discuss in this chapter:

    ...

Technical requirements

Before we begin this chapter, you must have a few prerequisites ready.

In this chapter’s exercises, we will use Dataflow, Pub/Sub, Google Cloud Storage (GCS), and BigQuery. If you’ve never opened any of these services in your GCP console, you can open them now and enable their application programming interfaces (APIs).

Also, make sure you have your GCP console, Cloud Shell, and Cloud Shell Editor ready.

You can download the example code and the dataset from here: https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform-Second-Edition/tree/main/chapter-6.

Be aware of the cost that might arise from Dataflow streaming. Make sure you delete all the environments after executing the exercises in this chapter to prevent unexpected charges.

This chapter will use the same data from Chapter 5, Building a Data Lake Using Dataproc. You can choose to use the same data or prepare new data from the preceding GitHub repository...

Processing streaming data

In the big data era, people like to correlate big data with real-time data. Some people say that if the data is not real time, then it’s not big data. This statement is partially true. In reality, the majority of data pipelines in the world use the batch approach, and that’s why it’s still very important for data engineers to understand the batch data pipeline. From Chapter 3, Building a Data Warehouse on BigQuery, to Chapter 5, Building a Data Lake Using Dataproc, we focused on handling batch data pipelines.

However, real-time capabilities in the big data era are something that many data engineers need to start to rethink in terms of data architecture. To understand more about architecture, we need to have a clear definition of what real-time data is.

From the end user perspective, real-time data can mean anything – from faster access to data, more frequent data refresh, and detecting events as soon as they happen. From a...

Introduction to Pub/Sub

Pub/Sub is a messaging system. What messaging systems do is receive messages from multiple systems and distribute them to multiple systems. The key here is multiple systems. A messaging system needs to be able to act as a bridge or middleware to many different systems.

The following diagram provides a high-level picture of Pub/Sub:

Figure 6.4 – Pub/Sub terminologies and flows

Figure 6.4 – Pub/Sub terminologies and flows

To understand how to use Pub/Sub, we need to understand the four main terminologies inside Pub/Sub, as follows:

  • Publisher

    The entry point of Pub/Sub is the publisher. Pub/Sub uses the publisher to control incoming messages. Users can write code to publish messages from their applications using programming languages such as Java, Python, Go, C++, C#, Hypertext Preprocessor (PHP), and Ruby. Pub/Sub will store the messages in topics.

  • Topic

    The central point of Pub/Sub is the topic. Pub/Sub stores messages in its internal storage. The sets of...

Introduction to Dataflow

Dataflow is a data processing engine that can handle both batch and streaming data pipelines. If we want to compare with technologies that we already learned about in this book, Dataflow is comparable with Spark – in terms of positioning, both technologies can process big data. Both technologies process data in parallel and can handle almost any kind of data or file.

But in terms of underlying technologies, they are different. From the user perspective, the main difference is the serverless nature of Dataflow. Using Dataflow, we don’t need to set up any cluster. We just submit jobs to Dataflow, and the data pipeline will run automatically on the cloud. How we write the data pipeline is by using Apache Beam.

If you have finished reading Chapter 5, Building a Data Lake Using Dataproc, you will know that Dataproc is also available with Spark Serverless. At the time of writing, this feature is relatively new compared to Dataflow. There are still...

Exercise – publishing event streams to Pub/Sub

In this exercise, we will try to stream data from Pub/Sub publishers. The goal is to create a data pipeline that can stream the data to a BigQuery table, but instead of using a scheduler (as we did in Chapter 4, Building Workflows for Batch Data Loading Using Cloud Composer), we will submit a Dataflow job that will run as an application to flow data from Pub/Sub to a BigQuery table. In the exercise, we will use the bike-sharing dataset we used in Chapter 3, Building a Data Warehouse in BigQuery. Here are the overall steps we will cover:

  1. Creating a Pub/Sub topic.
  2. Creating and running a Pub/Sub publisher using Python.
  3. Creating a Pub/Sub subscription.

We’ll start by creating a Pub/Sub topic.

Creating a Pub/Sub topic

We can create Pub/Sub topics using many approaches – for example, using the GCP console, the gcloud command, or through code. As a starter, let’s use the GCP console. Proceed...

Exercise – using Dataflow to stream data from Pub/Sub to GCS

In this exercise, we will learn how to develop Beam code in Python to create data pipelines. Learning Beam will be challenging at first as you will need to get used to the specific coding pattern. So, in this exercise, we will start with some HelloWorld code. But the benefit of using Beam is it’s a general framework. Generally, you can create a batch or streaming pipeline with similar code. You can also run using different runners. In this exercise, we will use Direct Runner and Dataflow. As a summary, here are the steps:

  1. Creating a HelloWorld application using Apache Beam.
  2. Creating a Dataflow streaming job without aggregation.
  3. Creating a Dataflow streaming job with aggregation.

To get started, check out the code for this exercise:

https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform-Second-Edition/blob/main/chapter-6/code/beam_helloworld.py.

Creating a...

Introduction to CDC and Datastream

Now that we’ve learned about Pub/Sub Dataflow streaming, let’s get a better idea of how the data starts being pushed from the source system to BigQuery. Unfortunately, in the real world, there are many cases in which you can’t change the source system code at all. This means that you can’t add a Pub/Sub publisher to publish the records for streaming.

This may happen for many reasons – for example, in an organization such as banking. The core application is usually a monolith product that is developed by third-party vendors. Even if it’s developed internally, the complexity of the banking core system makes it difficult to change the code to add a Pub/Sub publisher in every data point. How can we solve this?

Back to our learning batch pipeline, we must extract data from tables in databases. We export the database’s table into files and load it to BigQuery. Can we do the same for streaming?

Unfortunately...

Exercise – Datastream ETL streaming to BigQuery

In this exercise, we will walk through the process of setting up a streaming process using Datastream. The pipeline will involve creating and configuring various GCP components to move and transform data from a CloudSQL MySQL table to a BigQuery dataset.

This exercise will be heavy on configurations and will use many different GCP components. I suggest that you open six browser tabs, one for each of the GCP components. To guide you, please use the following diagram as your checklist to make sure all the components are configured correctly:

Figure 6.24 – Datastream end-to-end steps

Figure 6.24 – Datastream end-to-end steps

Most of the steps other than Datastream and Dataflow were covered in this chapter or previous ones. I will not go through every step in too much detail. This is a good chance for you to review your understanding of what we’ve learned so far. Let’s start with the first step.

Step 1 – create...

Summary

In this chapter, we learned about streaming data and how to handle incoming data as soon as it is created. Data is created using the Pub/Sub publisher client. In practice, you can use this approach by requesting the application developer to send messages to Pub/Sub as the data source, though a second option is to use a CDC tool. In GCP, you can use the Google-provided tool for CDC called Datastream. CDC tools can be attached to the backend database such as CloudSQL to publish data changes such as insert, update, and delete operations.

The second part of streaming data is how to process the data. In this chapter, we learned how to use Dataflow to handle continuously incoming data from Pub/Sub to aggregate it on the fly and store it in BigQuery tables. Keep in mind that you can also handle data from Pub/Sub using Dataflow in a batch manner.

With experience in creating streaming data pipelines on GCP, you will realize how easy it is to start creating one from an infrastructure...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Google Cloud Platform - Second Edition
Published in: Apr 2024Publisher: PacktISBN-13: 9781835080115
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya