Reader small image

You're reading from  Data Engineering with Google Cloud Platform - Second Edition

Product typeBook
Published inApr 2024
PublisherPackt
ISBN-139781835080115
Edition2nd Edition
Right arrow
Author (1)
Adi Wijaya
Adi Wijaya
author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya

Right arrow

Building Workflows for Batch Data Loading Using Cloud Composer

In data engineering, the definition of a workflow is a set of configurations to automate tasks, jobs, and their dependencies. If we are talking about database workflow, we talk about how to automate the table creation process.

The main objects in any database system are tables, and one of the main differences between an application database and a data warehouse is the creation of tables. Compared to tables in application databases, where tables are mostly static and created to support frontend applications, tables in data warehouses are dynamic. The table is the main product as it is a collection of business logic, and data flows.

In this chapter, we will learn how to create the workflow for our data warehouse tables from Chapter 3, Building a Data Warehouse in BigQuery. We will learn how to automate the table creations using a Google Cloud Platform (GCP) service called Cloud Composer. This will include how to create...

Technical requirements

For this chapter’s exercises, we will need the following:

Introduction to Cloud Composer

Cloud Composer is an Airflow-managed service in GCP. Using Cloud Composer, we don’t need to think about the infrastructure, installation, and software management for the Airflow environment. With this, we can focus only on the development and deployment of our data pipeline.

From the perspective of a data engineer, there is almost no difference between Cloud Composer and Airflow. When we learn how to use Cloud Composer, basically we learn how Airflow works.

Now, what is Airflow? Airflow is an open source workflow management tool. It comes with many features to support workflows, including monitoring, logging, a user interface, backfilling, and metadata management. What is unique in Airflow is that we use a Python script to manage our workflows, which is very friendly for data engineers in general.

There are some elements that we need to look at in the previous statement. Let’s break down the statement into more detail. When talking...

Understanding the working of Airflow

Airflow handles all three of the preceding elements using Python scripts. As data engineers, what we need to do is to create code in Python for handling task dependencies, scheduling our jobs, and integrating with other systems. This is different from traditional extract, transform, load (ETL) tools. If you have ever heard of or used tools such as Control-M, Informatica, Talend, or many other ETL tools, Airflow has the same positioning as these tools. The difference is that Airflow is not a user interface (UI)-based drag and drop tool. Airflow is designed for you to write the workflow using code.

There are a couple of good reasons why managing the workflow using code is a better idea than using drag-and-drop tools. Here’s why we should do this:

  • Using code, you can automate a lot of development and deployment processes
  • Using code, it’s easier for you to enable good testing practices
  • All the configurations can be managed...

Cloud Composer 1 vs Cloud Composer 2

Currently, there are two main Cloud Composer versions, Cloud Composer 1 and Composer 2. Don’t be mistaken for the Airflow version, Airflow also has their versioning from the open-source. For example, currently, the latest Airflow version is 2.5.3.

What is the main difference between the two major Cloud Composer versions? There are some aspects that make Cloud Composer 1 and 2 different. But one of the most significant aspects is how the Airflow managed service works. In Cloud Composer 1, the number of Airflow workers is pre-defined for each environment. Cloud Composer 2 introduces the auto-scaling capability for the number of Airflow workers.

For example, in Cloud Composer 1, you need to choose the number of workers when you create the environment, let’s say three workers. This environment will always occupy three virtual machines for the environment, with or without workloads running on top of it. In other cases, if there are...

Provisioning Cloud Composer in a GCP project

In order to develop our Airflow code, we will need a Cloud Composer environment. In this section, we will create our first Cloud Composer environment using the GCP console.

Follow these steps to create our environment:

  1. Go to the GCP Console navigation bar.
  2. Find and click on Composer in the pinned services, or if you haven’t, find it in the ANALYTICS section, as illustrated in the following screenshot:

Figure 4.1 – Composer button in the navigation bar

Figure 4.1 – Composer button in the navigation bar

  1. After clicking on Composer in the navigation bar, you may be asked to enable the API (if you have already enabled it, you can ignore this). After enabling the API, you will be on a new web page. We will call this web page Cloud Composer Console. Let’s continue our steps.
  2. Click CREATE ENVIRONMENT – Composer 2 on the Cloud Composer Console web page.
  3. Choose your Composer Environment name.
  4. Choose...

Exercise – build data pipeline orchestration using Cloud Composer

We will continue our bike-sharing scenario from Chapter 3, Building a Data Warehouse in BigQuery. Please finish Chapter 3 before going through this exercise.

This Cloud Composer exercise will be divided into five different DAG levels. Each DAG level will have specific learning objectives, as follows:

  • Level 1: Learn how to create a DAG and submit it to Cloud Composer
  • Level 2: Learn how to use operators
  • Level 3: Learn how to use variables
  • Level 4: Learn how to apply task idempotency
  • Level 5: Handling DAG dependencies using an Airflow dataset

It’s important for you to understand that learning Airflow is as easy as Level 1 DAG. But as we go through each of the levels, you will see the challenges and opportunities we may have in practicing it.

In reality, you can choose to follow all of the best practices or none at all—Airflow won’t forbid you from doing that...

Summary

In this chapter, we learned about Cloud Composer. Having learned about Cloud Composer, we then needed to know how to work with Airflow. We realized that as an open source tool, Airflow has a wide range of features. We focused on how to use Airflow to help us build a data pipeline for our BigQuery data warehouse. There are a lot more features and capabilities in Airflow that are not covered in this book. You can always expand your skills in this area, but you will already have a good foundation after finishing this chapter.

As a tool, Airflow is fairly simple. You just need to know how to write a Python script to define DAGs. We’ve learned in the Level 1 DAG exercise that you just need to write simple code to build your first DAG, but a complication arises when it comes to best practices, as there are a lot of best practices that you can follow. At the same time, there are also a lot of potential bad practices that Airflow developers can make.

By learning the examples...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Google Cloud Platform - Second Edition
Published in: Apr 2024Publisher: PacktISBN-13: 9781835080115
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya