Reader small image

You're reading from  Data Engineering with Google Cloud Platform

Product typeBook
Published inMar 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800561328
Edition1st Edition
Languages
Right arrow
Author (1)
Adi Wijaya
Adi Wijaya
author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya

Right arrow

Chapter 4: Building Orchestration for Batch Data Loading Using Cloud Composer

The definition of orchestration is a set of configurations to automate tasks, jobs, and their dependencies. If we are talking about database orchestration, we talk about how to automate the table creation process.

The main objects in any database system are tables, and one of the main differences between an application database and a data warehouse is the creation of tables. Compared to tables in application databases, where tables are mostly static and created to support applications, tables in data warehouses are dynamic. Tables are products, collections of business logic, and data flows.

In this chapter, we will learn how to orchestrate our data warehouse tables from Chapter 3, Building a Data Warehouse in BigQuery. We will learn how to automate the table creations using a Google Cloud Platform (GCP) service called Cloud Composer. This will include how to create a new Cloud Composer environment...

Technical requirements

For this chapter's exercises, we will need the following:

Steps on how to access, create, or configure the technical requirements will be provided later in each exercise.

Introduction to Cloud Composer

Cloud Composer is an Airflow-managed service in GCP. Using Cloud Composer, we don't need to think about the infrastructure, installation, and software management for the Airflow environment. With this, we can focus only on the development and deployment of our data pipeline.

From the perspective of a data engineer, there is almost no difference between Cloud Composer and Airflow. When we learn how to use Cloud Composer, practically we learn how Airflow works.

Now, what is Airflow? Airflow is an open source workflow management tool. What is unique in Airflow is that we use a Python script to manage our workflows.

There are some elements we need to look at from the previous statement. Let's break down the statement into more detail. When talking about workflow management tools, there are three main components here, as follows:

  • Handling task dependencies
  • Scheduler
  • System integration

As a workflow management tool...

Understanding the working of Airflow

Airflow handles all three of the preceding elements using Python scripts. As data engineers, what we need to do is to code in Python for handling the task dependencies, schedule our jobs, and integrate with other systems. This is different from traditional extract, transform, load (ETL) tools. If you have ever heard of or used tools such as Control-M, Informatica, Talend, or many other ETL tools, Airflow has the same positioning as these tools. The difference is Airflow is not a user interface (UI)-based drag and drop tool. Airflow is designed for you to write the workflow using code.

There are a couple of good reasons why managing the workflow using code is a good idea compared to the drag and drop tools. Here's why we should do this:

  • Using code, you can automate a lot of development and deployment processes.
  • Using code, it's easier for you to enable good testing practices.
  • All the configurations can be managed in...

Exercise: Build data pipeline orchestration using Cloud Composer

We will continue our bike-sharing scenario from Chapter 3, Building a Data Warehouse in BigQuery.

This exercise will be divided into five different DAG levels. Each DAG level will have specific learning objectives, as follows:

  • Level 1: Learn how to create a DAG and submit it to Cloud Composer.
  • Level 2: Learn how to create a BigQuery DAG.
  • Level 3: Learn how to use variables.
  • Level 4: Learn how to apply task idempotency.
  • Level 5: Learn how to handle late data.

It's important for you to understand that learning Airflow is as easy as learning Level 1 DAG, but as we go through each of the levels, you will see the challenges we may have in practicing it.

In reality, you can choose to follow all of the best practices or none of them—Airflow won't forbid you from doing that. Using this leveling approach, you can learn step by step from the simplest to the most complicated...

Summary

In this chapter, we learned about Cloud Composer. Having learned about Cloud Composer, we then needed to know how to work with Airflow. We realized that as an open source tool, Airflow has a wide range of features. We focused on how to use Airflow to help us build a data pipeline for our BigQuery data warehouse. There are a lot more features and capabilities in Airflow that are not covered in this book. You can always expand your skills in this area, but you will already have a good foundation after finishing this chapter.

As a tool, Airflow is fairly simple. You just need to know how to write a Python script to define DAGs. We've learned in the Level 1 DAG exercise that you just need to write simple code to build your first DAG, but a complication arises when it comes to best practices, as there are a lot of best practices that you can follow. At the same time, there are also a lot of potential bad practices that Airflow developers can make.

By learning the examples...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Google Cloud Platform
Published in: Mar 2022Publisher: PacktISBN-13: 9781800561328
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya