Reader small image

You're reading from  Data Engineering with Google Cloud Platform - Second Edition

Product typeBook
Published inApr 2024
PublisherPackt
ISBN-139781835080115
Edition2nd Edition
Right arrow
Author (1)
Adi Wijaya
Adi Wijaya
author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya

Right arrow

CI/CD on GCP for Data Engineers

Continuous integration/continuous deployment (CI/CD) is a common concept for DevOps engineers, but most of the time, data engineers need to understand and be able to apply this practice to their development endeavors. This chapter will cover the necessary concepts of CI/CD and provide examples of how to apply CI/CD to our Cloud Composer example from Chapter 4, Building Workflows for Batch Data Loading Using Cloud Composer. Upon completing this chapter, you will understand what CI/CD is, why and when it’s needed in data engineering, and what GCP services are needed for it.

In this chapter, we will cover the following topics:

  • An introduction to CI/CD
  • Understanding CI/CD components with GCP services
  • Exercise – implementing CI using Cloud Build
  • Exercise – deploying Cloud Composer jobs using Cloud Build

Let’s look at the technical requirements for the exercises in this chapter:

Technical requirements

For this chapter’s exercises, we will use the following GCP services:

  • Cloud Build
  • Google Cloud Repositories
  • Google Container Registry
  • Cloud Composer (optionally)

If you have never opened any of these services in your GCP console, open them and enable the necessary APIs. Make sure that you have your GCP console, Cloud Shell, and Cloud Editor ready.

You can download the example code and the dataset for this chapter here: https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform-Second-Edition/tree/main/chapter-12/code.

An introduction to CI/CD

CI/CD is an engineering practice that combines two methods – continuous integration and continuous deployment:

  • CI is a method that’s used to automatically integrate code from multiple engineers that collaborate in a repository. Automatic integration can be in the form of testing the code, building Docker images, integration testing, or any other steps that are required to integrate the code.

    The main benefit of CI is that you can integrate code changes from many developers as quickly as possible. With this practice, you can detect errors quickly and locate the issues more easily.

  • CD is a method that’s used to automatically deploy the final code to the production applications. Automatic deployment can be in the form of pushing the built software to the production server, moving the necessary files to their destination, or any other steps that are required to deploy the final application.

The main benefit of CD is that you...

Understanding CI/CD components with GCP services

There are some steps in the CI/CD practice. Each step may involve different tools or GCP services. To understand this concept better, let’s take a look at the following diagram:

Figure 12.1 – The CI/CD steps and the GCP services involved

Figure 12.1 – The CI/CD steps and the GCP services involved

The diagram shows the high-level steps of a complete CI/CD. At each step, there is a corresponding GCP service that can handle that step. For example, the first step is Source Code. In GCP, you can create a GitHub repository using a service called a Cloud Source Repository. Later, in the Exercise – implementing CI using Cloud Build section, we will learn how to create one. For now, let’s understand the steps and what GCP services are involved:

  1. The CI process starts with the source code. This source code should always be managed in a GitHub repository. It can be in GitHub, GitLab, or any other Git provider. As we mentioned previously, GCP...

Exercise – implementing CI using Cloud Build

In this exercise, we will create a CI pipeline using Cloud Build. There will be four main steps, as follows:

  1. Creating a GitHub repository using a Cloud Source Repository.
  2. Developing the code and Cloud Build scripts.
  3. Creating a Cloud Build trigger.
  4. Pushing the code to the GitHub repository.

Let’s get started!

Creating a GitHub repository using a Cloud Source Repository

First, let’s prepare the GitHub repository. Follow these steps:

  1. Go to your GCP console and find Source Repository from the navigation bar. It’s located under the CI/CD section:
Figure 12.2 – The Source Repository option in the navigation menu

Figure 12.2 – The Source Repository option in the navigation menu

  1. After clicking the menu, a new browser tab will open within the Cloud Source Repository console. Click the Get started button on this screen to create a repository:
Figure 12.3 – The Get started button

Figure 12.3 – The Get...

Exercise – deploying Cloud Composer jobs using Cloud Build

In this section, we will continue creating a Cloud Build pipeline. This time, I will help you get an idea of how this practice can be implemented in terms of data engineering. To do that, we will try to create a CI/CD pipeline to deploy a Cloud Composer DAG.

In this exercise, we will use the DAG from Chapter 4, Building Workflows for Batch Data Loading Using Cloud Composer. Let’s refresh ourselves a little bit on the exercises from that chapter.

In Chapter 4, Building Workflows for Batch Data Loading Using Cloud Composer, we learned how Cloud Composer works. We learned that in Cloud Composer, you can develop DAGs to create data pipelines. These data pipelines can use Airflow operators to manage BigQuery, CloudSQL, GCS, or simple Bash scripts. In those exercises, we practiced five levels of DAGs, with the level-one DAG being the simplest one and the level-five DAG being the most complex. To deploy a DAG,...

CI/CD best practices in data engineering

After learning how to use the GCP tools to implement CI/CD, you now hopefully have an idea of the possibilities available to a data engineering team when implementing CI/CD on a data platform.

You may start to think of some ideas, such as the following:

  • Checking how clean your code is
  • Checking your data quality before going into productions
  • Planning automatic testing in different environments

In data engineering specifically, after working with dozens of companies from many industries, I haven’t seen any golden standard on how to implement CI/CD. It depends on the skill set of the team, the number of people in the team, the complexity of the systems, and the budget.

What makes this topic exciting is that it’s still evolving. Even though there is no golden rule and almost endless possibilities, I saw some patterns. In this section, I will share my thoughts on the considerations and best practices I&...

Summary

In this chapter, we learned about how CI/CD works in GCP services. More specifically, we learned about this from the perspective of a data engineer. CI/CD is a big topic by itself and is more mature in the software development practice. However, lately, it’s become more and more common for data engineers to follow this practice in big organizations.

We started this chapter by talking about high-level concepts and ended it with an exercise that showed how data engineers can use CI/CD in a data project. We used Cloud Build, a Cloud Source Repository, and Google Container Registry in the exercises. Understanding these concepts and what kind of technologies are involved were the two main goals of this chapter. If you want to learn more about DevOps practices, containers, and unit testing, check out the links in the Further reading section.

This was the final technical chapter in this book. If you have read all the chapters in this book, then you’ve learned about...

Further reading

To learn more about the topics that were covered in this chapter, take a look at the following resources:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Google Cloud Platform - Second Edition
Published in: Apr 2024Publisher: PacktISBN-13: 9781835080115
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya