Reader small image

You're reading from  Data Engineering with Google Cloud Platform - Second Edition

Product typeBook
Published inApr 2024
PublisherPackt
ISBN-139781835080115
Edition2nd Edition
Right arrow
Author (1)
Adi Wijaya
Adi Wijaya
author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya

Right arrow

Building Machine Learning Solutions on GCP

The first machine learning (ML) solution came from the 1950s era. And I believe most of you know that in recent years, it’s become immensely popular. It’s undeniable that the discussion of artificial intelligence (AI) and ML is one of the hottest topics of the 21st century. There are two main drivers of this. One is the advancement in the infrastructure, while the second is data. This second driver brings us, as data engineers, into the ML area.

In my experience of discussing ML with data engineers, there are two different reactions – either extremely excited or totally against it. Before you lose interest in finishing this chapter, I want to be clear about what we are going to cover.

We are not going to learn about ML from any historical stories and the mathematical aspects of it. Instead, I am going to prepare you, as data engineers, for potential ML involvement in your GCP environment.

As we learn about the...

Technical requirements

For this chapter’s exercises, we will use the following GCP services:

  • BigQuery
  • GCS
  • Vertex AI Pipelines
  • Vertex AI AutoML
  • Google Cloud Vision AI
  • Google Cloud Translate

If you have never opened any of these services in your GCP console, try to open them and enable the respective API if required.

Also, make sure you have your GCP console, Cloud Shell, and Cloud Editor ready.

Finally, download the example code and the dataset from https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform-Second-Edition/tree/main/chapter-8.

Now, let’s get started!

A quick look at ML

First, let’s understand what ML is from a data engineering perspective. ML is a data process that uses data as input. The output of the process is a generalized formula for one specific objective, which is called the ML model.

As an illustration, let’s imagine some of the real-world use cases that use ML. The first example is a recommendation system from an eCommerce platform. This eCommerce platform may use ML to use the customer’s purchase history as input data. This data can be processed to calculate how likely each customer will purchase other items in the future. Another example is a cancer predictor that uses X-ray images from the health industry. A collection of X-ray images with cancer and without cancer can be used as input data and be used to predict unidentified X-ray images.

I believe you’ve heard about those kinds of ML use cases and many other real-world use cases. Even in the latest hype surrounding generative AI,...

Exercise – practicing ML code using Python

In this section, we will use some of the terminologies we provided in the previous section and practice creating a quite simple ML solution using Python. The focus is for us to understand the steps and start using the correct terminologies.

For this exercise, we will be using Cloud Editor and Cloud Shell. I believe you either know of or have heard that the most common tool for creating ML models for data scientists is Jupyter Notebook. There are two reasons I choose to use the editor style. One, not many data engineers are used to the notebook coding style. Second, using the editor will make it easier to port the files to pipelines.

For our example use case, we will predict if a credit card customer fails to pay their credit card bill next month. I will name the use case credit card default. The dataset is available in the BigQuery public dataset. Let’s get started.

Here are the steps that you will complete in this exercise...

The MLOps landscape in GCP

In this section, we’ll learn what GCP services are related to MLOps. But before that, let’s understand what MLOps is.

Understanding the basic principles of MLOps

When we created the ML model in the previous section, we created some ML code, which included creating features, models, and predictions. I found that much ML content and its discussion on the public internet is about creating and improving ML models. Some examples of typical topics include how to create a Random Forest model, ML regression versus classification, boosting ML accuracy with hyperparameters, and many more.

All of the example topics mentioned previously are part of creating ML code. In reality, ML in a real production system needs a lot more than that. Take a look at the following diagram for the other aspects:

Figure 8.4 – Various ML aspects that ML code is only a small part of

Figure 8.4 – Various ML aspects that ML code is only a small part of

As you can see, it’s logical to have the...

Exercise – leveraging pre-built GCP models as a service

In this exercise, we will use a GCP service called Google Cloud Vision. Google Cloud Vision is one of many pre-built models in GCP. In pre-built models, we only need to call the API from our application. This means that we don’t need to create an ML model.

In this exercise, we will create a Python application that can read an image with handwritten text and convert it into a Python string.

The following are the steps for this exercise:

  1. Upload the image to a GCS bucket.
  2. Install the required Python packages.
  3. Create a detect text function in Python.

Let’s start by uploading the image.

Uploading the image to a GCS bucket

In the GCS console, go to the bucket that you created in the previous chapters. For example, my bucket is wired-apex-392509-data-bucket.

Inside the bucket, create a new folder called chapter-8. This is an example from my console:

Figure 8.6 – Example GCS bucket folder for storing the image file
...

Exercise – using GCP in AutoML to train an ML model

As we learned earlier in this chapter, AutoML is an automated way for you to build an ML model. It will handle model selection, hyperparameter tuning, and various data preparation steps.

Note that for the data preparation part, it will not be smart enough to transform data from very raw tables, aggregate based on business context, and automatically clean all data to create features. Those activities are still the responsibilities of data engineers and data scientists.

What AutoML will do, however, is perform simple data preparation tasks, such as detecting numeric, binary, categorical, and text features, and then apply the required transformation to be used in the ML training process. Let’s learn how to do this. Here are the steps that you will complete in this exercise:

  1. Create a Vertex AI dataset.
  2. Train the ML model using AutoML.
  3. Choose the compute and budget for AutoML.

For the use case...

Exercise – deploying a dummy workflow with Vertex AI Pipelines

Before we continue with the hands-on exercise, let’s understand what Vertex AI Pipelines is. Vertex AI Pipelines is a tool for orchestrating ML workflows. Under the hood, it uses an open source tool called Kubeflow Pipeline. Like the relationship between Airflow and Cloud Composer or Hadoop and Dataproc, to understand Vertex AI Pipelines, we need to be familiar with Kubeflow Pipelines.

Kubeflow Pipelines is a platform for building and deploying portable, scalable ML workflows based on Docker containers. Using containers for ML workflows is particularly important compared to data workflows. For example, in data workflows, it’s typical to load the BigQuery, GCS, and pandas libraries for all the steps. Those libraries will be used in the upstream to downstream steps. In ML, the upstream process is data loading; the other step is building models that need specific libraries, such as TensorFlow or scikit...

Exercise – deploying a scikit-learn model pipeline with Vertex AI

In this exercise, we will simulate creating a pipeline for an ML model. There will be two pipelines – one to train the ML model and another to predict new data using the model from the first pipeline. We will continue using the credit card default dataset. The two pipelines will look like this:

Figure 8.21 – The steps in the two pipelines

Figure 8.21 – The steps in the two pipelines

Later in this section, we will load data from BigQuery. But instead of storing the data in pandas, we will write the output to a GCS bucket. We will be doing this as we don’t want to return an in-memory Python object from the function. What I mean by an in-memory Python object, in this case, is a pandas DataFrame. This also applies to other data structures, such as arrays or lists. Remember that every step in Vertex AI Pipelines will be executed in a different container – that is, in a different machine. You can’...

Summary

In this chapter, we learned how to create an ML model. We learned that creating ML code is not that difficult and that the surrounding aspects are what makes it complex. On top of that, we learned about some basic terminologies, such as AutoML, pre-built models, and MLOps.

As I mentioned in the introduction, ML is not a core skill that a data engineer needs to have. However, understanding this topic will give a data engineer a bigger picture of the whole data architecture. This way, you can imagine and make better decisions when designing your core data pipelines.

This chapter is the end of our big section on Building Data Solutions with GCP Components. Starting from Chapter 3, Building a Data Warehouse in BigQuery, to Chapter 8, Building Machine Learning Solutions on GCP, we’ve learned about all the fundamental principles of data engineering and how to use GCP services. At this point, you are more than ready to build a data solution in GCP.

Starting from the...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Google Cloud Platform - Second Edition
Published in: Apr 2024Publisher: PacktISBN-13: 9781835080115
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya