You're reading from Data Engineering with Google Cloud Platform - Second Edition

Product typeBook

Published inApr 2024

PublisherPackt

ISBN-139781835080115

Edition2nd Edition

Concepts

Data Engineering

Author (1)

Adi Wijaya

Building Machine Learning Solutions on GCP

The first machine learning (ML) solution came from the 1950s era. And I believe most of you know that in recent years, it’s become immensely popular. It’s undeniable that the discussion of artificial intelligence (AI) and ML is one of the hottest topics of the 21st century. There are two main drivers of this. One is the advancement in the infrastructure, while the second is data. This second driver brings us, as data engineers, into the ML area.

In my experience of discussing ML with data engineers, there are two different reactions – either extremely excited or totally against it. Before you lose interest in finishing this chapter, I want to be clear about what we are going to cover.

We are not going to learn about ML from any historical stories and the mathematical aspects of it. Instead, I am going to prepare you, as data engineers, for potential ML involvement in your GCP environment.

As we learn about the...

Technical requirements

For this chapter’s exercises, we will use the following GCP services:

BigQuery
GCS
Vertex AI Pipelines
Vertex AI AutoML
Google Cloud Vision AI
Google Cloud Translate

If you have never opened any of these services in your GCP console, try to open them and enable the respective API if required.

Also, make sure you have your GCP console, Cloud Shell, and Cloud Editor ready.

Finally, download the example code and the dataset from https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform-Second-Edition/tree/main/chapter-8.

Now, let’s get started!

A quick look at ML

First, let’s understand what ML is from a data engineering perspective. ML is a data process that uses data as input. The output of the process is a generalized formula for one specific objective, which is called the ML model.

As an illustration, let’s imagine some of the real-world use cases that use ML. The first example is a recommendation system from an eCommerce platform. This eCommerce platform may use ML to use the customer’s purchase history as input data. This data can be processed to calculate how likely each customer will purchase other items in the future. Another example is a cancer predictor that uses X-ray images from the health industry. A collection of X-ray images with cancer and without cancer can be used as input data and be used to predict unidentified X-ray images.

I believe you’ve heard about those kinds of ML use cases and many other real-world use cases. Even in the latest hype surrounding generative AI,...

Exercise – practicing ML code using Python

In this section, we will use some of the terminologies we provided in the previous section and practice creating a quite simple ML solution using Python. The focus is for us to understand the steps and start using the correct terminologies.

For this exercise, we will be using Cloud Editor and Cloud Shell. I believe you either know of or have heard that the most common tool for creating ML models for data scientists is Jupyter Notebook. There are two reasons I choose to use the editor style. One, not many data engineers are used to the notebook coding style. Second, using the editor will make it easier to port the files to pipelines.

For our example use case, we will predict if a credit card customer fails to pay their credit card bill next month. I will name the use case credit card default. The dataset is available in the BigQuery public dataset. Let’s get started.

Here are the steps that you will complete in this exercise...

The MLOps landscape in GCP

In this section, we’ll learn what GCP services are related to MLOps. But before that, let’s understand what MLOps is.

Understanding the basic principles of MLOps

When we created the ML model in the previous section, we created some ML code, which included creating features, models, and predictions. I found that much ML content and its discussion on the public internet is about creating and improving ML models. Some examples of typical topics include how to create a Random Forest model, ML regression versus classification, boosting ML accuracy with hyperparameters, and many more.

All of the example topics mentioned previously are part of creating ML code. In reality, ML in a real production system needs a lot more than that. Take a look at the following diagram for the other aspects:

Figure 8.4 – Various ML aspects that ML code is only a small part of

As you can see, it’s logical to have the...

Exercise – leveraging pre-built GCP models as a service

In this exercise, we will use a GCP service called Google Cloud Vision. Google Cloud Vision is one of many pre-built models in GCP. In pre-built models, we only need to call the API from our application. This means that we don’t need to create an ML model.

In this exercise, we will create a Python application that can read an image with handwritten text and convert it into a Python string.

The following are the steps for this exercise:

Upload the image to a GCS bucket.
Install the required Python packages.
Create a detect text function in Python.

Let’s start by uploading the image.

Uploading the image to a GCS bucket

In the GCS console, go to the bucket that you created in the previous chapters. For example, my bucket is wired-apex-392509-data-bucket.

Inside the bucket, create a new folder called chapter-8. This is an example from my console:

Figure 8.6 – Example GCS bucket folder for storing the image file

...

Exercise – using GCP in AutoML to train an ML model

As we learned earlier in this chapter, AutoML is an automated way for you to build an ML model. It will handle model selection, hyperparameter tuning, and various data preparation steps.

Note that for the data preparation part, it will not be smart enough to transform data from very raw tables, aggregate based on business context, and automatically clean all data to create features. Those activities are still the responsibilities of data engineers and data scientists.

What AutoML will do, however, is perform simple data preparation tasks, such as detecting numeric, binary, categorical, and text features, and then apply the required transformation to be used in the ML training process. Let’s learn how to do this. Here are the steps that you will complete in this exercise:

Create a Vertex AI dataset.
Train the ML model using AutoML.
Choose the compute and budget for AutoML.

For the use case...

Exercise – deploying a dummy workflow with Vertex AI Pipelines

Before we continue with the hands-on exercise, let’s understand what Vertex AI Pipelines is. Vertex AI Pipelines is a tool for orchestrating ML workflows. Under the hood, it uses an open source tool called Kubeflow Pipeline. Like the relationship between Airflow and Cloud Composer or Hadoop and Dataproc, to understand Vertex AI Pipelines, we need to be familiar with Kubeflow Pipelines.

Kubeflow Pipelines is a platform for building and deploying portable, scalable ML workflows based on Docker containers. Using containers for ML workflows is particularly important compared to data workflows. For example, in data workflows, it’s typical to load the BigQuery, GCS, and pandas libraries for all the steps. Those libraries will be used in the upstream to downstream steps. In ML, the upstream process is data loading; the other step is building models that need specific libraries, such as TensorFlow or scikit...

Exercise – deploying a scikit-learn model pipeline with Vertex AI

In this exercise, we will simulate creating a pipeline for an ML model. There will be two pipelines – one to train the ML model and another to predict new data using the model from the first pipeline. We will continue using the credit card default dataset. The two pipelines will look like this:

Figure 8.21 – The steps in the two pipelines

Later in this section, we will load data from BigQuery. But instead of storing the data in pandas, we will write the output to a GCS bucket. We will be doing this as we don’t want to return an in-memory Python object from the function. What I mean by an in-memory Python object, in this case, is a pandas DataFrame. This also applies to other data structures, such as arrays or lists. Remember that every step in Vertex AI Pipelines will be executed in a different container – that is, in a different machine. You can’...

Summary

In this chapter, we learned how to create an ML model. We learned that creating ML code is not that difficult and that the surrounding aspects are what makes it complex. On top of that, we learned about some basic terminologies, such as AutoML, pre-built models, and MLOps.

As I mentioned in the introduction, ML is not a core skill that a data engineer needs to have. However, understanding this topic will give a data engineer a bigger picture of the whole data architecture. This way, you can imagine and make better decisions when designing your core data pipelines.

This chapter is the end of our big section on Building Data Solutions with GCP Components. Starting from Chapter 3, Building a Data Warehouse in BigQuery, to Chapter 8, Building Machine Learning Solutions on GCP, we’ve learned about all the fundamental principles of data engineering and how to use GCP services. At this point, you are more than ready to build a data solution in GCP.

Starting from the...

The rest of the chapter is locked

You have been reading a chapter from

Data Engineering with Google Cloud Platform - Second Edition

Published in: Apr 2024Publisher: PacktISBN-13: 9781835080115

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Adi Wijaya

Adi Widjaja is a strategic cloud data engineer at Google. He holds a bachelor's degree in computer science from Binus University and co-founded DataLabs in Indonesia. Currently, he dedicates himself to big data and analytics and has spent a good chunk of his career helping global companies in different industries.
Read more about Adi Wijaya

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages