You're reading from The Definitive Guide to Google Vertex AI

Product typeBook

Published inDec 2023

PublisherPackt

ISBN-139781801815260

Edition1st Edition

Concepts

Data Science

Authors (2):

Jasmeet Bhatia

Kartik Chaudhary

View More author details

It’s All About Data – Options to Store and Transform ML Datasets

The real work on a machine learning project only starts once the required data is available in the project development environment. Sometimes, when the data changes very frequently or the use case requires real-time data, we may need to set up some data pipelines to ensure that the required data is always available for analysis and modeling purposes. The best way to transfer, store, or transform data also depends on the size, type, and nature of the underlying data. Raw data, as collected in the real world, is often massive in size and may belong to multiple types, such as text, audio, images, videos, and so on. Due to the varying nature, size, and type of real-world data, it becomes really important to set up the correct infrastructure for storing, transferring, transforming, and analyzing the data at scale.

In this chapter, we will learn about the different options for moving data to the Google Cloud...

Moving data to Google Cloud

When we start a machine learning project on Google Cloud Platform (GCP), the very first step is to move all our project-related data to the Google Cloud environment. While transferring data to the cloud, the key things to focus on are reliability, security, scalability, and the ease of managing the transfer process. With these points in mind, Google Cloud provides four major data transfer utilities to meet customer requirements across a variety of use cases. In general, these utilities are useful for any kind of data transfer purposes, including data center migration, data backup, content storage, and machine learning. As our current focus is on making data available for machine learning use cases, we can utilize any of the following transfer solutions:

Google Cloud Storage Transfer tools
BigQuery Data Transfer Service
Storage Transfer Service
Transfer Appliance

Let’s understand each of these transfer solutions.

Google...

Where to store data

GCS and BQ are two recommended options for storing any machine learning use case-related datasets for high security and efficiency purposes. If the underlying data is structured or semi-structured, BQ is the recommended option due to its off-the-shelf features for manipulating or processing structured datasets. If the data contains images, videos, audio, and unstructured data, then GCS is the suitable option to store it. Let’s learn about these two data storage systems in more detail.

GCS – object storage

A significant amount of data that we collect from real-world applications is in unstructured form. Some examples are images, videos, emails, audio files, web pages, and sensor data. Managing and storing such huge amounts of unstructured data affordably and efficiently is quite challenging. Nowadays, object storage has become a preferable solution for storing such large amounts of static data and backups. Object storage is a computer data architecture...

Transforming data

Raw data present in real-world applications is often unstructured and noisy. Thus, it cannot be fed directly to machine learning algorithms. We often need to apply several transformations on raw data and convert it into a format that is well supported by machine learning algorithms. In this section, we will learn about multiple options for transforming data in a scalable and efficient way on Google Cloud.

Here are three common options for data transformation in the GCP environment:

Ad hoc transformation within Jupyter Notebooks
Cloud Data Fusion
Dataflow pipelines for scalable data transformations

Let’s learn about these three methods in more detail.

Ad hoc transformations within Jupyter Notebook

Machine learning algorithms are mathematical and can only understand numeric data. For example, in computer vision problems, images are converted into numerical pixel values before they’re fed into a model. Similarly, in the...

Summary

Managing data effectively is really important for saving time, cost, and complexity for every organization. A machine learning practitioner should be aware of the best options for transferring, storing, and transforming data to build machine learning solutions more efficiently. In this chapter, we learned about multiple ways of bringing data into the Google Cloud environment. We discussed the best options for storing it based on the characteristics of the data. Finally, we discussed multiple different tools and methods for transforming/processing data in a scalable manner.

After reading this chapter, you should feel confident about choosing the best option for moving or transferring data into your Google Cloud environment based on the requirements of the use case. Choosing the best place to store data and the best strategy to analyze and transform data should be easier as we now know the pros and cons of different options. In the next chapter, we will deep dive into Vertex...

The rest of the chapter is locked

You have been reading a chapter from

The Definitive Guide to Google Vertex AI

Published in: Dec 2023Publisher: PacktISBN-13: 9781801815260

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Jasmeet Bhatia

Jasmeet is a Machine Learning Architect with over 8 years of experience in Data Science and Machine Learning Engineering at Google and Microsoft, and overall has 17 years of experience in Product Engineering and Technology consulting at Deloitte, Disney, and Motorola. He has been involved in building technology solutions that focus on solving complex business problems by utilizing information and data assets. He has built high performing engineering teams, designed and built global scale AI/Machine Learning, Data Science, and Advanced analytics solutions for image recognition, natural language processing, sentiment analysis, and personalization.
Read more about Jasmeet Bhatia

Kartik Chaudhary

Kartik is an Artificial Intelligence and Machine Learning professional with 6+ years of industry experience in developing and architecting large scale AI/ML solutions using the technological advancements in the field of Machine Learning, Deep Learning, Computer Vision and Natural Language Processing. Kartik has filed 9 patents at the intersection of Machine Learning, Healthcare, and Operations. Kartik loves sharing knowledge, blogging, travel, and photography.
Read more about Kartik Chaudhary

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages