You're reading from Machine Learning Engineering on AWS

Product typeBook

Published inOct 2022

PublisherPackt

ISBN-139781803247595

Edition1st Edition

Tools

AWS

Concepts

Machine Learning

Author (1)

Joshua Arvin Lat

Serverless Data Management on AWS

Businesses generally utilize systems that collect and store user information, along with transaction data, inside databases. One good example of this would be an e-commerce startup that has a web application where customers can create an account and use their credit card to make online purchases. The user profiles, transaction data, and purchase history stored in several production databases can be used to build a product recommendation engine, which can help suggest products that customers would probably want to purchase as well. However, before this stored data is analyzed and used to train machine learning (ML) models, it must be merged and joined into a centralized data store so that it can be transformed and processed using a variety of tools and services. Several options are frequently used for these types of use cases, but we will focus on two of these in this chapter – data warehouses and data lakes.

Data warehouses and data lakes...

Technical requirements

Before we start, we must have the following ready:

A web browser (preferably Chrome or Firefox)
Access to the AWS account that was used in the first few chapters of this book

The Jupyter notebooks, source code, and other files for each chapter are available in this book’s GitHub repository: https://github.com/PacktPublishing/Machine-Learning-Engineering-on-AWS.

Getting started with serverless data management

Years ago, developers, data scientists, and ML engineers had to spend hours or even days setting up the infrastructure needed for data management and data engineering. If a large dataset stored in S3 needed to be analyzed, a team of data scientists and ML engineers performed the following sequence of steps:

Launch and configure a cluster of EC2 instances.
Copy the data from S3 to the volumes attached to the EC2 instances.
Perform queries on the data using one or more of the applications installed in the EC2 instances.

One of the known challenges with this approach is that the provisioned resources may end up being underutilized. If the schedule of the data query operations is unpredictable, it would be tricky to manage the uptime, cost, and compute specifications of the setup as well. In addition to these, system administrators and DevOps engineers need to spend time managing the security, stability, performance...

Preparing the essential prerequisites

In this section, we will ensure that the following prerequisites are ready before proceeding with setting up our data warehouse and data lake in this chapter:

A text editor (for example, VS Code) on your local machine
An IAM user with the permissions to create and manage the resources we will use in this chapter
A VPC where we will launch the Redshift Serverless endpoint
A new S3 bucket where our data will be uploaded using AWS CloudShell

In this chapter, we will create and manage our resources in the Oregon (us-west-2) region. Make sure that you have set the correct region before proceeding with the next steps.

Opening a text editor on your local machine

Make sure you have an open text editor (for example, VS Code) on your local machine. We will copy some string values into the text editor for later use in this chapter. Here are the values we will have to copy later in this chapter:

IAM sign-in link, username...

Running analytics at scale with Amazon Redshift Serverless

Data warehouses play a crucial role in data management, data analysis, and data engineering. Data engineers and ML engineers spend time building data warehouses to work on projects involving batch reporting and business intelligence.

Figure 4.11 – Data warehouse

As shown in the preceding diagram, a data warehouse contains combined data from different relational data sources such as PostgreSQL and MySQL databases. It generally serves as the single source of truth when querying data for reporting and business intelligence requirements. In ML experiments, a data warehouse can serve as the source of clean data where we can extract the dataset used to build and train ML models.

Note

When generating reports, businesses and start-ups may end up performing queries directly on the production databases used by running web applications. It is important to note that these queries may cause unplanned...

Setting up Lake Formation

Now, it’s time to take a closer look at setting up our serverless data lake on AWS! Before we begin, let’s define what a data lake is and what type of data is stored in it. A data lake is a centralized data store that contains a variety of structured, semi-structured, and unstructured data from different data sources. As shown in the following diagram, data can be stored in a data lake without us having to worry about the structure and format. We can use a variety of file types such as JSON, CSV, and Apache Parquet when storing data in a data lake. In addition to these, data lakes may include both raw and processed (clean) data:

Figure 4.26 – Getting started with data lakes

ML engineers and data scientists can use data lakes as the source of the data used for building and training ML models. Since the data stored in data lakes may be a mixture of both raw and clean data, additional data processing, data cleaning...

Using Amazon Athena to query data in Amazon S3

Amazon Athena is a serverless query service that allows us to use SQL statements to query data from files stored in S3. With Amazon Athena, we don’t have to worry about infrastructure management and it scales automatically to handle our queries:

Figure 4.35 – How Amazon Athena works

If you were to set this up yourself, you may need to set up an EC2 instance cluster with an application such as Presto. In addition to this, you will need to manage the overall cost, security, performance, and stability of this EC2 cluster setup yourself.

Setting up the query result location

If the Before you run your first query, you need to set up a query result location in Amazon S3 notification appears on the Editor page, this means that you must make a quick configuration change on the Amazon Athena Settings page so that Athena can store the query results in a specified S3 bucket location every time there’...

Summary

In this chapter, we were able to take a closer look at several AWS services that help enable serverless data management in organizations. When using serverless services, we no longer need to worry about infrastructure management, which helps us focus on what we need to do.

We were able to utilize Amazon Redshift Serverless to prepare a serverless data warehouse. We were also able to use AWS Lake Formation, AWS Glue, and Amazon Athena to create and query data from a serverless data lake. With these serverless services, we were able to load and query data in just a few minutes.

Security best practices for your VPC (https://docs.aws.amazon.com/vpc/latest/userguide/vpc-security-best-practices.html)
Introducing Amazon Redshift Serverless (https://aws.amazon.com/blogs/aws/introducing-amazon-redshift-serverless-run-analytics-at-any-scale-without-having-to-manage-infrastructure/)
Security in AWS Lake Formation (https://docs.aws.amazon.com/lake-formation/latest/dg/security.html)

The rest of the chapter is locked

You have been reading a chapter from

Machine Learning Engineering on AWS

Published in: Oct 2022Publisher: PacktISBN-13: 9781803247595

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Joshua Arvin Lat

Joshua Arvin Lat is the Chief Technology Officer (CTO) of NuWorks Interactive Labs, Inc. He previously served as the CTO for three Australian-owned companies and as director of software development and engineering for multiple e-commerce start-ups in the past. Years ago, he and his team won first place in a global cybersecurity competition with their published research paper. He is also an AWS Machine Learning Hero and has shared his knowledge at several international conferences, discussing practical strategies on machine learning, engineering, security, and management.
Read more about Joshua Arvin Lat

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Machine Learning Engineering on AWS

Serverless Data Management on AWS

Technical requirements

Getting started with serverless data management

Preparing the essential prerequisites

Opening a text editor on your local machine

Running analytics at scale with Amazon Redshift Serverless

Setting up Lake Formation

Using Amazon Athena to query data in Amazon S3

Setting up the query result location

Summary

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook