Reader small image

You're reading from  Machine Learning Engineering on AWS

Product typeBook
Published inOct 2022
PublisherPackt
ISBN-139781803247595
Edition1st Edition
Tools
Right arrow
Author (1)
Joshua Arvin Lat
Joshua Arvin Lat
author image
Joshua Arvin Lat

Joshua Arvin Lat is the Chief Technology Officer (CTO) of NuWorks Interactive Labs, Inc. He previously served as the CTO for three Australian-owned companies and as director of software development and engineering for multiple e-commerce start-ups in the past. Years ago, he and his team won first place in a global cybersecurity competition with their published research paper. He is also an AWS Machine Learning Hero and has shared his knowledge at several international conferences, discussing practical strategies on machine learning, engineering, security, and management.
Read more about Joshua Arvin Lat

Right arrow

Serverless Data Management on AWS

Businesses generally utilize systems that collect and store user information, along with transaction data, inside databases. One good example of this would be an e-commerce startup that has a web application where customers can create an account and use their credit card to make online purchases. The user profiles, transaction data, and purchase history stored in several production databases can be used to build a product recommendation engine, which can help suggest products that customers would probably want to purchase as well. However, before this stored data is analyzed and used to train machine learning (ML) models, it must be merged and joined into a centralized data store so that it can be transformed and processed using a variety of tools and services. Several options are frequently used for these types of use cases, but we will focus on two of these in this chapter – data warehouses and data lakes.

Data warehouses and data lakes...

Technical requirements

Before we start, we must have the following ready:

  • A web browser (preferably Chrome or Firefox)
  • Access to the AWS account that was used in the first few chapters of this book

The Jupyter notebooks, source code, and other files for each chapter are available in this book’s GitHub repository: https://github.com/PacktPublishing/Machine-Learning-Engineering-on-AWS.

Getting started with serverless data management

Years ago, developers, data scientists, and ML engineers had to spend hours or even days setting up the infrastructure needed for data management and data engineering. If a large dataset stored in S3 needed to be analyzed, a team of data scientists and ML engineers performed the following sequence of steps:

  1. Launch and configure a cluster of EC2 instances.
  2. Copy the data from S3 to the volumes attached to the EC2 instances.
  3. Perform queries on the data using one or more of the applications installed in the EC2 instances.

One of the known challenges with this approach is that the provisioned resources may end up being underutilized. If the schedule of the data query operations is unpredictable, it would be tricky to manage the uptime, cost, and compute specifications of the setup as well. In addition to these, system administrators and DevOps engineers need to spend time managing the security, stability, performance...

Preparing the essential prerequisites

In this section, we will ensure that the following prerequisites are ready before proceeding with setting up our data warehouse and data lake in this chapter:

  • A text editor (for example, VS Code) on your local machine
  • An IAM user with the permissions to create and manage the resources we will use in this chapter
  • A VPC where we will launch the Redshift Serverless endpoint
  • A new S3 bucket where our data will be uploaded using AWS CloudShell

In this chapter, we will create and manage our resources in the Oregon (us-west-2) region. Make sure that you have set the correct region before proceeding with the next steps.

Opening a text editor on your local machine

Make sure you have an open text editor (for example, VS Code) on your local machine. We will copy some string values into the text editor for later use in this chapter. Here are the values we will have to copy later in this chapter:

  • IAM sign-in link, username...

Running analytics at scale with Amazon Redshift Serverless

Data warehouses play a crucial role in data management, data analysis, and data engineering. Data engineers and ML engineers spend time building data warehouses to work on projects involving batch reporting and business intelligence.

Figure 4.11 – Data warehouse

As shown in the preceding diagram, a data warehouse contains combined data from different relational data sources such as PostgreSQL and MySQL databases. It generally serves as the single source of truth when querying data for reporting and business intelligence requirements. In ML experiments, a data warehouse can serve as the source of clean data where we can extract the dataset used to build and train ML models.

Note

When generating reports, businesses and start-ups may end up performing queries directly on the production databases used by running web applications. It is important to note that these queries may cause unplanned...

Setting up Lake Formation

Now, it’s time to take a closer look at setting up our serverless data lake on AWS! Before we begin, let’s define what a data lake is and what type of data is stored in it. A data lake is a centralized data store that contains a variety of structured, semi-structured, and unstructured data from different data sources. As shown in the following diagram, data can be stored in a data lake without us having to worry about the structure and format. We can use a variety of file types such as JSON, CSV, and Apache Parquet when storing data in a data lake. In addition to these, data lakes may include both raw and processed (clean) data:

Figure 4.26 – Getting started with data lakes

ML engineers and data scientists can use data lakes as the source of the data used for building and training ML models. Since the data stored in data lakes may be a mixture of both raw and clean data, additional data processing, data cleaning...

Using Amazon Athena to query data in Amazon S3

Amazon Athena is a serverless query service that allows us to use SQL statements to query data from files stored in S3. With Amazon Athena, we don’t have to worry about infrastructure management and it scales automatically to handle our queries:

Figure 4.35 – How Amazon Athena works

If you were to set this up yourself, you may need to set up an EC2 instance cluster with an application such as Presto. In addition to this, you will need to manage the overall cost, security, performance, and stability of this EC2 cluster setup yourself.

Setting up the query result location

If the Before you run your first query, you need to set up a query result location in Amazon S3 notification appears on the Editor page, this means that you must make a quick configuration change on the Amazon Athena Settings page so that Athena can store the query results in a specified S3 bucket location every time there’...

Summary

In this chapter, we were able to take a closer look at several AWS services that help enable serverless data management in organizations. When using serverless services, we no longer need to worry about infrastructure management, which helps us focus on what we need to do.

We were able to utilize Amazon Redshift Serverless to prepare a serverless data warehouse. We were also able to use AWS Lake Formation, AWS Glue, and Amazon Athena to create and query data from a serverless data lake. With these serverless services, we were able to load and query data in just a few minutes.

Further reading

For more information on the topics that were covered in this chapter, feel free to check out the following resources:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Engineering on AWS
Published in: Oct 2022Publisher: PacktISBN-13: 9781803247595
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Joshua Arvin Lat

Joshua Arvin Lat is the Chief Technology Officer (CTO) of NuWorks Interactive Labs, Inc. He previously served as the CTO for three Australian-owned companies and as director of software development and engineering for multiple e-commerce start-ups in the past. Years ago, he and his team won first place in a global cybersecurity competition with their published research paper. He is also an AWS Machine Learning Hero and has shared his knowledge at several international conferences, discussing practical strategies on machine learning, engineering, security, and management.
Read more about Joshua Arvin Lat