What do you get with eBook?

Instant access to your Digital eBook purchase

Download this book in EPUB and PDF formats

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

Buy Now

Serverless Machine Learning with Amazon Redshift ML

Introduction to Amazon Redshift Serverless

“Hey, what’s a data warehouse?” John Doe, CEO and co-founder of Red.wines, a fictional specialty wine e-commerce company, asked Tathya Vishleshak*, the company’s CTO. John, who owned a boutique winery, had teamed up with Tathya for the project. The company’s success surged during the pandemic, driven by social media and the stay-at-home trend. John wanted detailed data analysis to align inventory and customer outreach. However, there was a problem – producing this analysis was slowing down their online transaction processing (OLTP) database.

“A data warehouse is like a big database where we store different data for a long time to find insights and make decisions,” Tathya explained.

John had a concern, “Sounds expensive; we’re already paying for unused warehouse space. Can we afford it?”

Tathya reassured him, “You’re right, but there are cloud data warehouses such as Amazon Redshift Serverless that let you pay as you use.”

Expanding on this, this chapter introduces data warehousing and Amazon Redshift. We’ll cover Amazon Redshift Serverless basics, such as namespaces and workgroups, and guide you in creating a data warehouse. Amazon Redshift can gather data from various sources, mainly Amazon Simple Storage Service (S3).

As we go through this chapter, you’ll learn about a crucial aspect of this, the AWS Identity and Access Management (IAM) role, needed for loading data from S3. This role connects to your Serverless namespace for smooth data transfer. You’ll also learn how to load sample data and run queries using Amazon Redshift query editor. Our goal is to make it simple and actionable, so you’re confident in navigating this journey.

Tathya Vishleshak

The phrase 'Tathya Vishleshak' can be loosely interpreted to reflect the concept of a data analyst in Sanskrit/Hindi. However, it's important to note that this is not a precise or established translation, but rather an attempt to convey a similar meaning based on the individual meanings of the words 'Tathya' and 'Vishleshak' in Sanskrit.

Additionally, Amazon Redshift is used to analyze structured and unstructured data in data warehouses, operational databases, and data lakes. It’s employed for traditional data warehousing, business intelligence, real-time analytics, and machine learning/predictive analytics. Data analysts and developers use Redshift data with machine learning (ML) models for tasks such as predicting customer behavior. Amazon Redshift ML streamlines this process using familiar SQL commands.

The book delves into ML, explaining supervised and unsupervised training. You’ll learn about problem-solving with binary classification, multi-class classification, and regression using real-world examples. You’ll also discover how to create deep learning models and custom models with XGBoost, as well as use time series forecasting. The book also covers in-database and remote inferences using existing models, applying ML for predictive analytics, and operationalizing machine learning models.

The following topics will be covered in this chapter:

What is Amazon Redshift?
Getting started with Amazon Redshift Serverless
Connecting to your data warehouse

This chapter requires a web browser and access to an AWS account.

What is Amazon Redshift?

Organizations churn out vast troves of customer data along with insights into these customers’ interactions with the business. This data gets funneled into various applications and stashed away in disconnected systems. A conundrum arises when attempting to decipher these data silos – a formidable challenge that hampers the derivation of meaningful insights essential for organizational clarity. Adding to this complexity, security and performance considerations typically muzzle business analysts from accessing data within OLTP systems.

The hiccup is that intricate analytical queries weigh down OLTP databases, casting a shadow over their core operations. Here, the solution is the data warehouse, which is a central hub of curated data, used by business analysts and data scientists to make informed decisions by employing the business intelligence and machine learning tools at their disposal. These users make use of Structured Query Language (SQL) to derive insights from this data trove. From operational systems, application logs, and social media streams to the influx of IoT device-generated data, customers channel structured and semi-structured data into organizations’ data warehouses, as depicted in Figure 1.1, showcasing the classic architecture of a conventional data warehouse.

Figure 1.1 – Data warehouse

Here’s where Amazon Redshift Serverless comes in. It’s a key option within Amazon Redshift, a well-managed cloud data warehouse offered by Amazon Web Services (AWS). With cloud-based ease, Amazon Redshift Serverless lets you set up your data storage without infrastructure hassles or cost worries. You pay based on what you use for compute and storage.

Amazon Redshift Serverless goes beyond convenience, propelling modern data applications that seamlessly connect to the data lake. Enter the data lake – a structure that gathers all data strands under one roof, providing limitless space to store data at any scale, cost-effectively. Alongside other data repositories such as data warehouses, data lakes redefine how organizations handle data. And this is where it all comes together – the following diagram shows how Amazon Redshift Serverless injects SQL-powered queries into the data lake, driving a dynamic data flow:

Figure 1.2 – Data lake and data warehouse

So, let’s get started on creating our first data warehouse in the cloud!

Getting started with Amazon Redshift Serverless

You can create your data warehouse with Amazon Redshift Serverless using the AWS Command-Line Interface (CLI), the API, AWS CloudFormation templates, or the AWS console. We are going to use the AWS console to create a Redshift Serverless data warehouse. Log in to your AWS console and search for Redshift in the top bar, as shown in Figure 1.3:

Figure 1.3 – AWS console page showing services filtered by our search for Redshift

Click on Amazon Redshift, which will take you to the home page for the Amazon Redshift console, as shown in Figure 1.4. To help get you started, Amazon provides free credit for first-time Redshift Serverless customers. So, let’s start creating your trial data warehouse by clicking on Try Amazon Redshift Serverless. If you or your organization has tried Amazon Redshift Serverless before, you will have to pay for the service based on your usage:

Figure 1.4 – Amazon Redshift service page in the AWS console

If you have free credit available, it will be indicated at the top of your screen, as in Figure 1.5:

Figure 1.5 – AWS console showing the Redshift Serverless Get started page

You can either choose the defaults or use the customized settings to create your data warehouse. The customized settings give you more control, allowing you to specify many additional parameters for your compute configuration including the workgroup, data-related settings such as the namespace, and advanced security settings. We will use the customized settings, which will help us customize the namespace settings for our Serverless data warehouse. A namespace combined with a workgroup is what makes a data warehouse with Redshift Serverless, as we will now see in more detail.

What is a namespace?

Amazon Redshift Serverless provides a separation of storage and compute for a data warehouse. A namespace is a collection of all your data stored in the database such as your tables, views, database users, and their privileges. You are separately charged for storage based on the size of the data stored in your data warehouse. For compute, you are charged for the capacity used over a given duration in Redshift processing hours (RPU) on a per second-basis. The storage capacity is billed as Redshift managed storage (RMS) and is billed by GB/month. You can view https://aws.amazon.com/redshift/pricing/ for detailed pricing for your AWS Region.

As a data warehouse admin, you can change the name of your data warehouse namespace while creating the namespace. You can also change your encryption settings, audit logging, and AWS IAM permissions, as shown in Figure 1.6. The primary reason we are going to use customized settings is to associate an IAM role with the namespace:

Figure 1.6 – Namespace configuration

AWS IAM allows you to specify which users or services can access other services and resources in AWS. We will use that role for loading data from S3 and training a machine learning model with Redshift ML that accesses Amazon SageMaker.

If you have already created an IAM role earlier, you can associate with that IAM role. If you have not created an IAM role, do so now by selecting the Manage IAM roles option, as shown in Figure 1.7:

Figure 1.7 – Creating an IAM role and associating it via the AWS console

Then, select the Create IAM role option, as shown in Figure 1.8:

Figure 1.8 – Selecting the “Create IAM role” option

You can then create a default IAM role and provide appropriate permissions to the IAM role to allow it to access S3 buckets, as shown in Figure 1.9:

Figure 1.9 – Granting S3 permissions to the IAM role

As shown in the preceding figure, select Any S3 bucket to enable Redshift to read data from and write data to all S3 buckets you have created. Then, select Create IAM role as default to create the role and set it as the default IAM role.

Figure 1.10 – An IAM role was created but is not yet applied

As shown in Figure 1.10, we created the IAM role and associated it with the namespace as a default role. Let’s next proceed to create a workgroup, wherein we will set up the compute configuration for the data warehouse.

What is a workgroup?

As we discussed earlier, a namespace combined with a workgroup is what makes a Redshift Serverless data warehouse. A workgroup provides the compute resources required to process your data. It also provides the endpoint for you to connect to the warehouse. As an admin, you need to configure the compute settings such as the network and security configuration for the workgroup.

We will not do any customization at this time and simply select the default settings instead, including the VPC and associated subnets for the workgroup, as shown in the following screenshot:

Figure 1.11 – Default settings and associated subnets for the workgroup

Click on the Save configuration button to create your Redshift Serverless instance, and your first data warehouse will be ready in a few minutes:

Figure 1.12 – Redshift Serverless creation progress

Once your data warehouse is ready, you will be redirected to your Serverless dashboard, as shown in Figure 1.13:

Figure 1.13 – Serverless dashboard showing your namespace and workgroup

Now that we have created our data warehouse, we will connect to the data warehouse, load some sample data, and run some queries.

Connecting to your data warehouse

Your data warehouse with Redshift Serverless is now ready. You can connect to your data warehouse using third-party tools via JDBC/ODBC/Python drivers. Other options include the Data API or the embedded Redshift query editor v2.

Using Amazon Redshift query editor v2

Now that your data warehouse is ready; let’s navigate to the query editor to load some sample data and run some queries. Select the Query data option from your dashboard, as shown in Figure 1.13, and you will be navigated to the query editor, as shown in Figure 1.14.

Figure 1.14 – Query editor

In the Redshift query editor v2 console, on the left pane, you will see the data warehouses, such as the Serverless:default workgroup, that you have access to. Click on the workgroup (Serverless:default) to connect to the data warehouse.

Figure 1.15 – Creating a connection to your workgroup

As shown in the preceding screenshot, select Federated user if you did not specify any database credentials while creating the namespace, and then click Create connection. You can leave the database name as dev. You will be prompted to create a connection only when connecting to the data warehouse for the first time. If you have created the connection, you will be connected automatically when you click on the workgroup. Once you are connected, you will see the databases in the navigator, as shown in Figure 1.16:

Figure 1.16 – List of databases

Since we just created our data warehouse for the first time, there is no data present in it, so let’s load some sample data into the data warehouse now.

Loading sample data

On the left pane, click on the sample_data_dev database to expand the available database:

Figure 1.17 – The Redshift query editor v2 navigator that shows the sample data available

As you can see from the preceding screenshot, three sample datasets are available for you to load into your data warehouse. Click on the icon showing the folder with an arrow located to the right of your chosen sample data notebook to load and open it, as shown in Figure 1.18:

Figure 1.18 – List of sample databases

You will be prompted to create your sample database. Click on Create to get started, as shown in Figure 1.19:

Figure 1.19 – Creating a sample database

The sample data will be loaded in a few seconds and presented in a notebook with SQL queries for the dataset that you can explore, as shown in Figure 1.20:

Figure 1.20 – Notebook with sample queries for the tickit database

You can expand the navigation tree on the left side of the query editor to view schemas and database objects, such as tables and views in your schema, as shown in Figure 1.21.

Figure 1.21 – Expanding the navigation tree to view schemas and database objects

You can click on a table to view the table definitions, as shown in Figure 1.22:

Figure 1.22 – Table definitions

Right-clicking on a table provides additional Select table, Show table definition, and Delete options, as shown in Figure 1.23:

Figure 1.23 – Right-clicking on a table to view more options

You can click Run all, as shown in Figure 1.24, to run all the queries in the sample notebook. The query editor provides a notebook interface to add annotation, and SQL cells organize your queries in a single document. You can use annotations for documentation purposes.

Figure 1.24 – The “Run all” option

You will see the results of your queries for each cell. You can download the results as JSON or CSV files to your desktop, as shown in Figure 1.25:

Figure 1.25 – Options to download query results

Let’s author our first query.

Running your first query

We want to find out the top 10 events by sales in the tickit database. We will run the following SQL statement in the data warehouse:

SELECT eventname, total_price
FROM  (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) as percentile
       FROM (SELECT eventid, sum(pricepaid) total_price
             FROM   tickit.sales
             GROUP BY eventid)) Q, tickit.event E
       WHERE Q.eventid = E.eventid
       AND percentile = 1
ORDER BY total_price desc
limit 10;

In the query editor, add a new query by clicking on the + sign and selecting Editor from the menu that appears. If you wanted to create a new notebook, you could click on Notebook instead, as shown in Figure 1.26:

Figure 1.26 – Creating a new query

Now, type the preceding SQL query in the editor and then click on Run. You will get the results as shown in the following screenshot:

Figure 1.27 – Query with results

As the saying goes, “A picture is worth a thousand words,” and query editor allows you to visualize the results to gain faster insight. You can create a chart easily by clicking on the Chart option and then selecting the chart you want. Let’s select a scatter plot, as shown in Figure 1.28:

Figure 1.28 – Using charts in Redshift query editor v2

You can add a chart name and notations for the X and Y axes and export the chart as PNG or JPG to put in your presentation or to share with your business partners:

Figure 1.29 – Charting options in query editor v2

As you have now seen, you can use Redshift query editor v2 to create your own database, create tables, load data, and run and author queries and notebooks. You can share your queries and notebooks with your team members.

Key benefits

Leverage supervised learning to build binary classification, multi-class classification, and regression models

Learn to use unsupervised learning using the K-means clustering method

Master the art of time series forecasting using Redshift ML

Purchase of the print or Kindle book includes a free PDF eBook

Description

Amazon Redshift Serverless enables organizations to run petabyte-scale cloud data warehouses quickly and in a cost-effective way, enabling data science professionals to efficiently deploy cloud data warehouses and leverage easy-to-use tools to train models and run predictions. This practical guide will help developers and data professionals working with Amazon Redshift data warehouses to put their SQL knowledge to work for training and deploying machine learning models. The book begins by helping you to explore the inner workings of Redshift Serverless as well as the foundations of data analytics and types of data machine learning. With the help of step-by-step explanations of essential concepts and practical examples, you’ll then learn to build your own classification and regression models. As you advance, you’ll find out how to deploy various types of machine learning projects using familiar SQL code, before delving into Redshift ML. In the concluding chapters, you’ll discover best practices for implementing serverless architecture with Redshift. By the end of this book, you’ll be able to configure and deploy Amazon Redshift Serverless, train and deploy machine learning models using Amazon Redshift ML, and run inference queries at scale.

What you will learn

Utilize Redshift Serverless for data ingestion, data analysis, and machine learning Create supervised and unsupervised models and learn how to supply your own custom parameters Discover how to use time series forecasting in your data warehouse Create a SageMaker endpoint and use that to build a Redshift ML model for remote inference Find out how to operationalize machine learning in your data warehouse Use model explainability and calculate probabilities with Amazon Redshift ML

What do you get with eBook?

Instant access to your Digital eBook purchase

Download this book in EPUB and PDF formats

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

Buy Now

Serverless Machine Learning with Amazon Redshift ML: Create, train, and deploy machine learning models using familiar SQL commands

What do you get with eBook?

Product Details

Serverless Machine Learning with Amazon Redshift ML

Introduction to Amazon Redshift Serverless

What is Amazon Redshift?

Getting started with Amazon Redshift Serverless

What is a namespace?

What is a workgroup?

Connecting to your data warehouse

Using Amazon Redshift query editor v2

Loading sample data

Running your first query

Summary

Page 1 of 5

Key benefits

Description

What you will learn

What do you get with eBook?

Product Details

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

Authors (5)

FAQs

Serverless Machine Learning with Amazon Redshift ML: Create, train, and deploy machine learning models using familiar SQL commands

What do you get with eBook?

Product Details

Key benefits

Description

What you will learn

What do you get with eBook?

Product Details

Packt Subscriptions

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

Authors (5)

FAQs