Home Data Serverless Machine Learning with Amazon Redshift ML

Serverless Machine Learning with Amazon Redshift ML

By Debu Panda , Phil Bates , Bhanu Pittampally and 1 more
books-svg-icon Book
eBook $39.99 $27.98
Print $49.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $39.99 $27.98
Print $49.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Chapter 1: Introduction to Amazon Redshift Serverless
About this book
Amazon Redshift Serverless enables organizations to run petabyte-scale cloud data warehouses quickly and in a cost-effective way, enabling data science professionals to efficiently deploy cloud data warehouses and leverage easy-to-use tools to train models and run predictions. This practical guide will help developers and data professionals working with Amazon Redshift data warehouses to put their SQL knowledge to work for training and deploying machine learning models. The book begins by helping you to explore the inner workings of Redshift Serverless as well as the foundations of data analytics and types of data machine learning. With the help of step-by-step explanations of essential concepts and practical examples, you’ll then learn to build your own classification and regression models. As you advance, you’ll find out how to deploy various types of machine learning projects using familiar SQL code, before delving into Redshift ML. In the concluding chapters, you’ll discover best practices for implementing serverless architecture with Redshift. By the end of this book, you’ll be able to configure and deploy Amazon Redshift Serverless, train and deploy machine learning models using Amazon Redshift ML, and run inference queries at scale.
Publication date:
August 2023
Publisher
Packt
Pages
290
ISBN
9781804619285

 

Introduction to Amazon Redshift Serverless

“Hey, what’s a data warehouse?” John Doe, CEO and co-founder of Red.wines, a fictional specialty wine e-commerce company, asked Tathya Vishleshak*, the company’s CTO. John, who owned a boutique winery, had teamed up with Tathya for the project. The company’s success surged during the pandemic, driven by social media and the stay-at-home trend. John wanted detailed data analysis to align inventory and customer outreach. However, there was a problem – producing this analysis was slowing down their online transaction processing (OLTP) database.

“A data warehouse is like a big database where we store different data for a long time to find insights and make decisions,” Tathya explained.

John had a concern, “Sounds expensive; we’re already paying for unused warehouse space. Can we afford it?”

Tathya reassured him, “You’re right, but there are cloud data warehouses such as Amazon Redshift Serverless that let you pay as you use.”

Expanding on this, this chapter introduces data warehousing and Amazon Redshift. We’ll cover Amazon Redshift Serverless basics, such as namespaces and workgroups, and guide you in creating a data warehouse. Amazon Redshift can gather data from various sources, mainly Amazon Simple Storage Service (S3).

As we go through this chapter, you’ll learn about a crucial aspect of this, the AWS Identity and Access Management (IAM) role, needed for loading data from S3. This role connects to your Serverless namespace for smooth data transfer. You’ll also learn how to load sample data and run queries using Amazon Redshift query editor. Our goal is to make it simple and actionable, so you’re confident in navigating this journey.

Tathya Vishleshak

The phrase 'Tathya Vishleshak' can be loosely interpreted to reflect the concept of a data analyst in Sanskrit/Hindi. However, it's important to note that this is not a precise or established translation, but rather an attempt to convey a similar meaning based on the individual meanings of the words 'Tathya' and 'Vishleshak' in Sanskrit.

Additionally, Amazon Redshift is used to analyze structured and unstructured data in data warehouses, operational databases, and data lakes. It’s employed for traditional data warehousing, business intelligence, real-time analytics, and machine learning/predictive analytics. Data analysts and developers use Redshift data with machine learning (ML) models for tasks such as predicting customer behavior. Amazon Redshift ML streamlines this process using familiar SQL commands.

The book delves into ML, explaining supervised and unsupervised training. You’ll learn about problem-solving with binary classification, multi-class classification, and regression using real-world examples. You’ll also discover how to create deep learning models and custom models with XGBoost, as well as use time series forecasting. The book also covers in-database and remote inferences using existing models, applying ML for predictive analytics, and operationalizing machine learning models.

The following topics will be covered in this chapter:

  • What is Amazon Redshift?
  • Getting started with Amazon Redshift Serverless
  • Connecting to your data warehouse

This chapter requires a web browser and access to an AWS account.

 

What is Amazon Redshift?

Organizations churn out vast troves of customer data along with insights into these customers’ interactions with the business. This data gets funneled into various applications and stashed away in disconnected systems. A conundrum arises when attempting to decipher these data silos – a formidable challenge that hampers the derivation of meaningful insights essential for organizational clarity. Adding to this complexity, security and performance considerations typically muzzle business analysts from accessing data within OLTP systems.

The hiccup is that intricate analytical queries weigh down OLTP databases, casting a shadow over their core operations. Here, the solution is the data warehouse, which is a central hub of curated data, used by business analysts and data scientists to make informed decisions by employing the business intelligence and machine learning tools at their disposal. These users make use of Structured Query Language (SQL) to derive insights from this data trove. From operational systems, application logs, and social media streams to the influx of IoT device-generated data, customers channel structured and semi-structured data into organizations’ data warehouses, as depicted in Figure 1.1, showcasing the classic architecture of a conventional data warehouse.

Figure 1.1 – Data warehouse

Figure 1.1 – Data warehouse

Here’s where Amazon Redshift Serverless comes in. It’s a key option within Amazon Redshift, a well-managed cloud data warehouse offered by Amazon Web Services (AWS). With cloud-based ease, Amazon Redshift Serverless lets you set up your data storage without infrastructure hassles or cost worries. You pay based on what you use for compute and storage.

Amazon Redshift Serverless goes beyond convenience, propelling modern data applications that seamlessly connect to the data lake. Enter the data lake – a structure that gathers all data strands under one roof, providing limitless space to store data at any scale, cost-effectively. Alongside other data repositories such as data warehouses, data lakes redefine how organizations handle data. And this is where it all comes together – the following diagram shows how Amazon Redshift Serverless injects SQL-powered queries into the data lake, driving a dynamic data flow:

Figure 1.2 – Data lake and data warehouse

Figure 1.2 – Data lake and data warehouse

So, let’s get started on creating our first data warehouse in the cloud!

 

Getting started with Amazon Redshift Serverless

You can create your data warehouse with Amazon Redshift Serverless using the AWS Command-Line Interface (CLI), the API, AWS CloudFormation templates, or the AWS console. We are going to use the AWS console to create a Redshift Serverless data warehouse. Log in to your AWS console and search for Redshift in the top bar, as shown in Figure 1.3:

Figure 1.3 – AWS console page showing services filtered by our search for Redshift

Figure 1.3 – AWS console page showing services filtered by our search for Redshift

Click on Amazon Redshift, which will take you to the home page for the Amazon Redshift console, as shown in Figure 1.4. To help get you started, Amazon provides free credit for first-time Redshift Serverless customers. So, let’s start creating your trial data warehouse by clicking on Try Amazon Redshift Serverless. If you or your organization has tried Amazon Redshift Serverless before, you will have to pay for the service based on your usage:

Figure 1.4 – Amazon Redshift service page in the AWS console

Figure 1.4 – Amazon Redshift service page in the AWS console

If you have free credit available, it will be indicated at the top of your screen, as in Figure 1.5:

Figure 1.5 – AWS console showing the Redshift Serverless Get started page

Figure 1.5 – AWS console showing the Redshift Serverless Get started page

You can either choose the defaults or use the customized settings to create your data warehouse. The customized settings give you more control, allowing you to specify many additional parameters for your compute configuration including the workgroup, data-related settings such as the namespace, and advanced security settings. We will use the customized settings, which will help us customize the namespace settings for our Serverless data warehouse. A namespace combined with a workgroup is what makes a data warehouse with Redshift Serverless, as we will now see in more detail.

What is a namespace?

Amazon Redshift Serverless provides a separation of storage and compute for a data warehouse. A namespace is a collection of all your data stored in the database such as your tables, views, database users, and their privileges. You are separately charged for storage based on the size of the data stored in your data warehouse. For compute, you are charged for the capacity used over a given duration in Redshift processing hours (RPU) on a per second-basis. The storage capacity is billed as Redshift managed storage (RMS) and is billed by GB/month. You can view https://aws.amazon.com/redshift/pricing/ for detailed pricing for your AWS Region.

As a data warehouse admin, you can change the name of your data warehouse namespace while creating the namespace. You can also change your encryption settings, audit logging, and AWS IAM permissions, as shown in Figure 1.6. The primary reason we are going to use customized settings is to associate an IAM role with the namespace:

Figure 1.6 – Namespace configuration

Figure 1.6 – Namespace configuration

AWS IAM allows you to specify which users or services can access other services and resources in AWS. We will use that role for loading data from S3 and training a machine learning model with Redshift ML that accesses Amazon SageMaker.

If you have already created an IAM role earlier, you can associate with that IAM role. If you have not created an IAM role, do so now by selecting the Manage IAM roles option, as shown in Figure 1.7:

Figure 1.7 – Creating an IAM role and associating it via the AWS console

Figure 1.7 – Creating an IAM role and associating it via the AWS console

Then, select the Create IAM role option, as shown in Figure 1.8:

Figure 1.8 – Selecting the “Create IAM role” option

Figure 1.8 – Selecting the “Create IAM role” option

You can then create a default IAM role and provide appropriate permissions to the IAM role to allow it to access S3 buckets, as shown in Figure 1.9:

Figure 1.9 – Granting S3 permissions to the IAM role

Figure 1.9 – Granting S3 permissions to the IAM role

As shown in the preceding figure, select Any S3 bucket to enable Redshift to read data from and write data to all S3 buckets you have created. Then, select Create IAM role as default to create the role and set it as the default IAM role.

Figure 1.10 – An IAM role was created but is not yet applied

Figure 1.10 – An IAM role was created but is not yet applied

As shown in Figure 1.10, we created the IAM role and associated it with the namespace as a default role. Let’s next proceed to create a workgroup, wherein we will set up the compute configuration for the data warehouse.

What is a workgroup?

As we discussed earlier, a namespace combined with a workgroup is what makes a Redshift Serverless data warehouse. A workgroup provides the compute resources required to process your data. It also provides the endpoint for you to connect to the warehouse. As an admin, you need to configure the compute settings such as the network and security configuration for the workgroup.

We will not do any customization at this time and simply select the default settings instead, including the VPC and associated subnets for the workgroup, as shown in the following screenshot:

Figure 1.11 – Default settings and associated subnets for the workgroup

Figure 1.11 – Default settings and associated subnets for the workgroup

Click on the Save configuration button to create your Redshift Serverless instance, and your first data warehouse will be ready in a few minutes:

Figure 1.12 – Redshift Serverless creation progress

Figure 1.12 – Redshift Serverless creation progress

Once your data warehouse is ready, you will be redirected to your Serverless dashboard, as shown in Figure 1.13:

Figure 1.13 – Serverless dashboard showing your namespace and workgroup

Figure 1.13 – Serverless dashboard showing your namespace and workgroup

Now that we have created our data warehouse, we will connect to the data warehouse, load some sample data, and run some queries.

 

Connecting to your data warehouse

Your data warehouse with Redshift Serverless is now ready. You can connect to your data warehouse using third-party tools via JDBC/ODBC/Python drivers. Other options include the Data API or the embedded Redshift query editor v2.

Using Amazon Redshift query editor v2

Now that your data warehouse is ready; let’s navigate to the query editor to load some sample data and run some queries. Select the Query data option from your dashboard, as shown in Figure 1.13, and you will be navigated to the query editor, as shown in Figure 1.14.

Figure 1.14 – Query editor

Figure 1.14 – Query editor

In the Redshift query editor v2 console, on the left pane, you will see the data warehouses, such as the Serverless:default workgroup, that you have access to. Click on the workgroup (Serverless:default) to connect to the data warehouse.

Figure 1.15 – Creating a connection to your workgroup

Figure 1.15 – Creating a connection to your workgroup

As shown in the preceding screenshot, select Federated user if you did not specify any database credentials while creating the namespace, and then click Create connection. You can leave the database name as dev. You will be prompted to create a connection only when connecting to the data warehouse for the first time. If you have created the connection, you will be connected automatically when you click on the workgroup. Once you are connected, you will see the databases in the navigator, as shown in Figure 1.16:

Figure 1.16 – List of databases

Figure 1.16 – List of databases

Since we just created our data warehouse for the first time, there is no data present in it, so let’s load some sample data into the data warehouse now.

Loading sample data

On the left pane, click on the sample_data_dev database to expand the available database:

Figure 1.17 – The Redshift query editor v2 navigator that shows the sample data available

Figure 1.17 – The Redshift query editor v2 navigator that shows the sample data available

As you can see from the preceding screenshot, three sample datasets are available for you to load into your data warehouse. Click on the icon showing the folder with an arrow located to the right of your chosen sample data notebook to load and open it, as shown in Figure 1.18:

Figure 1.18 – List of sample databases

Figure 1.18 – List of sample databases

You will be prompted to create your sample database. Click on Create to get started, as shown in Figure 1.19:

Figure 1.19 – Creating a sample database

Figure 1.19 – Creating a sample database

The sample data will be loaded in a few seconds and presented in a notebook with SQL queries for the dataset that you can explore, as shown in Figure 1.20:

Figure 1.20 – Notebook with sample queries for the tickit database

Figure 1.20 – Notebook with sample queries for the tickit database

You can expand the navigation tree on the left side of the query editor to view schemas and database objects, such as tables and views in your schema, as shown in Figure 1.21.

Figure 1.21 – Expanding the navigation tree to view schemas and database objects

Figure 1.21 – Expanding the navigation tree to view schemas and database objects

You can click on a table to view the table definitions, as shown in Figure 1.22:

Figure 1.22 – Table definitions

Figure 1.22 – Table definitions

Right-clicking on a table provides additional Select table, Show table definition, and Delete options, as shown in Figure 1.23:

Figure 1.23 – Right-clicking on a table to view more options

Figure 1.23 – Right-clicking on a table to view more options

You can click Run all, as shown in Figure 1.24, to run all the queries in the sample notebook. The query editor provides a notebook interface to add annotation, and SQL cells organize your queries in a single document. You can use annotations for documentation purposes.

Figure 1.24 – The “Run all” option

Figure 1.24 – The “Run all” option

You will see the results of your queries for each cell. You can download the results as JSON or CSV files to your desktop, as shown in Figure 1.25:

Figure 1.25 – Options to download query results

Figure 1.25 – Options to download query results

Let’s author our first query.

Running your first query

We want to find out the top 10 events by sales in the tickit database. We will run the following SQL statement in the data warehouse:

SELECT eventname, total_price
FROM  (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) as percentile
       FROM (SELECT eventid, sum(pricepaid) total_price
             FROM   tickit.sales
             GROUP BY eventid)) Q, tickit.event E
       WHERE Q.eventid = E.eventid
       AND percentile = 1
ORDER BY total_price desc
limit 10;

In the query editor, add a new query by clicking on the + sign and selecting Editor from the menu that appears. If you wanted to create a new notebook, you could click on Notebook instead, as shown in Figure 1.26:

Figure 1.26 – Creating a new query

Figure 1.26 – Creating a new query

Now, type the preceding SQL query in the editor and then click on Run. You will get the results as shown in the following screenshot:

Figure 1.27 – Query with results

Figure 1.27 – Query with results

As the saying goes, “A picture is worth a thousand words,” and query editor allows you to visualize the results to gain faster insight. You can create a chart easily by clicking on the Chart option and then selecting the chart you want. Let’s select a scatter plot, as shown in Figure 1.28:

Figure 1.28 – Using charts in Redshift query editor v2

Figure 1.28 – Using charts in Redshift query editor v2

You can add a chart name and notations for the X and Y axes and export the chart as PNG or JPG to put in your presentation or to share with your business partners:

Figure 1.29 – Charting options in query editor v2

Figure 1.29 – Charting options in query editor v2

As you have now seen, you can use Redshift query editor v2 to create your own database, create tables, load data, and run and author queries and notebooks. You can share your queries and notebooks with your team members.

 

Summary

In this chapter, you learned about cloud data warehouses and Amazon Redshift Serverless. You created your first data warehouse powered by Redshift Serverless and loaded some sample data using the query editor. You also learned how to use the query editor to run queries and visualize data to produce insights.

In Chapter 2, you will learn the best techniques for loading data and performing analytics in your Amazon Redshift Serverless data warehouse.

About the Authors
  • Debu Panda

    Debu Panda, a Senior Manager, Product Management at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt, 2009).

    Browse publications by this author
  • Phil Bates

    Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS. He has more than 25 years of experience implementing large-scale data warehouse solutions. He is passionate about helping customers through their cloud journey and leveraging the power of ML within their data warehouse.

    Browse publications by this author
  • Bhanu Pittampally

    Bhanu Pittampally is Analytics Specialist Solutions Architect at Amazon Web Services. His background is in data and analytics and is in the field for over 16 years. He currently lives in Frisco, TX with his wife Kavitha and daughters Vibha and Medha.

    Browse publications by this author
  • Sumeet Joshi

    Sumeet Joshi is an Analytics Specialist Solutions Architect based out of New York. He specializes in building large-scale data warehousing solutions. He has over 17 years of experience in the data warehousing and analytical space.

    Browse publications by this author
Serverless Machine Learning with Amazon Redshift ML
Unlock this book and the full library FREE for 7 days
Start now