You're reading from Data Engineering with AWS - Second Edition

Product typeBook

Published inOct 2023

PublisherPackt

ISBN-139781804614426

Edition2nd Edition

Concepts

Data Engineering

Author (1)

Gareth Eagar

A Deeper Dive into Data Marts and Amazon Redshift

While a data lake enables a significant amount of analytics to happen inside it, there are several use cases where a data engineer may need to load data into an external data warehouse, or data mart, to enable a set of data consumers.

As we reviewed in Chapter 2, Data Management Architectures for Analytics, a data lake is a single source of truth across multiple lines of business, while a data mart generally contains a subset of data of interest to a particular group of users. A data mart could be a relational database, a data warehouse, or a different kind of datastore.

Data marts serve two primary purposes. First, they provide a database with a subset of the data in the data lake, optimized for specific types of queries (such as for a specific business function). In addition, they also provide a higher-performing, lower-latency query engine, which is often required for specific analytic use cases (such as for powering Business...

Technical requirements

For the hands-on exercises in this chapter, you will need permission to create a new IAM role, as well as permission to create a Redshift cluster.

You can find the code files of this chapter in the GitHub repository using the following link: https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter09

Extending analytics with data warehouses/data marts

Tools such as Amazon Athena (which we will do a deeper dive into in Chapter 11, Ad Hoc Queries with Amazon Athena) allow us to run SQL queries directly on data in the data lake. While this enables us to query very large datasets that exist in an Amazon S3 data lake, the performance of these queries is generally lower than the performance you get when running queries against data on a high-performance disk that is local to the compute engine.

However, not all queries require this kind of high performance, and we can categorize our queries and data into cold, warm, and hot tiers. Before diving into the topic of data marts and data warehouses, let’s first take a look at the different tiers of queries/data storage that are common in data lake projects.

Cold and warm data

We’ve grouped the cold and warm data tiers into one section, as when building in AWS, both of these tiers generally use Amazon S3 storage...

What not to do – anti-patterns for a data warehouse

While there are many good ways to use a data warehouse for analytics, there are some approaches that at first may seem to be a good fit for a data warehouse but are generally not recommended.

Let’s take a look at some of the ways of using a data warehouse that should be avoided.

Using a data warehouse as a transactional datastore

Data warehouses are designed to be optimized for Online Analytical Processing (OLAP) queries, so they should not be used for Online Transaction Processing (OLTP) queries and use cases.

While there are mechanisms to update or delete data from a data warehouse (such as the merge statement in Redshift), a data warehouse is primarily designed for mostly append-only, or insert, queries. There are also other features of transactional databases (such as MySQL or PostgreSQL) that are available in Redshift – such as the concept of primary and foreign keys – but these are...

Redshift architecture review and storage deep dive

In this section, we will take a deeper dive into the architecture of Redshift clusters, as well as into how data in tables is stored across Redshift nodes. This in-depth look will help you understand and fine-tune Redshift’s performance, though we will also cover how many of the design decisions affecting table layout can be automated by Redshift.

In Chapter 2, Data Management Architectures for Analytics, we briefly discussed how the Redshift architecture uses leader and compute nodes. Each compute node contains a certain amount of compute power (CPUs and memory), as well as a certain amount of local storage. When configuring your Redshift cluster, you can add multiple compute nodes, depending on your compute and storage requirements. Note that to provide fault tolerance and improved durability, the compute nodes have 2.5–3x the stated node storage capacity (for example, if the addressable storage capacity is listed...

Designing a high-performance data warehouse

When you’re looking to design a high-performing data warehouse, multiple factors need to be considered. These include items such as cluster type and sizing, compression types, distribution keys, sort keys, data types, and table constraints.

As part of the design process, you will need to consider several trade-offs, such as cost versus performance. Business requirements and the available budget will often drive these decisions.

Beyond decisions about infrastructure and storage, the logical schema design also plays a big part in optimizing the performance of the data warehouse. Often, this will be an iterative process, where you start with an initial schema design that you refine over time to optimize for increased performance.

Provisioned versus Redshift Serverless clusters

When creating an Amazon Redshift cluster, you can select to either use a serverless Redshift configuration or provision specific resources. With...

Moving data between a data lake and Redshift

Moving data between a data lake and a data warehouse, such as Amazon Redshift, is a common requirement for many use cases. Data may be cleansed and processed with Glue ETL jobs in the data lake, for example, and then hot data can be loaded into Redshift so that it can be queried via BI tools with optimal performance.

In the same way, there are certain use cases where data may be further processed in the data warehouse, and this newly processed data then needs to be exported back to the data lake so that other users and processes can consume this data.

In this section, we will examine some best practices and recommendations for both ingesting data from the data lake and exporting data back to the data lake.

Optimizing data ingestion in Redshift

While there are various ways that you can insert data into Redshift, the recommended way is to bulk ingest data using the Redshift COPY command. The COPY command enables optimized...

Exploring advanced Redshift features

While Redshift was first launched a long while back (2013), AWS is continually adding new functionality and features to Redshift. In this section, we are going to look at some of the advanced capabilities launched in the past few years that can help you get the most from your Redshift cluster. You should also regularly review the AWS What’s New page for Redshift (https://aws.amazon.com/redshift/whats-new/), as well as AWS blog posts tagged with Redshift (https://aws.amazon.com/blogs/big-data/tag/amazon-redshift/), to ensure you keep up to date with new features.

Data sharing between Redshift clusters

There are a number of use cases where you may want to share data from one Redshift cluster with data in another (or multiple other) Redshift cluster. For example, if you implement a data mesh architecture, you may want to make data available from one part of the business easily accessible for other parts of the business without needing...

Hands-on – deploying a Redshift Serverless cluster and running Redshift Spectrum queries

In our Redshift hands-on exercise, we’re going to create a new Redshift Serverless cluster and configure Redshift Spectrum so that we can query data in external tables on Amazon S3. We’ll then use both Redshift Spectrum and Amazon Athena to query data in S3.

Uploading our sample data to Amazon S3

For this exercise, we are going to use some data generated with a service called Mockaroo (https://www.mockaroo.com/). This service enables us to generate fake data with a wide variety of field types and is useful for demos and testing.

We will upload this dataset, containing a list of users, to Amazon S3 and then query it using Redshift Spectrum. Note that all data in this file is fake data, generated with the tool mentioned above. Therefore, the names, email addresses, street addresses, phone numbers, etc. in this dataset are not real.

Let’s get started...

Summary

In this chapter, we learned how a cloud data warehouse can be used to store hot data to optimize performance and manage costs (such as for dashboarding or other BI use cases). We reviewed some common “anti-patterns” for data warehouse usage before diving deep into the Redshift architecture to learn more about how Redshift optimizes data storage across nodes.

We then reviewed some of the important design decisions that need to be made when creating a Redshift cluster optimized for performance, before reviewing how to ingest data into Redshift and unload data from Redshift.

Finally, we reviewed some of the advanced features of Redshift (such as data sharing, DDM, and cluster resizing) before moving on to doing some hands-on exercises.

In the hands-on exercise portion of this chapter, we created a new Redshift Serverless cluster, explored some sample data, and configured Redshift Spectrum to query data from Amazon S3.

In the next chapter, we will...

The rest of the chapter is locked

You have been reading a chapter from

Data Engineering with AWS - Second Edition

Published in: Oct 2023Publisher: PacktISBN-13: 9781804614426

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Gareth Eagar

Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA. Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers. Gareth frequently speaks on data related topics.
Read more about Gareth Eagar

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages