You're reading from Modern Data Architecture on AWS

Product typeBook

Published inAug 2023

PublisherPackt

ISBN-139781801813396

Edition1st Edition

Concepts

Data Science

Author (1)

Behram Irani

Data Warehousing

In this chapter, we will look at the following key topics:

The need for a data warehouse
Data warehousing using Amazon Redshift
Data warehouse modernization with Redshift
Data ingestion patterns
Data transformation using ELT patterns
Data security and governance patterns
Data consumption patterns

The concept of data warehouses has existed for a long time and organizations have been able to use data warehouse systems to do online analytics processing (OLAP). Deriving analytical insights from the data from these systems is the main goal of every organization. However, as we discussed in Chapter 1, the traditional data warehouse setup became challenging in the age of cloud computing. With the ever-growing volume, velocity, and variety of data in recent times, traditional on-premises data warehouses are not able to handle all the new use cases businesses users wish to solve.

The need for a data warehouse

Before we dive deeper into the topics of data warehouses, once again, let’s distinguish between using a data lake versus a data warehouse. Both systems help solve a lot of overlapping use cases and can be used interchangeably for most common use cases. However, there are major differences between them. Essentially, a data lake is a schema-on-read centralized repository that’s flexible enough to store all kinds of structured, semi-structured, and unstructured data at any scale and allows all personas in an organization to derive value from this data easily and cost-effectively. A data warehouse, on the other hand, is a schema-on-write structured repository that stores structured and semi-structured data that’s used for analytics and business intelligence (BI). It excels in data aggregations, slice and dice data operations, roll-up and roll-down data operations, data cubes, and all other OLAP kinds of use cases. Both systems co-exist...

Data warehousing using Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse service. It is designed on the principles of massively parallel processing (MPP) architecture, which allows users to analyze large volumes of data efficiently. Redshift addresses a whole range of analytical use cases, but more importantly, it addresses the top three areas of what businesses are looking for:

Analyzing data by breaking down data silos.
Providing the best price performance at scale.
Providing easy, secure, and reliable insights from the data.

Before we look at some use cases, let’s quickly understand the basics of Redshift.

Amazon Redshift basics

Redshift uses a massively parallel, shared-nothing architecture. It uses columnar storage, which means data is stored in columns instead of rows.

This columnar storage approach has several advantages in terms of data compression, query performance, and analytics:

Compression...

Data warehouse modernization using Redshift

We will start with the most obvious high-level use case: organizations that want to modernize their data warehouses. The primary reason to modernize is that traditional data warehouses are unable to keep up with the new emerging use cases. Due to their architectural limitations, traditional data warehouses are not able to handle the exponential growth in data volume along with the new variety of data that’s being produced. Long story short, traditional data warehouses have become slow, complex, and expensive. Let’s bring up the use case from GreatFin again.

Use case for data warehouse modernization

GreatFin has an on-premises data warehouse that is nearing its end of life. The continuous requests from businesses to support newer types of data analytics use cases have made this platform difficult to operate and expand. Its performance is becoming slow and the infrastructure and operating expenses are growing steadily. They...

Data ingestion patterns

One of the most complex and time-consuming parts of data warehouse modernization is data onboarding. Data can be onboarded in many different ways, using many different services. It all boils down to the requirement and the need for onboarding data in a particular manner. Let’s explore some typical data onboarding patterns for Amazon Redshift.

Data ingestion using AWS DMS

Let’s start with a use case first, so that the importance of DMS can be better understood when it comes to loading data into Redshift.

Use case for batch loading data into Amazon Redshift

GreatFin uses multiple databases and traditional data warehouses for their enterprise analytics reporting needs. They want to modernize their data warehouse using Amazon Redshift and would like to bulk load all the historic data from these existing systems into Redshift. They are looking for a fast, easy, and cost-effective way to do this in Redshift.

As you may recall from our...

Data transformation using ELT patterns

There are several reasons why ELT patterns may be more appealing for certain data projects. Sometimes, you need the data available in raw format as soon as possible, sometimes, it’s the comfort level of personas using a particular programming language or tool, and other times, it’s just about cost efficiency. Amazon Redshift also provides a platform where data engineering teams can create their ELT pipelines. Let’s introduce a use case to understand this pattern.

Use case for ELT inside Amazon Redshift

GreatFin uses DMS to create a continuous data ingestion pipeline from many source data stores in Redshift. Once the data has landed in Redshift, a bunch of technical and business rules need to be applied to this data before it’s ready for consumption. Different teams are well versed in the SQL programming language and prefer to write ANSI-SQL logic to transform the data. The teams also want to save costs by not...

Data security and governance patterns

Redshift has a very broad and robust set of security and governance mechanisms that allow tight control of the data and the infrastructure around it. We may not be able to cover all use cases around security and access control patterns regarding Redshift but let’s list some key aspects so that you understand how robust these features are and how they can cover a wide range of governance patterns:

Encryption: Redshift supports encryption of data both at rest and in transit
Auditing and compliance: Redshift provides detailed logs and audit trails for security and compliance purposes
Data masking: Redshift provides masking capabilities to protect sensitive information
User management: Redshift provides a comprehensive user management system that allows administrators to control who has access to which data, and at what level
Access Control Lists (ACLs): Redshift allows you to assign specific access rights to users and...

Data consumption patterns

All the effort of ingesting, curating, and securing data in Redshift is so that it can be consumed by different personas inside the organization, as well as outside by the customers of the company. The following figure highlights some of the main ways in which data is consumed from Redshift:

Figure 7.11 – Amazon Redshift consumption patterns

Let’s dive into the details of some of the consumption patterns with Redshift and also understand the use cases better.

Redshift Spectrum

Before we look at use cases that consume data stored in Redshift, we have to address the elephant in the room first – Redshift Spectrum. Redshift Spectrum provides a unique ability inside Redshift to transparently query the data stored in the S3 data lake. The data lake tables that are stored in the Glue Data Catalog can be queried and joined with regular Redshift tables. This is truly what a modern data warehouse looks like and...

Summary

In this chapter, we looked at how Amazon Redshift helps modernize data warehouses. We covered the basics of what Amazon Redshift looks like and how some of its features help meet next-gen business use cases. We went through each type of use case, starting from an overarching use case around modernizing legacy on-premises data warehouses by migrating the data to Amazon Redshift. We then looked at some of the data ingestion use cases that most organizations use to get the data inside Redshift. Once the data was ingested, we looked at how to leverage the compute power of Redshift to transform data using the ELT pattern. Stored procs, MVs, and Apache Spark connectors are all supported by Redshift to help process the data so that it can be ready for consumption.

Before the data can be consumed, we had to learn how to control and set security measures for the data that resides in Redshift. We applied some fine-grained access control patterns such as RBAC, row-level and column...

References

Amazon Redshift workshops: https://catalog.us-east-1.prod.workshops.aws/workshops/9f29cdba-66c0-445e-8cbb-28a092cb5ba7/en-US
https://catalog.us-east-1.prod.workshops.aws/workshops/380e0b8a-5d4c-46e3-95a8-82d68cf5789a/en-US
Modernization workshops: https://awsworkshop.io/tags/redshift/

The rest of the chapter is locked

You have been reading a chapter from

Modern Data Architecture on AWS

Published in: Aug 2023Publisher: PacktISBN-13: 9781801813396

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Behram Irani

Behram Irani is currently a technology leader with Amazon Web Services (AWS) specializing in data, analytics and AI/ML. He has spent over 18 years in the tech industry helping organizations, from start-ups to large-scale enterprises, modernize their data platforms. In the last 6 years working at AWS, Behram has been a thought leader in the data, analytics and AI/ML space; publishing multiple papers and leading the digital transformation efforts for many organizations across the globe. Behram has completed his Bachelor of Engineering in Computer Science from the University of Pune and has an MBA degree from the University of Florida.
Read more about Behram Irani

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages