Reader small image

You're reading from  Modern Data Architecture on AWS

Product typeBook
Published inAug 2023
PublisherPackt
ISBN-139781801813396
Edition1st Edition
Concepts
Right arrow
Author (1)
Behram Irani
Behram Irani
author image
Behram Irani

Behram Irani is currently a technology leader with Amazon Web Services (AWS) specializing in data, analytics and AI/ML. He has spent over 18 years in the tech industry helping organizations, from start-ups to large-scale enterprises, modernize their data platforms. In the last 6 years working at AWS, Behram has been a thought leader in the data, analytics and AI/ML space; publishing multiple papers and leading the digital transformation efforts for many organizations across the globe. Behram has completed his Bachelor of Engineering in Computer Science from the University of Pune and has an MBA degree from the University of Florida.
Read more about Behram Irani

Right arrow

Data Warehousing

In this chapter, we will look at the following key topics:

  • The need for a data warehouse
  • Data warehousing using Amazon Redshift
  • Data warehouse modernization with Redshift
  • Data ingestion patterns
  • Data transformation using ELT patterns
  • Data security and governance patterns
  • Data consumption patterns

The concept of data warehouses has existed for a long time and organizations have been able to use data warehouse systems to do online analytics processing (OLAP). Deriving analytical insights from the data from these systems is the main goal of every organization. However, as we discussed in Chapter 1, the traditional data warehouse setup became challenging in the age of cloud computing. With the ever-growing volume, velocity, and variety of data in recent times, traditional on-premises data warehouses are not able to handle all the new use cases businesses users wish to solve.

The need for a data warehouse

Before we dive deeper into the topics of data warehouses, once again, let’s distinguish between using a data lake versus a data warehouse. Both systems help solve a lot of overlapping use cases and can be used interchangeably for most common use cases. However, there are major differences between them. Essentially, a data lake is a schema-on-read centralized repository that’s flexible enough to store all kinds of structured, semi-structured, and unstructured data at any scale and allows all personas in an organization to derive value from this data easily and cost-effectively. A data warehouse, on the other hand, is a schema-on-write structured repository that stores structured and semi-structured data that’s used for analytics and business intelligence (BI). It excels in data aggregations, slice and dice data operations, roll-up and roll-down data operations, data cubes, and all other OLAP kinds of use cases. Both systems co-exist...

Data warehousing using Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse service. It is designed on the principles of massively parallel processing (MPP) architecture, which allows users to analyze large volumes of data efficiently. Redshift addresses a whole range of analytical use cases, but more importantly, it addresses the top three areas of what businesses are looking for:

  1. Analyzing data by breaking down data silos.
  2. Providing the best price performance at scale.
  3. Providing easy, secure, and reliable insights from the data.

Before we look at some use cases, let’s quickly understand the basics of Redshift.

Amazon Redshift basics

Redshift uses a massively parallel, shared-nothing architecture. It uses columnar storage, which means data is stored in columns instead of rows.

This columnar storage approach has several advantages in terms of data compression, query performance, and analytics:

  • Compression...

Data warehouse modernization using Redshift

We will start with the most obvious high-level use case: organizations that want to modernize their data warehouses. The primary reason to modernize is that traditional data warehouses are unable to keep up with the new emerging use cases. Due to their architectural limitations, traditional data warehouses are not able to handle the exponential growth in data volume along with the new variety of data that’s being produced. Long story short, traditional data warehouses have become slow, complex, and expensive. Let’s bring up the use case from GreatFin again.

Use case for data warehouse modernization

GreatFin has an on-premises data warehouse that is nearing its end of life. The continuous requests from businesses to support newer types of data analytics use cases have made this platform difficult to operate and expand. Its performance is becoming slow and the infrastructure and operating expenses are growing steadily. They...

Data ingestion patterns

One of the most complex and time-consuming parts of data warehouse modernization is data onboarding. Data can be onboarded in many different ways, using many different services. It all boils down to the requirement and the need for onboarding data in a particular manner. Let’s explore some typical data onboarding patterns for Amazon Redshift.

Data ingestion using AWS DMS

Let’s start with a use case first, so that the importance of DMS can be better understood when it comes to loading data into Redshift.

Use case for batch loading data into Amazon Redshift

GreatFin uses multiple databases and traditional data warehouses for their enterprise analytics reporting needs. They want to modernize their data warehouse using Amazon Redshift and would like to bulk load all the historic data from these existing systems into Redshift. They are looking for a fast, easy, and cost-effective way to do this in Redshift.

As you may recall from our...

Data transformation using ELT patterns

There are several reasons why ELT patterns may be more appealing for certain data projects. Sometimes, you need the data available in raw format as soon as possible, sometimes, it’s the comfort level of personas using a particular programming language or tool, and other times, it’s just about cost efficiency. Amazon Redshift also provides a platform where data engineering teams can create their ELT pipelines. Let’s introduce a use case to understand this pattern.

Use case for ELT inside Amazon Redshift

GreatFin uses DMS to create a continuous data ingestion pipeline from many source data stores in Redshift. Once the data has landed in Redshift, a bunch of technical and business rules need to be applied to this data before it’s ready for consumption. Different teams are well versed in the SQL programming language and prefer to write ANSI-SQL logic to transform the data. The teams also want to save costs by not...

Data security and governance patterns

Redshift has a very broad and robust set of security and governance mechanisms that allow tight control of the data and the infrastructure around it. We may not be able to cover all use cases around security and access control patterns regarding Redshift but let’s list some key aspects so that you understand how robust these features are and how they can cover a wide range of governance patterns:

  • Encryption: Redshift supports encryption of data both at rest and in transit
  • Auditing and compliance: Redshift provides detailed logs and audit trails for security and compliance purposes
  • Data masking: Redshift provides masking capabilities to protect sensitive information
  • User management: Redshift provides a comprehensive user management system that allows administrators to control who has access to which data, and at what level
  • Access Control Lists (ACLs): Redshift allows you to assign specific access rights to users and...

Data consumption patterns

All the effort of ingesting, curating, and securing data in Redshift is so that it can be consumed by different personas inside the organization, as well as outside by the customers of the company. The following figure highlights some of the main ways in which data is consumed from Redshift:

Figure 7.11 – Amazon Redshift consumption patterns

Figure 7.11 – Amazon Redshift consumption patterns

Let’s dive into the details of some of the consumption patterns with Redshift and also understand the use cases better.

Redshift Spectrum

Before we look at use cases that consume data stored in Redshift, we have to address the elephant in the room first – Redshift Spectrum. Redshift Spectrum provides a unique ability inside Redshift to transparently query the data stored in the S3 data lake. The data lake tables that are stored in the Glue Data Catalog can be queried and joined with regular Redshift tables. This is truly what a modern data warehouse looks like and...

Summary

In this chapter, we looked at how Amazon Redshift helps modernize data warehouses. We covered the basics of what Amazon Redshift looks like and how some of its features help meet next-gen business use cases. We went through each type of use case, starting from an overarching use case around modernizing legacy on-premises data warehouses by migrating the data to Amazon Redshift. We then looked at some of the data ingestion use cases that most organizations use to get the data inside Redshift. Once the data was ingested, we looked at how to leverage the compute power of Redshift to transform data using the ELT pattern. Stored procs, MVs, and Apache Spark connectors are all supported by Redshift to help process the data so that it can be ready for consumption.

Before the data can be consumed, we had to learn how to control and set security measures for the data that resides in Redshift. We applied some fine-grained access control patterns such as RBAC, row-level and column...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Modern Data Architecture on AWS
Published in: Aug 2023Publisher: PacktISBN-13: 9781801813396
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Behram Irani

Behram Irani is currently a technology leader with Amazon Web Services (AWS) specializing in data, analytics and AI/ML. He has spent over 18 years in the tech industry helping organizations, from start-ups to large-scale enterprises, modernize their data platforms. In the last 6 years working at AWS, Behram has been a thought leader in the data, analytics and AI/ML space; publishing multiple papers and leading the digital transformation efforts for many organizations across the globe. Behram has completed his Bachelor of Engineering in Computer Science from the University of Pune and has an MBA degree from the University of Florida.
Read more about Behram Irani