Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Modern Data Architecture on AWS

You're reading from  Modern Data Architecture on AWS

Product type Book
Published in Aug 2023
Publisher Packt
ISBN-13 9781801813396
Pages 420 pages
Edition 1st Edition
Languages
Author (1):
Behram Irani Behram Irani
Profile icon Behram Irani

Table of Contents (24) Chapters

Preface 1. Part 1: Foundational Data Lake
2. Prologue: The Data and Analytics Journey So Far 3. Chapter 1: Modern Data Architecture on AWS 4. Chapter 2: Scalable Data Lakes 5. Part 2: Purpose-Built Services And Unified Data Access
6. Chapter 3: Batch Data Ingestion 7. Chapter 4: Streaming Data Ingestion 8. Chapter 5: Data Processing 9. Chapter 6: Interactive Analytics 10. Chapter 7: Data Warehousing 11. Chapter 8: Data Sharing 12. Chapter 9: Data Federation 13. Chapter 10: Predictive Analytics 14. Chapter 11: Generative AI 15. Chapter 12: Operational Analytics 16. Chapter 13: Business Intelligence 17. Part 3: Govern, Scale, Optimize And Operationalize
18. Chapter 14: Data Governance 19. Chapter 15: Data Mesh 20. Chapter 16: Performant and Cost-Effective Data Platform 21. Chapter 17: Automate, Operationalize, and Monetize 22. Index 23. Other Books You May Enjoy

Scalable Data Lakes

In this chapter, we will look at how organizations can build a data platform foundation by creating data lakes on AWS.

We will cover the following main topics:

  • Why choose Amazon S3 as a data lake store?
  • Business scenario setup
  • Data lake layers
  • Data lake patterns
  • Data catalogs
  • Transactional data lakes
  • Putting it all together

Why choose Amazon S3 as a data lake store?

Before we dive deep into the actual data and analytics use cases and explore how to design data lakes on AWS, it is first important to understand why Amazon Simple Storage Service (Amazon S3) is the preferred choice for building a data lake and why it is used as a storage layer to store all kinds of data in a centralized location.

If you recall from the discussions we had in Chapter 1, an ideal storage for building a data lake should inherently be scalable, durable, highly performant, easy to use, secure, cost-effective, and integrated with other building blocks of the data lake ecosystem. So, we ask a very important question: why choose Amazon S3 as a data lake store?

S3 checks all the boxes on what we look for in a store for building data lakes. Here are some of the features of S3:

  • Scalable: S3 is a petabyte-scale object store with virtually unlimited storage
  • Durable: S3 is designed for 99.999999999% (11 9s) of data durability...

Business scenario setup

The flow of this book is kept in such a way that it helps you get to the end state of building a modern data platform using AWS, with the ultimate goal of solving business use cases. To demonstrate all the building blocks of the data platform, it is important that I assume a fictitious entity and build a story around it. It’s easier to understand concepts if there is steady progression and continuity in the storyline.

For this book, I will consider a financial organization and all its use cases. You can apply most of the design and architecture techniques to other sector use cases too. The bottom line is that organizations may have different business models, but the concepts that go into building a modern data platform on AWS, to a large extent, remain the same irrespective of the business domain. In other words, the same AWS services and functionalities will be leveraged in building any kind of data platform.

Let’s consider a fictitious...

Data lake layers

Now that we have a broader business use case for setting up a data lake, let’s look at a use case that will help us define what the different layers of a typical data lake are and why they are required.

Use case for creating data lake layers

GreatFin has different LOBs, and within each of these LOBs, multiple personas have different tasks to perform on the data. Each persona may need specific access to different sets of data. They will all need the data to be formatted and stored in a certain way for them to do their day-to-day operations easily. For example, data engineers may need access to the raw source data so that they can profile the data and understand the quality of the data. Data scientists may need access to a standardized form of datasets so that they can do feature engineering for creating machine learning (ML) models. Data analysts may need access to business-friendly datasets so that they can derive insights from the data.

Before we get...

Data lake patterns

There are two types of data lake patterns, as follows:

  • Centralized pattern
  • Distributed pattern

Let’s discuss each of them. Note that you can use a hybrid pattern too, depending on your use case.

Centralized pattern

In a centralized pattern, the business data is stored and accessed from a central location, to be used throughout the enterprise. For example, it may be easy to manage entity information in a centralized location; entity information such as name, address, gender, age, and profession of a person. It’s easier to manage such datasets in a centralized way, from a governance point of view as well as to avoid data duplication.

Certain LOBs may have additional properties of the data that are relevant only to their use cases. For example, the marketing department may also want to see customer lifetime value (CLV), net promoter score (NPS), marketing preferences, and so on for a person. These additional attributes can then...

Data catalogs

We talked about a data lake in AWS being a combination of the data in S3 buckets and the metadata of this data stored in a catalog. We will solve the mystery of creating a technical catalog in AWS by introducing another critical service for building a modern data platform, AWS Glue—a serverless data integration service. Now, Glue is actually an umbrella service consisting of multiple parts. It has the Glue ETL part, which is used for building data integration work, and we have multiple chapters on data ingestion and integration. The component of Glue that is relevant to our data catalog discussion is Glue Data Catalog. Let’s unfold more about the catalog in Glue and how it helps with our data lake in S3.

Glue Data Catalog

As the data passes through layers of the data lake in S3, the metadata of the data is captured and stored in Glue Data Catalog. It creates and stores the technical metadata in the form of data definition language (DDL) statements...

Transactional data lakes

Let’s introduce this topic with a use case from GreatFin.

Use case for a transactional data lake

GreatFin wants to comply with the right to be forgotten General Data Protection Regulation (GDPR) compliance in Europe. It wants to have the ability in all its systems, including its analytics environments, to easily locate, update, or delete records as and when required.

The need to create transactional data lakes came about due to many business use cases and the challenges associated with them, such as the following:

Putting it all together

So far, we have discussed the different storage layers in a typical data lake in S3 and defined the purpose of each of the layers. We also introduced the concept of creating metadata using a Glue crawler and storing it in Glue Data Catalog. Finally, we looked at use cases for building transactional data lakes. This is a good time to pivot back to the GreatFin business requirements we introduced earlier and apply these data lake foundational concepts to our use case.

Marketing use case

Suppose the marketing department at GreatFin wants to find certain top leads for offering a new type of certificate of deposit (CD) with a higher interest rate to select a few high-net-worth customers only. In this case, the customer data will be stored in multiple systems, from different LOBs.

Let’s walk through what each layer in the data lake might look like.

Raw layer example

The following diagram is a depiction of data stored in a raw layer bucket in...

Summary

In this chapter, we went through why so many organizations prefer to build their data lakes on Amazon S3. We then went through different layers of data lakes in S3 and the purpose of each of them. Along with the layers of data, we also looked at how Glue Data Catalog helps to capture the metadata about the data in the form of tables. We also touched upon a new trend around having to build a transactional data lake, which involves selecting a table format that aligns closely with the specific use case being solved. Finally, we put it all together to solve a specific use case and saw it all come together, at least from the data storage and catalog side of things.

We have the data in S3 and we have the catalog of this data in Glue Data Catalog in the form of tables. The real value of this setup is that businesses can easily consume this data to derive insights from it. This leads us to the next section of this book around different purpose-built services and how each of them...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Modern Data Architecture on AWS
Published in: Aug 2023 Publisher: Packt ISBN-13: 9781801813396
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}

Use Case

Challenge

Compliance requirements

Compliance and privacy laws—for example, the GDPR requires the deletion of certain data within a specific timeframe and/or across all datasets

Change data capture (CDC)

CDC from the source databases and incremental...