You're reading from Modern Data Architecture on AWS

Product type Book

Published in Aug 2023

Publisher Packt

ISBN-13 9781801813396

Pages 420 pages

Edition 1st Edition

Languages

Concepts

Data Science

Author (1):

Behram Irani

Table of Contents (24) Chapters

Preface

1. Part 1: Foundational Data Lake

2. Prologue: The Data and Analytics Journey So Far

3. Chapter 1: Modern Data Architecture on AWS

4. Chapter 2: Scalable Data Lakes

5. Part 2: Purpose-Built Services And Unified Data Access

6. Chapter 3: Batch Data Ingestion

7. Chapter 4: Streaming Data Ingestion

8. Chapter 5: Data Processing

9. Chapter 6: Interactive Analytics

10. Chapter 7: Data Warehousing

11. Chapter 8: Data Sharing

12. Chapter 9: Data Federation

13. Chapter 10: Predictive Analytics

14. Chapter 11: Generative AI

15. Chapter 12: Operational Analytics

16. Chapter 13: Business Intelligence

17. Part 3: Govern, Scale, Optimize And Operationalize

18. Chapter 14: Data Governance

19. Chapter 15: Data Mesh

20. Chapter 16: Performant and Cost-Effective Data Platform

21. Chapter 17: Automate, Operationalize, and Monetize

22. Index

Why subscribe?

23. Other Books You May Enjoy

Scalable Data Lakes

In this chapter, we will look at how organizations can build a data platform foundation by creating data lakes on AWS.

We will cover the following main topics:

Why choose Amazon S3 as a data lake store?
Business scenario setup
Data lake layers
Data lake patterns
Data catalogs
Transactional data lakes
Putting it all together

Why choose Amazon S3 as a data lake store?

Before we dive deep into the actual data and analytics use cases and explore how to design data lakes on AWS, it is first important to understand why Amazon Simple Storage Service (Amazon S3) is the preferred choice for building a data lake and why it is used as a storage layer to store all kinds of data in a centralized location.

If you recall from the discussions we had in Chapter 1, an ideal storage for building a data lake should inherently be scalable, durable, highly performant, easy to use, secure, cost-effective, and integrated with other building blocks of the data lake ecosystem. So, we ask a very important question: why choose Amazon S3 as a data lake store?

S3 checks all the boxes on what we look for in a store for building data lakes. Here are some of the features of S3:

Scalable: S3 is a petabyte-scale object store with virtually unlimited storage
Durable: S3 is designed for 99.999999999% (11 9s) of data durability...

Business scenario setup

The flow of this book is kept in such a way that it helps you get to the end state of building a modern data platform using AWS, with the ultimate goal of solving business use cases. To demonstrate all the building blocks of the data platform, it is important that I assume a fictitious entity and build a story around it. It’s easier to understand concepts if there is steady progression and continuity in the storyline.

For this book, I will consider a financial organization and all its use cases. You can apply most of the design and architecture techniques to other sector use cases too. The bottom line is that organizations may have different business models, but the concepts that go into building a modern data platform on AWS, to a large extent, remain the same irrespective of the business domain. In other words, the same AWS services and functionalities will be leveraged in building any kind of data platform.

Let’s consider a fictitious...

Data lake layers

Now that we have a broader business use case for setting up a data lake, let’s look at a use case that will help us define what the different layers of a typical data lake are and why they are required.

Use case for creating data lake layers

GreatFin has different LOBs, and within each of these LOBs, multiple personas have different tasks to perform on the data. Each persona may need specific access to different sets of data. They will all need the data to be formatted and stored in a certain way for them to do their day-to-day operations easily. For example, data engineers may need access to the raw source data so that they can profile the data and understand the quality of the data. Data scientists may need access to a standardized form of datasets so that they can do feature engineering for creating machine learning (ML) models. Data analysts may need access to business-friendly datasets so that they can derive insights from the data.

Before we get...

Data lake patterns

There are two types of data lake patterns, as follows:

Centralized pattern
Distributed pattern

Let’s discuss each of them. Note that you can use a hybrid pattern too, depending on your use case.

Centralized pattern

In a centralized pattern, the business data is stored and accessed from a central location, to be used throughout the enterprise. For example, it may be easy to manage entity information in a centralized location; entity information such as name, address, gender, age, and profession of a person. It’s easier to manage such datasets in a centralized way, from a governance point of view as well as to avoid data duplication.

Certain LOBs may have additional properties of the data that are relevant only to their use cases. For example, the marketing department may also want to see customer lifetime value (CLV), net promoter score (NPS), marketing preferences, and so on for a person. These additional attributes can then...

Data catalogs

We talked about a data lake in AWS being a combination of the data in S3 buckets and the metadata of this data stored in a catalog. We will solve the mystery of creating a technical catalog in AWS by introducing another critical service for building a modern data platform, AWS Glue—a serverless data integration service. Now, Glue is actually an umbrella service consisting of multiple parts. It has the Glue ETL part, which is used for building data integration work, and we have multiple chapters on data ingestion and integration. The component of Glue that is relevant to our data catalog discussion is Glue Data Catalog. Let’s unfold more about the catalog in Glue and how it helps with our data lake in S3.

Glue Data Catalog

As the data passes through layers of the data lake in S3, the metadata of the data is captured and stored in Glue Data Catalog. It creates and stores the technical metadata in the form of data definition language (DDL) statements...

Transactional data lakes

Let’s introduce this topic with a use case from GreatFin.

Use case for a transactional data lake

GreatFin wants to comply with the right to be forgotten General Data Protection Regulation (GDPR) compliance in Europe. It wants to have the ability in all its systems, including its analytics environments, to easily locate, update, or delete records as and when required.

The need to create transactional data lakes came about due to many business use cases and the challenges associated with them, such as the following:

Putting it all together

So far, we have discussed the different storage layers in a typical data lake in S3 and defined the purpose of each of the layers. We also introduced the concept of creating metadata using a Glue crawler and storing it in Glue Data Catalog. Finally, we looked at use cases for building transactional data lakes. This is a good time to pivot back to the GreatFin business requirements we introduced earlier and apply these data lake foundational concepts to our use case.

Marketing use case

Suppose the marketing department at GreatFin wants to find certain top leads for offering a new type of certificate of deposit (CD) with a higher interest rate to select a few high-net-worth customers only. In this case, the customer data will be stored in multiple systems, from different LOBs.

Let’s walk through what each layer in the data lake might look like.

Raw layer example

The following diagram is a depiction of data stored in a raw layer bucket in...

Summary

In this chapter, we went through why so many organizations prefer to build their data lakes on Amazon S3. We then went through different layers of data lakes in S3 and the purpose of each of them. Along with the layers of data, we also looked at how Glue Data Catalog helps to capture the metadata about the data in the form of tables. We also touched upon a new trend around having to build a transactional data lake, which involves selecting a table format that aligns closely with the specific use case being solved. Finally, we put it all together to solve a specific use case and saw it all come together, at least from the data storage and catalog side of things.

We have the data in S3 and we have the catalog of this data in Glue Data Catalog in the form of tables. The real value of this setup is that businesses can easily consume this data to derive insights from it. This leads us to the next section of this book around different purpose-built services and how each of them...

The rest of the chapter is locked

You have been reading a chapter from

Modern Data Architecture on AWS

Published in: Aug 2023 Publisher: Packt ISBN-13: 9781801813396

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime}

Authors (1)

Behram Irani

Behram Irani is currently a technology leader with Amazon Web Services (AWS) specializing in data, analytics and AI/ML. He has spent over 18 years in the tech industry helping organizations, from start-ups to large-scale enterprises, modernize their data platforms. In the last 6 years working at AWS, Behram has been a thought leader in the data, analytics and AI/ML space; publishing multiple papers and leading the digital transformation efforts for many organizations across the globe. Behram has completed his Bachelor of Engineering in Computer Science from the University of Pune and has an MBA degree from the University of Florida.

See other products by Behram Irani

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

Aug 2023 7 hours 40 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

Aug 2023 22 hours 48 minutes

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

Sep 2023 8 hours 36 minutes

Building AI Applications with ChatGPT APIs

Sep 2023 8 hours 36 minutes

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Oct 2023 21 hours 12 minutes

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

Aug 2023 14 hours 0 minutes

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

Dec 2023 8 hours 0 minutes

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

Nov 2023 22 hours 8 minutes

Use Case	Challenge
Compliance requirements	Compliance and privacy laws—for example, the GDPR requires the deletion of certain data within a specific timeframe and/or across all datasets
Change data capture (CDC)	CDC from the source databases and incremental...