You're reading from Modern Data Architecture on AWS

Product type Book

Published in Aug 2023

Publisher Packt

ISBN-13 9781801813396

Pages 420 pages

Edition 1st Edition

Languages

Concepts

Data Science

Author (1):

Behram Irani

Table of Contents (24) Chapters

Preface

1. Part 1: Foundational Data Lake

2. Prologue: The Data and Analytics Journey So Far

3. Chapter 1: Modern Data Architecture on AWS

4. Chapter 2: Scalable Data Lakes

5. Part 2: Purpose-Built Services And Unified Data Access

6. Chapter 3: Batch Data Ingestion

7. Chapter 4: Streaming Data Ingestion

8. Chapter 5: Data Processing

9. Chapter 6: Interactive Analytics

10. Chapter 7: Data Warehousing

11. Chapter 8: Data Sharing

12. Chapter 9: Data Federation

13. Chapter 10: Predictive Analytics

14. Chapter 11: Generative AI

15. Chapter 12: Operational Analytics

16. Chapter 13: Business Intelligence

17. Part 3: Govern, Scale, Optimize And Operationalize

18. Chapter 14: Data Governance

19. Chapter 15: Data Mesh

20. Chapter 16: Performant and Cost-Effective Data Platform

21. Chapter 17: Automate, Operationalize, and Monetize

22. Index

Why subscribe?

23. Other Books You May Enjoy

Data Federation

In the previous chapter, we explored different use cases for sharing data, both internally and externally with the organization. Data sharing is a very critical aspect of any data platform, where data stored in an Amazon S3-based data lake and in an Amazon Redshift data warehouse is seamlessly shared, without the need to create duplicate copies. Every data platform has distinct components for data storage, as well as for data computations. In the data sharing model, we focused on sharing data between similar systems – for example, using Amazon Athena to share data stored in an S3 data lake and using Amazon Redshift to share data with other Redshift clusters.

Data doesn’t always get stored, processed, and shared within homogeneous systems. A lot of times, data is captured in heterogeneous systems and those systems may not even reside inside the AWS ecosystem. This brings us to the question, how do we seamlessly and transparently query datasets from a...

Data federation using Amazon Athena

Amazon Athena is primarily used to query data from S3 data lakes. However, to query data across heterogeneous sources, Athena provides a feature called Federated Query. This feature enables different personas, such as data analysts, data engineers, and data scientists, to execute queries across disparate data sources from Athena itself. The single biggest differentiator for Federated Query is that the execution of such queries happens inside the systems that store the data.

Athena executes these federated queries using connectors. Athena provides many connectors to a variety of source systems. Using these connectors, Athena can pass portions of the query that need to be executed in the source system. This execution is assisted by AWS Lambda functions, which optimize the query’s execution and gather the data received from the underlying systems. Since Lambda functions are serverless and scalable, this allows Athena to query larger datasets...

Data federation using Amazon Redshift

Federated queries can be executed even from inside Redshift, allowing Redshift data to be joined with data from relational data sources such as PostgreSQL and MySQL, either on Amazon RDS or on Amazon Aurora. For certain use cases, it does not make sense to spend time creating an ETL pipeline to load data inside Redshift. Redshift can connect to these sources and distribute the execution of such queries down to the data source itself to improve performance.

The following figure highlights the current data sources that Redshift federated queries can work with. With the federated architecture in place inside Redshift, more source connectors may get added in the future, to expand the ecosystem and broaden the use cases that can be solved with this architecture pattern:

Figure 9.7 – Redshift federated queries

Amazon Redshift federated queries use case

To understand this better, let’s consider a use case...

Summary

In this chapter, we looked at how data federation helps organizations quickly fetch data using a single pane of glass from multiple heterogeneous source systems.

We looked at how different connectors in Amazon Athena allow for a quick and easy way to join datasets from other sources. Athena’s connectors make it a seamless and transparent user experience where reports can be created just by writing SQL statements inside Athena, to join datasets from the underlying data stores.

We also looked at how Amazon Redshift can assist in federated queries, by fetching data stored in ODS systems such as MySQL and PostgreSQL. A use case that typically gets solved by this mechanism is querying live operational data that’s constantly getting updated in the ODS.

The next chapter is critical in our modern data platform journey as we will discuss everything about predictive analytics and how it helps organizations think big with their data.