You're reading from Data Engineering with AWS - Second Edition

Product typeBook

Published inOct 2023

PublisherPackt

ISBN-139781804614426

Edition2nd Edition

Concepts

Data Engineering

Author (1)

Gareth Eagar

Implementing a Data Mesh Strategy

The original definition of a data lake, which first appeared in a blog post by James Dixon in 2010 (see https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/), was as follows:

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

In his vision of what a data lake would be, Dixon imagined that a data lake would be fed by a single source of data, containing the raw data from a system (so not pre-aggregated like you would have with a traditional data warehouse). He imagined that you may then have multiple data lakes for different source systems, but that these would be somewhat isolated.

Of course, new terms and ideas often...

Technical requirements

In the last section of this chapter, we will go through a hands-on exercise that uses Amazon DataZone to implement a basic data mesh approach.

As with the other hands-on activities in this book, if you have access to an administrator user in your AWS account, you should have the permissions needed to complete these activities.

You can access more information about running the exercises in this chapter using the following link: https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter15

What is a data mesh?

The concept of a data mesh was introduced around 2019 by Zhamak Dehghani, who at the time was a consultant for a company called ThoughtWorks. The data mesh architecture was built around four principles:

Domain-oriented, decentralized data ownership
Data as a product
Self-service data infrastructure as a platform
Federated computational governance

Over time, as with data lakes, the term began to mean different things to different people. Some organizations would claim they had implemented a data mesh because they had enabled data sharing between multiple data lakes, while others would go all in with organizational change, in addition to building technology stacks to support a data mesh.

I believe that it is okay for a term to evolve and change, but that does mean that when someone uses a term such as data mesh, you need to ask them exactly what that means to them. If someone defines a data mesh as the ability to share data...

Challenges that a data mesh approach attempts to resolve

Traditional data lakes and approaches served many organizations well for a long time, but as with everything, there are always new developments and approaches that help drive improvements.

In the previous chapter, we looked at how new table formats (such as Apache Iceberg) introduced new functionality that improved querying and processing data in data lakes. In a similar way, the concepts and approaches introduced by a data mesh help solve some different challenges of traditional data lakes and how data teams are structured.

Let’s look at a few of the traditional challenges that a data mesh helps solve.

Bottlenecks with a centralized data team

While not the case for every data lake, it was common for large enterprises to create a centralized team that would ingest data from transactional systems across the organization and then perform ETL tasks on that data (cleaning the data, joining data from across...

The organizational and technical challenges of building a data mesh

As we discussed at the start of this chapter, a data mesh may mean different things to different people. Some people approach a data mesh implementation as though it were just a technical challenge about improving the sharing and creation of analytical data. But as we have seen, the way that Dehghani proposed a data mesh approach is not about technical solutions to data sharing, but much more about the overall way that an organization approaches analytical data.

In this section, we look at some of the challenges (both organizational and technical) of implementing a data mesh.

Changing the way that an organization approaches analytical data

While there are technical challenges to building a data mesh, the more difficult part is changing the way that an organization views analytical data, and changing who is responsible for creating analytical data.

While there is no “traditional” way to...

AWS services that help enable a data mesh approach

Most analytics vendors have been adding functionality to their solutions to support a data mesh approach over the past few years. And while there is currently no single solution that enables a data mesh across a complex selection of analytical tools and hybrid environments, many companies have made good progress in supporting the data mesh approach, at least within their own “ecosystem” of tools.

AWS has supported sharing both S3-and Redshift-based datasets across AWS accounts for a while. And while easy sharing of data across different AWS accounts (and different teams within an organization generally have their own AWS accounts) is a key component of a data mesh architecture, it is only a piece of building a data mesh. Another key component is the ability to centrally catalog data and add rich business metadata for each dataset, which can be done with the Amazon DataZone service. In this section, we explore the...

A sample architecture for a data mesh on AWS

We have looked at what a data mesh is, some of the organizational and technical challenges of building a data mesh, and finally some of the AWS services that can be used to build a data mesh. Now, in this section, let’s bring it all together with a sample architecture for a data mesh on AWS.

Architecture for a data mesh using AWS-native services

Earlier in this chapter, we discussed how a data mesh is easier to build if starting new, or if just using AWS-native services. In this section, we will look at a sample architecture for when you only use AWS-native services, and in the section after this, we will review an architecture for environments that use analytic tools from other vendors.

The following architecture diagram shows a data mesh that is built using Amazon DataZone, with data in an Amazon S3-based data lake (a similar architecture would support data in Amazon Redshift) and using AWS Lake Formation for data sharing...

Hands-on – Setting up Amazon DataZone

In the hands-on section of this chapter, we are going to set up and configure the Amazon DataZone service. We will then create a DataZone project, import a data source, add business metadata, and publish a data product. Finally, we will access the DataZone data portal as a data consumer, to search for and subscribe to a data product.

At the time of publication of this book, the Amazon DataZone service has just recently been released. There are sometimes a number of changes made to a service shortly after it becomes Generally Available (GA), so make sure to reference the GitHub page for this chapter to check for any updates related to these hands-on exercises. The GitHub page is available at https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter15

Let’s get started.

Setting up AWS Identity Center

To log in to the Amazon DataZone data portal, you can either use your AWS IAM credentials...

Building a Modern Data Platform on AWS

As we near the end of this book, we will review high-level concepts around building a modern data platform on AWS. We could easily devote another whole book to this topic alone, but in this chapter, we will provide at least an overview to give you a strong foundation on how to approach the build-out of a modern data platform.

There are many different pieces to the puzzle of building a modern data platform, and this chapter will build on many of the other topics we have covered in this book (such as data meshes and modern table formats) alongside introducing topics we have not yet covered (such as Agile development and CI/CD pipelines).

The goal of this chapter is to help you think through how to bring together many of the different concepts you have learned in this book to create a data platform that supports both the data producers and data consumers in your organization. This chapter is not a complete guide to building a data platform...

The rest of the chapter is locked

You have been reading a chapter from

Data Engineering with AWS - Second Edition

Published in: Oct 2023Publisher: PacktISBN-13: 9781804614426

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Gareth Eagar

Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA. Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers. Gareth frequently speaks on data related topics.
Read more about Gareth Eagar

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages