You're reading from Data Engineering with AWS - Second Edition

Product typeBook

Published inOct 2023

PublisherPackt

ISBN-139781804614426

Edition2nd Edition

Concepts

Data Engineering

Author (1)

Gareth Eagar

Building a Modern Data Platform on AWS

As we near the end of this book, we will review high-level concepts around building a modern data platform on AWS. We could easily devote another whole book to this topic alone, but in this chapter, we will provide at least an overview to give you a strong foundation on how to approach the build-out of a modern data platform.

There are many different pieces to the puzzle of building a modern data platform, and this chapter will build on many of the other topics we have covered in this book (such as data meshes and modern table formats) alongside introducing topics we have not yet covered (such as Agile development and CI/CD pipelines).

The goal of this chapter is to help you think through how to bring together many of the different concepts you have learned in this book to create a data platform that supports both the data producers and data consumers in your organization. This chapter is not a complete guide to building a data platform...

Technical requirements

In the last section of this chapter, we will go through a hands-on exercise that automates the deployment of components that could be used in a data platform, as well as the code for a data engineering pipeline. This will require permission for services such as AWS CloudFormation, AWS CodeCommit, AWS CodeDeploy, as well as AWS Glue and various other services. As with the other hands-on activities in this book, having access to an administrator user in your AWS account should give you the permissions needed to complete these activities.

You can access more information about running the exercises in this chapter using the following link: https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter16

Goals of a modern data platform

In Chapter 15, Implementing a Data Mesh Strategy, we discussed how a central data platform team is responsible for building a platform that makes it easy for both data producers and data consumers to work with organizational data.

A data platform is intended to provide a system where multiple teams from across an organization can easily ingest data (including both structured and semi-structured, via batch and streaming), process the ingested data, and create new data products by joining datasets. It should also provide data governance controls, a catalog for making data discoverable across the organization, and the ability to easily share datasets across different teams/data domains.

Let’s review some of the top goals for a modern data platform, after which we will explore approaches to building these data platforms.

A flexible and agile platform

As we all know, the only constant is change. We have seen this throughout this...

Deciding whether to build or buy a data platform

The question of whether to build or buy applies to many different purchasing decisions that an organization needs to make. Some of these decisions are pretty obvious – for example, not many organizations will choose to build their own power plant and generate their own power, rather than just purchasing power from their local utility company.

Within the IT realm, there is likely to be a mix of building and buying, depending on the size of the organization. For example, most organizations that need systems for HR, Customer Relationship Management (CRM), or Enterprise Resource Planning (ERP) will purchase these from one of the many vendors that have built these products for many years. However, many organizations will choose to build and manage their own website and mobile app, including the microservices that power those systems.

Organizations also have a choice when it comes to their approach to implementing a modern...

DataOps as an approach to building data platforms

DataOps is a term that has been around since at least 2015 and refers to an agile approach to building data platforms and data products that borrows from some of the concepts of DevOps. Where DevOps transformed the approach to how software is engineered, DataOps transforms the approach by which data products are built.

Much like it is difficult to give an exact definition or outline an exact approach for other concepts we have discussed (such as data lakes and data meshes), it is similarly difficult to tie down one clear-cut definition for what is meant by DataOps. The original author of the term may have had a clear definition of what they meant, but over time, the term may come to mean different things to different people, and the meaning of the term as a whole may evolve.

In this section, we will attempt to focus on some of the core concepts of DataOps, and specifically, how they apply to building a data platform and data...

Hands-on – automated deployment of data platform components and data transformation code

While we do not have space to cover all aspects of building a modern data platform, in this section we will cover how to use various AWS services to deploy some components of a data platform. We start by setting up an AWS CodeCommit repository that will contain all the resources for our data repository (such as Glue ETL scripts and CloudFormation templates). We then use AWS CodePipeline to configure pipeline jobs that push any code or infrastructure changes into our target account.

Setting up a Cloud9 IDE environment

Our first step is to create a Cloud9 IDE environment, which we can use for writing our code and committing code to a CodeCommit repository. Cloud9 is an AWS service that can be used to provision a managed EC2 instance to provide us with a browser-based Integrated Development Environment (IDE) that we can use to write, run, and debug code from within our web browser...

Wrapping Up the First Part of Your Learning Journey

In this book, we have explored many different aspects of the data engineering role by learning more about common architecture patterns, understanding how to approach designing a data engineering pipeline, and getting hands-on with many different AWS services commonly used by data engineers (for data ingestion, data transformation, orchestrating pipelines, and consuming data).

We examined some of the important issues surrounding data security and governance and discussed the importance of a data catalog to avoid a data lake turning into a data swamp. We also reviewed data marts and data warehouses and introduced the concept of a data lake house.

We learned about data consumers – the end users of the product that’s produced by data engineering pipelines – and looked into some of the tools that they use to consume data (including Amazon Athena for ad hoc SQL queries and Amazon QuickSight for data visualization...

The rest of the chapter is locked

You have been reading a chapter from

Data Engineering with AWS - Second Edition

Published in: Oct 2023Publisher: PacktISBN-13: 9781804614426

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Gareth Eagar

Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA. Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers. Gareth frequently speaks on data related topics.
Read more about Gareth Eagar

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages