You're reading from Data Engineering with AWS - Second Edition

Product typeBook

Published inOct 2023

PublisherPackt

ISBN-139781804614426

Edition2nd Edition

Concepts

Data Engineering

Author (1)

Gareth Eagar

Building Transactional Data Lakes

In the last few years, new technologies have emerged that have significantly enhanced the capabilities of traditional data lakes, enabling them to operate similarly to a data warehouse. These new technologies provide all the benefits of data lakes (such as low-cost object storage, and the ability to use serverless data processing services) while also making it much easier to update data in the data lake (amongst other benefits).

Traditional data lakes were built on the Apache Hive technology stack, which enables you to store data in various file formats (such as CSV, JSON, Parquet, and Avro). Hive enabled many tens of thousands of data lakes to be built on object storage, but over the years the limitations of Hive became more clear, as we will discuss in this chapter.

To overcome these limitations, a number of new table formats have been created by a number of different companies and open-source organizations. Keep reading to learn more...

Technical requirements

In the last section of this chapter, we will go through a hands-on exercise that uses Amazon Glue to read data, and write the data out using the Apache Iceberg table format.

As with the other hands-on activities in this book, if you have access to an administrator user in your AWS account, you should have the permissions needed to complete these activities. If not, you will need to ensure that your user is granted access to create and run AWS Glue jobs, and to read and write data from Amazon S3.

You can find the SQL statements that we run in the hands-on activity section of this chapter in the GitHub repository for this book, using the following link: https://github.com/PacktPublishing/Data-Engineering-with-AWS-2nd-edition/tree/main/Chapter14.

What does it mean for a data lake to be transactional?

Transactional data lakes is a common way to refer to the abilities enabled by these new table formats, but what does that mean?

Let’s start by looking at the definition of a database transaction in general, from Wikipedia (https://en.wikipedia.org/wiki/Database_transaction):

”A database transaction symbolizes a unit of work, performed within a database management system (or similar system) against a database, that is treated in a coherent and reliable way independent of other transactions.”

What this means is that you have the ability to update a database in a way that may potentially make multiple updates as part of the transaction, and you have the guarantee that all the individual updates will work and be applied consistently, or the whole transaction will fail. That means that if there are five updates as part of the transaction, and the third update fails, then the two previous...

An overview of Delta Lake, Apache Hudi, and Apache Iceberg

The three table formats that we are reviewing in this book all provide similar functionality, as outlined above, but they also all have their own unique features and slightly different implementations. In this section, we are going to do a deep dive into each of the three open table formats.

Deep dive into Delta Lake

Let’s start by looking at Delta Lake; however, we will not be covering the enhanced capabilities available as part of the paid Databricks offering. For example, Delta Live Tables provides ETL pipeline functionality, but is not open-sourced, so is not covered here.

Delta Lake has become a very popular table format, in large part as a result of Databricks having a very popular Lakehouse offering that incorporates Delta Lake. Databricks has made all Delta Lake API’s open-source, including a number of performance optimization features that they initially built for their paying customers...

AWS service integrations for building transactional data lakes

AWS services constantly evolve as new services are introduced and existing services have new functionality added. This applies to the AWS analytic services as well, with many of these services introducing support for these new transactional open table formats over the last few years. In this section, we will look at the support for open table formats in various services, as at the time of publishing.

However, make sure to review the latest AWS documentation to understand the latest status of support across the services.

Open table format support in AWS Glue

AWS Glue has broad support for open table formats across the different components of the Glue service. In this section, we examine open table support in two of the key Glue components.

AWS Glue crawler support

As covered earlier in this book, the AWS Glue crawler is a component of the Glue service that can scan a data source (such as Amazon S3...

Hands-on – Working with Apache Iceberg tables in AWS

As discussed in the previous section, Amazon Athena has strong support for the Apache Iceberg format, and as a serverless service, it is the quickest and simplest way to work with Apache Iceberg tables.

For the hands-on section of this chapter, we are going to use the Amazon Athena service to create an Apache Iceberg table, and then explore some of the features of Iceberg as we query and modify the table. To do this, we will create an Iceberg version of one of the tables we created earlier in this book.

Creating an Apache Iceberg table using Amazon Athena

To create our Apache Iceberg table, we will access the Athena console and then run DDL statements to specify the details of the table we want to create. At the time of writing, Amazon Athena supports the creation of Iceberg v2 tables. Remember to refer to the GitHub site for this book for a copy of the SQL statements used in this section (as mentioned at the...

Implementing a Data Mesh Strategy

The original definition of a data lake, which first appeared in a blog post by James Dixon in 2010 (see https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/), was as follows:

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

In his vision of what a data lake would be, Dixon imagined that a data lake would be fed by a single source of data, containing the raw data from a system (so not pre-aggregated like you would have with a traditional data warehouse). He imagined that you may then have multiple data lakes for different source systems, but that these would be somewhat isolated.

Of course, new terms and ideas often...

The rest of the chapter is locked

You have been reading a chapter from

Data Engineering with AWS - Second Edition

Published in: Oct 2023Publisher: PacktISBN-13: 9781804614426

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Gareth Eagar

Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA. Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers. Gareth frequently speaks on data related topics.
Read more about Gareth Eagar

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages