Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Engineering Lakehouses with Open Table Formats
Engineering Lakehouses with Open Table Formats

Engineering Lakehouses with Open Table Formats: Build scalable and efficient lakehouses with Apache Iceberg, Apache Hudi, and Delta Lake

Arrow left icon
Profile Icon Dipankar Mazumdar Profile Icon Vinoth Govindarajan
Arrow right icon
Early Access Early Access Publishing in Dec 2025
$31.99 $35.99
eBook Dec 2025 1st Edition
eBook
$31.99 $35.99
Paperback
$44.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Dipankar Mazumdar Profile Icon Vinoth Govindarajan
Arrow right icon
Early Access Early Access Publishing in Dec 2025
$31.99 $35.99
eBook Dec 2025 1st Edition
eBook
$31.99 $35.99
Paperback
$44.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$31.99 $35.99
Paperback
$44.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Engineering Lakehouses with Open Table Formats

Open Data Lakehouse – a New Architectural Paradigm

In today’s data-driven business landscape, the need to process vast amounts of data is essential for understanding operations and making informed decisions. Over the past few decades, the growing demand for both real-time transactional processing and large-scale analytics has shaped the evolution of data management systems. Initially, organizations relied on two primary systems: Online Transactional Processing (OLTP) and Online Analytical Processing (OLAP). As data volumes expanded, Data Lakes emerged to handle the scale, but their limitations (also discussed in the chapter) led to the development of a more advanced solution—the lakehouse architecture.In this chapter, we’re going to cover the following main topics:

  • The emergence of OLTP and OLAP systems
  • Data lakes – centralized data storage for a new era
  • The emergence of lakehouse architecture
  • An introduction to data lakehouse
  • Lakehouse – architectural...

The emergence of OLTP and OLAP systems

OLTP and OLAP systems have long been regarded as foundational for enterprise data management. OLTP systems handle real-time transactions, while OLAP systems focus on long-term data analysis, creating a clear separation between operational and analytical workloads. However, this division brought challenges, including the need for constant ETL processes to transfer data between the two systems, which resulted in latency and inconsistencies between operational and analytics data.In the following subsections, we will dive deeper into the roles and limitations of OLTP and OLAP systems, setting the stage for understanding how modern data architectures such as data lakes and lakehouses address these issues.

OLTP – the transactional backbone

The OLTP systems were designed to handle day-to-day transactional data. These systems excel at high-speed insert, update, and delete operations, optimized for real-time transaction processing.These were systems...

Data lakes – centralized data storage for a new era

Data lakes are centralized storage facilities for raw data in native formats. Thus, they are designed to handle large volumes from sources such as social media, IoT devices, and sensors, and often run on top of distributed file systems such as Hadoop, HDFS, or cloud-based object storage such as Amazon S3.With big data solutions starting to see demand, enterprises needed a system to capture, process, and analyze unstructured and semi-structured data coming from sources such as logs, sensors, and social media. Over time, data lakes became systems that store raw data natively in their format with not-so-stringent schema enforcement.Data lakes leverage distributed storage systems designed to scale with volume. This flexibility in scaling lets organizations store many types of data formats in a single location, potentially unlocking advanced analytics, machine learning, and AI capabilities not afforded by the rigidity of traditional...

The emergence of lakehouse architecture

Lakehouse architecture combines the advantages of data lakes and data warehouses. It introduces a unified platform for both transactional and analytical workloads, addressing the need for scalable, flexible, and high-performance data processing.Apache Iceberg, Apache Hudi, and Delta Lake – three main open source projects driving the evolution of lakehouse architecture together with their respective communities – have contributed innumerable feature additions such as data governance, ACID transactions, and query optimization to data lakes, thereby enabling the lakehouse to prove itself a real solution for the unification of data infrastructure for any organization.

An introduction to data lakehouse

The data lakehouse architecture has shifted ways on how organizations handle and derive insights from their data. Its growing popularity stems from the need for storage systems that not only manage massive, diverse types of datasets but also offer an open architecture and a unified platform for both batch and streaming workloads. As we have noted, data architectures such as data warehouses (OLAP database) and data lakes have each served specific needs—data warehouses excel in structured workloads and provide robust data management, while data lakes are known for supporting diverse data types and are very cost effective. The lakehouse architecture targets the gap between these two, offering an approach that brings out their strengths while addressing their limitations.Lakehouse architecture is built on two foundational pillars: openness and reliability. By storing data in open storage formats, the lakehouse treats storage as an independent and open...

Lakehouse – architectural breakdown

In this section, we’ll break down the key components of a lakehouse architecture and understand how they interface with each other to provide an open foundation for the lakehouse architecture. To begin, let’s first take a look at the lakehouse architecture breakdown:

Figure 1.1 – An architectural breakdown of the lakehouse architecture

We will discuss the components of lakehouse architecture in the following sections. The idea is to establish a standard definition for each of these technical components and go over the functionalities and their role in the lakehouse architecture.

Lake storage

The very first component of a lakehouse architecture is the storage layer. It serves as the destination where files from various operational systems are housed after the ingestion process. These systems serve as a repository where all data – whether structured, semi-structured, or unstructured – is stored and organized...

Attributes of an open data lakehouse

In this section, we will explore the defining attributes that give open data lakehouse architecture its edge for today’s analytical workloads. These attributes form the foundation of how an open data lakehouse architecture delivers openness, reliability, and cost-efficiency, enabling running multiple analytical workloads on the same architecture. As we explore each attribute, you'll see how they come together to optimize data management, reduce costs, and ensure seamless integration across various workloads and compute engines.

Open data architecture

One of the most important characteristics of a lakehouse architecture is its emphasis on an open data architecture. In today’s context, "openness" refers to open standards and the open way of storing data. This is facilitated by the metadata layer (table formats such as Apache Iceberg, Apache Hudi, and Delta Lake) in combination with the data layer (open file formats such...

Summary

In this chapter, we explored the evolution of data architectures, from OLTP and OLAP systems to data lakes, providing a foundation for understanding the data lakehouse paradigm. This historical context helps explain how the challenges of older systems led to the development of more flexible and scalable solutions, even as foundational components remained the same. You also gained a deep understanding of the core components and principles of open data lakehouse architecture, including storage, file formats, table formats, storage engines, catalogs, and query engines. These building blocks will help you design scalable and flexible systems capable of handling both batch and streaming workloads efficiently. Additionally, you learned about some of the key attributes of lakehouse architecture, such as open data architecture, modularity, flexibility, and cost-efficiency.In the next chapter, you will dive deep into the transactional layer of the lakehouse to understand how critical technical...

Questions

  1. What are the main differences between OLTP and OLAP systems in terms of their primary focus and technical components?
  2. Explain the role and characteristics of data lakes in modern data architectures. What limitations do they have that led to the development of the lakehouse architecture?
  3. What are the key attributes of a lakehouse architecture that make it distinct from traditional data lakes and warehouses?
  4. How do table formats such as Apache Iceberg, Apache Hudi, and Delta Lake enhance the functionality of a lakehouse architecture?
  5. What are the benefits of using an open data architecture in a lakehouse setup? Provide examples.
  6. Why is unification of batch and streaming workloads an important feature of the lakehouse architecture? How is this achieved?
  7. Describe the importance of file formats (row-based and columnar) in a lakehouse. How do they impact query performance and storage efficiency?
  8. What role does the storage engine play in a lakehouse architecture, and how does it ensure...

Answers

  1. OLTP systems focus on real-time transactional processing, optimized for rapid data insertion, updates, and deletions with ACID compliance. OLAP systems target large-scale data analysis with high query performance, often using column-oriented file formats and denormalized schemas.
  2. Data lakes provide centralized storage for raw data in its native format, enabling scalability and cost-effectiveness. However, they face challenges such as lack of governance, slow query performance, and no native ACID support, which the lakehouse architecture addresses by combining the flexibility of data lakes with the reliability of data warehouses.
  3. Key attributes of a lakehouse architecture include open data architecture, unification of batch and streaming workloads, cost efficiency, reliable transactions, and multi-compute engine support.
  4. Table formats such as Apache Iceberg, Apache Hudi, and Delta Lake provide schema evolution, time travel, and ACID compliance, enabling better data management...
Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Build open lakehouses with open table formats using popular compute engines such as Apache Spark, Apache Flink, Trino, and Python
  • Optimize Lakehouse performance with advanced techniques such as pruning, partitioning, compaction, indexing, and clustering
  • Learn how to enable seamless integration, data management, and interoperability using Apache XTable
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

Engineering Lakehouses with Open Table Formats provides detailed insights into lakehouse concepts, and dives deep into the practical implementation of open table formats such as Apache Iceberg, Apache Hudi, and Delta Lake. If you are a data engineer or architect looking to understand the intricacies of open lakehouse architectures, this book is for you. You'll start by exploring the internals of a table format and learn in detail about the transactional capabilities of lakehouses. You’ll also work with each table format with hands-on exercises using popular computing engines such as Apache Spark, Flink, Trino, dbt, and Python-based tools. The book addresses advanced topics, including performance optimization techniques and interoperability among different formats, equipping you to build production-ready lakehouses. With step-by-step explanations, you’ll get to grips with the key components of Lakehouse architecture and learn how to build, maintain, and optimize them. By the end, you'll be proficient in evaluating and implementing open table formats, optimizing lakehouse performance, and applying these concepts to real-world scenarios, ensuring you make informed decisions in selecting the right architecture for your organization’s data needs.

Who is this book for?

This book is for data engineers, software engineers, and data architects who want to deepen their understanding of open table formats such as Apache Iceberg, Apache Hudi, and Delta Lake, and learn how they are used to build lakehouses. It is also a good fit for professionals working with traditional data warehouses, relational databases, and data lakes, who wish to transition to an open data architectural pattern. Basic knowledge of databases, Python, Apache Spark, Java, and SQL are recommended for a smooth learning experience.

What you will learn

  • Explore Lakehouse fundamentals such as table formats, file formats, compute engines, and catalogs
  • Gain a complete understanding of data lifecycle management in lakehouses
  • Integrate lakehouses with Apache Airflow, dbt, and Apache Beam
  • Optimize performance with sorting, clustering, and indexing techniques
  • Use the open table formats data with ML frameworks like Spark MLlib, Tensorflow, and MLFlow
  • Interoperate across different table formats with Apache XTable and UniForm
  • Secure your lakehouse with access controls and ensure regulatory compliance

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Dec 19, 2025
Edition : 1st
Language : English
ISBN-13 : 9781836207221
Vendor :
Apache
Category :
Languages :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Dec 19, 2025
Edition : 1st
Language : English
ISBN-13 : 9781836207221
Vendor :
Apache
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Table of Contents

3 Chapters
Engineering Lakehouses with Open Table Formats: Build scalable and efficient lakehouses with Apache Iceberg, Apache Hudi, and Delta Lake Chevron down icon Chevron up icon
1. Open Data Lakehouse – a New Architectural Paradigm Chevron down icon Chevron up icon
2. Transactional Capabilities in Lakehouse Chevron down icon Chevron up icon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.

Modal Close icon
Modal Close icon