DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. IntroductionModern data platforms must support massive scale, high concurrency, and multiple compute engines without sacrificing correctness. This is where the lakehouse architecture stands out by bringing ACID-compliant transactions to distributed, file-based data lakes. Through mechanisms such as optimistic concurrency control (OCC) and multi-version concurrency control (MVCC), lakehouses ensure reliable reads and writes even under heavy parallel workloads. This article explores how ACID properties, conflict resolution strategies, and storage engines work together to deliver consistency, durability, and performance in modern lakehouse systems.Understanding transactions and ACID propertiesTransactions are fundamental units of work in any database system. They represent a series of operations that are executed as a single, indivisible unit. The concept of a transaction ensures that either all the operations within it are executed successfully, or none at all, preserving the integrity of the system. In a transactional context, we focus on the atomicity, consistency, isolation, and durability (ACID) properties, which govern the behavior of these operations.ACID properties serve as the backbone of robust data management systems. They are crucial in ensuring that transactional operations are reliable and consistent, even in the presence of multiple concurrent operations or system failures. Without ACID guarantees, data systems could become unreliable, leading to data corruption, inconsistencies, and operational inefficiencies. Deep dive into ACID propertiesTo understand the foundation of transactional systems, let’s break down the four core ACID properties that ensure data consistency and reliability:Atomicity: Atomicity ensures that a transaction is “all or nothing.” If any part of a transaction fails, the entire transaction is rolled back, ensuring that partial updates do not leave the system in an inconsistent state. This is especially important in distributed systems where partial updates can lead to a variety of issues, including data loss and corruption.Consistency: Consistency ensures that a transaction brings the database from one valid state to another. In other words, after the transaction is completed, the data should still satisfy all the integrity constraints (such as foreign keys, unique constraints, and so on). For example, if a table requires a column to have unique values, no transaction should violate this rule.Isolation: Isolation guarantees that the intermediate states of a transaction are invisible to other transactions. Even though multiple transactions might be running concurrently, they are isolated from each other. Different isolation levels (such as read committed and serializable) can be used based on the system’s requirements and the balance between performance and consistency.Durability: Durability ensures that once a transaction is committed, it is permanently stored in the system, even in the event of a crash or failure. Durability is typically achieved through techniques such as logging and write-ahead logs (WALs) in storage engines to ensure that all changes are safely written to persistent storage.ACID properties in traditional databasesIn traditional databases such as MySQL, PostgreSQL, and Oracle, the ACID properties have been well established. These systems use a tightly coupled architecture where both the compute and storage components are integrated, ensuring that transactions can be processed efficiently.However, the introduction of distributed and cloud-native architectures has brought new challenges. In such environments, ensuring ACID properties across distributed systems requires specialized techniques such as 2-phase commit (2PC) or optimistic concurrency control to handle distributed transactions while maintaining consistency across nodes.ACID properties in lakehouse architecturesThe lakehouse architecture addresses some of the inherent limitations of traditional data lakes by introducing ACID-compliant transactional capabilities. In data lakes, operations such as updates and deletes were difficult to handle without complex workarounds. However, with open table formats such as Apache Hudi, Apache Iceberg, and Delta Lake, these challenges are addressed, bringing ACID properties to the forefront in a distributed file-based architecture. We will dive deep into each table format’s transactional capabilities later in this chapter.In a lakehouse, the storage engine manages transactions, ensuring that any operation, whether it’s a write, update, or delete, is executed in a manner that conforms to ACID guarantees. This ensures that users can trust the consistency and durability of the data, even in large-scale, multipetabyte environments where high concurrency and distributed operations are common.Now that we’ve explored ACID properties in traditional databases and lakehouse architectures, let’s dive deeper into the conflict resolution mechanisms in the next section.Discovering conflict resolution mechanismsIn any multi-user or concurrent environment, conflicts can arise when multiple transactions attempt to update or modify the same data simultaneously. Confl ict resolution is crucial for maintaining data integrity, particularly in systems where high levels of concurrency are expected, like those in distributed databases or lakehouse environments.Traditional database systems employ various mechanisms to manage conflicts, ensuring that simultaneous operations do not lead to data corruption or inconsistencies. Two key concepts in this space are locking mechanisms (for example, pessimistic and optimistic locking) and concurrency control methods (for example, two-phase locking or timestamp ordering).Types of conflict resolutionsTo ensure data consistency and integrity, transactional systems employ various concurrency control mechanisms. Three primary strategies address potential conflicts:Pessimistic concurrency control: Locks data to prevent conflictsOptimistic concurrency control: Assumes that conflicts are rare and resolves them when they occurMulti-version concurrency control: Uses multiple versions of data to resolve conflicts Let’s discuss them in detail. Pessimistic concurrency controlIn pessimistic concurrency control (PCC), the system assumes that conflicts are likely to occur. Therefore, it locks the data before a transaction starts, preventing other transactions from accessing it until the lock is released, as shown:Figure 2.1 – Pessimistic strategyThis method is effective in preventing conflicts but can lead to performance bottlenecks, particularly in high-traffic systems where multiple users are accessing the same data concurrently.Optimistic concurrency controlOptimistic concurrency control (OCC) operates on the assumption that conflicts are rare. Instead of locking data preemptively, the system allows multiple transactions to proceed concurrently and checks for conflicts only at the time of the transaction’s commit. If a conflict is detected, such as two transactions trying to modify the same data simultaneously, the system takes corrective action, such as aborting or retrying the conflicting transaction.Figure 2.2 – Optimistic strategyOCC is particularly useful in environments where read operations are more frequent than writes, as it enables higher performance and reduced contention among transactions.Multi-version concurrency controlMulti-version concurrency control (MVCC) offers a more nuanced solution by allowing multiple versions of data to exist simultaneously. Rather than locking data or assuming that conflicts are rare, MVCC creates a new version of the data each time it is modified. When a transaction reads the data, it retrieves the most recent committed version that was available at the time the transaction began.Figure 2.3 – Multi-version strategyThis ensures that readers are never blocked by writers, and writers can update data without waiting for readers to finish. MVCC allows for high levels of concurrency while minimizing contention and ensuring consistency:How it works: When a user reads data, they see a snapshot of the database at a particular point in time, ensuring consistency without the need to lock resources. If multiple versions of the same record exist, each transaction sees the version that was valid at the start of the transaction. Writers create new versions of data rather than overwriting the existing version, which allows for better handling of concurrent operations.Conflict resolution: In MVCC, all transactions are allowed to complete, with each creating a new version of the data. Readers see a consistent snapshot of the database as of the start of their transaction, while subsequent reads or queries can use the latest committed version. If multiple versions of the same record exist, the system resolves conflicts at read time, typically by exposing the latest version in the metadata or by merging updates as per the system’s confl ict resolution strategy. Locking granularityLocking granularity refers to the level of data specificity when applying locks. Choosing the right granularity is crucial, as it affects concurrency, performance, and conflict resolution. Here are three common locking granularities:Row-level locking: Specific rows are locked, allowing more granular control but also increasing the risk of deadlocksTable-level locking: The entire table is locked, reducing concurrency but simplifying conflict preventionPartition-level locking: A specific partition of data is locked, often used in distributed databases to allow other partitions to be accessed concurrently Conflict resolution in distributed systemsDistributed systems introduce added complexity, as conflicts must be managed across different nodes that might hold different versions of the data. Techniques such as consensus protocols (Paxos and Raft) and MVCC are employed to ensure that conflicts are resolved efficiently and that the data remains consistent across all nodes. In some cases, systems also leverage conflict-free replicated data types (CRDTs), which enable concurrent updates to be merged automatically, ensuring eventual consistency without requiring complex coordination.In distributed systems, MVCC ensures that transactions operate on a consistent snapshot of the data, with any modifications written as a new version. This approach allows multiple transactions to coexist without blocking each other, enabling higher concurrency and efficient resource utilization. While MVCC provides snapshot isolation and version management, replication of these versions across nodes is handled separately by the underlying storage system or coordination service (for example, HDFS, S3, or consensus protocols such as Raft).Conflict resolution in lakehouse architecturesIn lakehouse architectures, conflict resolution is a key challenge due to the decoupling of compute and storage, which allows multiple engines to operate on the same data. To address this, two concurrency control mechanisms are most widely used: OCC and MVCC.With OCC, lakehouses allow transactions to proceed concurrently without locking data preemptively. At the time of committing the transaction, the system checks whether a conflict occurred. If a conflict is detected, the system can roll back or retry the conflicting transaction. This method aligns well with the open storage layer in lakehouses, where multiple compute engines may simultaneously access the data.Lakehouses implement MVCC to manage concurrent reads and writes without introducing locks. When data is modified, a new version of the data is created, allowing readers to continue accessing the previous version. This ensures that readers are never blocked by ongoing write operations, and writers can proceed without waiting for readers.Apache Iceberg and Delta Lake both implement MVCC to provide snapshot isolation and support concurrent reads and writes. Apache Hudi provides a similar guarantee through its commit timeline and supports snapshot isolation in Copy-on-Write mode, while Merge-on-Read enables near-real-time incremental views.By leveraging OCC and MVCC, lakehouse architectures maintain the integrity and consistency of data even in environments with high levels of concurrency. This ensures that users can operate on data without running into conflicts, enabling smooth collaboration and efficient processing of large-scale datasets.An essential aspect of conflict resolution in lakehouse architecture is version control. Lakehouses maintain versions of data as part of their transactional system, allowing users to view the state of the data at any given point in time. This feature is not only useful for conflict resolution but also for auditing and rollback purposes, allowing users to revert to previous versions of the data if necessary.This version control ensures that no data is lost and any conflicts can be traced and resolved, ensuring the integrity of the system over time. This is particularly valuable for enterprises with compliance requirements that demand data lineage and historical tracking.In the next section, we’ll explore how these concepts apply to the lakehouse architecture, specifically the storage engine component. We’ll delve into table management services and transactional integrity, discussing how lakehouse storage engines provide ACID properties, concurrency control, and more.Understanding the storage engineIn Chapter 1, Open Data Lakehouse: A New Architectural Paradigm, we introduced the storage engine as the critical component responsible for managing how data is stored, retrieved, and maintained in a data lakehouse architecture. In this chapter, we will deepen that understanding by exploring how the storage engine integrates with the open table format to provide transactional capabilities. However, before we dive into the specifi c capabilities of a storage engine, let’s fi rst examine the various sub-components that make up the storage engine within a traditional database system. This will help clarify why the storage engine is so essential in an open lakehouse architecture.Components of a traditional database storage engineStorage engines enable transactional capabilities in a lakehouse, so it’s helpful to fi rst look at what makes up a traditional database storage engine. Understanding its sub-components highlights why this layer is essential for efficient data management. The following diagram illustrates the architecture of a database system, with the storage engine highlighted at the bottom:Figure 2.4 – Architecture of a traditional database management systemIn traditional database management systems (DBMSs), the storage engine is the core layer responsible for managing how data is physically stored, retrieved, and maintained. It handles the low-level operations of the database, such as reading and writing data to storage, managing transactions, and enforcing concurrency. As shown in Figure 2.4, the storage engine includes components such as the lock manager, access methods, buffer manager, and recovery manager. Each of these components plays a vital role in ensuring the efficiency, accuracy, and durability of database operations. The lakehouse architecture draws inspiration from traditional database systems by incorporating a storage engine to provide transactional capabilities across massive datasets stored in systems such as Amazon S3, Google Cloud Storage, and Azure Blob Storage.Let’s take a look at the four sub-components of a storage engine and understand their need in today’s lakehouse architecture:Lock manager: The lock manager is a critical component in database systems, ensuring that multiple operations can access shared resources without interfering with one another. In a lakehouse architecture, this role becomes crucial due to the distributed nature of data and since different compute engines, such as Spark, Flink, or Presto, operate on the same dataset simultaneously for running diverse sets of workloads. Therefore, a lock manager is needed to prevent conflicts and maintain data correctness under concurrent operations. By bringing effective concurrency control mechanisms to coordinate multiple readers and writers, the lock manager upholds data integrity, ensuring that workloads run smoothly without conflicts or data corruption.Access methods: In traditional database systems, the access methods component determines how data is accessed and retrieved, often relying on storage structures such as B-trees, hashing, or log-structured-merge-trees (LSM-trees) for efficient data retrieval. Similarly, in a lakehouse architecture, there is a need to effectively organize data and metadata so that compute engines can access data in storage efficiently. Lakehouse table formats employ a directory-based access method for efficient data retrieval from distributed cloud object stores. For example, the Apache Hudi table format organizes data into efficiently groups and efficiently slices (discussed in Chapter 4, Apache Hudi Deep Dive) within its timeline – a type of event log. Indexes operate on top of this structure to help compute engines quickly locate the right efficiently groups for faster reads and writes. Apache Iceberg uses a tree-based structure to organize metadata efficiently (manifests), which enables efficient pruning of irrelevant data efficiently. Similarly, Delta Lake employs a directory-based access method, combined with its transaction log, to track all transactions and help prune data efficientlys.Buffer manager: The buffer manager in a database system is essential for improving system performance by caching frequently accessed data, reducing the need to repeatedly access slower storage. In lakehouse architectures, where cloud object stores serve as the storage layer, there is a fundamental trade-off between faster data ingestion and optimal query performance. Writing smaller efficientlys or logging deltas can speed up ingestion, but for optimal query performance, it is critical to minimize the number of efficientlys accessed and pre-materialize data merges. This is where a buffer manager becomes essential in a lakehouse context. It addresses the issue by caching frequently accessed or modified data, allowing query engines to avoid repeated access to slower lake storage. For example, in Apache Hudi, there are proposals to tightly integrate a columnar caching service that can sit between the lake storage and query engines, using the Hudi timeline to ensure cache consistency across transactions and mutations. This approach will not only improve read performance by serving data directly from the cache but also amortize the cost of merge operations by compacting data in memory. Recovery manager: The recovery manager is responsible for ensuring data durability and consistency, especially in the event of system failures. In a lakehouse architecture, the recovery manager takes on the role of maintaining data reliability across distributed storage systems. This involves mechanisms that allow the system to recover data to a consistent state, even after unexpected disruptions. Open table formats such as Apache Hudi, Apache Iceberg, and Delta Lake use metadata snapshots and WALs to guarantee durability and facilitate recovery. For example, Iceberg uses immutable snapshots to enable data rollback and point-in-time recovery, while Hudi uses a timeline of commits to maintain consistent versions of the dataset and to revert for any unexpected scenarios. Delta Lake employs transaction logs that track all changes, enabling the system to revert to a stable state if needed. These capabilities ensure that lakehouses are resilient and can maintain data integrity, even at a massive scale.While these four interconnected sub-components form the foundation for executing ACID-based transactions in a traditional database system, the storage engine’s role extends beyond just ensuring transactional integrity. Another key responsibility of the storage engine is to continuously optimize the data layout within the storage layer. In traditional databases, the storage engine works closely with the administration, monitoring, and utilities components (as illustrated in Figure 2.4) to manage the organization and structure of efficientlys, ensuring efficient data access and performance. This concept carries over to modern lakehouse architecture, where the storage engine is not only responsible for maintaining transactional consistency but also for delivering a comprehensive suite of table management services. These services include cleaning, compaction, clustering, archival, and indexing, operations that are essential for ensuring that the storage layout remains optimized for different compute engines to efficiently query the data. By organizing the underlying efficiently and metadata, the storage engine enables query engines to leverage advanced optimization techniques, such as predicate pushdown, data skipping, and pruning, ultimately improving query performance across various workloads.With this understanding, we can dive deeper into the two fundamental responsibilities of the lakehouse storage engine: transactional capabilities and table management services.How a lakehouse handles transactionsOne of the defining attributes of the lakehouse architecture is its ability to provide transactional guarantees when running various Data Definition Language (DDL) and Data Manipulation Language (DML) operations, something that distinguishes it from traditional data lakes, which often lack such capabilities. These transactional features align the lakehouse architecture with the reliability and consistency typically associated with data warehouses (OLAP databases), while still retaining the flexibility of data lakes. In this section, we will explore how transactional integrity is achieved in the three table formats through ACID guarantees and concurrency control mechanisms, which allow multiple users and applications to safely interact with the same dataset without compromising data integrity.ACID guaranteesACID properties are essential for ensuring that data operations are reliable, consistent, and isolated from each other. In lakehouses, the need for these guarantees arises from the complexities of handling concurrent reads and writes, managing multiple compute engines, and supporting large-scale applications accessing the same data. Each of the table formats, such as Apache Hudi, Apache Iceberg, and Delta Lake, implements ACID properties slightly differently, given the distinct structure of the formats. Let’s understand how the design decisions of each table format ensure ACID properties in a lakehouse.Apache Hudi ACIDHudi achieves ACID guarantees via its timeline – a WAL that keeps track of all the transactions occurring in the table. Atomic writes in Hudi are achieved by publishing commits atomically to a timeline, where each commit is stamped with an instant time. This instant time denotes the point at which the action is deemed to have occurred, ensuring that changes are either fully committed or rolled back, preventing partial updates. This guarantees that readers never see partially written data, maintaining atomicity.Consistency in Hudi is primarily ensured through its timeline mechanism, which acts as a structured log of all actions (insert, update, and delete) represented as a sequence of “instants.” Only committed instants (those that complete the requested → infl ight → completed cycle) are visible to readers and writers, ensuring the consistent ordering of events and preventing partial or incomplete writes from affecting the dataset. In concurrent scenarios, each transaction is assigned a unique, increasing timestamp, guaranteeing that operations are applied in the correct order and avoiding conflicts between overlapping modifications.In addition to this, Hudi enforces primary key uniqueness as a data integrity rule. Each record is assigned a primary key, which helps map data to specific partitions and efficiently groups, allowing efficient updates and ensuring duplicate records are not introduced into the dataset.Apache Hudi uses a combination of MVCC and OCC to ensure isolation between different types of processes. Hudi distinguishes between writer processes (which handle data modifi cations such as upserts and deletes), table services (which perform tasks such as compaction and cleaning), and readers (which execute queries). These processes operate on consistent snapshots of the dataset, thanks to MVCC, ensuring that readers always access the last successfully committed data, even when writes are ongoing. MVCC enables non-blocking, lock-free concurrency control between writers and table services, and between different table services, allowing for smooth parallel execution. At the same time, Hudi uses OCC to manage concurrency between writers, ensuring that conflicting operations are detected and handled without compromising data integrity. This dual approach provides snapshot isolation across all processes, maintaining isolation while efficiently supporting concurrent operations.Hudi provides durability via its WAL (timeline) that comprises Apache Avro serialized efficientlys containing individual actions (such as commit, compaction, or rollback), ensuring that all changes made to the dataset are permanent and recoverable, even in the event of system failures. Any changes made during a transaction are fi rst written to the WAL, which acts as a temporary log of changes before they are applied to the actual dataset. This mechanism ensures that, in case of a crash or failure, the system can recover from the WAL and complete any pending operations or roll back uncommitted changes, preventing data loss. Additionally, both metadata and data efficientlys in Hudi are stored in systems such as cloud object stores or HDFS, which enables the formats to take advantage of their durable nature.Apache Iceberg ACIDApache Iceberg guarantees ACID properties by organizing data into a persistent tree structure of metadata and data efficiently, with the entire table state represented by the root of this tree. Each commit to the dataset, such as an insert, update, or delete, creates a new tree with a new root metadata efficiently. The key design element for atomicity is Iceberg’s use of an atomic compare-andswap (CAS) operation for committing these new metadata efficiently to the catalog.The catalog stores the location of the current metadata efficiently, which holds the complete state of all active snapshots of the table. When a writer performs a change, it generates a new metadata efficiently and attempts to commit the update by swapping the current metadata location with the new one. This CAS operation ensures that either the entire change is applied successfully or none at all, preventing any partial updates. In the case of concurrent writes, if two writers attempt to commit changes based on the same original metadata efficiently, only the fi rst writer’s commit will succeed. The second writer’s commit will be rejected, ensuring that no conflicting or incomplete changes are written. Technically, any kind of database or data store can be used as a catalog in Apache Iceberg. The only requirement is that the database should have a locking mechanism to ensure there are no conflicts. The atomic swap mechanism is fundamental to maintaining the integrity of the table and forms the backbone of Iceberg’s transactional model.Iceberg’s metadata tree structure and commit process are designed to ensure consistency guarantees. At the core of Iceberg’s consistency model is the guarantee that every commit must replace an expected table state with a new, valid state. Each writer proposes changes based on the current table state, but if another writer has modifi ed the table in the meantime, the original writer’s commit will fail. In such cases, the failed writer must retry the operation with the new table state, preventing the system from introducing any inconsistencies due to concurrent writes. This mechanism enforces a linear history of changes, where each commit builds on the most recent table version (snapshot), avoiding scenarios where the table could be left in a partially updated state. Iceberg ensures that as much of the previous work as possible is reused in the retry attempt. This efficient handling of retries reduces the overhead associated with repeated write operations and ensures that concurrent writers can work independently without causing inconsistencies. Readers only access fully committed data, as changes are atomically applied, maintaining a consistent view of the table without partial updates.Apache Iceberg supports both snapshot isolation and serializable isolation to ensure that readers and concurrent writers remain isolated from each other. Iceberg uses OCC, where writers operate on the assumption that the table’s current version will not change during their transaction. A writer prepares its updates and attempts to commit them by swapping the metadata efficiently pointer from the current version to a new one. If another writer commits a change before this operation completes, the first writer’s update is rejected, and it must retry based on the new table state. This mechanism guarantees that uncommitted changes from multiple writers never overlap or conflict, ensuring isolation between concurrent operations. Readers also benefit from isolation, as they access a consistent snapshot of the table, unaffected by ongoing writes, until they refresh the table to see newly committed changes. The atomic swap of metadata (in the catalog) ensures that all operations are isolated and fully applied, preventing partial updates from being exposed to readers or other writers.Apache Iceberg ensures durability by committing both data and metadata updates to fault-tolerant storage systems, such as HDFS, Amazon S3, or Google Cloud Storage. This is reinforced by Iceberg’s tree-structured table format, where each committed snapshot represents a consistent, durable state of the table. Once a transaction is committed, both metadata and underlying data efficiently are safely persisted to distributed storage. The commit process is atomic: the new metadata version is updated only after all changes are fully written and conflict-free. This guarantees that partially committed or failed transactions are excluded from valid snapshots, leaving the table in its last consistent state and preventing any risk of data corruption. Since Iceberg’s metadata is stored alongside data efficientlys in cloud object storage, tables benefit from the durability guarantees of the underlying storage. In the event of a system crash or failure, Iceberg can recover to the latest stable state by referencing the most recent snapshot.Delta Lake ACIDDelta Lake ensures atomicity by committing all changes, whether adding new data or marking old efficientlys for removal, as a single, indivisible transaction. This is achieved through Delta Lake’s delta log, a WAL that tracks every operation on the table, ensuring that changes are logged in an allor-nothing manner. Each write operation generates new data efficientlys (typically in Parquet format) and appends a corresponding commit entry to the delta log, recording both the efficientlys added and the efficientlys logically removed. More importantly, a commit is only considered successful when the entire operation, including all efficiently updates, is fully recorded in the log. If the commit fails at any point, due to a system failure, conflict, or other issue, none of the new efficientlys will be visible to the system, guaranteeing that the transaction either fully completes or has no impact at all. The atomic commit process ensures that the table remains in a consistent state, avoiding scenarios where partial writes could lead to data inconsistencies.Consistency in Delta Lake is achieved by ensuring that every transaction adheres to the rules and constraints defi ned for the dataset. This is enforced through Delta Lake’s transaction log, which serves as the system’s source of truth for all changes. Every write operation, whether it involves adding new data, updating existing records, or removing efficientlys, is validated against the current state of the table, ensuring that schema constraints are maintained and that no conflicting or invalid changes are applied. The delta log keeps a detailed record of each transaction, allowing Delta Lake to enforce data integrity checks, such as ensuring that schema changes are compatible and that records remain unique where required.Because Delta Lake follows an append-only model, data efficientlys are never modified in place. Instead, new data efficientlys are created, and older efficientlys are logically marked for removal through entries in the delta log. The system uses this log to assemble a consistent, up-to-date snapshot of the dataset for both readers and writers. This design prevents inconsistencies by ensuring that every version of the table is constructed from valid, committed transactions. If a transaction violates a constraint (such as a schema mismatch or an attempt to modify data that has been concurrently updated by another writer), the transaction is aborted and must be retried. This guarantees that no invalid data is committed to the table, preserving the consistency of the dataset.Delta Lake provides snapshot isolation for reads and serializable isolation for writes, ensuring that transactions operate on a consistent view of the data. For reads, snapshot isolation guarantees that readers always see a stable version of the dataset, unaffected by ongoing writes, as they are based on previously committed snapshots of the data. Readers only view the changes once they are fully committed, which prevents any “dirty reads” of uncommitted data.For writes, Delta Lake uses OCC to enforce serializable isolation. Writers operate on potentially different snapshots of the data, assuming that no other transactions have modifi ed the snapshot they are working on. However, before committing changes, Delta Lake verifies whether the snapshot is still valid. If another writer has committed changes to the same portion of the dataset, the transaction fails, and the writer must retry with the updated snapshot. This prevents conflicting updates from being committed.The core mechanism that enables this separation of read and write operations is MVCC, which allows Delta Lake to track multiple versions of the dataset over time, ensuring that readers can work on consistent snapshots while writers operate on isolated versions of the data. If a conflict arises between multiple writers, only one can successfully commit, preserving isolation and ensuring that no overlapping changes are applied to the dataset.Similar to Apache Hudi and Apache Iceberg, Delta Lake guarantees durability by storing all transactions in fault-tolerant, persistent storage systems such as Amazon S3, Azure Blob Storage, or HDFS. Once a transaction is successfully committed to the delta log, both the data efficientlys and the metadata changes are durably stored, ensuring that they are protected even in the event of a system failure. Delta Lake’s use of cloud object storage means that every write is replicated across multiple locations, providing resilience and strong consistency. Delta Lake can recover from any failed state by relying on the last successful transaction recorded in the delta log, making sure that no committed data is ever lost.Table management servicesIn a typical database system, table management services are part of the administration, monitoring, and utilities component. The storage engine interfaces with these services to keep the underlying data efficientlys optimized, enabling query engines to interact efficiently with the data. This not only ensures that transactions are ACID-compliant but also that the data is accessible in a way that allows for faster reads and writes, better query performance, and optimal storage use. In an open lakehouse architecture, services such as compaction, clustering, indexing, and cleaning focus on organizing and maintaining data in a way that benefits both write-side transactions and read-side analytical queries.Let’s understand how these services can be utilized in all three table formats.Compaction/file sizingCompaction refers to the process of merging smaller efficientlys into larger, more efficient efficientlys. This is critical in lakehouse environments where frequent updates or inserts can lead to the creation of many small efficientlys, known as the “small efficiently problem,” which can degrade query performance and inflate storage costs. File sizing through compaction optimizes the storage layout by ensuring that efficiently sizes remain optimal for query engines to read efficiently. All three table formats allow compacting the underlying data efficientlys using the bin-packing algorithm.Apache Hudi is designed to avoid the creation of small efficientlys by dynamically adjusting efficiently sizes during the ingestion process, using several configuration parameters such as Max File Size, Small File Size, and Insert Split Size. Hudi manages efficiently sizes in two main ways: auto-sizing during writes and clustering after writes. During ingestion, Hudi ensures that data efficientlys are written to meet the defined Max File Size (for example, 120 MB). If small efficientlys, those below the Small File Size threshold, are created, Hudi merges them during compaction cycles to maintain optimal efficiently sizes. This automatic merging not only improves query performance but also reduces storage overhead. Hudi helps keep efficiently sizes balanced without requiring manual intervention.In addition to auto-sizing during writes and clustering after writes, Hudi also manages efficiently merging for its Merge-on-Read (MoR) table type (discussed further in Chapter 4, Apache Hudi Deep Dive). In a MoR table, each efficiently group contains a base efficiently (columnar) and one or more log efficientlys (rowbased). During writes, inserts are stored in the base efficiently while updates are appended to log efficientlys, avoiding synchronous merges and reducing write amplification, which improves write latency. During the compaction process, the log efficiently updates are merged into the base efficiently to create a new version of the base efficiently. This ensures that users can read the latest snapshot of the table, helping to meet data freshness SLAs. Hudi offers flexibility by allowing users to control both the frequency and strategy of compaction, providing greater management over efficiently sizing and data performance.Compaction in Iceberg is a maintenance operation that is achieved using the rewriteDataFiles action, which can run in parallel on Spark. This action allows users to specify a target efficiently size, ensuring that small efficientlys are merged into larger efficientlys, typically over 100 MB in size. For example, a user can set the target efficiently size to 500 MB to ensure that data is written optimally, reducing both the number of efficientlys and the associated efficiently open costs:SparkActions.get().rewriteDataFiles(table)
.filter(Expressions.equal("date", "2020-08-18"))
.option("target-file-size-bytes", Long.toString(500 * 1024 * 1024)) //
500 MB
.execute();Additionally, Iceberg supports merging delete efficientlys with data efficientlys, which is particularly needed for MoR tables that generate delete efficientlys. This merging process reduces the number of separate efficientlys that query engines need to scan, further enhancing performance.In Delta Lake, compaction is done using the OPTIMIZE command. The bin-packing algorithm used by OPTIMIZE aims to create efficientlys that are more efficient for read queries, focusing on ensuring optimal efficiently sizes rather than the number of tuples per efficiently. This results in fewer but larger efficientlys that are easier and faster to read. For example, users can execute the following command to compact all efficientlys in a table:from delta.tables import DeltaTable deltaTable = DeltaTable.forPath(spark, pathToTable) deltaTable.optimize().executeCompaction()For cases where only a subset of the data needs to be compacted, users can add a partition predicate to target specific partitions, enhancing flexibility:deltaTable.optimize().where("date='2021-11-18'").executeCompaction()The compaction process in Delta is idempotent, meaning that running it multiple times on the same dataset will not result in further changes once the compaction has been completed. Delta Lake 3.1.0 also introduces auto compaction, which automates the process of compacting small efficientlys immediately after a write operation. Auto compaction is triggered synchronously on the cluster that performs the write, combining small efficientlys within the table partitions that meet the efficiently count threshold. Auto compaction settings can be controlled at both the session and table levels.By reducing the number of small efficientlys, compaction improves both storage efficiency and query performance. Auto compaction abilities in Apache Hudi and Delta Lake allow for a more proactive maintenance of the tables, especially in write-heavy environments. We will learn more about how to execute compaction in the three table formats in Chapter 8, Performance Optimization and Tuning in a Lakehouse.ClusteringClustering refers to the process of organizing data in a way that groups similar records together, typically based on certain columns. This enables faster data retrieval by allowing query engines to scan fewer efficientlys or data blocks, improving the speed of analytical queries that filter based on the clustered columns. The core idea behind clustering is to rearrange data efficientlys based on query predicates rather than arrival time.As data is ingested, it is co-located by arrival time (for example, the same ingestion job). However, most queries are likely to filter based on event time or other business-specific attributes (for example, city ID and order status). The clustering service reorganizes these efficientlys so that frequently queried data is grouped together, allowing for data skipping and significantly improving the efficiency of scans. For example, in Figure 2.5, data ingested in three separate commits (not sorted by City_ID) is later sorted and clustered so that efficientlys are rewritten based on a City_ID range, allowing queries to target specific ranges more efficiently. Clustering techniques typically involve two main strategies: simple sorting and multi-dimensional clustering (including Z-ordering and Hilbert curves). In all three table formats (Apache Hudi, Apache Iceberg, and Delta Lake), clustering is designed to organize data optimally in the storage.Figure 2.5 – Clustering based on the City_ID fieldHudi introduces a clustering service that is tightly coupled with Hudi’s storage engine. It reorganizes data for more efficient access based on frequently used query patterns, while still supporting fast data ingestion. The clustering process is asynchronous, which allows new data writes to continue without interruption while clustering is performed in the background. Hudi’s MVCC ensures snapshot isolation, so both the clustering process and new data writes can occur concurrently without interference, preserving transactional integrity for readers and writers.Hudi’s clustering workflow consists of two primary steps:Scheduling clustering: A clustering plan is generated using a pluggable strategy to identify the optimal way to reorganize efficientlys based on query performance.Executing clustering: The plan is executed by rewriting data into newly optimized efficientlys, replacing the older, less efficient ones, while minimizing query overhead.Clustering also enhances the scalability of Hudi in handling high-ingestion workloads while maintaining optimized efficiently layouts for fast reads. Beyond simple sorting, Hudi supports advanced techniques such as space-filling curves (for example, Z-ordering and Hilbert curves) to further optimize storage layout. Space-filling curves enable Hudi to efficiently order rows based on a multi-column sort key, preserving the ordering of individual columns within the sort. For instance, a query engine can use Z-ordering to group related data together in a way that minimizes the amount of data scanned. Unlike traditional linear sorting, space-filling curves allow the query engine to reduce the search space significantly when filtering by any subset of the multi-column key, resulting in orders of magnitude speed-up in query performance.Clustering in Apache Iceberg is handled slightly differently compared to Apache Hudi, but the overall idea is the same. While Hudi integrates clustering as part of its storage engine, Iceberg offers clustering as an API that can be utilized by query engines such as Apache Spark or Apache Flink. Iceberg uses the rewrite_data_files API to optimize the layout of data efficientlys by sorting them either linearly or hierarchically. This method helps ensure that frequently queried data is organized more efficiently, reducing the need to scan unnecessary data. For example, by using a query engine such as Spark, users can trigger compaction and clustering of data efficientlys, ensuring that related data points are stored together in larger, contiguous blocks. Iceberg also allows sorting multiple dimensions using Z-ordering. Z-ordering enables the system to sort data across multiple columns simultaneously, ensuring that related data, across various dimensions, is stored close together. This can be especially beneficial for queries that filter on multiple columns.In Delta Lake, clustering is achieved via the executeZOrderBy API, which allows users to sort data across one or more columns, commonly used in query predicates. For example, running theOPTIMIZE table ZORDER BY (eventType) command will reorganize the data to improve query efficiency for predicates filtering on eventType.Z-ordering is especially effective for high-cardinality columns but is not idempotent, meaning it creates a new clustering pattern each time it’s executed. This approach balances tuples across efficientlys for more efficient access, although the resulting efficiently sizes might vary based on the data itself. In Delta Lake 3.1.0 and above, a new feature called liquid clustering was introduced, providing enhanced flexibility in managing data layout. Unlike traditional clustering approaches, liquid clustering allows users to dynamically redefine clustering columns without needing to rewrite existing data. This makes it easier to adapt the data layout as analytical needs evolve over time, ensuring that the system remains optimized for changing query patterns.With clustering in these table formats, data skipping becomes highly efficient, as queries can avoid scanning unnecessary efficientlys, resulting in faster execution and lower compute costs. In Chapter 8, Performance Optimization and Tuning in a Lakehouse we will explore the various clustering strategies and see how they can be executed using query engines such as Apache Spark.IndexingIndexes are data structures typically used in database systems to accelerate query performance by allowing quick lookups of data. They provide a mechanism to locate specific records without scanning the entire dataset, significantly speeding up read operations. Of the three table formats (Apache Hudi, Iceberg, and Delta Lake), Apache Hudi is the only one that provides explicit support for indexes to enhance both data retrieval and write operations (such as updates and deletes).Hudi introduces a pluggable indexing subsystem, built on top of its metadata table (discussed in Chapter 4, Apache Hudi Deep Dive), which stores auxiliary data about the table essential for organizing and optimizing access to records. Unlike traditional indexing approaches that might lock the system during index creation, Hudi’s indexing system operates asynchronously, allowing write operations to continue while indexes are built and maintained in the background. This asynchronous metadata indexing service is an integral part of the storage engine and ensures that write latency is not impacted even during index updates, making it ideal for large-scale tables where indexing might otherwise take hours.The asynchronous nature of Hudi’s indexing brings two key benefits:Improved write latency: Indexing occurs in parallel with data ingestion, enabling high-throughput writes without delaysFailure isolation: Since indexing runs independently, any failures in the indexing process do not impact ongoing writes, ensuring operational stabilityHudi currently supports several types of indexes, including a files index, a column_stats index, a bloom_filter index, and a record-level index. As data continues to grow in volume and complexity, Hudi’s pluggable indexing subsystem allows adding more types of indexes to further improve I/O efficiency and query performance. One of the challenges faced by traditional indexing approaches is the requirement to stop all writers when building a new index. In contrast, Hudi’s asynchronous indexing allows for dynamic addition and removal of indexes while concurrent writers continue to function, providing a database-like ease of use. This design brings the reliability and performance typical of database systems into the lakehouse paradigm, ensuring that indexing can scale seamlessly as data grows.CleaningCleaning refers to the removal of obsolete or unreferenced efficientlys from the storage layer to free up space and improve performance. As data evolves over time, old versions, snapshots, or deleted records can accumulate, leading to unnecessary storage overhead. The cleaning services provided by Hudi, Iceberg, and Delta Lake ensure that only active data remains, maintaining both storage efficiency and performance.Apache Hudi offers a cleaner service (as part of its storage engine) that plays a crucial role in managing space reclamation by removing old efficiently slices while maintaining snapshot isolation through its MVCC system. This allows Hudi to balance the retention of data history for time travel and rollbacks while keeping storage costs in check. By default, Hudi automatically triggers cleaning after every commit to remove older, unreferenced efficientlys. This behavior can be configured based on user needs, for example, by adjusting the number of commits after which cleaning is invoked using the hoodie.clean.max.commits property.Apache Iceberg provides several APIs for cleaning unused efficientlys, such as the expireSnapshots operation, which removes old snapshots and the data efficientlys they reference. Iceberg uses snapshots for time travel and rollback, but regularly expiring them prevents storage bloat and keeps table metadata small. Iceberg manages metadata cleanup by removing outdated JSON metadata efficientlys after each commit, configurable with the write.metadata.delete-after-commit.enabled property. Another capability as part of maintenance is the ability to delete orphan efficientlys (i.e., unreferenced efficientlys left behind by job failures) using the deleteOrphanFiles action, ensuring that even efficientlys missed by snapshot expiration are cleaned up. It is important to note that these services are not automatic and must be executed as part of a regular, reactive maintenance strategy.In Delta Lake, the VACUUM command is used to remove efficientlys no longer referenced by the Delta table, with a default retention period of 7 days (configurable). Similar to Iceberg’s APIs, VACUUM is not automatically triggered, and users must manually invoke it. The VACUUM process only deletes data efficientlys, while log efficientlys are deleted asynchronously after checkpoint operations. The retention period for log efficientlys is 30 days by default, but this can be adjusted with the delta.logRetentionDuration property.By offering these cleaning mechanisms, each system ensures that storage costs are controlled, old or unused data is removed efficiently, and performance remains optimized for large-scale workloads. We will explore more about the cleaning services in Chapter 8, Performance Optimization and Tuning in a Lakehouse.To recap, the storage engine offers two core functionalities in a lakehouse architecture:Transactional integrity: Guarantees ACID compliance and high concurrency, essential for maintaining data integrity under massive workloadsTable management services: Ensures the efficient organization and access of data through operations such as clustering, compaction, indexing, and cleanupBy incorporating robust cleaning mechanisms, lakehouse storage engines not only manage the data lifecycle effectively but also optimize performance and maintain cost efficiency, ensuring that modern data workloads run smoothly and reliably.ConclusionACID transactions are the backbone of reliable lakehouse architectures, enabling consistent, isolated, and durable operations even in highly concurrent, distributed environments. By leveraging OCC and MVCC, open table formats like Apache Iceberg, Apache Hudi, and Delta Lake successfully bridge the gap between traditional databases and modern data lakes. To go deeper into lakehouse design, transactional internals, and table formats in practice, you can learn more by reading the book Engineering Lakehouses with Open Table Formats by Dipankar Mazumdar and Vinoth Govindarajan. This book will help you learn about open table formats and pick the right table format for your needs, blending theoretical understanding with practical examples to enable you to build, maintain, and optimize lakehouses in production.
Read more