Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data Engineering

5 Articles
article-image-acid-transactions-in-lakehouse-architectures-mvcc-occ-and-conflict-resolution-with-iceberg-hudi-and-delta-lake
Dipankar Mazumdar, Vinoth Govindarajan
29 Jan 2026
10 min read
Save for later

ACID Transactions in Lakehouse Architectures: MVCC, OCC, and Conflict Resolution with Iceberg, Hudi, and Delta Lake

Dipankar Mazumdar, Vinoth Govindarajan
29 Jan 2026
10 min read
DataPro is a weekly, expert-curated newsletter trusted by 120k+ global data professionals. Built by data practitioners, it blends first-hand industry experience with practical insights and peer-driven learning.Make sure to subscribe here so you never miss a key update in the data world. IntroductionModern data platforms must support massive scale, high concurrency, and multiple compute engines without sacrificing correctness. This is where the lakehouse architecture stands out by bringing ACID-compliant transactions to distributed, file-based data lakes. Through mechanisms such as optimistic concurrency control (OCC) and multi-version concurrency control (MVCC), lakehouses ensure reliable reads and writes even under heavy parallel workloads. This article explores how ACID properties, conflict resolution strategies, and storage engines work together to deliver consistency, durability, and performance in modern lakehouse systems.Understanding transactions and ACID propertiesTransactions are fundamental units of work in any database system. They represent a series of operations that are executed as a single, indivisible unit. The concept of a transaction ensures that either all the operations within it are executed successfully, or none at all, preserving the integrity of the system. In a transactional context, we focus on the atomicity, consistency, isolation, and durability (ACID) properties, which govern the behavior of these operations.ACID properties serve as the backbone of robust data management systems. They are crucial in ensuring that transactional operations are reliable and consistent, even in the presence of multiple concurrent operations or system failures. Without ACID guarantees, data systems could become unreliable, leading to data corruption, inconsistencies, and operational inefficiencies. Deep dive into ACID propertiesTo understand the foundation of transactional systems, let’s break down the four core ACID properties that ensure data consistency and reliability:Atomicity: Atomicity ensures that a transaction is “all or nothing.” If any part of a transaction fails, the entire transaction is rolled back, ensuring that partial updates do not leave the system in an inconsistent state. This is especially important in distributed systems where partial updates can lead to a variety of issues, including data loss and corruption.Consistency: Consistency ensures that a transaction brings the database from one valid state to another. In other words, after the transaction is completed, the data should still satisfy all the integrity constraints (such as foreign keys, unique constraints, and so on). For example, if a table requires a column to have unique values, no transaction should violate this rule.Isolation: Isolation guarantees that the intermediate states of a transaction are invisible to other transactions. Even though multiple transactions might be running concurrently, they are isolated from each other. Different isolation levels (such as read committed and serializable) can be used based on the system’s requirements and the balance between performance and consistency.Durability: Durability ensures that once a transaction is committed, it is permanently stored in the system, even in the event of a crash or failure. Durability is typically achieved through techniques such as logging and write-ahead logs (WALs) in storage engines to ensure that all changes are safely written to persistent storage.ACID properties in traditional databasesIn traditional databases such as MySQL, PostgreSQL, and Oracle, the ACID properties have been well established. These systems use a tightly coupled architecture where both the compute and storage components are integrated, ensuring that transactions can be processed efficiently.However, the introduction of distributed and cloud-native architectures has brought new challenges. In such environments, ensuring ACID properties across distributed systems requires specialized techniques such as 2-phase commit (2PC) or optimistic concurrency control to handle distributed transactions while maintaining consistency across nodes.ACID properties in lakehouse architecturesThe lakehouse architecture addresses some of the inherent limitations of traditional data lakes by introducing ACID-compliant transactional capabilities. In data lakes, operations such as updates and deletes were difficult to handle without complex workarounds. However, with open table formats such as Apache Hudi, Apache Iceberg, and Delta Lake, these challenges are addressed, bringing ACID properties to the forefront in a distributed file-based architecture. We will dive deep into each table format’s transactional capabilities later in this chapter.In a lakehouse, the storage engine manages transactions, ensuring that any operation, whether it’s a write, update, or delete, is executed in a manner that conforms to ACID guarantees. This ensures that users can trust the consistency and durability of the data, even in large-scale, multipetabyte environments where high concurrency and distributed operations are common.Now that we’ve explored ACID properties in traditional databases and lakehouse architectures, let’s dive deeper into the conflict resolution mechanisms in the next section.Discovering conflict resolution mechanismsIn any multi-user or concurrent environment, conflicts can arise when multiple transactions attempt to update or modify the same data simultaneously. Confl ict resolution is crucial for maintaining data integrity, particularly in systems where high levels of concurrency are expected, like those in distributed databases or lakehouse environments.Traditional database systems employ various mechanisms to manage conflicts, ensuring that simultaneous operations do not lead to data corruption or inconsistencies. Two key concepts in this space are locking mechanisms (for example, pessimistic and optimistic locking) and concurrency control methods (for example, two-phase locking or timestamp ordering).Types of conflict resolutionsTo ensure data consistency and integrity, transactional systems employ various concurrency control mechanisms. Three primary strategies address potential conflicts:Pessimistic concurrency control: Locks data to prevent conflictsOptimistic concurrency control: Assumes that conflicts are rare and resolves them when they occurMulti-version concurrency control: Uses multiple versions of data to resolve conflicts Let’s discuss them in detail. Pessimistic concurrency controlIn pessimistic concurrency control (PCC), the system assumes that conflicts are likely to occur. Therefore, it locks the data before a transaction starts, preventing other transactions from accessing it until the lock is released, as shown:Figure 2.1 – Pessimistic strategyThis method is effective in preventing conflicts but can lead to performance bottlenecks, particularly in high-traffic systems where multiple users are accessing the same data concurrently.Optimistic concurrency controlOptimistic concurrency control (OCC) operates on the assumption that conflicts are rare. Instead of locking data preemptively, the system allows multiple transactions to proceed concurrently and checks for conflicts only at the time of the transaction’s commit. If a conflict is detected, such as two transactions trying to modify the same data simultaneously, the system takes corrective action, such as aborting or retrying the conflicting transaction.Figure 2.2 – Optimistic strategyOCC is particularly useful in environments where read operations are more frequent than writes, as it enables higher performance and reduced contention among transactions.Multi-version concurrency controlMulti-version concurrency control (MVCC) offers a more nuanced solution by allowing multiple versions of data to exist simultaneously. Rather than locking data or assuming that conflicts are rare, MVCC creates a new version of the data each time it is modified. When a transaction reads the data, it retrieves the most recent committed version that was available at the time the transaction began.Figure 2.3 – Multi-version strategyThis ensures that readers are never blocked by writers, and writers can update data without waiting for readers to finish. MVCC allows for high levels of concurrency while minimizing contention and ensuring consistency:How it works: When a user reads data, they see a snapshot of the database at a particular point in time, ensuring consistency without the need to lock resources. If multiple versions of the same record exist, each transaction sees the version that was valid at the start of the transaction. Writers create new versions of data rather than overwriting the existing version, which allows for better handling of concurrent operations.Conflict resolution: In MVCC, all transactions are allowed to complete, with each creating a new version of the data. Readers see a consistent snapshot of the database as of the start of their transaction, while subsequent reads or queries can use the latest committed version. If multiple versions of the same record exist, the system resolves conflicts at read time, typically by exposing the latest version in the metadata or by merging updates as per the system’s confl ict resolution strategy. Locking granularityLocking granularity refers to the level of data specificity when applying locks. Choosing the right granularity is crucial, as it affects concurrency, performance, and conflict resolution. Here are three common locking granularities:Row-level locking: Specific rows are locked, allowing  more granular control but also increasing the risk of deadlocksTable-level locking: The entire table is  locked, reducing concurrency but simplifying conflict preventionPartition-level locking: A specific partition of data is  locked, often used in distributed databases to allow other partitions to be accessed concurrently Conflict resolution in distributed systemsDistributed systems introduce added complexity, as conflicts must be managed across different nodes that might hold different versions of the data. Techniques such as consensus protocols (Paxos and Raft) and MVCC are employed to ensure that conflicts are resolved efficiently and that the data remains consistent across all nodes. In some cases, systems also leverage conflict-free replicated data types (CRDTs), which enable concurrent updates to be merged automatically, ensuring eventual consistency without requiring complex coordination.In distributed systems, MVCC ensures that transactions operate on a consistent snapshot of the data, with any modifications written as a new version. This approach allows multiple transactions to coexist without blocking each other, enabling higher concurrency and efficient resource utilization. While MVCC provides snapshot isolation and version management, replication of these versions across nodes is handled separately by the underlying storage system or coordination service (for example, HDFS, S3, or consensus protocols such as Raft).Conflict resolution in lakehouse architecturesIn lakehouse architectures, conflict resolution is a key challenge due to the decoupling of compute and storage, which allows multiple engines to operate on the same data. To address this, two concurrency control mechanisms are most widely used: OCC and MVCC.With OCC, lakehouses allow transactions to proceed concurrently without locking data preemptively. At the time of committing the transaction, the system checks whether a conflict occurred. If a conflict is detected, the system can roll back or retry the conflicting transaction. This method aligns well with the open storage layer in lakehouses, where multiple compute engines may simultaneously access the data.Lakehouses implement MVCC to manage concurrent reads and writes without introducing locks. When data is modified, a new version of the data is created, allowing readers to continue accessing the previous version. This ensures that readers are never blocked by ongoing write operations, and writers can proceed without waiting for readers.Apache Iceberg and Delta Lake both implement MVCC to provide snapshot isolation and support concurrent reads and writes. Apache Hudi provides a similar guarantee through its commit timeline and supports snapshot isolation in Copy-on-Write mode, while Merge-on-Read enables near-real-time incremental views.By leveraging OCC and MVCC, lakehouse architectures maintain the integrity and consistency of data even in environments with high levels of concurrency. This ensures that users can operate on data without running into conflicts, enabling smooth collaboration and efficient processing of large-scale datasets.An essential aspect of conflict resolution in lakehouse architecture is version control. Lakehouses maintain versions of data as part of their transactional system, allowing users to view the state of the data at any given point in time. This feature is not only useful for conflict resolution but also for auditing and rollback purposes, allowing users to revert to previous versions of the data if necessary.This version control ensures that no data is lost and any conflicts can be traced and resolved, ensuring the integrity of the system over time. This is particularly valuable for enterprises with compliance requirements that demand data lineage and historical tracking.In the next section, we’ll explore how these concepts apply to the lakehouse architecture, specifically the storage engine component. We’ll delve into table management services and transactional integrity, discussing how lakehouse storage engines provide ACID properties, concurrency control, and more.Understanding the storage engineIn Chapter 1, Open Data Lakehouse: A New Architectural Paradigm, we introduced the storage engine as the critical component responsible for managing how data is stored, retrieved, and maintained in a data lakehouse architecture. In this chapter, we will deepen that understanding by exploring how the storage engine integrates with the open table format to provide transactional capabilities. However, before we dive into the specifi c capabilities of a storage engine, let’s fi rst examine the various sub-components that make up the storage engine within a traditional database system. This will help clarify why the storage engine is so essential in an open lakehouse architecture.Components of a traditional database storage engineStorage engines enable transactional capabilities in a lakehouse, so it’s helpful to fi rst look at what makes up a traditional database storage engine. Understanding its sub-components highlights why this layer is essential for efficient data management. The following diagram illustrates the architecture of a database system, with the storage engine highlighted at the bottom:Figure 2.4 – Architecture of a traditional database management systemIn traditional database management systems (DBMSs), the storage engine is the core layer responsible for managing how data is physically stored, retrieved, and maintained. It handles the low-level operations of the database, such as reading and writing data to storage, managing transactions, and enforcing concurrency. As shown in Figure 2.4, the storage engine includes components such as the lock manager, access methods, buffer manager, and recovery manager. Each of these components plays a vital role in ensuring the efficiency, accuracy, and durability of database operations. The lakehouse architecture draws inspiration from traditional database systems by incorporating a storage engine to provide transactional capabilities across massive datasets stored in systems such as Amazon S3, Google Cloud Storage, and Azure Blob Storage.Let’s take a look at the four sub-components of a storage engine and understand their need in today’s lakehouse architecture:Lock manager: The lock manager is a critical component in database systems, ensuring that multiple operations can access shared resources without interfering with one another. In a lakehouse architecture, this role becomes crucial due to the distributed nature of data and since different compute engines, such as Spark, Flink, or Presto, operate on the same dataset simultaneously for running diverse sets of workloads. Therefore, a lock manager is needed to prevent conflicts and maintain data correctness under concurrent operations. By bringing effective concurrency control mechanisms to coordinate multiple readers and writers, the lock manager upholds data integrity, ensuring that workloads run smoothly without conflicts or data corruption.Access methods: In traditional database systems, the access methods component determines how data is accessed and retrieved, often relying on storage structures such as B-trees, hashing, or log-structured-merge-trees (LSM-trees) for efficient data retrieval. Similarly, in a lakehouse architecture, there is a need to effectively organize data and metadata so that compute engines can access data in storage efficiently. Lakehouse table formats employ a directory-based access method for efficient data retrieval from distributed cloud object stores. For example, the Apache Hudi table format organizes data into efficiently groups and efficiently slices (discussed in Chapter 4, Apache Hudi Deep Dive) within its timeline – a type of event log. Indexes operate on top of this structure to help compute engines quickly locate the right efficiently groups for faster reads and writes. Apache Iceberg uses a tree-based structure to organize metadata efficiently (manifests), which enables efficient pruning of irrelevant data efficiently. Similarly, Delta Lake employs a directory-based access method, combined with its transaction log, to track all transactions and help prune  data efficientlys.Buffer manager: The buffer manager in a database system is essential for improving system performance by caching frequently accessed data, reducing the need to repeatedly access slower storage. In lakehouse architectures, where cloud object stores serve as the storage layer, there is a fundamental trade-off between faster data ingestion and optimal query performance. Writing smaller efficientlys or logging deltas can speed up ingestion, but for optimal query performance, it is critical to minimize the number of efficientlys accessed and pre-materialize data merges. This is where a buffer manager becomes essential in a lakehouse context. It addresses the issue by caching frequently accessed or modified data, allowing query engines to avoid repeated access to slower lake storage. For example, in Apache Hudi, there are proposals to tightly integrate a columnar caching service that can sit between the lake storage and query engines, using the Hudi timeline to ensure cache consistency across transactions and mutations. This approach will not only improve read performance by serving data directly from the cache but also amortize the cost of merge operations by compacting data in memory. Recovery manager: The recovery manager is responsible for ensuring data durability and consistency, especially in the event of system failures. In a lakehouse architecture, the recovery manager takes on the role of maintaining data reliability across distributed storage systems. This involves mechanisms that allow the system to recover data to a consistent state, even after unexpected disruptions. Open table formats such as Apache Hudi, Apache Iceberg, and Delta Lake use metadata snapshots and WALs to guarantee durability and facilitate recovery. For example, Iceberg uses immutable snapshots to enable data rollback and point-in-time recovery, while Hudi uses a timeline of commits to maintain consistent versions of the dataset and to revert for any unexpected scenarios. Delta Lake employs transaction logs that track all changes, enabling the system to revert to a stable state if needed. These capabilities ensure that lakehouses are resilient and can maintain data integrity, even at a massive scale.While these four interconnected sub-components form the foundation for executing ACID-based transactions in a traditional database system, the storage engine’s role extends beyond just ensuring transactional integrity. Another key responsibility of the storage engine is to continuously optimize the data layout within the storage layer. In traditional databases, the storage engine works closely with the administration, monitoring, and utilities components (as illustrated in Figure 2.4) to manage the organization and structure of efficientlys, ensuring efficient data access and performance. This concept carries over to modern lakehouse architecture, where the storage engine is not only responsible for maintaining transactional consistency but also for delivering a comprehensive suite of table management services. These services include cleaning, compaction, clustering, archival, and indexing, operations that are essential for ensuring that the storage layout remains optimized for different compute engines to efficiently query the data. By organizing the underlying efficiently and metadata, the storage engine enables query engines to leverage advanced optimization techniques, such as predicate pushdown, data skipping, and pruning, ultimately improving query performance across various workloads.With this understanding, we can dive deeper into the two fundamental responsibilities of the lakehouse storage engine: transactional capabilities and table management services.How a lakehouse handles transactionsOne of the defining attributes of the lakehouse architecture is its ability to provide transactional guarantees when running various Data Definition Language (DDL) and Data Manipulation Language (DML) operations, something that distinguishes it from traditional data lakes, which often lack such capabilities. These transactional features align the lakehouse architecture with the reliability and consistency typically associated with data warehouses (OLAP databases), while still retaining the flexibility of data lakes. In this section, we will explore how transactional integrity is achieved in the three table formats through ACID guarantees and concurrency control mechanisms, which allow multiple users and applications to safely interact with the same dataset without compromising data integrity.ACID guaranteesACID properties are essential for ensuring that data operations are reliable, consistent, and isolated from each other. In lakehouses, the need for these guarantees arises from the complexities of handling concurrent reads and writes, managing multiple compute engines, and supporting large-scale applications accessing the same data. Each of the table formats, such as Apache Hudi, Apache Iceberg, and Delta Lake, implements ACID properties slightly differently, given the distinct structure of the formats. Let’s understand how the design decisions of each table format ensure ACID properties in a lakehouse.Apache Hudi ACIDHudi achieves ACID  guarantees via its timeline – a WAL that keeps track of all the transactions occurring in the table. Atomic writes in Hudi are achieved by publishing commits atomically to a timeline, where each commit is stamped with an instant time. This instant time denotes the point at which the action is deemed to have occurred, ensuring that changes are either fully committed or rolled back, preventing partial updates. This guarantees that readers never see partially written data, maintaining atomicity.Consistency in Hudi is primarily ensured through its timeline mechanism, which acts as a structured log of all actions (insert, update, and delete) represented as a sequence of “instants.” Only committed instants (those that complete the requested → infl ight → completed cycle) are visible to readers and writers, ensuring the consistent ordering of events and preventing partial or incomplete writes from affecting the dataset. In concurrent scenarios, each transaction is assigned a unique, increasing timestamp, guaranteeing that operations are applied in the correct order and avoiding conflicts between overlapping modifications.In addition to this, Hudi enforces primary key uniqueness as a data integrity rule. Each record is assigned a primary key, which helps map data to specific partitions and efficiently groups, allowing efficient updates and ensuring duplicate records are not introduced into the dataset.Apache Hudi uses a combination of MVCC and OCC to ensure isolation between different types of processes. Hudi distinguishes between writer processes (which handle data modifi cations such as upserts and deletes), table services (which perform tasks such as compaction and cleaning), and readers (which execute queries). These processes operate on consistent snapshots of the dataset, thanks to MVCC, ensuring that readers always access the last successfully committed data, even when writes are ongoing. MVCC enables non-blocking, lock-free concurrency control between writers and table services, and between different table services, allowing for smooth parallel execution. At the same time, Hudi uses OCC to manage concurrency between writers, ensuring that conflicting operations are detected and handled without compromising data integrity. This dual approach provides snapshot isolation across all processes, maintaining isolation while efficiently supporting concurrent operations.Hudi provides durability via its WAL (timeline) that comprises Apache Avro serialized efficientlys containing individual actions (such as commit, compaction, or rollback), ensuring that all changes made to the dataset are permanent and recoverable, even in the event of system failures. Any changes made during a transaction are fi rst written to the WAL, which acts as a temporary log of changes before they are applied to the actual dataset. This mechanism ensures that, in case of a crash or failure, the system can recover from the WAL and complete any pending operations or roll back uncommitted changes, preventing data loss. Additionally, both metadata and data efficientlys in Hudi are stored in systems such as cloud object stores or HDFS, which enables the formats to take advantage of their durable nature.Apache Iceberg ACIDApache Iceberg guarantees  ACID properties by organizing data into a persistent tree structure of metadata and data efficiently, with the entire table state represented by the root of this tree. Each commit to the dataset, such as an insert, update, or delete, creates a new tree with a new root metadata efficiently. The key design element for atomicity is Iceberg’s use of an atomic compare-andswap (CAS) operation for committing these new metadata efficiently to the catalog.The catalog stores the location of the current metadata efficiently, which holds the complete state of all active snapshots of the table. When a writer performs a change, it generates a new metadata efficiently and attempts to commit the update by swapping the current metadata location with the new one. This CAS operation ensures that either the entire change is applied successfully or none at all, preventing any partial updates. In the case of concurrent writes, if two writers attempt to commit changes based on the same original metadata efficiently, only the fi rst writer’s commit will succeed. The second writer’s commit will be rejected, ensuring that no conflicting or incomplete changes are written. Technically, any kind of database or data store can be used as a catalog in Apache Iceberg. The only requirement is that the database should have a locking mechanism to ensure there are no conflicts. The atomic swap mechanism is fundamental to maintaining the integrity of the table and forms the backbone of Iceberg’s transactional model.Iceberg’s metadata tree structure and commit process are designed to ensure consistency guarantees. At the core of Iceberg’s consistency model is the guarantee that every commit must replace an expected table state with a new, valid state. Each writer proposes changes based on the current table state, but if another writer has modifi ed the table in the meantime, the original writer’s commit will fail. In such cases, the failed writer must retry the operation with the new table state, preventing the system from introducing any inconsistencies due to concurrent writes. This mechanism enforces a linear history of changes, where each commit builds on the most recent table version (snapshot), avoiding scenarios where the table could be left in a partially updated state. Iceberg ensures that as much of the previous work as possible is reused in the retry attempt. This efficient handling of retries reduces the overhead associated with repeated write operations and ensures that concurrent writers can work independently without causing inconsistencies. Readers only access fully committed data, as changes are atomically applied, maintaining a consistent view of the table without partial updates.Apache Iceberg supports both snapshot isolation and serializable isolation to ensure that readers and concurrent writers remain isolated from each other. Iceberg uses OCC, where writers operate on the assumption that the table’s current version will not change during their transaction. A writer prepares its updates and attempts to commit them by swapping the metadata efficiently pointer from the current version to a new one. If another writer commits a change before this operation completes, the first writer’s update is rejected, and it must retry based on the new table state. This mechanism guarantees that uncommitted changes from multiple writers never overlap or conflict, ensuring isolation between concurrent operations. Readers also benefit from isolation, as they access a consistent snapshot of the table, unaffected by ongoing writes, until they refresh the table to see newly committed changes. The atomic swap of metadata (in the catalog) ensures that all operations are isolated and fully applied, preventing partial updates from being exposed to readers or other writers.Apache Iceberg ensures durability by committing both data and metadata updates to fault-tolerant storage systems, such as HDFS, Amazon S3, or Google Cloud Storage. This is reinforced by Iceberg’s tree-structured table format, where each committed snapshot represents a consistent, durable state of the table. Once a transaction is committed, both metadata and underlying data efficiently are safely persisted to distributed storage. The commit process is atomic: the new metadata version is updated only after all changes are fully written and conflict-free. This guarantees that partially committed or failed transactions are excluded from valid snapshots, leaving the table in its last consistent state and preventing any risk of data corruption. Since Iceberg’s metadata is stored alongside data efficientlys in cloud object storage, tables benefit from the durability guarantees of the underlying storage. In the event of a system crash or failure, Iceberg can recover to the latest stable state by referencing the most recent snapshot.Delta Lake ACIDDelta Lake ensures atomicity by committing all changes, whether adding new data or marking old efficientlys for removal, as a single, indivisible transaction. This is achieved through Delta Lake’s delta log, a WAL that tracks every operation on the table, ensuring that changes are logged in an allor-nothing manner. Each write operation generates new data efficientlys (typically in Parquet format) and appends a corresponding commit entry to the delta log, recording both the efficientlys added and the efficientlys logically removed. More importantly, a commit is only considered successful when the entire operation, including all efficiently updates, is fully recorded in the log. If the commit fails at any point, due to a system failure, conflict, or other issue, none of the new efficientlys will be visible to the system, guaranteeing that the transaction either fully completes or has no impact at all. The atomic commit process ensures that the table remains in a consistent state, avoiding scenarios where partial writes could lead to data inconsistencies.Consistency in Delta Lake is achieved by ensuring that every transaction adheres to the rules and constraints defi ned for the dataset. This is enforced through Delta Lake’s transaction log, which serves as the system’s source of truth for all changes. Every write operation, whether it involves adding new data, updating existing records, or removing efficientlys, is validated against the current state of the table, ensuring that schema constraints are maintained and that no conflicting or invalid changes are applied. The delta log keeps a detailed record of each transaction, allowing Delta Lake to enforce data integrity checks, such as ensuring that schema changes are compatible and that records remain unique where required.Because Delta Lake follows an append-only model, data efficientlys are never modified in place. Instead, new data efficientlys are created, and older efficientlys are logically marked for removal through entries in the delta log. The system uses this log to assemble a consistent, up-to-date snapshot of the dataset for both readers and writers. This design prevents inconsistencies by ensuring that every version of the table is constructed from valid, committed transactions. If a transaction violates a constraint (such as a schema mismatch or an attempt to modify data that has been concurrently updated by another writer), the transaction is aborted and must be retried. This guarantees that no invalid data is committed to the table, preserving the consistency of the dataset.Delta Lake provides  snapshot isolation for reads and serializable isolation for writes, ensuring that transactions operate on a consistent view of the data. For reads, snapshot isolation guarantees that readers always see a stable version of the dataset, unaffected by ongoing writes, as they are based on previously committed snapshots of the data. Readers only view the changes once they are fully committed, which prevents any “dirty reads” of uncommitted data.For writes, Delta Lake uses OCC to enforce serializable isolation. Writers operate on potentially different snapshots of the data, assuming that no other transactions have modifi ed the snapshot they are working on. However, before committing changes, Delta Lake verifies whether the snapshot is still valid. If another writer has committed changes to the same portion of the dataset, the transaction fails, and the writer must retry with the updated snapshot. This prevents conflicting updates from being committed.The core mechanism that enables this separation of read and write operations is MVCC, which allows Delta Lake to track multiple versions of the dataset over time, ensuring that readers can work on consistent snapshots while writers operate on isolated versions of the data. If a conflict arises between multiple writers, only one can successfully commit, preserving isolation and ensuring that no overlapping changes are applied to the dataset.Similar to Apache Hudi and Apache Iceberg, Delta Lake guarantees durability by storing all transactions in fault-tolerant, persistent storage systems such as Amazon S3, Azure Blob Storage, or HDFS. Once a transaction is successfully committed to the delta log, both the data efficientlys and the metadata changes are durably stored, ensuring that they are protected even in the event of a system failure. Delta Lake’s use of cloud object storage means that every write is replicated across multiple locations, providing resilience and strong consistency. Delta Lake can recover from any failed state by relying on the last successful transaction recorded in the delta log, making sure that no committed data is ever lost.Table management servicesIn a typical database system, table management services are part of the administration, monitoring, and utilities component. The storage engine interfaces with these services to keep the underlying data efficientlys optimized, enabling query engines to interact efficiently with the data. This not only ensures that transactions are ACID-compliant but also that the data is accessible in a way that allows for faster reads and writes, better query performance, and optimal storage use. In an open lakehouse architecture, services such as compaction, clustering, indexing, and cleaning focus on organizing and maintaining data in a way that benefits both write-side transactions and read-side analytical queries.Let’s understand how these services can be utilized in all three table formats.Compaction/file sizingCompaction refers to the process of merging smaller efficientlys into larger, more efficient efficientlys. This is critical in lakehouse environments where frequent updates or inserts can lead to the creation of many small efficientlys, known as the “small efficiently problem,” which can degrade query performance and inflate storage costs. File sizing through compaction optimizes the storage layout by ensuring that efficiently sizes remain optimal for query engines to read efficiently. All three table formats allow compacting the underlying data efficientlys using the bin-packing algorithm.Apache Hudi is designed to avoid the creation of small efficientlys by dynamically adjusting efficiently sizes during the ingestion process, using several configuration parameters such as Max File Size, Small File Size, and Insert Split Size. Hudi manages efficiently sizes in two main ways: auto-sizing during writes and clustering after writes. During ingestion, Hudi ensures that data efficientlys are written to meet the defined Max File Size (for example, 120 MB). If small efficientlys, those below the Small File Size threshold, are created, Hudi merges them during compaction cycles to maintain optimal efficiently sizes. This automatic merging not only improves query performance but also reduces storage overhead. Hudi helps keep efficiently sizes balanced without requiring manual intervention.In addition to auto-sizing during writes and clustering after writes, Hudi also manages efficiently merging for its Merge-on-Read (MoR) table type (discussed further in Chapter 4, Apache Hudi Deep Dive). In a MoR table, each efficiently group contains a base efficiently (columnar) and one or more log efficientlys (rowbased). During writes, inserts are stored in the base efficiently while updates are appended to log efficientlys, avoiding synchronous merges and reducing write amplification, which improves write latency. During the compaction process, the log efficiently updates are merged into the base efficiently to create a new version of the base efficiently. This ensures that users can read the latest snapshot of the table, helping to meet data freshness SLAs. Hudi offers flexibility by allowing users to control both the frequency and strategy of compaction, providing greater management over efficiently sizing and data performance.Compaction in Iceberg is a maintenance operation that is achieved using the rewriteDataFiles action, which can run in parallel on Spark. This action allows users to specify a target efficiently size, ensuring that small efficientlys are merged into larger efficientlys, typically over 100 MB in size. For example, a user can set the target efficiently size to 500 MB to ensure that data is written optimally, reducing both the number of efficientlys and the associated efficiently open costs:SparkActions.get().rewriteDataFiles(table)    .filter(Expressions.equal("date", "2020-08-18"))    .option("target-file-size-bytes", Long.toString(500 * 1024 * 1024)) // 500 MB    .execute();Additionally, Iceberg  supports merging delete efficientlys with data efficientlys, which is particularly needed for MoR tables that generate delete efficientlys. This merging process reduces the number of separate efficientlys that query engines need to scan, further enhancing performance.In Delta Lake, compaction is done using the OPTIMIZE command. The bin-packing algorithm used by OPTIMIZE aims to create efficientlys that are more efficient for read queries, focusing on ensuring optimal efficiently sizes rather than the number of tuples per efficiently. This results in fewer but larger efficientlys that are easier and faster to read. For example, users can execute the following command to compact all efficientlys in a table:from delta.tables import DeltaTable deltaTable = DeltaTable.forPath(spark, pathToTable) deltaTable.optimize().executeCompaction()For cases where only a subset of the data needs to be compacted, users can add a partition predicate to target specific partitions, enhancing flexibility:deltaTable.optimize().where("date='2021-11-18'").executeCompaction()The compaction process in Delta is idempotent, meaning that running it multiple times on the same dataset will not result in further changes once the compaction has been completed. Delta Lake 3.1.0 also introduces auto compaction, which automates the process of compacting small efficientlys immediately after a write operation. Auto compaction is triggered synchronously on the cluster that performs the write, combining small efficientlys within the table partitions that meet the efficiently count threshold. Auto compaction settings can be controlled at both the session and table levels.By reducing the number of small efficientlys, compaction improves both storage efficiency and query performance. Auto compaction abilities in Apache Hudi and Delta Lake allow for a more proactive maintenance of the tables, especially in write-heavy environments. We will learn more about how to execute compaction in the three table formats in Chapter 8, Performance Optimization and Tuning in a Lakehouse.ClusteringClustering refers to the process of organizing data in a way that groups similar records together, typically based on certain columns. This enables faster data retrieval by allowing query engines to scan fewer efficientlys or data blocks, improving the speed of analytical queries that filter based on the clustered columns. The core idea behind clustering is to rearrange data efficientlys based on query predicates rather than arrival time.As data is ingested, it is co-located by arrival time (for example, the same ingestion job). However, most queries are likely to filter based on event time or other business-specific attributes (for example, city ID and order status). The clustering service reorganizes these efficientlys so that frequently queried data is grouped together, allowing for data skipping and significantly improving the efficiency of scans. For example, in Figure 2.5, data ingested in three separate commits (not sorted by City_ID) is later sorted and clustered so that efficientlys are rewritten based on a City_ID range, allowing queries to target specific ranges more efficiently. Clustering techniques typically involve two main strategies: simple sorting and multi-dimensional clustering (including Z-ordering and Hilbert curves). In all three table formats (Apache Hudi, Apache Iceberg, and Delta Lake), clustering is designed to organize data optimally in the storage.Figure 2.5 – Clustering based on the City_ID fieldHudi introduces a clustering service that is tightly coupled with  Hudi’s storage engine. It reorganizes data for more efficient access based on frequently used query patterns, while still supporting fast data ingestion. The clustering process is asynchronous, which allows new data writes to continue without interruption while clustering is performed in the background. Hudi’s MVCC ensures snapshot isolation, so both the clustering process and new data writes can occur concurrently without interference, preserving transactional integrity for readers and writers.Hudi’s clustering workflow consists of two primary steps:Scheduling clustering: A clustering plan is generated using a pluggable strategy to identify the optimal way to reorganize efficientlys based on query performance.Executing clustering: The plan is executed by rewriting data into newly optimized efficientlys, replacing the older, less efficient ones, while minimizing query overhead.Clustering also enhances the scalability of Hudi in handling high-ingestion workloads while maintaining optimized efficiently layouts for fast reads. Beyond simple sorting, Hudi supports advanced techniques such as space-filling curves (for example, Z-ordering and Hilbert curves) to further optimize storage layout. Space-filling curves enable Hudi to efficiently order rows based on a multi-column sort key, preserving the ordering of individual columns within the sort. For instance, a query engine can use Z-ordering to group related data together in a way that minimizes the amount of data scanned. Unlike traditional linear sorting, space-filling curves allow the query engine to reduce the search space significantly when filtering by any subset of the multi-column key, resulting in orders of magnitude speed-up in query performance.Clustering in Apache Iceberg is handled slightly differently compared to Apache Hudi, but the overall idea is the same. While Hudi integrates clustering as part of its storage engine, Iceberg offers clustering as an API that can be utilized by query engines such as Apache Spark or Apache Flink. Iceberg uses the rewrite_data_files API to optimize the layout of data efficientlys by sorting them either linearly or hierarchically. This method helps ensure that frequently queried data is organized more efficiently, reducing the need to scan unnecessary data. For example, by using a query engine such as Spark, users can trigger compaction and clustering of data efficientlys, ensuring that related data points are stored together in larger, contiguous blocks. Iceberg also allows sorting multiple dimensions using Z-ordering. Z-ordering enables the system to sort data across multiple columns simultaneously, ensuring that related data, across various dimensions, is stored close together. This can be especially beneficial for queries that filter on multiple columns.In Delta Lake, clustering is achieved via the executeZOrderBy API, which allows users to sort data across one or more columns, commonly used in query predicates. For example, running theOPTIMIZE table ZORDER BY (eventType) command will reorganize the data to improve query efficiency for predicates filtering on eventType.Z-ordering is  especially effective for high-cardinality columns but is not idempotent, meaning it creates a new clustering pattern each time it’s executed. This approach balances tuples across efficientlys for more efficient access, although the resulting  efficiently sizes might vary based on the data itself. In Delta Lake 3.1.0 and above, a new feature called liquid clustering was introduced, providing enhanced flexibility in managing data layout. Unlike traditional clustering approaches, liquid clustering allows users to dynamically redefine clustering columns without needing to rewrite existing data. This makes it easier to adapt the data layout as analytical needs evolve over time, ensuring that the system remains optimized for changing query patterns.With clustering in these table formats, data skipping becomes highly efficient, as queries can avoid scanning unnecessary efficientlys, resulting in faster execution and lower compute costs. In Chapter 8, Performance Optimization and Tuning in a Lakehouse we will explore the various clustering strategies and see how they can be executed using query engines such as Apache Spark.IndexingIndexes are data  structures typically used in database systems to accelerate query performance by allowing quick lookups of data. They provide a mechanism to locate specific records without scanning the entire dataset, significantly speeding up read operations. Of the three table formats (Apache Hudi, Iceberg, and Delta Lake), Apache Hudi is the only one that provides explicit support for indexes to enhance both data retrieval and write operations (such as updates and deletes).Hudi introduces a pluggable indexing subsystem, built on top of its metadata table (discussed in Chapter 4, Apache Hudi Deep Dive), which stores auxiliary data about the table essential for organizing and optimizing access to records. Unlike traditional indexing approaches that might lock the system during index creation, Hudi’s indexing system operates asynchronously, allowing write operations to continue while indexes are built and maintained in the background. This asynchronous metadata indexing service is an integral part of the storage engine and ensures that write latency is not impacted even during index updates, making it ideal for large-scale tables where indexing might otherwise take hours.The asynchronous nature of Hudi’s  indexing brings two key benefits:Improved write latency: Indexing occurs in parallel with data ingestion, enabling high-throughput writes without delaysFailure isolation: Since indexing runs independently, any failures in the indexing process do not impact ongoing writes, ensuring operational stabilityHudi currently supports several types of indexes, including a files index, a column_stats index, a bloom_filter index, and a record-level index. As data continues to grow in volume and complexity, Hudi’s pluggable indexing subsystem allows adding more types of indexes to further improve I/O efficiency and query performance. One of the challenges faced by traditional indexing approaches is the requirement to stop all writers when building a new index. In contrast, Hudi’s asynchronous indexing allows for dynamic addition and removal of indexes while concurrent writers continue to function, providing a database-like ease of use. This design brings the reliability and performance typical of database systems into the lakehouse paradigm, ensuring that indexing can scale seamlessly as data grows.CleaningCleaning refers to the removal of obsolete or unreferenced efficientlys from the storage layer to free up space and improve performance. As data evolves over time, old versions, snapshots, or deleted records can accumulate, leading to unnecessary storage overhead. The cleaning services provided by Hudi, Iceberg, and Delta Lake ensure that only active data remains, maintaining both storage efficiency and performance.Apache Hudi offers a cleaner service (as part of its storage engine) that plays a crucial role in managing space reclamation by removing old efficiently slices while maintaining snapshot isolation through its MVCC system. This allows Hudi to balance the retention of data history for time travel and rollbacks while keeping storage costs in check. By default, Hudi automatically triggers cleaning after every commit to remove older, unreferenced efficientlys. This behavior can be configured based on user needs, for example, by adjusting the number of commits after which cleaning is invoked using the hoodie.clean.max.commits property.Apache Iceberg provides several APIs for cleaning unused efficientlys, such as the expireSnapshots operation, which removes old snapshots and the data efficientlys they reference. Iceberg uses snapshots for time travel and rollback, but regularly expiring them prevents storage bloat and keeps table metadata small. Iceberg manages metadata cleanup by removing outdated JSON metadata efficientlys after each commit, configurable with the write.metadata.delete-after-commit.enabled property. Another capability as part of maintenance is the ability to delete orphan efficientlys (i.e., unreferenced efficientlys left behind by job failures) using the deleteOrphanFiles action, ensuring that even efficientlys missed by snapshot expiration are cleaned up. It is important to note that these services are not automatic and must be executed as part of a regular, reactive maintenance strategy.In Delta Lake, the VACUUM command is used to remove efficientlys no longer referenced by the Delta table, with a default retention period of 7 days (configurable). Similar to Iceberg’s APIs, VACUUM is not automatically triggered, and users must manually invoke it. The VACUUM process only deletes data efficientlys, while log efficientlys are deleted asynchronously after checkpoint operations. The retention period for log efficientlys is 30 days by default, but this can be adjusted with the delta.logRetentionDuration property.By offering these cleaning mechanisms, each system ensures that storage costs are controlled, old or unused data is removed efficiently, and performance remains optimized for large-scale workloads. We will explore more about the cleaning services in Chapter 8, Performance Optimization and Tuning in a Lakehouse.To recap, the storage engine  offers two core functionalities in a lakehouse  architecture:Transactional integrity: Guarantees ACID compliance and high concurrency, essential for maintaining data integrity under massive workloadsTable management services: Ensures the efficient organization and access of data through operations such as clustering, compaction, indexing, and cleanupBy incorporating robust cleaning mechanisms, lakehouse storage engines not only manage the data lifecycle effectively but also optimize performance and maintain cost efficiency, ensuring that modern data workloads run smoothly and reliably.ConclusionACID transactions are the backbone of reliable lakehouse architectures, enabling consistent, isolated, and durable operations even in highly concurrent, distributed environments. By leveraging OCC and MVCC, open table formats like Apache Iceberg, Apache Hudi, and Delta Lake successfully bridge the gap between traditional databases and modern data lakes. To go deeper into lakehouse design, transactional internals, and table formats in practice, you can learn more by reading the book Engineering Lakehouses with Open Table Formats by Dipankar Mazumdar and Vinoth Govindarajan. This book will help you learn about open table formats and pick the right table format for your needs, blending theoretical understanding with practical examples to enable you to build, maintain, and optimize lakehouses in production.  
Read more
  • 0
  • 0

article-image-airflow-ops-best-practices-observation-and-monitoring
Dylan Intorf, Kendrick van Doorn, Dylan Storey
12 Nov 2024
15 min read
Save for later

Airflow Ops Best Practices: Observation and Monitoring

Dylan Intorf, Kendrick van Doorn, Dylan Storey
12 Nov 2024
15 min read
This article is an excerpt from the book, "Apache Airflow Best Practices", by Dylan Intorf, Kendrick van Doorn, Dylan Storey. With practical approach and detailed examples, this book covers newest features of Apache Airflow 2.x and it's potential for workflow orchestration, operational best practices, and data engineering.IntroductionIn this article, we will continue to explore the application of modern “ops” practices within Apache Airflow, focusing on the observation and monitoring of your systems and DAGs after they’ve been deployed.We’ll divide this observation into two segments – the core Airflow system and individual DAGs. Each segment will cover specific metrics and measurements you should be monitoring for alerting and potential intervention.When we discuss monitoring in this section, we will consider two types of monitoring – active and suppressive.In an active monitoring scenario, a process will actively check a service’s health state, recording its state and potentially taking action directly on the return value.In a suppressive monitoring scenario, the absence of a state (or state change) is usually meaningful. In these scenarios, the monitored application sends an active schedule to a process to inform it that it is OK, usually suppressing an action (such as an alert) from occurring.This chapter covers the following topics:Monitoring core Airflow componentsMonitoring your DAGsTechnical requirementsBy now, we expect you to have a good understanding of Airflow and its core components, along with functional knowledge in the deployment and operation of Airflow and Airflow DAGs.We will not be covering specific observability aggregators or telemetry tools; instead, we will focus on the activities you should be keeping an eye on. We strongly recommend that you work closely with your ops teams to understand what tools exist in your stack and how to configure them for capture and alerting your deployments.Monitoring core Airflow componentsAll of the components we will discuss here are critical to ensuring a functioning Airflow deployment. Generally, all of them should be monitored with a bare minimum check of Is it on? and if a component is not, an alert should surface to your team for investigation. The easiest way to check this is to query the REST API on the web server at `/health/`; this will return a JSON object that can be parsed to determine whether components are healthy and, if not, when they were last seen.SchedulerThis component needs to be running and working effectively in order for tasks to be scheduled for execution.When the scheduler service is started, it also starts a `/health` endpoint that can be checked by an external process with an active monitoring approach.The returned signal does not always indicate that the scheduler is working properly, as its state is simply indicative that the service is up and running. There are many scenarios where the scheduler may be operating but unable to schedule jobs; as a result, many deployments will include a canary dag to their deployment that has a single task, acting to suppress an external alert from going off.Import metrics that airflow exposes for you include the following:scheduler.scheduler_loop_duration: This should be monitored to ensure that your scheduler is able to loop and schedule tasks for execution. As this metric increases, you will see tasks beginning to schedule more slowly, to the point where you may begin missing SLAs because tasks fail to reach a schedulable state.scheduler.tasks.starving: This indicates how many tasks cannot be scheduled because there are no slots available. Pools are a mechanism that Airflow uses to balance large numbers of submitted task executions versus a finite amount of execution throughput. It is likely that this number will not be zero, but being high for extended periods of time may point to an issue in how DAGs are being written to schedule work.scheduler.tasks.executable: This indicates how many tasks are ready for execution (i.e., queued). This number will sometimes not be zero, and that is OK, but if the number increases and stays high for extended periods of time, it indicates that you may need additional computer resources to handle the load. Look at your executor to increase the number of workers it can run. Metadata databaseThe metadata database is used to store and track all of the metadata for your Airflow deployments’ previous DAG/task executions, along with information about your environment’s roles and permissions. Losing data from this database can interrupt normal operations and cause unintended consequences, with DAG runs being repeated.While critical, because it is architecturally ubiquitous, the database is also least likely to encounter issues, and if it does, they are absolutely catastrophic in nature.We generally suggest you utilize a managed service for provisioning and operating your backing database, ensuring that a disaster recovery plan for your metadata database is in place at all times.Some active areas to monitor on your database include the following:Connection pool size/usage: Monitor both the connection pool size and usage over time to ensure appropriate configuration, and identify potential bottlenecks or resource contention arising from Airflow components’ concurrent connections.Query performance: Measure query latency to detect inefficient queries or performance issues, while monitoring query throughput to ensure effective workload handling by the database.Storage metrics: Monitor the disk space utilization of the metadata database to ensure that it has sufficient storage capacity. Set up alerts for low disk space conditions to prevent database outages due to storage constraints.Backup status: Monitor the status of database backups to ensure that they are performed regularly and successfully. Verify backup integrity and retention policies to mitigate the risk of data loss if there is a database failure.TriggererThe Triggerer instance manages all of the asynchronous operations of deferrable operators in a deferred state. As such, major operational concerns generally relate to ensuring that individual deferred operators don’t cause major blocking calls to the event loop. If this occurs, your deferrable tasks will not be able to check their state changes as frequently, and this will impact scheduling performance.Import metrics that airflow exposes for you include the following:triggers.blocked_main_thread: The number of triggers that have blocked the main thread. This is a counter and should monotonically increase over time; pay attention to large differences between recording (or quick acceleration) counts, as it’s indicative of a larger problem.triggers.running: The number of triggers currently on a triggerer instance. This metric should be monitored to determine whether you need to increase the number of triggerer instances you are running. While the official documentation claims that up to tens of thousands of triggers can be on an instance, the common operational number is much lower. Tune at your discretion, but depending on the complexity of your triggers, you may need to add a new instance for every few hundred consistent triggers you run.Executors/workersDepending on the executor you use, you will need to monitor your executors and workers a bit differently.The Kubernetes executor will utilize the Kubernetes API to schedule tasks for execution; as such, you should utilize the Kubernetes events and metrics servers to gather logs and metrics for your task instances. Common metrics to collect on an individual task are CPU and memory usage. This is crucial for tuning requests or mutating individual task resource requests to ensure that they execute safely.The Celery worker has additional components and long-lived processes that you need to metricize. You should monitor an individual Celery worker’s memory and CPU utilization to ensure that it is not over- or under-provisioned, tuning allocated resources accordingly. You also need to monitor the message broker (usually Redis or RabbitMQ) to ensure that it is appropriately sized. Finally, it is critical to measure the queue length of your message broker and ensure that too much “back pressure” isn’t being created in the system. If you find that your tasks are sitting in a queued state for a long period of time and the queue length is consistently growing, it’s a sign that you should start an additional Celery worker to execute on scheduled tasks. You should also investigate using the native Celery monitoring tool Flower (https://flower.readthedocs.io/en/latest/) for additional, more nuanced methods of monitoring.Web serverThe Airflow web server is the UI for not just your Airflow deployment but also the RESTful interface. Especially if you happen to be controlling Airflow scheduling behavior with API calls, you should keep an eye on the following metrics:Response time: Measure the time taken for the API to respond to requests. This metric indicates the overall performance of the API and can help identify potential bottlenecks.Error rate: Monitor the rate of errors returned by the API, such as 4xx and 5xx HTTP status codes. High error rates may indicate issues with the API implementation or underlying systems.Request rate: Track the rate of incoming requests to the API over time. Sudden spikes or drops in request rates can impact performance and indicate changes in usage patterns.System resource utilization: Monitor resource utilization metrics such as CPU, memory, disk I/O, and network bandwidth on the servers hosting the API. High resource utilization can indicate potential performance bottlenecks or capacity limits.Throughput: Measure the number of successful requests processed by the API per unit of time. Throughput metrics provide insights into the API’s capacity to handle incoming traffic.Now that you have some basic metrics to collect from your core architectural components and can monitor the overall health of an application, we need to monitor the actual DAGs themselves to ensure that they function as intended.Monitoring your DAGsThere are multiple aspects to monitoring your DAGs, and while they’re all valuable, they may not all be necessary. Take care to ensure that your monitoring and alerting stack match your organizational needs with regard to operational parameters for resiliency and, if there is a failure, recovery times. No matter how much or how little you choose to implement, knowing that your DAGs work and if and how they fail is the first step in fixing problems that will arise.LoggingAirflow writes logs for tasks in a hierarchical structure that allows you to see each task’s logs in the Airflow UI. The community also provides a number of providers to utilize other services for backing log storage and retrieval. A complete list of supported providers is available at https://airflow.apache.org/docs/apache-airflow-providers/core-extensions/logging.html.Airflow uses the standard Python logging framework to write logs. If you’re writing custom operators or executing Python functions with a PythonOperator, just make sure that you instantiate a Python logger instance, and then the associated methods will handle everything for you.AlertingAirflow provides mechanisms for alerting on operational aspects of your executing workloads that can be configured within your DAG:Email notifications: Email notifications can be sent if a task is put into a marked or retry state with the `email_on_failure` or `email_on_retry` state, respectively. These arguments can be provided to all tasks in the DAG with the `default_args` key work in the DAG, or individual tasks by setting the keyword argument individually.Callbacks: Callbacks are special actions that are executed if a specific state change occurs. Generally, these callbacks should be thoughtfully leveraged to send alerts that are critical operationally:on_success_callback: This callback will be executed at both the task and DAG levels when entering a successful state. Unless it is critical that you know whether something succeeds, we generally suggest not using this for alerting.on_failure_callback: This callback is invoked when a task enters a failed state. Generally, this callback should always be set and, in critical scenarios, alert on failures that require intervention and support.on_execute_callback: This is invoked right before a task executes and only exists at the task level. Use sparingly for alerting, as it can quickly become a noisy alert when overused.on_retry_callback: This is invoked when a task is placed in a retry state. This is another callback to be cautious about as an alert, as it can become noisy and cause false alarms.sla_miss_callback: This is invoked when a DAG misses its defined SLA. This callback is only executed at the end of a DAG’s execution cycle so tends to be a very reactive notification that something has gone wrong.SLA monitoringAs awesome of a tool as Airflow is, it is a well-known fact in the community that SLAs, while largely functional, have some unfortunate details with regard to implementation that can make them problematic at best, and they are generally regarded as a broken feature in Airflow. We suggest that if you require SLA monitoring on your workflows, you deploy a CRON job monitoring tool such as healthchecks (https://github.com/healthchecks/healthchecks) that allows you to create suppressive alerts for your services through its rest API to manage SLAs. By pairing this third- party service with either HTTP operators or simple requests from callbacks, you can ensure that your most critical workflows achieve dynamic and resilient SLA alerting.Performance profilingThe Airflow UI is a great tool for profiling the performance of individual DAGs:The Gannt chart view: This is a great visualization for understanding the amount of time spent on individual tasks and the relative order of execution. If you’re worried about bottlenecks in your workflow, start here.Task duration: This allows you to profile the run characteristics of tasks within your DAG over a historical period. This tool is great at helping you understand temporal patterns in execution time and finding outliers in execution. Especially if you find that a DAG slows down over time, this view can help you understand whether it is a systemic issue and which tasks might need additional development.Landing times: This shows the delta between task completion and the start of the DAG run. This is an un-intuitive but powerful metric, as increases in it, when paired with stable task durations in upstream tasks, can help identify whether a scheduler is under heavy load and may need tuning.Additional metrics that have proven to be useful (but may need to be calculated) include the following:Task startup time: This is an especially useful metric when operating with a Kubernetes executor. To calculate this, you will need to calculate the difference between `start_date` and `execution_date` on each task instance. This metric will especially help you identify bottlenecks outside of Airflow that may impact task run times.Task failure and retry counts: Monitoring the frequency of task failures and retries can help identify information about the stability and robustness of your environment. Especially if these types of failure can be linked back to patterns in time or execution, it can help debug interactions with other services.DAG parsing time: Monitoring the amount of time a DAG takes to parse is very important to understand scheduler load and bottlenecks. If an individual DAG takes a long time to load (either due to heavy imports or long blocking calls being executed during parsing), it can have a material impact on the timeliness of scheduling tasks.ConclusionIn this article, we covered some essential strategies to effectively monitor both the core Airflow system and individual DAGs post-deployment. We highlighted the importance of active and suppressive monitoring techniques and provided insights into the critical metrics to track for each component, including the scheduler, metadata database, triggerer, executors/workers, and web server. Additionally, we discussed logging, alerting mechanisms, SLA monitoring, and performance profiling techniques to ensure the reliability, scalability, and efficiency of Airflow workflows. By implementing these monitoring practices and leveraging the insights gained, operators can proactively manage and optimize their Airflow deployments for optimal performance and reliability.Author BioDylan Intorf is a solutions architect and data engineer with a BS from Arizona State University in Computer Science. He has 10+ years of experience in the software and data engineering space, delivering custom tailored solutions to Tech, Financial, and Insurance industries.Kendrick van Doorn is an engineering and business leader with a background in software development, with over 10 years of developing tech and data strategies at Fortune 100 companies. In his spare time, he enjoys taking classes at different universities and is currently an MBA candidate at Columbia University.Dylan Storey has a B.Sc. and M.Sc. from California State University, Fresno in Biology and a Ph.D. from University of Tennessee, Knoxville in Life Sciences where he leveraged computational methods to study a variety of biological systems. He has over 15 years of experience in building, growing, and leading teams; solving problems in developing and operating data products at a variety of scales and industries.
Read more
  • 2
  • 0
  • 62994

article-image-essential-sql-for-data-engineers
Kedeisha Bryan, Taamir Ransome
31 Oct 2024
10 min read
Save for later

Essential SQL for Data Engineers

Kedeisha Bryan, Taamir Ransome
31 Oct 2024
10 min read
This article is an excerpt from the book, Cracking the Data Engineering Interview, by Kedeisha Bryan, Taamir Ransome. The book is a practical guide that’ll help you prepare to successfully break into the data engineering role. The chapters cover technical concepts as well as tips for resume, portfolio, and brand building to catch the employer's attention, while also focusing on case studies and real-world interview questions.Introduction In the world of data engineering, SQL is the unsung hero that empowers us to store, manipulate, transform, and migrate data easily. It is the language that enables data engineers to communicate with databases, extract valuable insights, and shape data to meet their needs. Regardless of the nature of the organization or the data infrastructure in use, a data engineer will invariably need to use SQL for creating, querying, updating, and managing databases. As such, proficiency in SQL can often the difference between a good data engineer and a great one. Whether you are new to SQL or looking to brush up your skills, this chapter will serve as a comprehensive guide. By the end of this chapter, you will have a solid understanding of SQL as a data engineer and be prepared to showcase your knowledge and skills in an interview setting. In this article, we will cover the following topics: Must-know foundational SQL concepts Must-know advanced SQL concepts Technical interview questions Must-know foundational SQL concepts In this section, we will delve into the foundational SQL concepts that form the building blocks of data engineering. Mastering these fundamental concepts is crucial for acing SQL-related interviews and effectively working with databases. Let’s explore the critical foundational SQL concepts every data engineer should be comfortable with, as follows: SQL syntax: SQL syntax is the set of rules governing how SQL statements should be written. As a data engineer, understanding SQL syntax is fundamental because you’ll be writing and reviewing SQL queries regularly. These queries enable you to extract, manipulate, and analyze data stored in relational databases. SQL order of operations: The order of operations dictates the sequence in which each of the following operators is executed in a query: FROM and JOIN WHERE GROUP BY HAVING SELECT DISTINCT ORDER BY LIMIT/OFFSET Data types: SQL supports a variety of data types, such as INT, VARCHAR, DATE, and so on. Understanding these types is crucial because they determine the kind of data that can be stored in a column, impacting storage considerations, query performance, and data integrity. As a data engineer, you might also need to convert data types or handle mismatches. SQL operators: SQL operators are used to perform operations on data. They include arithmetic operators (+, -, *, /), comparison operators (>, <, =, and so on), and logical operators (AND, OR, and NOT). Knowing these operators helps you construct complex queries to solve intricate data-related problems. Data Manipulation Language (DML), Data Definition Language (DDL), and Data Control  Language (DCL) commands: DML commands such as SELECT, INSERT, UPDATE, and DELETE allow you to manipulate data stored in the database. DDL commands such as CREATE, ALTER, and DROP enable you to manage database schemas. DCL commands such as GRANT and REVOKE are used for managing permissions. As a data engineer, you will frequently use these commands to interact with databases. Basic queries: Writing queries to select, filter, sort, and join data is an essential skill for any data engineer. These operations form the basis of data extraction and manipulation. Aggregation functions: Functions such as COUNT, SUM, AVG, MAX, MIN, and GROUP BY are used to perform calculations on multiple rows of data. They are essential for generating reports and deriving statistical insights, which are critical aspects of a data engineer’s role. The following section will dive deeper into must-know advanced SQL concepts, exploring advanced techniques to elevate your SQL proficiency. Get ready to level up your SQL game and unlock new possibilities in data engineering! Must-know advanced SQL concepts This section will explore advanced SQL concepts that will elevate your data engineering skills to the next level. These concepts will empower you to tackle complex data analysis, perform advanced data transformations, and optimize your SQL queries. Let’s delve into must-know advanced SQL concepts, as follows: Window functions: These do a calculation on a group of rows that are related to the current row. They are needed for more complex analyses, such as figuring out running totals or moving averages, which are common tasks in data engineering. Subqueries: Queries nested within other queries. They provide a powerful way to perform complex data extraction, transformation, and analysis, often making your code more efficient and readable. Common Table Expressions (CTEs): CTEs can simplify complex queries and make your code more maintainable. They are also essential for recursive queries, which are sometimes necessary for problems involving hierarchical data. Stored procedures and triggers: Stored procedures help encapsulate frequently performed tasks, improving efficiency and maintainability. Triggers can automate certain operations, improving data integrity. Both are important tools in a data engineer’s toolkit. Indexes and optimization: Indexes speed up query performance by enabling the database to locate data more quickly. Understanding how and when to use indexes is key for a data engineer, as it affects the efficiency and speed of data retrieval. Views: Views simplify access to data by encapsulating complex queries. They can also enhance security by restricting access to certain columns. As a data engineer, you’ll create and manage views to facilitate data access and manipulation. By mastering these advanced SQL concepts, you will have the tools and knowledge to handle complex data scenarios, optimize your SQL queries, and derive meaningful insights from your datasets. The following section will prepare you for technical interview questions on SQL. We will equip you with example answers and strategies to excel in SQL-related interview discussions. Let’s further enhance your SQL expertise and be well prepared for the next phase of your data engineering journey. Technical interview questions This section will address technical interview questions specifically focused on SQL for data engineers. These questions will help you demonstrate your SQL proficiency and problem-solving abilities. Let’s explore a combination of primary and advanced SQL interview questions and the best methods to approach and answer them, as follows: Question 1: What is the difference between the WHERE and HAVING clauses? Answer: The WHERE clause filters data based on conditions applied to individual rows, while the HAVING clause filters data based on grouped results. Use WHERE for filtering before aggregating data and HAVING for filtering after aggregating data. Question 2: How do you eliminate duplicate records from a result set? Answer: Use the DISTINCT keyword in the SELECT statement to eliminate duplicate records and retrieve unique values from a column or combination of columns. Question 3: What are primary keys and foreign keys in SQL? Answer: A primary key uniquely identifies each record in a table and ensures data integrity. A foreign key establishes a link between two tables, referencing the primary key of another table to enforce referential integrity and maintain relationships. Question 4: How can you sort data in SQL? Answer: Use the ORDER BY clause in a SELECT statement to sort data based on one or more columns. The ASC (ascending) keyword sorts data in ascending order, while the DESC (descending) keyword sorts it in descending order. Question 5: Explain the difference between UNION and UNION ALL in SQL. Answer: UNION combines and removes duplicate records from the result set, while UNION ALL combines all records without eliminating duplicates. UNION ALL is faster than UNION because it does not involve the duplicate elimination process. Question 6: Can you explain what a self join is in SQL? Answer: A self join is a regular join where a table is joined to itself. This is often useful when the data is related within the same table. To perform a self join, we have to use table aliases to help SQL distinguish the left from the right table. Question 7: How do you optimize a slow-performing SQL query? Answer: Analyze the query execution plan, identify bottlenecks, and consider strategies such as creating appropriate indexes, rewriting the query, or using query optimization techniques such as JOIN order optimization or subquery optimization.  Question 8: What are CTEs, and how do you use them? Answer: CTEs are temporarily named result sets that can be referenced within a query. They enhance query readability, simplify complex queries, and enable recursive queries. Use the WITH keyword to define CTEs in SQL. Question 9: Explain the ACID properties in the context of SQL databases. Answer: ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability. These are basic properties that make sure database operations are reliable and transactional. Atomicity makes sure that a transaction is handled as a single unit, whether it is fully done or not. Consistency makes sure that a transaction moves the database from one valid state to another. Isolation makes sure that transactions that are happening at the same time don’t mess with each other. Durability makes sure that once a transaction is committed, its changes are permanent and can survive system failures. Question 10: How can you handle NULL values in SQL? Answer: Use the IS NULL or IS NOT NULL operator to check for NULL values. Additionally, you can use the COALESCE function to replace NULL values with alternative non-null values. Question 11: What is the purpose of stored procedures and functions in SQL? Answer: Stored procedures and functions are reusable pieces of SQL code encapsulating a set of SQL statements. They promote code modularity, improve performance, enhance security, and simplify database maintenance. Question 12: Explain the difference between a clustered and a non-clustered index. Answer: The physical order of the data in a table is set by a clustered index. This means that a table can only have one clustered index. The data rows of a table are stored in the leaf nodes of a clustered index. A non-clustered index, on the other hand, doesn’t change the order of the data in the table. After sorting the pointers, it keeps a separate object in a table that points back to the original table rows. There can be more than one non-clustered index for a table. Prepare for these interview questions by understanding the underlying concepts, practicing SQL queries, and being able to explain your answers. ConclusionThis article explored the foundational and advanced principles of SQL that empower data engineers to store, manipulate, transform, and migrate data confidently. Understanding these concepts has unlocked the door to seamless data operations, optimized query performance, and insightful data analysis. SQL is the language that bridges the gap between raw data and valuable insights. With a solid grasp of SQL, you possess the skills to navigate databases, write powerful queries, and design efficient data models. Whether preparing for interviews or tackling real-world data engineering challenges, the knowledge you have gained in this chapter will propel you toward success. Remember to continue exploring and honing your SQL skills. Stay updated with emerging SQL technologies, best practices, and optimization techniques to stay at the forefront of the ever-evolving data engineering landscape. Embrace the power of SQL as a critical tool in your data engineering arsenal, and let it empower you to unlock the full potential of your data. Author BioKedeisha Bryan is a data professional with experience in data analytics, science, and engineering. She has prior experience combining both Six Sigma and analytics to provide data solutions that have impacted policy changes and leadership decisions. She is fluent in tools such as SQL, Python, and Tableau.She is the founder and leader at the Data in Motion Academy, providing personalized skill development, resources, and training at scale to aspiring data professionals across the globe. Her other works include another Packt book in the works and an SQL course for LinkedIn Learning.Taamir Ransome is a Data Scientist and Software Engineer. He has experience in building machine learning and artificial intelligence solutions for the US Army. He is also the founder of the Vet Dev Institute, where he currently provides cloud-based data solutions for clients. He holds a master's degree in Analytics from Western Governors University.
Read more
  • 2
  • 0
  • 44542

article-image-connecting-cloud-object-storage-with-databricks-unity-catalog
Pulkit Chadha
22 Oct 2024
10 min read
Save for later

Connecting Cloud Object Storage with Databricks Unity Catalog

Pulkit Chadha
22 Oct 2024
10 min read
This article is an excerpt from the book, Data Engineering with Databricks Cookbook, by Pulkit Chadha. This book shows you how to use Apache Spark, Delta Lake, and Databricks to build data pipelines, manage and transform data, optimize performance, and more. Additionally, you’ll implement DataOps and DevOps practices, and orchestrate data workflows.IntroductionDatabricks Unity Catalog allows you to manage and access data in cloud object storage using a unified namespace and a consistent set of APIs. With Unity Catalog, you can do the following: Create and manage storage credentials, external locations, storage locations, and volumes using SQL commands or the Unity Catalog UI Access data from various cloud platforms (AWS S3, Azure Blob Storage, or Google Cloud Storage) and storage formats (Parquet, Delta Lake, CSV, or JSON) using the same SQL syntax or Spark APIs Apply fine-grained access control and data governance policies to your data using Databricks SQL Analytics or Databricks Runtime In this article, you will learn what Unity Catalog is and how it integrates with AWS S3. Getting ready Before you start setting up and configuring Unity Catalog, you need to have the following prerequisites: A Databricks workspace with administrator privileges A Databricks workspace with the Unity Catalog feature enabled A cloud storage account (such as AWS S3, Azure Blob Storage, or Google Cloud Storage) with the necessary permissions to read and write data How to do it… In this section, we will first create a storage credential, the IAM role, with access to an s3 bucket. Then, we will create an external location in Databricks Unity Catalog that will use the storage credential to access the s3 bucket. Creating a storage credential You must create a storage credential to access data from an external location or a volume. In this example, you will create a storage credential that uses an IAM role taccess the S3 Bucket. The steps are as follows: 1. Go to Catalog Explorer: Click on Catalog in the left panel and go to Catalog Explorer. 2. Create storage credentials: Click on +Add and select Add a storage credential. Figure 10.1 – Add a storage credential 3. Enter storage credential details: Give the credential a name, the IAM role ARN that allows Unity Catalog to access the storage location on your cloud tenant, and a comment if you want, and click on Create.  Figure 10.2 – Create a new storage credential Important note To learn more about IAM roles in AWS, you can reference the user guide here: https:// docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html. 4. Get External ID: In the Storage credential created dialog, copy the External ID value and click on Done.  Figure 10.3 – External ID for the storage credential 5. Update the trust policy with an External ID: Update the trust policy associated with the IAM role and add the External ID value for sts:ExternalId:  Figure 10.4 – Updated trust policy with External ID Creating an external location An external location contains a reference to a storage credential and a cloud storage path. You need to create an external location to access data from a custom storage location that Unity Catalog uses to reference external tables. In this example, you will create an external location that points to the de-book-ext-loc folder in an S3 bucket. To create an external location, you can follow these steps: 1. Go to Catalog Explorer: Click on Catalog in the left panel to go to Catalog Explorer. 2. Create external location: Click on +Add and select Add an external location:  Figure 10.5 – Add an external location 3. Pick an external location creation method: Select Manual and then click on Next:  Figure 10.6 – Create a new external location 4. Enter external location details: Enter the external location name, select the storage credential, and enter the S3 URL; then, click on the Create button:  Figure 10.7 – Create a new external location manually 5. Test connection: Test the connection to make sure you have set up the credentials accurately and that Unity Catalog is able to access cloud storage:  Figure 10.8 – Test connection for external location If everything is set up right, you should see a screen like the following. Click on Done:  Figure 10.9 – Test connection results See also Databricks Unity Catalog: https://www.databricks.com/product/unity-catalog What is Unity Catalog: https://docs.databricks.com/en/data-governance/ unity-catalog/index.html Databricks Unity Catalog documentation: https://docs.databricks.com/en/ compute/access-mode-limitations.html Databricks SQL documentation: https://docs.databricks.com/en/datagovernance/unity-catalog/create-tables.html Databricks Unity Catalog: A Comprehensive Guide to Features, Capabilities, and Architecture: https://atlan.com/databricks-unity-catalog/ Step By Step Guide on Databricks Unity Catalog Setup and its key Features: https:// medium.com/@sauravkum780/step-by-step-guide-on-databricks-unitycatalog-setup-and-its-features-1d0366c282b7 Conclusion In summary, connecting to cloud object storage using Databricks Unity Catalog provides a streamlined approach to managing and accessing data across various cloud platforms such as AWS S3, Azure Blob Storage, and Google Cloud Storage. By utilizing a unified namespace, consistent APIs, and powerful governance features, Unity Catalog simplifies the process of creating and managing storage credentials and external locations. With built-in fine-grained access controls, you can securely manage data stored in different formats and cloud environments, all while leveraging Databricks' powerful data analytics capabilities. This guide walks through setting up an IAM role and creating an external location in AWS S3, demonstrating how easy it is to connect cloud storage with Unity Catalog. Author BioPulkit Chadha is a seasoned technologist with over 15 years of experience in data engineering. His proficiency in crafting and refining data pipelines has been instrumental in driving success across diverse sectors such as healthcare, media and entertainment, hi-tech, and manufacturing. Pulkit’s tailored data engineering solutions are designed to address the unique challenges and aspirations of each enterprise he collaborates with.
Read more
  • 0
  • 0
  • 47916

article-image-mastering-semi-structured-data-in-snowflake
Serge Gershkovich
27 Jun 2024
7 min read
Save for later

Mastering Semi-Structured Data in Snowflake

Serge Gershkovich
27 Jun 2024
7 min read
This article is an excerpt from the book, Data Modeling with Snowflake, by Serge Gershkovich. Discover how Snowflake's unique objects and features can be used to leverage universal modeling techniques through real-world examples and SQL recipes.Introduction In the era of big data, the ability to efficiently manage and analyze semi-structured data is crucial for businesses. Snowflake, a leading cloud-based data platform, offers robust features to handle semi-structured data formats like JSON, Avro, and Parquet. This article explores the benefits of using the VARIANT data type in Snowflake and provides a hands-on guide to managing semi-structured data.The Benefits of Semi-Structured Data in Snowflake Semi-structured data formats are popular due to their flexibility when working with dynamically varying information. Unlike relational schemas, where a precise entity structure must be predefined, semi-structured data can adapt to include or omit attributes as needed, as long as they are properly nested within corresponding parent objects. For example, consider the contact list on your phone. It contains a list of people and their contact details but does not capture those details uniformly. Some contacts may have multiple phone numbers, while others have only one. Some entries might include an email address and street address, while others have just a number and a vague description. To handle this type of data, Snowflake uses the VARIANT data type, which allows semi-structured data to be stored as a column in a relational table. Snowflake optimizes how VARIANT data is stored internally, ensuring better compression and faster access. Semi-structured data can sit alongside relational data in the same table, and users can access it using basic extensions to standard SQL, achieving similar performance. Another compelling reason to use the VARIANT data type is its adaptability to change. If columns are added or removed from semi-structured data, there is no need to modify ELT (extract, load, and transform) pipelines. The VARIANT data type does not require schema changes, and read operations will not fail for an attribute that no longer exists.Getting Hands-On with Semi-Structured Data Let's delve into a practical example of working with semi-structured data in Snowflake. This example uses JSON data representing information about pirates, such as details about the crew, weapons, and their ship. All this information is stored in a single VARIANT data type column. In relational data, a row represents a single entity; in semi-structured data, a row can represent an entire file containing multiple entities. Creating a Table for Semi-Structured Data Here is a sample SQL script to create a table with semi-structured data:CREATE TABLE pirates_data (    id NUMBER AUTOINCREMENT PRIMARY KEY,    load_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,    data VARIANT ); ``` In this example, the `AUTOINCREMENT` keyword generates a unique ID for each record inserted, and the `VARIANT` column stores the semi-structured JSON data.Loading Semi-Structured Data To load semi-structured data into Snowflake, you can use the `COPY INTO` command. Here’s an example of how to load JSON data from an external stage into the `pirates_data` table:COPY INTO pirates_data FROM @my_stage/pirates_data.json FILE_FORMAT = (TYPE = 'JSON'); ```Querying Semi-Structured Data Once the data is loaded, you can query it using standard SQL. For instance, to extract specific attributes from the JSON data, you can use the dot notation: SELECT    data:id::NUMBER AS pirate_id,    data:crew AS crew,    data:weapons AS weapons FROM pirates_data; ```This query extracts the `id`, `crew`, and `weapons` fields from the JSON data stored in the `data` column.Converting Semi-Structured Data into Relational Data Although semi-structured data offers flexibility, converting it into a relational format can provide better performance for certain queries. Snowflake allows you to transform VARIANT data into relational columns using the `FLATTEN` function. Here's an example of how to flatten a JSON array into a relational table:SELECT    value:id::NUMBER AS pirate_id,    value:name::STRING AS name,    value:rank::STRING AS rank FROM pirates_data, LATERAL FLATTEN(input => data:crew); ``` This query converts the `crew` array from the JSON data into individual rows in a relational format, making it easier to query and analyze.Schema-on-Read vs. Schema-on-Write One of the main advantages of using the VARIANT data type in Snowflake is the flexibility of schema-on-read. This approach allows you to ingest data without a predefined schema, and then define the schema at the time of reading the data. This contrasts with the traditional schema-on-write approach, where the schema must be defined before data ingestion.Benefits of Schema-on-ReadFlexibility: You can ingest data without worrying about its structure, which is particularly useful for unstructured or semi-structured data sources.Adaptability: Schema changes do not require re-ingestion of data, as the schema is applied at read time.Speed: Data can be loaded more quickly, as there is no need to enforce a schema during the ingestion process.Example: Using Schema-on-Read with VARIANT Data Here’s an example demonstrating schema-on-read with semi-structured data in Snowflake: SELECT    data:id::NUMBER AS pirate_id,    data:ship.name::STRING AS ship_name,    data:ship.type::STRING AS ship_type FROM pirates_data; ```In this query, the schema is defined at read time, allowing you to extract specific attributes from the nested JSON data.Handling Nested and Repeated Data Snowflake’s support for semi-structured data also extends to handling nested and repeated data structures. The FLATTEN function is particularly useful for working with such data, enabling you to transform nested arrays into a more manageable relational format.Example: Flattening Nested Data Consider a JSON structure where each pirate has a nested array of previous voyages. To flatten this nested data, you can use the following query: SELECT    data:id::NUMBER AS pirate_id,    value:date::DATE AS voyage_date,    value:destination::STRING AS voyage_destination FROM pirates_data, LATERAL FLATTEN(input => data:previous_voyages); ```This query extracts the nested `previous_voyages` array and converts it into individual rows in a relational format.Performance Considerations When working with semi-structured data in Snowflake, it’s important to consider performance implications. While the VARIANT data type offers flexibility, it can also introduce overhead if not managed properly.Tips for Optimizing PerformanceUse Caching: Take advantage of Snowflake’s caching mechanisms to reduce query times for frequently accessed data.Optimize Queries: Write efficient SQL queries, avoiding unnecessary complexity and ensuring that only the required data is processed.Monitor Usage: Regularly monitor your Snowflake usage and performance metrics to identify and address potential bottlenecks.ConclusionHandling semi-structured data in Snowflake using the VARIANT data type provides immense flexibility and performance benefits. Whether you are dealing with dynamically changing schemas or integrating semi-structured data with relational data, Snowflake’s capabilities can significantly enhance your data management and analytics workflows. By leveraging the techniques outlined in this article, you can efficiently model and transform semi-structured data, unlocking new insights and value for your organization.For more detailed guidance and advanced techniques, refer to the book "Data Modeling with Snowflake," which provides comprehensive insights into modern data modeling practices and Snowflake’s powerful features.Author BioSerge Gershkovich is a seasoned data architect with decades of experience designing and maintaining enterprise-scale data warehouse platforms and reporting solutions. He is a leading subject matter expert, speaker, content creator, and Snowflake Data Superhero. Serge earned a bachelor of science degree in information systems from the State University of New York (SUNY) Stony Brook. Throughout his career, Serge has worked in model-driven development from SAP BW/HANA to dashboard design to cost-effective cloud analytics with Snowflake. He currently serves as product success lead at SqlDBM, an online database modeling tool.
Read more
  • 0
  • 0
  • 19613
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
Modal Close icon
Modal Close icon