You're reading from Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Product type Book

Published in Oct 2021

Publisher Packt

ISBN-13 9781801077743

Pages 480 pages

Edition 1st Edition

Languages

Concepts

Data Processing

Author (1):

Manoj Kukreja

Table of Contents (17) Chapters

Preface

Section 1: Modern Data Engineering and Tools

Chapter 1: The Story of Data Engineering and Analytics

Chapter 2: Discovering Storage and Compute Data Lakes

Chapter 3: Data Engineering on Microsoft Azure

Section 2: Data Pipelines and Stages of Data Engineering

Chapter 4: Understanding Data Pipelines

Chapter 5: Data Collection Stage – The Bronze Layer

Chapter 6: Understanding Delta Lake

Chapter 7: Data Curation Stage – The Silver Layer

Chapter 8: Data Aggregation Stage – The Gold Layer

Section 3: Data Engineering Challenges and Effective Deployment Strategies

Chapter 9: Deploying and Monitoring Pipelines in Production

Chapter 10: Solving Data Engineering Challenges

Chapter 11: Infrastructure Provisioning

Chapter 12: Continuous Integration and Deployment (CI/CD) of Data Pipelines

Other Books You May Enjoy

Chapter 6: Understanding Delta Lake

In the previous chapter, we created the bronze layer of the lakehouse. The bronze layer stores raw data in the native form as collected from the data sources. The problem is that raw data is not in a shape that can be readily consumed for analytical operations.

As a data engineer, it is your responsibility to convert raw data into a shape and form that becomes ready for use analytical workloads. In this chapter, we will further advance our learning to cleanse raw data. The process of cleansing data involves applying the logic that cleans and standardizes data followed by writing it to the silver layer of the lakehouse.

But that is not all – the silver layer should store data in an open format that supports ACID (atomicity, consistency, isolation, and durability) transactions. This is done by using the Delta Lake engine. Before we start building the silver layer, we need to completely understand some critical features of Delta Lake and...

Understanding how Delta Lake enables the lakehouse

In Chapter 2, Discovering Storage and Compute Data Lake Architectures, we talked about the give-and-take struggle between traditional warehouse systems versus data lakes. In the last few years, many organizations have modernized their data engineering and analytics platforms by moving away from traditional data warehouses to data lakes. The move to the data lake has undoubtedly given them the flexibility to store and compute any format of data at a large scale. However, these advantages did not come without a few sacrifices along the way. The biggest one is the reliability of data. Like data warehouses, there is no transactional assurance available in a data lake. This need led to the launch of an open source storage layer known as Delta Lake.

After its launch, experts came up with the idea of mixing the power of resilient data warehouses with the flexibility of data lakes and they called it a lakehouse. The combined power of a...

Understanding Delta Lake

As mentioned before, modern data lakes lack some critical features such as ACID transactions, indexing, and versioning, which can negatively affect the reliability, quality, and performance of data.

Important Note

In data warehouse terms, ACID is the short form for atomicity, consistency, isolation, and durability of data. These properties are intended to make database transactions accurate, reliable, and permanent.

The following diagram depicts the properties:

Figure 6.2 – ACID properties in Delta Lake

Delta Lake functions as a layer on top of the distributed computing framework Apache Spark. By design, Apache Spark lacks the following key principles of transaction management:

It does not lock previous data during edit transactions, which means data may become unavailable during overwrites for a very brief period.
During data overwrites, there is a chance where the old data gets deleted yet the new data...

Creating a Delta Lake table

With the environment set up, we are ready to understand how Delta Lake works. In our Spark session, we have a Spark DataFrame that stores the data of the store_orders table that was ingested at the first iteration of the electroniz_batch_ingestion_pipeline run:

Important Note

A Spark DataFrame is an immutable distributed collection of data. It contains rows and columns like a table in a relational database.

At this point, you should be comfortable running instructions in notebook cells. New cells can be created using Ctrl + Alt + N. After each command, you need to press Shift + Enter to run the command.
From here onwards, I will simply ask you to run the instructions with the assumption that you know how to create new cells and run commands. Invoke the following instructions to write the store_orders delta table:
```
SCRATCH_LAYER_NAMESPACE="scratch"
DELTA_TABLE_WRITE_PATH="wasbs://"+SCRATCH_LAYER_NAMESPACE+"@"+STORAGE_ACCOUNT...
```

Changing data in an existing Delta Lake table

In the previous section, we saw how new data can be written to a Delta Lake table. But transactions in Delta Lake are not only about new data – we may also need to update and delete data as well. In this section, we will find out how the data lake reacts to changes in existing data:

To highlight the effect of changes to Delta Lake tables, we will work with a sample row in the store_orders table as follows:
```
%sql
SELECT * FROM store_orders WHERE order_number=5; 
```
This results in the following output:
Figure 6.22 – Checking the data in the sales_orders table for the sample row
Now, we will update this row and change the sale_ price value from 98.41 to 90.50:
```
%sql
UPDATE store_orders SET sale_price=90.50 WHERE order_number=5;
```
Check if the sale_ price value got updated in the store_orders table:
```
%sql
SELECT * FROM store_orders WHERE order_number=5;
```
This results in the following output:
Figure 6.23 – Checking the...

Performing time travel

A unique feature of Delta Lake is its ability to perform time travel. By using this feature, you can query and restore previous snapshots of your table. Access to previous snapshots is granted by using the versionAsOf option.

Important Note

The time travel functionality in Delta Lake implements data lineage. Data lineage is an extremely critical tool for data audits and compliance purposes. The same feature comes in handy for data engineers who are trying to trace data anomalies.

This is how you can query previous versions of the delta table. In this example, we are querying version 0 of the table – in other words, when it was created:
```
%sql
SELECT * FROM store_orders VERSION AS OF 0 WHERE order_number=5;
```
This results in the following output:
Figure 6.25 – Checking the data in the sales_orders table for the sample row for version 0
Notice how the previous version of the table shows the sale_ price value as 98.41.
This time we will delete...

Performing upserts of data

Delta Lake also supports the upsert operation using the MERGE clause. The idea of the upsert is simply to merge data in the existing table with new data originating from the source. It is very common in a lakehouse to ingest incremental data and merge it with the pre-existing table in Delta Lake. The MERGE process is repeated each time new data gets ingested and is instrumental in creating the single truth of data in the silver layer of the lakehouse.

Important Note

An UPSERT operation is a mix of updates and inserts in the same operation. Using the MERGE clause, you can perform an insert in a table if it does not exist or simply update existing data if it does.

Before we start the MERGE process, let's find out the current number of rows in the store_orders table:
```
%sql
SELECT count(*) FROM store_orders;
```
This results in the following output:
Figure 6.30 – Checking the count of rows in sales_orders
Now, we will read the incrementally...

Understanding isolation levels

In Delta Lake, the isolation level of a table defines the degree to which the transaction must be isolated from modifications that are being made by concurrent transactions. There are two isolation levels, as follows:

The Serializable isolation level is strong, which means the write and read operations exist in a serial sequence.
The WriteSerializable isolation level is the default. In this isolation level, only the write operations exist in a serial sequence.

We will check them out as follows:

The isolation level can be queried for every transaction in history:
```
%sql
SELECT version, operation, isolationLevel
 from (DESCRIBE HISTORY store_orders);
```
This results in the following output:
Figure 6.40 – Transaction history isolationLevel
Although the default isolation level for delta tables is WriteSerializable, it can be changed as desired. We can alter the isolation of the store_orders table to the Serializable isolation...

Understanding concurrency control

By now, hopefully, we have a good understanding of ACID compliance in a Delta Lake using the transaction log. Up until now, all operations during this exercise were performed using the same user. However, in a real case scenario, you can have multiple users trying to read and write to the same delta table at the same time. This is how concurrency controls are implemented in Delta Lake.

Figure 6.43 – Concurrency control in Delta Lake

This is better represented as follows:

Assume User 1 and User 2 perform a write operation to a delta table at the same time.
Delta Lake records the version of the table before any change has been made to the delta table.
The write from User 1 commits successfully, and a success message is sent back.
The write from User 2 does not fail. Instead, Delta Lake tries to silently resolve conflicts from User 2's transaction against the previously committed data. If no conflicts...

Cleaning up Azure resources

For this training exercise, we created a scratch namespace in Azure Data Lake storage. This namespace is not required for the rest of the book, so you may delete i:

Invoke the following commands in the Cloud Shell to delete the scratch namespace:

SCRATCH_NAMESPACE="scratch"
STORAGEACCOUNTNAME="traininglakehouse"
az storage fs delete -n $SCRATCH_NAMESPACE --account-name $STORAGEACCOUNTNAME --yes

Invoke the following commands in the Cloud Shell to delete the Azure Databricks workspace:

RESOURCEGROUPNAME="training_rg"
WORKSPACE="trainingdatabricks"
az config set extension.use_dynamic_install=yes_without_prompt
az databricks workspace delete --resource-group $RESOURCEGROUPNAME --name $WORKSPACE

The resources have now been deleted and we should be good to go!

Summary

In this chapter, we learned about Delta Lake, known to be the driving force behind the modern lakehouse architecture – it truly enables the lakehouse. We learned in detail about some key features of Delta Lake, such as ACID compliance and open standards. After that, we performed common data lake table operations and showed how Delta Lake controls data isolation and concurrency levels.

In Chapter 7, Data Curation Stage – The Silver Layer, we will put the knowledge gained in this chapter to good use. Using Delta Lake, we will create the silver layer of the Electroniz lakehouse. You may recall from previous chapters that the silver layer in the lakehouse stores the curated, deduplicated, and standardized data. The silver layer represents the single source of truth of data, stored in a state that is readily consumable by diverse data analytics workloads.

The rest of the chapter is locked

You're reading from Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Table of Contents (17) Chapters

Chapter 6: Understanding Delta Lake

Understanding how Delta Lake enables the lakehouse

Understanding Delta Lake

Creating a Delta Lake table

Changing data in an existing Delta Lake table

Performing time travel

Performing upserts of data

Understanding isolation levels

Understanding concurrency control

Cleaning up Azure resources

Summary

Authors (1)

Personalised recommendations for you

You're reading from Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Table of Contents (17) Chapters

Chapter 6: Understanding Delta Lake

Understanding how Delta Lake enables the lakehouse

Understanding Delta Lake

Creating a Delta Lake table

Changing data in an existing Delta Lake table

Performing time travel

Performing upserts of data

Understanding isolation levels

Understanding concurrency control

Cleaning up Azure resources

Summary

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you