Reader small image

You're reading from  Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Product typeBook
Published inOct 2021
PublisherPackt
ISBN-139781801077743
Edition1st Edition
Right arrow
Author (1)
Manoj Kukreja
Manoj Kukreja
author image
Manoj Kukreja

Manoj Kukreja is a Principal Architect at Northbay Solutions who specializes in creating complex Data Lakes and Data Analytics Pipelines for large-scale organizations such as banks, insurance companies, universities, and US/Canadian government agencies. Previously, he worked for Pythian, a large managed service provider where he was leading the MySQL and MongoDB DBA group and supporting large-scale data infrastructure for enterprises across the globe. With over 25 years of IT experience, he has delivered Data Lake solutions using all major cloud providers including AWS, Azure, GCP, and Alibaba Cloud. On weekends, he trains groups of aspiring Data Engineers and Data Scientists on Hadoop, Spark, Kafka and Data Analytics on AWS and Azure Cloud.
Read more about Manoj Kukreja

Right arrow

Creating a Delta Lake table

With the environment set up, we are ready to understand how Delta Lake works. In our Spark session, we have a Spark DataFrame that stores the data of the store_orders table that was ingested at the first iteration of the electroniz_batch_ingestion_pipeline run:

Important Note

A Spark DataFrame is an immutable distributed collection of data. It contains rows and columns like a table in a relational database.

  1. At this point, you should be comfortable running instructions in notebook cells. New cells can be created using Ctrl + Alt + N. After each command, you need to press Shift + Enter to run the command.

    From here onwards, I will simply ask you to run the instructions with the assumption that you know how to create new cells and run commands. Invoke the following instructions to write the store_orders delta table:

    SCRATCH_LAYER_NAMESPACE="scratch"
    DELTA_TABLE_WRITE_PATH="wasbs://"+SCRATCH_LAYER_NAMESPACE+"@"+STORAGE_ACCOUNT...
lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Data Engineering with Apache Spark, Delta Lake, and Lakehouse
Published in: Oct 2021Publisher: PacktISBN-13: 9781801077743

Author (1)

author image
Manoj Kukreja

Manoj Kukreja is a Principal Architect at Northbay Solutions who specializes in creating complex Data Lakes and Data Analytics Pipelines for large-scale organizations such as banks, insurance companies, universities, and US/Canadian government agencies. Previously, he worked for Pythian, a large managed service provider where he was leading the MySQL and MongoDB DBA group and supporting large-scale data infrastructure for enterprises across the globe. With over 25 years of IT experience, he has delivered Data Lake solutions using all major cloud providers including AWS, Azure, GCP, and Alibaba Cloud. On weekends, he trains groups of aspiring Data Engineers and Data Scientists on Hadoop, Spark, Kafka and Data Analytics on AWS and Azure Cloud.
Read more about Manoj Kukreja