Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Modern Data Architectures with Python

You're reading from  Modern Data Architectures with Python

Product type Book
Published in Sep 2023
Publisher Packt
ISBN-13 9781801070492
Pages 318 pages
Edition 1st Edition
Languages
Author (1):
Brian Lipp Brian Lipp
Profile icon Brian Lipp

Table of Contents (19) Chapters

Preface 1. Part 1:Fundamental Data Knowledge
2. Chapter 1: Modern Data Processing Architecture 3. Chapter 2: Understanding Data Analytics 4. Part 2: Data Engineering Toolset
5. Chapter 3: Apache Spark Deep Dive 6. Chapter 4: Batch and Stream Data Processing Using PySpark 7. Chapter 5: Streaming Data with Kafka 8. Part 3:Modernizing the Data Platform
9. Chapter 6: MLOps 10. Chapter 7: Data and Information Visualization 11. Chapter 8: Integrating Continous Integration into Your Workflow 12. Chapter 9: Orchestrating Your Data Workflows 13. Part 4:Hands-on Project
14. Chapter 10: Data Governance 15. Chapter 11: Building out the Groundwork 16. Chapter 12: Completing Our Project 17. Index 18. Other Books You May Enjoy

Apache Spark Deep Dive

One of the most fundamental questions for an architect is how they should store their data and what methodology they should use. For example, should they use a relational database, or should they use object storage? This chapter attempts to explain which storage pattern is best for your scenario. Then, we will go through how to set up Delta Lake, a hybrid approach to data storage in an object store. In most cases, we will stick to the Python API, but in some cases, we will have to use SQL. Lastly, we will cover the most important Apache Spark theory you need to know to build a data platform effectively.

In this chapter, we’re going to cover the following main topics:

  • Understand how Spark manages its cluster
  • How Spark processes data
  • How cloud storage varies and what options are available
  • How to create and manage Delta Lake tables and databases

Technical requirements

The tooling that will be used in this chapter is tied to the tech stack that’s been chosen for this book. All vendors should offer a free trial account.

I will be using the following:

  • Databricks
  • AWS or Azure

Setting up your environment

Before we begin this chapter, let’s take some time to set up our working environment.

Python, AWS, and Databricks

As in previous chapters, this chapter assumes you have a working version of Python 3.6 or higher installed in your development environment. It also assumes you have set up an AWS account and have set up Databricks with that AWS account.

If you do not have a working Databricks setup, please refer to the following guide to get started: https://docs.databricks.com/en/getting-started/index.html.

Databricks CLI

The first step is to install the databricks-cli tool using the pip Python package manager:

pip install databricks-cli

Let’s validate that everything has been installed correctly. If the following command produces the tool’s version, then everything is working correctly:

Databricks –v

Now, let’s set up authentication. First, go into the Databricks UI and generate a personal access...

Cloud data storage

Modern data storage in the cloud comes in many flavors. The main flavors are in three general areas: object storage, NoSQL, and relational data storage. Each type of data storage has its pros and cons and should be thoroughly evaluated when you’re making decisions.

Object storage

Object storage has become one of the most used storage methods. It comes with plenty of benefits and some significant concerns. The advantages are its filesystem-like nature, its ability to integrate common file types, its massively scalable possibilities, and its relatively low cost. Moreover, object stores can store both structured and semi-structured data and files such as audio and videos. However, object storage does have some characteristics that should always be considered. How do you govern access to your object storage? This can be a significant task. Do you limit what technologies have access to? What files can you store, and how do you store them? How are things maintained...

Spark architecture

The Apache Spark architecture is complex, to say the least, and requires in-depth knowledge. However, you only need some background knowledge to be reasonably productive with Spark. So, first, let’s go through the basics of Apache Spark.

Introduction to Apache Spark

Spark is a popular parallel data processing framework built from the lessons learned after the Apache Hadoop project. Spark is written in Scala, a JVM language, but supports other languages, including Python, R, and Java, to name a few. Spark can be used as the central processing component in any data platform, but others may be a better fit for your problem. The key thing to understand is that Spark is separated from your storage layer, which allows you to connect Spark to any storage technology you need. Similar tools include Flink, AWS Glue, and Snowflake (Snowflake uses a decoupled storage and compute layer pattern behind the scenes).

Key components

Spark is a cluster-based in-memory...

Delta Lake

Delta Lake is an evolution in file storage for data tables and merges techniques from data lakes and data warehouse technology. Delta Lake takes the current Parquet technology but adds a transaction log that creates ACID transactions, similar to a database. The key detail to understand is that with transactions, you gain parallelism and reliability, among other features.

Transaction log

One massive adaption Delta Lake gives the traditional data lake is the concept of transactions. A transaction is a small chunk of work that’s accomplished in full. In layman’s terms, when you send data to a Delta table, you can guarantee that data is written as intended without another user creating anything that hinders that process. This avoids dirty reads and writes and inconsistent data. The main component of this feature is the transaction log, which is an ordered log of every transaction that’s made on the table. The transaction log has six actions: adding...

Adding speed with Z-ordering

Z-ordering is the process of collocating data related to common files. This can allow for data skipping and a significant reduction in processing time. Z-order is applied per column and should be used like partitions on columns when you’re filtering your table.

Here, we are applying Z-order to the whole table for a given column:

deltaTable.optimize().executeZOrderBy(COLUMN NAME)

We can also use the where method to apply Z-ordering to a table slice:

deltaTable.optimize().where("date=' YYYY-MM-DD'").executeZOrderBy(COLUMUN NAME)

With that, we have looked at one type of performance enhancement with Delta tables: Z-ordering. Next, we will look at another critical performance enhancement, known as bloom filtering. What makes bloom filtering is that it’s a data structure that saves space and allows for data skipping.

Bloom filters

One way to increase read speed is to use bloom filters. A bloom filter is an...

Practical lab

Now, let’s implement everything we’ve learned by going through some practical exercises.

Problem 1

Our teams have been requested to create a new database and table for the accounting departments’ BI tooling. The database should be called accounting_alpha and its location should be set to dbfs:/tmp/accounting_alpha. The table should be called returns_bronze. The schema of the table should be Date: Date, Account: String, Debit: Float, and Credit:Float.

Problem 2

Our team is now receiving data to be populated into the table. Perform an append using the new data provided:

[(datetime.strptime("1981-11-21", '%Y-%m-%d'), "Banking", 1000.0, 0.1), (datetime.strptime("1776-08-02", '%Y-%m-%d') , "Cash",0.1, 3000.2), (datetime.strptime("1948-05-14", '%Y-%m-%d'), "Land",0.5,10000.5)]

Problem 3

There has been a forced change to the table. Using Python...

Solution

The solution to Problem 1 is as follows:

  1. Here, we are dropping any residual database, then creating a DataFrame and writing the DataFrame as a table:
    spark.sql(f"DROP DATABASE IF EXISTS {database_name} CASCADE;")
  2. Now, we must import our libraries:
    from pyspark.sql.types import StructField, DateType, StringType, FloatType, StructType
  3. Next, we will create our database. We are defining the location of the database; all tables will be in that location:
    database_name = "chapter_2_lab"
    spark.sql(f" CREATE DATABASE IF NOT EXISTS {database_name} LOCATION 'dbfs:/tmp/accounting_alpha' ;")
  4. Now, we can write our table. First, we will define our table’s name and the schema of the table:
    table_name = "returns_bronze"
    schema = StructType([StructField("Date", DateType(), True),
                         StructField...

Summary

As we come to the end of this chapter, let’s reflect on some of the topics we have covered. We went through some of the fundamentals of cloud storage, and dived deep into Delta tables, a very important technology when it comes to handling data. Finally, we learned how to improve the performance of our tables. In the next chapter, we will become familiar with batch and stream processing using Apache Spark.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Modern Data Architectures with Python
Published in: Sep 2023 Publisher: Packt ISBN-13: 9781801070492
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}