You're reading from Modern Data Architectures with Python

Product type Book

Published in Sep 2023

Publisher Packt

ISBN-13 9781801070492

Pages 318 pages

Edition 1st Edition

Languages

Python

Concepts

Data Science

Author (1):

Brian Lipp

Table of Contents (19) Chapters

Preface

Part 1:Fundamental Data Knowledge

Chapter 1: Modern Data Processing Architecture

Chapter 2: Understanding Data Analytics

Part 2: Data Engineering Toolset

Chapter 3: Apache Spark Deep Dive

Chapter 4: Batch and Stream Data Processing Using PySpark

Chapter 5: Streaming Data with Kafka

Part 3:Modernizing the Data Platform

Chapter 6: MLOps

Chapter 7: Data and Information Visualization

Chapter 8: Integrating Continous Integration into Your Workflow

Chapter 9: Orchestrating Your Data Workflows

Part 4:Hands-on Project

Chapter 10: Data Governance

Chapter 11: Building out the Groundwork

Chapter 12: Completing Our Project

Index

Why subscribe?

Other Books You May Enjoy

Data Governance

Data governance is one of the most complex topics in the data field. Data governance is the amalgamation of people, processes, and technology. It lays down the foundation for the creation, modification, usage, and decimation of data, and who owns what data and in what capacity. My approach will be to cover some fundamental ideas and go through how to apply some of them. Why is data governance important? When joining a project, I have often found that there are significant data governance issues. This can range from data quality to security or cataloging. Without data governance, you can see a wide variety of issues in your data. In this chapter, we’re going to cover the following main topics:

Databricks Unity Catalog
Data governance
Great Expectations

Technical requirements

The tooling used in this chapter is tied to the technology stack chosen for the book. All vendors should offer a free trial account.

I will be using the following:

Databricks

Setting up your environment

Before we begin our chapter, let’s take the time to set up our working environment.

Python, AWS, and Databricks

As we have with many others, this chapter assumes you have a working version of Python 3.6 or above installed in your development environment. We will also assume you have set up an AWS account and have set up Databricks with that AWS account.

The Databricks CLI

The first step is to install the databricks-cli tool using the pip Python package manager:

pip install databricks-cli

Let’s validate that everything has been installed correctly. If this command produces the tool version, then everything is working correctly:

Databricks -v

Now, let’s set up authentication. First, go into the Databricks UI and generate a personal access token. The following command will ask for the host created for your Databricks instance and the created token:

databricks configure --token

We can quickly determine whether...

What is data governance?

Data governance is a buzzword that can have a very complex meaning. At its core, data governance is the processes and procedures that are in place around your data. Without data governance, you might see several data sources acting as a source of truth for the same key data items. This can cause data drift and inaccuracies in your data usage. To avoid this, the platform should identify the single source of truth for that specific data and define that source of truth for consumers to access. Another aspect of data governance is identifying the data owner, which ties back to some concepts in data mesh. Data owners are responsible for the quality of the data and resolving data issues. Data owners (or data stewards, as they are often called) are the point of contact for issues that users have with the data. Lastly, data governance can cover better metadata around your data, so that you can better understand, access, and secure your data. Metadata is data about...

Data catalogs

A data catalog is a central hub for users to find and understand your data. In the data catalog, you will find the connection between your business metadata and technical metadata. Business metadata is about the data that pertains to business functions such as lines of business and business names. Technical metadata might be the schema of the data and the server and database. Sometimes, data catalogs can include data lineage information and data quality checks made on the data. Many products exist on the market that might give you some of these capabilities, such as Informatica. In many cases, I have seen companies build their own custom data catalog as well. At its root, users need a place to find certified datasets and understand what they specifically mean. Why do we care so much about creating a central hub for metadata? you might ask. When we open up this type of tooling to users, it changes how users interact with the data. A user can now explore and learn about...

Practical lab

Problem 1: Create a table and give access to a user through group ACL.

Create a user and group called chapter_10_user and chapter_10_group, respectively.

Figure 10.5: Managing users

Let’s now create our user, chapter_10_user.

Figure 10.6: Configuring the user

Let’s now create our group, chapter_10_group; we can click the Groups tab and then click Add group.

Figure 10.7: Group management

Once you type in the name, click Save.

Figure 10.8: Adding a new group

Lastly, we need to add users to the group by clicking Add members.

Figure 10.9: Group membership

Search for chapter_10_user and add the user to the group.

Figure 10.10: Group new user

We can now see our user is added to our group.

Figure 10.11: Group user management

Now, let’s create...

Summary

We have climbed through many techniques for our data platform. Let’s take some time to review those ideas as we close this chapter. We discussed the ins and outs of data governance basics. We transitioned into data catalogs and the importance of having a metadata catalog. With data catalogs, we also discussed data lineage or the evolutionary path of each column in our data. We next covered basic security in a Databricks platform using grants. We then tackled data quality and testing for quality using the Great Expectations Python package. Data quality is a complex topic, and this approach addresses one direction. Other directions include allowing users to report errors or using complex AI systems. Finally, we delved into Databricks Unity Catalog, an enhanced Hive metastore-based product offering metastore capability across many workspaces, among many other growing features.

We have yet to cover all the theory chapters and will look at a comprehensive lab across two...