Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Modern Data Architectures with Python

You're reading from  Modern Data Architectures with Python

Product type Book
Published in Sep 2023
Publisher Packt
ISBN-13 9781801070492
Pages 318 pages
Edition 1st Edition
Languages
Author (1):
Brian Lipp Brian Lipp
Profile icon Brian Lipp

Table of Contents (19) Chapters

Preface Part 1:Fundamental Data Knowledge
Chapter 1: Modern Data Processing Architecture Chapter 2: Understanding Data Analytics Part 2: Data Engineering Toolset
Chapter 3: Apache Spark Deep Dive Chapter 4: Batch and Stream Data Processing Using PySpark Chapter 5: Streaming Data with Kafka Part 3:Modernizing the Data Platform
Chapter 6: MLOps Chapter 7: Data and Information Visualization Chapter 8: Integrating Continous Integration into Your Workflow Chapter 9: Orchestrating Your Data Workflows Part 4:Hands-on Project
Chapter 10: Data Governance Chapter 11: Building out the Groundwork Chapter 12: Completing Our Project Index Other Books You May Enjoy

Data Governance

Data governance is one of the most complex topics in the data field. Data governance is the amalgamation of people, processes, and technology. It lays down the foundation for the creation, modification, usage, and decimation of data, and who owns what data and in what capacity. My approach will be to cover some fundamental ideas and go through how to apply some of them. Why is data governance important? When joining a project, I have often found that there are significant data governance issues. This can range from data quality to security or cataloging. Without data governance, you can see a wide variety of issues in your data. In this chapter, we’re going to cover the following main topics:

  • Databricks Unity Catalog
  • Data governance
  • Great Expectations

Technical requirements

The tooling used in this chapter is tied to the technology stack chosen for the book. All vendors should offer a free trial account.

I will be using the following:

  • Databricks

Setting up your environment

Before we begin our chapter, let’s take the time to set up our working environment.

Python, AWS, and Databricks

As we have with many others, this chapter assumes you have a working version of Python 3.6 or above installed in your development environment. We will also assume you have set up an AWS account and have set up Databricks with that AWS account.

The Databricks CLI

The first step is to install the databricks-cli tool using the pip Python package manager:

pip install databricks-cli

Let’s validate that everything has been installed correctly. If this command produces the tool version, then everything is working correctly:

Databricks -v

Now, let’s set up authentication. First, go into the Databricks UI and generate a personal access token. The following command will ask for the host created for your Databricks instance and the created token:

databricks configure --token

We can quickly determine whether...

What is data governance?

Data governance is a buzzword that can have a very complex meaning. At its core, data governance is the processes and procedures that are in place around your data. Without data governance, you might see several data sources acting as a source of truth for the same key data items. This can cause data drift and inaccuracies in your data usage. To avoid this, the platform should identify the single source of truth for that specific data and define that source of truth for consumers to access. Another aspect of data governance is identifying the data owner, which ties back to some concepts in data mesh. Data owners are responsible for the quality of the data and resolving data issues. Data owners (or data stewards, as they are often called) are the point of contact for issues that users have with the data. Lastly, data governance can cover better metadata around your data, so that you can better understand, access, and secure your data. Metadata is data about...

Data catalogs

A data catalog is a central hub for users to find and understand your data. In the data catalog, you will find the connection between your business metadata and technical metadata. Business metadata is about the data that pertains to business functions such as lines of business and business names. Technical metadata might be the schema of the data and the server and database. Sometimes, data catalogs can include data lineage information and data quality checks made on the data. Many products exist on the market that might give you some of these capabilities, such as Informatica. In many cases, I have seen companies build their own custom data catalog as well. At its root, users need a place to find certified datasets and understand what they specifically mean. Why do we care so much about creating a central hub for metadata? you might ask. When we open up this type of tooling to users, it changes how users interact with the data. A user can now explore and learn about...

Practical lab

Problem 1: Create a table and give access to a user through group ACL.

Create a user and group called chapter_10_user and chapter_10_group, respectively.

Figure 10.5: Managing users

Figure 10.5: Managing users

Let’s now create our user, chapter_10_user.

Figure 10.6: Configuring the user

Figure 10.6: Configuring the user

Let’s now create our group, chapter_10_group; we can click the Groups tab and then click Add group.

Figure 10.7: Group management

Figure 10.7: Group management

Once you type in the name, click Save.

Figure 10.8: Adding a new group

Figure 10.8: Adding a new group

Lastly, we need to add users to the group by clicking Add members.

Figure 10.9: Group membership

Figure 10.9: Group membership

Search for chapter_10_user and add the user to the group.

Figure 10.10: Group new user

Figure 10.10: Group new user

We can now see our user is added to our group.

Figure 10.11: Group user management

Figure 10.11: Group user management

Now, let’s create...

Summary

We have climbed through many techniques for our data platform. Let’s take some time to review those ideas as we close this chapter. We discussed the ins and outs of data governance basics. We transitioned into data catalogs and the importance of having a metadata catalog. With data catalogs, we also discussed data lineage or the evolutionary path of each column in our data. We next covered basic security in a Databricks platform using grants. We then tackled data quality and testing for quality using the Great Expectations Python package. Data quality is a complex topic, and this approach addresses one direction. Other directions include allowing users to report errors or using complex AI systems. Finally, we delved into Databricks Unity Catalog, an enhanced Hive metastore-based product offering metastore capability across many workspaces, among many other growing features.

We have yet to cover all the theory chapters and will look at a comprehensive lab across two...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Modern Data Architectures with Python
Published in: Sep 2023 Publisher: Packt ISBN-13: 9781801070492
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}