You're reading from Cloud Scale Analytics with Azure Data Services

Product typeBook

Published inJul 2021

PublisherPackt

ISBN-139781800562936

Edition1st Edition

Tools

Azure Pack

Concepts

Data Streaming

Author (1)

Patrik Borosch

Chapter 14: Establishing Data Governance

In the modern data warehouse architecture with its various options to land and store data, you will no longer have the one database where your single version of the truth resides. This will make it more complex to keep track of the content, relationships between entities, and their sensitivity, for example.

Actual regulations such as the General Data Protection Regulation (GDPR) require you to be able to locate your customer data and all the related information in case the customer asks for a report or, more importantly, when the customer requires deletion.

But it's not only regulative requirements that force you to gain insights into your data. By adding more and more different datasets in a more and more diverse business, you will need additional tools – tools that enable you to find your way through your data lake, your data warehouse, and for sure the sources where you extract the data from to create all the analytics...

Technical requirements

To be able to give Azure Purview a try, you will need the following:

An Azure subscription with the rights to provision a Purview service.
A Synapse workspace where you own the System Administrator RBAC role (see Chapter 4, Understanding SQL Pools and SQL Options).
An Azure Data Lake Storage account, perhaps from previous chapters such as Chapter 3, Understanding the Data Lake Storage Layer. You will need at least reading permissions on the data lake.

Discovering Azure Purview

With the Azure Purview preview, Microsoft introduces the first wave of data governance tools. You will find modules for the following:

Data scanning and cataloging and search
Data classification and glossary
Data lineage including Data Factory/Synapse pipelines and Power BI
Metadata insights

Purview integrates with Synapse in a way that you can use the Purview search functionality within the Synapse workspace. The search results can be used by a Synapse developer, for example, to start a SQL script or a Spark Notebook just in the same way as you would do from the Data Lake browser in the Synapse workspace. Remember what you saw in Chapter 10, Loading the Presentation Layer?

Microsoft is not only targeting data governance; it also aims to improve developers' productivity and experience.

Provisioning the service

But let's start with the service and provision it first:

Please go to the Azure portal and type...

Classifying data

Please examine the airdelayspredict.csv file that you created in Chapter 9, Integrating Azure Cognitive Services and Machine Learning. You will find a schema classification on the Overview tab or in the Schema view: U.S. Stage Name for DEST_STATE_ABR and ORIGIN_STATE_ABR. This is one of the predefined classification rules that are available in Purview.

Let's discover where you can find them and how you can add your own classifications and classification rules.

Please navigate to the Management hub of your Purview Studio, Metadata management with the two nodes, Classification and Classification rules.

Please examine the two nodes. You will find the System and Custom categories in these lists.

The System list already shows about 100+ different classifications, such as Age of an individual, for example, or different patterns for passport numbers and much more. Some of them are pattern-oriented and implement a regular expression to recognize the pattern...

Integrating with Azure services

Purview already integrates with other services, such as Synapse, for example, or Azure Data Factory and Power BI. This integration helps you to leverage the insights that Purview can deliver in different use cases.

Integrating with Synapse

When we examine the integration with Synapse, for example, you will be able to search the Purview catalog from within Synapse Studio. But this is not the only advantage! From the results of your search, you will be able to directly start a SQL script, a Spark Notebook, or a Pipeline data flow just like you would do from a Data Lake folder or a database table or view.

The Synapse integration will be done from your Synapse Studio. You will find the Azure Purview (Preview) entry in the Management hub of your Synapse workspace:

Figure 14.20 – Connect to Purview from Synapse

You will be displayed a selector that will prompt you on whether you want to select from an Azure subscription...

Using data lineage

Once the data factory is connected, it will send lineage information into your Purview environment for every pipeline that is run. Give it a try and create a Data Factory pipeline that copies data from one folder to another in your data lake. Remember: you are quickest when you use the Copy Data Wizard (or just use the MyFirstPipeline pipeline that you created in Chapter 5, Integrating Data in Your Modern Data Warehouse, if you used the data factory there).

When you are finished in the data factory, switch back to Purview, repeat your scan (again, this might take a few minutes), and search for the newly created file or the pipeline name, and in the asset details, check the Lineage tab:

Figure 14.22 – First lineage overview for a Data Factory Copy pipeline

When you check the lineage closely, you will see that you can drill down to the column level and reveal even the column mappings.

Imagine the power of this feature, when you...

Discovering Insights

If you examine the Insights hub, you will find several sections that will help you to analyze the content of your catalog. You can discover the content of your Purview environment from another perspective. In Asset insights, for example, you'll find a tree view and a link to a more detailed grouped overview of all the assets that you have discovered:

Figure 14.23 – Asset insights

Give it a go and browse the Insights environment.

Discovering more Purview

There is more about Purview that you still can discover. The product is, as mentioned, still in preview status. So, expect changes and new features to come. Maybe you want to examine, for example, a feature that was added recently, such as the usage of sensitivity labels. This means Purview will use sensitivity labels that are extracted from the Microsoft 365 Security & Compliance Center (SCC). These labels will be applied based on classifications and their combinations found in the scanned data. If you want to find out about sensitivity labels, please refer to the Further reading, Sensitivity labels section.

Another interesting concept that was introduced recently is the Pattern rules feature and Resource sets. These are used to automatically group big amounts of data during your scans. You can find more information about these features in the Further reading, Pattern rules section.

Summary

This chapter took you through the Azure Purview preview. You have seen how to connect to data sources and how to set up scans to parse not only your data sources but also your whole modern data warehouse.

You have learned how to use classification rules in your scan rule sets to classify your data. You have seen how to create your own custom classifications and add them to your scans.

In the second part of the chapter, you saw how Purview integrates with other services such as Azure Synapse Analytics, where Purview can help increase productivity by integrating with the Synapse search and the Synapse compute components, such as the serverless SQL engine, the Spark engine, and Synapse pipelines.

You have examined how to integrate Power BI to be able to scan Power BI datasets, reports, and dashboards and how to add this information to the Purview data lineage part.

Finally, you have seen how Purview integrates with Azure Data Factory and can display data lineage...

Navigate to your Synapse Studio and there, in the Data hub, go to the Linked section where you have access to your data lake.
In your data lake, browse to the airdelays.csv file. (Do you have another idea of how to find this using Purview maybe?)
Right-click the airdelays.csv file and start a new SQL script using serverless SQL in Synapse.

Adjust the second line of the query text as follows:

SELECT
    distinct ORIGIN
FROM
    OPENROWSET(
...

Copy and paste the query text and enter a UNION command between the two.
In the second query, replace ORIGIN with DEST. Your query should look like this:
Figure 14.24 – UNION query to generate the airports dictionary
Below the query text, please click on Export results and download the results as a CSV...

The rest of the chapter is locked

You have been reading a chapter from

Cloud Scale Analytics with Azure Data Services

Published in: Jul 2021Publisher: PacktISBN-13: 9781800562936

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Patrik Borosch

Patrik Borosch is a cloud solution architect for data and AI at Microsoft Switzerland GmbH. He has more than 25 years of BI and analytics development, engineering, and architecture experience and is a Microsoft Certified Data Engineer and a Microsoft Certified AI Engineer. Patrik has worked on numerous significant international data warehouse, data integration, and big data projects. Through this, he has built and extended his experience in all facets, from requirements engineering to data modeling and ETL, all the way to reporting and dashboarding. At Microsoft Switzerland, he supports customers in their journey into the analytical world of the Azure Cloud.
Read more about Patrik Borosch

Other recommended products

Related to this chapter

Limitless Analytics with Azure Synapse

This book helps you understand the basic concepts and techniques of using Azure Synapse step-by-step. You'll gradually gain the skills you need to work with data and develop analytics solutions using the Azure analytics platform even with no prior knowledge of Azure.

BookJun 2021392 pages

Azure Data Engineering Cookbook

This book will help you design and implement modern ETL workflows along with data management, monitoring, and security aspects to meet the current organization's needs. You will use various services such as Azure Data Factory, Azure Databricks, Azure Stream Analytics, and Azure Data Explorer to design efficient data processing solutions.

BookApr 2021454 pages

Azure Data Factory Cookbook

With the help of well-structured and practical recipes, this book will teach you how to integrate data from the cloud and on-premise. You’ll learn how to transform, clean, and consolidate data into a single data platform and get to grips with using ADF as the main ETL and orchestration tool for your data warehouse or data platform project.

BookDec 2020382 pages

Azure Databricks Cookbook

The Azure Databricks Cookbook shows you how to work with the latest as well as older versions of Apache Spark and integrate with various Azure resources for orchestrating, deploying, and monitoring big data solutions. You'll use Azure Databricks to build end-to-end solutions and address challenges in securing, productionizing, and monitoring them.

BookSep 2021452 pages

Data Modeling for Azure Data Services

Data modeling for Azure Data Services teaches you the core concepts of setting up different types of databases for different use cases. With this hands-on guide, you'll learn how to implement the resulting data model in Azure efficiently.

BookJul 2021428 pages

Cloud Analytics with Microsoft Azure

Cloud Analytics with Microsoft Azure enables you to understand the design and business considerations that you must keep in mind while planning to adopt the cloud analytics model for your business.

BookJan 2021184 pages

Hands-On Data Warehousing with Azure Data Factory

Azure Data Factory (ADF) is a Microsoft Azure PaaS solution which supports data movement between many on premises and cloud data sources. This book covers custom tailored tutorials to help you develop , maintain and troubleshoot data movement processes and environments using Azure Data Factory V2 and SQL Server Integration Services 2017

BookMay 2018284 pages

Cloud Analytics with Microsoft Azure

Cloud Analytics with Microsoft Azure is an end-to-end guide to processing and analyzing big data using a range of Microsoft Azure features. This book covers everything you need to build your own data warehouse and learn numerous techniques to gain useful insights by analyzing big data.

BookNov 2019242 pages

Stream Analytics with Microsoft Azure

This book is your guide to understanding the basics of how Azure Stream Analytics works, and build your own analytics solution using its capabilities. By the end of this book, you will be well-versed in using Azure Stream Analytics to develop an efficient analytics solution which can work with any type of data.

BookDec 2017322 pages

ETL with Azure Cookbook

This book will take you through hand-on recipes for extracting, transforming, and loading data using big data tools and Azure services such as Data Factory and Azure Databricks. You will learn how to interact effectively with Azure services, along with covering automation with BIML and data profiling in Azure.

BookSep 2020446 pages

Distributed Data Systems with Azure Databricks

This book helps you to learn how to extract, transform, and orchestrate massive amounts of data to develop robust data pipelines. You'll perform complex machine learning tasks using advanced Azure Databricks features, and also explore model tuning, deployment, and control using Databricks functionalities such as AutoML and Delta Lake with TensorFlow.

BookMay 2021414 pages

Introducing Microsoft SQL Server 2019

Introducing Microsoft SQL Server 2019 takes you through what’s new in SQL Server 2019 and why it matters. After reading this book, you’ll be well placed to explore exactly how you can make MIcrosoft SQL Server 2019 work best for you.

BookApr 2020488 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Cloud Scale Analytics with Azure Data Services

Chapter 14: Establishing Data Governance

Technical requirements

Discovering Azure Purview

Provisioning the service

Classifying data

Integrating with Azure services

Integrating with Synapse

Using data lineage

Discovering Insights

Discovering more Purview

Summary

Further reading

Why subscribe?

Unlock this book and the full library FREE for 7 days

Author (1)

Limitless Analytics with Azure Synapse

This book helps you understand the basic concepts and techniques of using Azure Synapse step-by-step. You'll gradually gain the skills you need to work with data and develop analytics solutions using the Azure analytics platform even with no prior knowledge of Azure.

Azure Data Engineering Cookbook

Azure Data Factory Cookbook

Azure Databricks Cookbook

Data Modeling for Azure Data Services

Data modeling for Azure Data Services teaches you the core concepts of setting up different types of databases for different use cases. With this hands-on guide, you'll learn how to implement the resulting data model in Azure efficiently.

Cloud Analytics with Microsoft Azure

Cloud Analytics with Microsoft Azure enables you to understand the design and business considerations that you must keep in mind while planning to adopt the cloud analytics model for your business.

Hands-On Data Warehousing with Azure Data Factory

Cloud Analytics with Microsoft Azure

Cloud Analytics with Microsoft Azure is an end-to-end guide to processing and analyzing big data using a range of Microsoft Azure features. This book covers everything you need to build your own data warehouse and learn numerous techniques to gain useful insights by analyzing big data.

Stream Analytics with Microsoft Azure

ETL with Azure Cookbook

Distributed Data Systems with Azure Databricks

Introducing Microsoft SQL Server 2019

Introducing Microsoft SQL Server 2019 takes you through what’s new in SQL Server 2019 and why it matters. After reading this book, you’ll be well placed to explore exactly how you can make MIcrosoft SQL Server 2019 work best for you.

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook