Reader small image

You're reading from  Cloud Scale Analytics with Azure Data Services

Product typeBook
Published inJul 2021
PublisherPackt
ISBN-139781800562936
Edition1st Edition
Right arrow
Author (1)
Patrik Borosch
Patrik Borosch
author image
Patrik Borosch

Patrik Borosch is a cloud solution architect for data and AI at Microsoft Switzerland GmbH. He has more than 25 years of BI and analytics development, engineering, and architecture experience and is a Microsoft Certified Data Engineer and a Microsoft Certified AI Engineer. Patrik has worked on numerous significant international data warehouse, data integration, and big data projects. Through this, he has built and extended his experience in all facets, from requirements engineering to data modeling and ETL, all the way to reporting and dashboarding. At Microsoft Switzerland, he supports customers in their journey into the analytical world of the Azure Cloud.
Read more about Patrik Borosch

Right arrow

Chapter 14: Establishing Data Governance

In the modern data warehouse architecture with its various options to land and store data, you will no longer have the one database where your single version of the truth resides. This will make it more complex to keep track of the content, relationships between entities, and their sensitivity, for example.

Actual regulations such as the General Data Protection Regulation (GDPR) require you to be able to locate your customer data and all the related information in case the customer asks for a report or, more importantly, when the customer requires deletion.

But it's not only regulative requirements that force you to gain insights into your data. By adding more and more different datasets in a more and more diverse business, you will need additional tools – tools that enable you to find your way through your data lake, your data warehouse, and for sure the sources where you extract the data from to create all the analytics...

Technical requirements

To be able to give Azure Purview a try, you will need the following:

  • An Azure subscription with the rights to provision a Purview service.
  • A Synapse workspace where you own the System Administrator RBAC role (see Chapter 4, Understanding SQL Pools and SQL Options).
  • An Azure Data Lake Storage account, perhaps from previous chapters such as Chapter 3, Understanding the Data Lake Storage Layer. You will need at least reading permissions on the data lake.

Discovering Azure Purview

With the Azure Purview preview, Microsoft introduces the first wave of data governance tools. You will find modules for the following:

  • Data scanning and cataloging and search
  • Data classification and glossary
  • Data lineage including Data Factory/Synapse pipelines and Power BI
  • Metadata insights

Purview integrates with Synapse in a way that you can use the Purview search functionality within the Synapse workspace. The search results can be used by a Synapse developer, for example, to start a SQL script or a Spark Notebook just in the same way as you would do from the Data Lake browser in the Synapse workspace. Remember what you saw in Chapter 10, Loading the Presentation Layer?

Microsoft is not only targeting data governance; it also aims to improve developers' productivity and experience.

Provisioning the service

But let's start with the service and provision it first:

  1. Please go to the Azure portal and type...

Classifying data

Please examine the airdelayspredict.csv file that you created in Chapter 9, Integrating Azure Cognitive Services and Machine Learning. You will find a schema classification on the Overview tab or in the Schema view: U.S. Stage Name for DEST_STATE_ABR and ORIGIN_STATE_ABR. This is one of the predefined classification rules that are available in Purview.

Let's discover where you can find them and how you can add your own classifications and classification rules.

Please navigate to the Management hub of your Purview Studio, Metadata management with the two nodes, Classification and Classification rules.

Please examine the two nodes. You will find the System and Custom categories in these lists.

The System list already shows about 100+ different classifications, such as Age of an individual, for example, or different patterns for passport numbers and much more. Some of them are pattern-oriented and implement a regular expression to recognize the pattern...

Integrating with Azure services

Purview already integrates with other services, such as Synapse, for example, or Azure Data Factory and Power BI. This integration helps you to leverage the insights that Purview can deliver in different use cases.

Integrating with Synapse

When we examine the integration with Synapse, for example, you will be able to search the Purview catalog from within Synapse Studio. But this is not the only advantage! From the results of your search, you will be able to directly start a SQL script, a Spark Notebook, or a Pipeline data flow just like you would do from a Data Lake folder or a database table or view.

The Synapse integration will be done from your Synapse Studio. You will find the Azure Purview (Preview) entry in the Management hub of your Synapse workspace:

Figure 14.20 – Connect to Purview from Synapse

You will be displayed a selector that will prompt you on whether you want to select from an Azure subscription...

Using data lineage

Once the data factory is connected, it will send lineage information into your Purview environment for every pipeline that is run. Give it a try and create a Data Factory pipeline that copies data from one folder to another in your data lake. Remember: you are quickest when you use the Copy Data Wizard (or just use the MyFirstPipeline pipeline that you created in Chapter 5, Integrating Data in Your Modern Data Warehouse, if you used the data factory there).

When you are finished in the data factory, switch back to Purview, repeat your scan (again, this might take a few minutes), and search for the newly created file or the pipeline name, and in the asset details, check the Lineage tab:

Figure 14.22 – First lineage overview for a Data Factory Copy pipeline

When you check the lineage closely, you will see that you can drill down to the column level and reveal even the column mappings.

Imagine the power of this feature, when you...

Discovering Insights

If you examine the Insights hub, you will find several sections that will help you to analyze the content of your catalog. You can discover the content of your Purview environment from another perspective. In Asset insights, for example, you'll find a tree view and a link to a more detailed grouped overview of all the assets that you have discovered:

Figure 14.23 – Asset insights

Figure 14.23 – Asset insights

Give it a go and browse the Insights environment.

Discovering more Purview

There is more about Purview that you still can discover. The product is, as mentioned, still in preview status. So, expect changes and new features to come. Maybe you want to examine, for example, a feature that was added recently, such as the usage of sensitivity labels. This means Purview will use sensitivity labels that are extracted from the Microsoft 365 Security & Compliance Center (SCC). These labels will be applied based on classifications and their combinations found in the scanned data. If you want to find out about sensitivity labels, please refer to the Further reading, Sensitivity labels section.

Another interesting concept that was introduced recently is the Pattern rules feature and Resource sets. These are used to automatically group big amounts of data during your scans. You can find more information about these features in the Further reading, Pattern rules section.

Summary

This chapter took you through the Azure Purview preview. You have seen how to connect to data sources and how to set up scans to parse not only your data sources but also your whole modern data warehouse.

You have learned how to use classification rules in your scan rule sets to classify your data. You have seen how to create your own custom classifications and add them to your scans.

In the second part of the chapter, you saw how Purview integrates with other services such as Azure Synapse Analytics, where Purview can help increase productivity by integrating with the Synapse search and the Synapse compute components, such as the serverless SQL engine, the Spark engine, and Synapse pipelines.

You have examined how to integrate Power BI to be able to scan Power BI datasets, reports, and dashboards and how to add this information to the Purview data lineage part.

Finally, you have seen how Purview integrates with Azure Data Factory and can display data lineage...

Further reading

  1. Navigate to your Synapse Studio and there, in the Data hub, go to the Linked section where you have access to your data lake.
  2. In your data lake, browse to the airdelays.csv file. (Do you have another idea of how to find this using Purview maybe?)
  3. Right-click the airdelays.csv file and start a new SQL script using serverless SQL in Synapse.
  4. Adjust the second line of the query text as follows:
    SELECT
        distinct ORIGIN
    FROM
        OPENROWSET(
    ...
  5. Copy and paste the query text and enter a UNION command between the two.
  6. In the second query, replace ORIGIN with DEST. Your query should look like this:

    Figure 14.24 – UNION query to generate the airports dictionary

  7. Below the query text, please click on Export results and download the results as a CSV...

Why subscribe?

  • Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
  • Improve your learning with Skill Plans built especially for you
  • Get a free eBook or video every month
  • Fully searchable for easy access to vital information
  • Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at customercare@packtpub.com for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Cloud Scale Analytics with Azure Data Services
Published in: Jul 2021Publisher: PacktISBN-13: 9781800562936
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Patrik Borosch

Patrik Borosch is a cloud solution architect for data and AI at Microsoft Switzerland GmbH. He has more than 25 years of BI and analytics development, engineering, and architecture experience and is a Microsoft Certified Data Engineer and a Microsoft Certified AI Engineer. Patrik has worked on numerous significant international data warehouse, data integration, and big data projects. Through this, he has built and extended his experience in all facets, from requirements engineering to data modeling and ETL, all the way to reporting and dashboarding. At Microsoft Switzerland, he supports customers in their journey into the analytical world of the Azure Cloud.
Read more about Patrik Borosch