Reader small image

You're reading from  Extending Power BI with Python and R - Second Edition

Product typeBook
Published inMar 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781837639533
Edition2nd Edition
Languages
Right arrow
Author (1)
Luca Zavarella
Luca Zavarella
author image
Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella

Right arrow

Join our book community on Discord

https://packt.link/EarlyAccessCommunity

Qr code Description automatically generated

It often happens that those who develop a specific software product for one customer want to repackage it and sell it to another customer who is interested in similar features. However, if you want to show some screenshots of the software in a demo to the new customer, you should avoid showing any data that might be sensitive. Getting in there and trying to manually mask the data from a copy of the original software database was definitely one of the tasks the poor hapless developer had to do in the past, maybe even a few days before the demo.The scenario described does not require data to be shared with a third party recipient, but is intended to successfully demonstrate a product to a customer by displaying simulated data. Therefore, there is no concern about a potential brute force attack by professional analysts to derive the original data prior to the de-identification operation.Things definitely change when...

Technical requirements

This chapter requires you to have a working internet connection and Power BI Desktop already installed on your machine (we used the version 2.114.664.0 64-bit, February 2022). You must have properly configured the R and Python engines and IDEs as outlined in Chapter 2Configuring R with Power BI, and Chapter 3Configuring Python with Power BI.

De-identifying data

PII, also known as personal information or personal data, is any information about an identifiable individual. There are two types of PII – direct and indirect. Examples of direct identifiers include your name, your address, a photograph of you, or a Radio Frequency Identification (RFID) tag associated with you. Indirect identifiers, on the other hand, are any pieces of information that don't explicitly refer to you as an individual, but somehow make it easier to identify you. Examples of indirect identifiers include your license plate number, your bank account number, the link to your profile on a social networking site, or your place of employment.The practice of de-identifying data is to manipulate PPIs so that it is no longer possible to identify the person who generated them.There are two ways to deal with direct and indirect personal identifiers – either you decide to...

Anonymizing data in Power BI

One of the possible scenarios that could happen to you during your career as a report developer in Power BI is the following. Imagine that you are given an Excel dataset to import into Power BI to create a report for another department in your company. The Excel data set contains sensitive personal information, such as the names and email addresses of people who have made multiple attempts to pay for an order with a credit card. The following is an example of the contents of the Excel file:

Figure 7.4 – Excel data to be anonymized

You are asked to create the report while anonymizing the sensitive data.The first thing you will notice is that, not only do you need to anonymize the Name and Email columns, but some names or email addresses can be contained in the text of some Notes. While it is quite easy to find email addresses using regular expressions, it is not so easy to find names in free text. For this purpose, it is necessary...

Pseudonymizing data in Power BI

Unlike anonymization, pseudonymization preserves the statistical properties of the dataset by transforming the same input string into the same output string, and keeps track of the replacements that have occurred, allowing those with access to this mapping information to recover the original dataset.In addition, pseudonymization replaces sensitive data with fake strings (pseudonyms), that have the same form as the original, making the de-identified data more realistic.Depending on the analytical language used, there are different solutions driven by the different packages available that lead to the same result. Let's see how to apply pseudonymization in Power BI to the contents of the same Excel file used in the previous sections with Python.

Pseudonymizing data using Python

The modules and the code structure you will use are quite similar to those already used for anonymization. One difference is that, once...

Summary

In this chapter, you learned the main differences between anonymization and pseudonymization. You also learned which techniques are most commonly used to apply each de-identification process.You also applied the anonymization process through tokenization and the pseudonymization process by creating similar pseudonyms in Power BI using both Python and R.In the next chapter, you will learn how to log data derived from operations performed with Power Query in Power BI to external repositories.

References

For additional reading, check out the following books and articles:

Test your knowledge

Q01. What is the most obvious disadvantage of anonymization?Q02. How does pseudonymization differ from anonymization?Q03. How does the architecture shown for pseudonymization ensure compliance with GDPR deletion requirements?Q04. Why is it necessary to use NLP techniques to identify PII instead of using the usual regexes?Q05. What is one of the best Python packages for de-identifying PPI? What NLP engines can be used behind the scenes?Q06. Which R package was used to de-identify PPI? What is special about this package as an engine for NLP?Q07. What are pseudonyms?Q08. Which Python and R packages were used to generate pseudonyms?

Answers

A01. The most obvious disadvantage of anonymization is that it removes significant value from the data involved. This is because once the anonymization process is complete, it becomes impossible to trace the identities that generated the data. This means that any information or insights that could be gained from analysing the data...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Extending Power BI with Python and R - Second Edition
Published in: Mar 2024Publisher: PacktISBN-13: 9781837639533
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella