Reader small image

You're reading from  Extending Power BI with Python and R - Second Edition

Product typeBook
Published inMar 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781837639533
Edition2nd Edition
Languages
Right arrow
Author (1)
Luca Zavarella
Luca Zavarella
author image
Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella

Right arrow

A brief introduction to regexes

A regular expression (usually shortened to regex) is defined by a series of characters that identify an abstract search pattern. Essentially, it is a mathematical technique that was developed in 1951 by experts of formal language and theoretical computer science. It is used to validate input data or to search for and extract information from texts.If you don't know the syntax of a regex, at first glance, it might look really tricky:

Figure 6.1 – An example of a regex pattern

Fortunately, there are online regex visualization tools that make it easier to understand patterns (you can find one of them at https://regexper.com). For example, the regex highlighted in Figure 6.1 can be visualized as follows:

Figure 6.2 – A visualization of a regex

From Figure 6.2, we can see that the regex in Figure 5.1 will identify email addresses in a piece of text.Learning how to use regexes like a pro is certainly not easy, and it is not the purpose...

Validating data using regex in Power BI

To date, Power BI has no native feature in Power Query to perform operations via regexes. There are cases when you can't avoid using regexes to extract useful information from data in text form. The only way to be able to use regexes is through R scripts or Python scripts. The only cons you have in this case is that, if you need to publish the report on the Power BI service, to allow Power Query to use external R or Python engines, you must also install the on-premises data gateway in personal mode.However, let's get right into it with real-world examples.Let's say you work for a retail company that has a team dedicated to identifying fraudulent customers. As soon as a team member identifies a fraudster, they fill out an Excel spreadsheet, in which the Email and BannedDate columns are included along with others. Your task is to load the data from this Excel file into Power BI and, from other data sources, select only the fraudster...

Loading complex log files using regex in Power BI

Log files are a very useful tool for developers and administrators of computer systems. They record what happened to the system, when it happened, and which user actually caused the event. Thanks to these files, you can find information about any system failure, which allows you to diagnose the causes of these failures more quickly.Logs are often semi-structured data, information that cannot be persisted in a relational database in the format in which it was generated. In order to be analyzed with standard tools, this data must first be transformed into a more suitable format.Because it is not structured data, it is difficult to import into Power BI as is, unless someone has developed a custom connector to do so. In these scenarios, using a regex in languages like Python or R can help us get the results we want.

Apache access logs

Let's suppose your company has a website published through an Apache web server. Your manager asks you...

Extracting values from text using regex in Power BI

The last use case we want to present happens very often when it comes to shipping goods to customers. Sometimes a fraudster manages to steal the goods addressed to a customer and the company has to refund the customer. The defrauded customer then contacts Customer Care to request a refund. If the management system provided to the Customer Care operator who has to manage the case does not allow you to enter the refund information in a structured way, the operator must resort to the only possible method: the entry of a free text note associated with the order, which specifies the amount, the reason and the date of the refund.You already know that information entered in free text is every analyst's nightmare, especially when your boss asks you to analyze the very information entered in those infamous notes.In the repository that comes with this book, you can find the OrderNotes.xlsx Excel file inside the Ch06 - Using Regular Expressions...

Summary

In this chapter, you were introduced to the basics of how to use regexes. Using the bare minimum, you were able to effectively validate strings representing email addresses and dates in Power BI, using both Python and R.You also learned how to extract information from semi-structured log files using regexes and how to import the extracted information into Power BI in a structured way.Finally, you learned how to use regex in Python and R to extract information from seemingly unprocessable free text thanks to the real-world case of notes associated with sales orders.In the next chapter, you'll learn how to use some de-identification techniques in Power BI to anonymize or pseudonymize datasets that show sensitive data about individuals in plain text before they are imported into Power BI.

References

For additional reading, please refer to the following books and articles:Regular Expressions: The Complete Tutorial, Jan Goyvaerts (https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf)

  1. nc, an R package that simplifies the process of extracting and organizing information in a tabular format with regular expressions (https://github.com/tdhock/nc)

Test your knowledge

Q01. What are the main purposes of using regexes?Q02. What are the main Python packages that implement regex functionality?Q03. What are the main R packages that implement regex functionality?Q04. Can you briefly summarize the method used to extract useful information from a free-text note?

Answers

A01. The main purposes of using regexes are:

  1. Text validation: Regexes can be used to check if a piece of text matches a certain pattern. This is useful for tasks such as verifying email addresses, dates, or other types of structured data.
  2. Text searching and filtering: Regexes can be used to search for patterns in large amounts of text, such as finding all instances of a particular word or phrase in a document. This can also be used to filter out unwanted data, such as removing all instances of a certain word or phrase from a document.
  3. Text manipulation and formatting: Regexes can be used to manipulate text in various ways, such as finding and replacing certain words or...

Test your knowledge

  1. What is the most obvious disadvantage of anonymization?
  2. How does pseudonymization differ from anonymization?
  3. How does the architecture shown for pseudonymization ensure compliance with GDPR deletion requirements?
  4. Why is it necessary to use NLP techniques to identify PII instead of using the usual regexes?
  5. What is one of the best Python packages for de-identifying PII? What NLP engines can be used behind the scenes?
  6. Which R package was used to de-identify PII? What is special about this package as an engine for NLP?
  7. What are pseudonyms?
  8. Which Python and R packages were used to generate pseudonyms?

Learn more on Discord

To join the Discord community for this book – where you can share feedback, ask questions to the author, and learn about new releases – follow the QR code below:

https://discord.gg/MKww5g45EB

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Extending Power BI with Python and R - Second Edition
Published in: Mar 2024Publisher: PacktISBN-13: 9781837639533
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella