Reader small image

You're reading from  Extending Power BI with Python and R - Second Edition

Product typeBook
Published inMar 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781837639533
Edition2nd Edition
Languages
Right arrow
Author (1)
Luca Zavarella
Luca Zavarella
author image
Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella

Right arrow

Preface

Welcome to the second edition of our book, where the journey into advanced analytics for Power BI deepens and expands your expertise in Python and R. This edition isn’t just an update; it’s an expansion into new territory, meticulously crafted to enhance your analytical prowess, whether you’re using Power BI or diving directly into Python and R.

We’ve introduced new chapters that pave the way for optimizing your environment, harnessing the power of Intel’s Math Kernel Library for speed, and tackling integration challenges with finesse. Imagine managing massive data sets effortlessly, using advanced techniques and the Swift parquet format for fast data processing. Imagine mastering the art of probabilistic fuzzy matching and navigating the potential of SQL Server External Languages to overcome traditional limitations in Power BI. You’ll also be guided in creating stunning visualizations using not only R, but also Python, applying the Grammar of Graphics through a specialized package.

The remaining chapters have been carefully updated, expanded, and technically refined. They give you the skills to excel at validating data using regular expressions, leveraging data from a variety of uncommon external sources, and applying sophisticated data transformation algorithms. You’ll explore advanced methods for protecting personal data in Power BI, including strategies for pseudonymization, anonymization, and data masking. This book also covers integrating external APIs to enrich your datasets, improve I/O efficiency, and leverage the powerful analytical capabilities of Python and R. Perform comprehensive analysis and uncover deep insights with statistical and machine learning methods, all without requiring the premium capacity. The text encourages you to visualize key statistical features of your data through a variety of graphical outputs as you develop machine learning models. In addition, each chapter has challenging questions and answers that reinforce your understanding and test your skills.

Join us on this challenging adventure where your analytical skills will soar to new heights.

Who this book is for

This book is intended for data analysts and developers who already have some experience with Power BI. In addition, ideally, readers should also be familiar with Python or R (or both), but the book is designed to still be accessible to beginners. For those more complex topics that require programming skills, the book provides in-depth learning resources. Newcomers are guided through each concept with clear explanations and annotated code examples, ensuring they will end up with a comprehensive understanding of how and why the code works.

What this book covers

Chapter 1, Where and How to Use R and Python Scripts in Power BI, provides a comprehensive overview of integrating R and Python scripts with Power BI. It delves into the capabilities of Power BI, emphasizing its utility beyond simple data visualization by incorporating advanced analytics through R and Python. The chapter covers key aspects such as injecting R or Python scripts into Power BI, the specific Power BI tools for script integration, and the limitations of these scripts on different Power BI products.

It also addresses the technical requirements for this integration and provides insights into data loading, transformation, and visualization using R and Python scripts. This chapter is critical to understanding the role of these scripts in extending the functionality of Power BI and how they interact with data at various stages of report development.

Chapter 2, Configuring R with Power BI, focuses on setting up and integrating R with Power BI. It walks you through installing and configuring the necessary R engines and development environments for use within Power BI. The chapter discusses various R distributions, including CRAN R and Microsoft R Open, and provides insight into improving R performance using Intel’s Math Kernel Library (MKL). It also covers installing RTools and configuring Power BI Desktop and the Power BI service to work effectively with R. In addition, the chapter discusses the limitations of R script visualizations and the technical requirements for these configurations.

Chapter 3, Configuring Python with Power BI, delves into the integration of Python with Power BI. It walks through the installation and configuration of Python engines and IDEs, highlighting different Python distributions and how to choose the right engine. The chapter discusses setting up the Power BI desktop and service for Python, focusing on the limitations and technical requirements of the Python visualization. It focuses on using virtual environments for data transformations and Python script visualizations in Power BI. In addition, the chapter discusses improving Python performance in Power BI, including using Intel’s MKL for optimized computational efficiency.

Chapter 4, Solving Common Issues When Using Python and R in Power BI, systematically addresses common problems encountered when developing solutions in Power BI using Python and R. It provides solutions to issues such as the ADO.NET error when running a Python script, the Formula.Firewall error, using multiple datasets in a Python/R script step, and handling dates/times in these scripts.

Chapter 5, Importing Unhandled Data Objects, focuses on using R and Python for data ingestion in Power BI. It covers importing RDS and PKL files into Power BI, which is useful for data that comes from external processing and is not managed directly by Power BI. The chapter is designed to guide through serialized files in Power BI, providing practical examples and detailed instructions.

Chapter 6, Using Regular Expressions in Power BI, explores the use of regular expressions (regex) to perform complex searches and replacements on strings in Power BI, improving data cleansing tasks. Key topics include a brief introduction to regex, validating data, loading complex log files, and extracting values from text using regex in Power BI. It also discusses the technical requirements for these processes and how to configure R and Python environments in Power BI to perform regex operations. The goal of this chapter is to provide you with the skills to use regex for high-quality data projects in Power BI.

Chapter 7, Anonymizing and Pseudonymizing Your Data in Power BI, focuses on techniques for de-identifying data in Power BI using Python and R scripts. The chapter emphasizes the importance of data privacy and compliance with regulations such as GDPR. It covers various de-identification methods, including information removal, data masking, swapping, generalization, perturbation, tokenization, hashing, and encryption. The chapter also distinguishes between anonymization and pseudonymization processes, and demonstrates their implementation in Power BI with real-world examples. It also discusses preserving the statistical properties of datasets during pseudonymization and anonymization.

Chapter 8, Logging Data from Power BI to External Sources, covers how to extract and log data from Power BI to external files or systems using Python and R. It covers various methods for logging data to CSV and Excel files, and demonstrates how to interact with SQL servers, including Azure SQL Server. The chapter provides detailed instructions and examples for each of these processes, demonstrating how to effectively manage and transfer data outside of Power BI for various purposes.

Chapter 9, Loading Large Datasets beyond the Available RAM in Power BI, focuses on working with large files, especially those that exceed the RAM capacity of your machine, in Power BI using Python and R. It covers practical techniques for importing, processing, and exporting large datasets. Key topics include typical analytical scenarios involving large datasets, importing and exporting large datasets in both Python and R, and the technical requirements for these operations. The chapter emphasizes the use of specific packages and distributed computing systems to provide a comprehensive guide to working with large datasets in Power BI.

Chapter 10, Boosting Data Loading Speed in Power BI with Parquet Format, discusses improving the performance of Power BI, especially for large data sources. The chapter focuses on converting data from the traditional CSV format to the more efficient Parquet file format. The goal of this conversion is to optimize query execution times and overall report performance in Power BI. The chapter provides technical requirements and detailed instructions on how to implement this conversion and illustrates the significant performance benefits of using the Parquet format. It also explains how to use Parquet files in Power BI for both Python and R users, highlighting practical applications and benefits.

Chapter 11, Calling External APIs to Enrich Your Data, teaches you how to extend existing data using external application programming interfaces (APIs), which are often exposed through web service endpoints. It covers understanding web services, using Bing Maps web services to geocode addresses in Python and R, and accessing these services through Power BI. The chapter includes detailed instructions for registering for Bing Maps web services, handling geocoding, and integrating these methods into Power BI, with an emphasis on technical requirements and practical examples.

Chapter 12, Calculating Columns Using Complex Algorithms: Distances, explores the use of distance measurements in data analysis. The chapter begins with an exploration of the concept of distance in various contexts, including geographic and string distances. Key topics include calculating the distance between two geographic locations and between strings. The chapter emphasizes the use of non-trivial algorithms for data analysis in Power BI, using R and Python for complex computations. It includes practical examples and the technical requirements necessary to implement these concepts.

Chapter 13, Calculating Columns Using Complex Algorithms: Fuzzy Matching, explores advanced data analysis techniques with a focus on fuzzy matching in Power BI. It covers the use of Microsoft Research’s Jaccard distance-based fuzzy matching algorithm and delves into the specifics of probabilistic data association. The chapter includes practical examples and guidance on implementing these techniques in Power BI, highlighting key aspects such as standard fuzzy matching, probabilistic record association algorithms, and their application.

Chapter 14, Calculating Columns Using Complex Algorithms: Optimization Problems, explores how Power BI analysts can tackle mathematical optimization problems without extensive knowledge of advanced mathematics. It focuses on linear programming (LP) and its application to data analysis, particularly for real-world cases such as demand optimization in manufacturing. The chapter covers topics such as the basics of linear programming, how to solve optimization problems using Python and R, and the technical requirements for setting up Power BI with these programming languages. Practical examples are provided to demonstrate how to effectively apply LP techniques in Power BI.

Chapter 15, Adding Statistical Insights: Associations, focuses on statistical techniques used to extract insights from data, emphasizing the critical role of statistics in data analysis. This chapter covers exploring associations between variables, including correlations between numeric and categorical variables, and discusses the technical requirements for performing these analyses. Key concepts include understanding the behavior of variables, measuring the degree of association (correlation), and using mathematical concepts to define different types of correlations. The chapter also discusses the limitations of certain statistical methods and introduces alternatives such as Spearman’s and Kendall’s correlation coefficients. It also includes practical examples and instructions for implementing these techniques in Python, R, and Power BI.

Chapter 16, Adding Statistical Insights: Outliers and Missing Values, explores advanced statistical capabilities in Power BI, with a focus on detecting outliers and imputing missing values in datasets. It covers different methods for detecting and handling outliers, the impact of missing values on data analysis, and strategies for dealing with these issues. The chapter provides you with the skills to effectively use Power BI for these purposes, including a comprehensive explanation of outlier detection and missing value imputation algorithms. It also outlines the technical requirements for performing these operations in Power BI.

Chapter 17, Using Machine Learning without Premium or Embedded Capacity, focuses on integrating machine learning (ML) capabilities into Power BI workflows, particularly for users with Pro licenses. The chapter addresses the use of Python and R for machine learning within Power BI, despite the limitations of certain advanced AI tools. Key topics include interacting with ML in Power BI using data flows, AutoML solutions, embedding training code in Power Query, using trained models, and calling web services in Power Query. The chapter aims to provide practical insights into effectively applying ML techniques in Power BI environments.

Chapter 18, Using SQL Server External Languages for Advanced Analytics and ML Integration in Power BI, explores the integration of Python and R analytical engines within SQL Server (or Azure SQL Managed Instance) and their use in Power BI.

This approach is considered because of certain limitations in Power BI for handling Python and R directly. The chapter covers installing and configuring Python and R in SQL Server, using ML services, and importing preprocessed datasets into Power BI. It provides a detailed guide to managing and integrating these technologies to extend the capabilities of Power BI reports.

Chapter 19, Exploratory Data Analysis, focuses on the importance of thoroughly understanding the inherent characteristics of your data before applying ML models. It introduces exploratory data analysis (EDA) techniques that can help you make informed decisions about selecting appropriate ML models and feature engineering methods. The chapter covers topics such as the goals of EDA, techniques for performing EDA using Python and R, and EDA in Power BI. It emphasizes the critical steps of cleaning the dataset, understanding variable relationships, and deriving meaningful insights to build accurate models.

Chapter 20, Using the Grammar of Graphics in Python with plotnine, provides a comprehensive guide to the plotnine package in Python, drawing parallels to the popular ggplot2 tool in R. The chapter begins with an overview of plotnine, explaining its foundation in the grammar of graphics and its intuitive, powerful syntax. It then delves into practical applications, demonstrating how to analyze the Titanic dataset using various plotnine techniques, such as creating bar charts and histograms. In addition, the chapter covers the integration of plotnine with Power BI, providing detailed instructions on how to effectively use plotnine visualizations within Power BI environments. This includes methods for converting plotnine graphs for Power BI compatibility and ensuring their effective display in reports.

Chapter 21, Advanced Visualizations, focuses on creating advanced and visually appealing custom graphs, with an emphasis on circular bar plots. These bar plots are particularly useful for displaying periodic or cyclical data in a clear and space-efficient manner. The chapter covers topics such as selecting and implementing pie charts in R and Power BI. It also discusses the integration of R scripts into Power BI for rendering complex ggplot2 graphs and provides a detailed walkthrough for this process.

Chapter 22, Interactive R Custom Visuals, focuses on enhancing data visualizations with interactivity, building on the concepts introduced in previous chapters. It explores the transition from static charts to interactive visualizations using HTML widgets and Plotly in R, emphasizing their benefits for data interpretation. The chapter covers key topics such as adding interactivity with Plotly, using HTML widgets, and integrating these interactive visuals with Power BI. It also provides practical guidance on creating and importing custom visual packages into Power BI, giving readers the skills to create more dynamic and engaging data presentations.

Software used in this book

Power BI is currently undergoing frequent updates, many on which are making changes to the UI. Each chapter in this book includes a note which indicates the version of Power BI used for the screenshots and examples within. If you are using a different version, please be aware that some UI elements may have changed or moved.

To get the most out of this book

  • You will need a working PC with a stable Internet connection. This setup will allow you to not only download the necessary software, but also access online resources that can enhance your learning experience. In addition, it is critical that you have Power BI Desktop installed on your computer. This software is the backbone of the concepts and labs that we will explore in this book.
  • It is best to have a basic understanding of Power BI. Familiarity with the interface and basic concepts will help you navigate through the exercises and understand the more advanced topics more easily.
  • If you are reading a digital version of this book, we recommend that you type the code examples yourself or access the code from the book’s GitHub repository. A link to the repository is provided in the next section. This practice will help you avoid potential errors that can result from copying and pasting code directly.

Download the example code files

The code bundle for the book is hosted on GitHub at https://github.com/PacktPublishing/Extending-Power-BI-with-Python-and-R-2nd-edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/gbp/9781837639533.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. For example: “Activate the environment that gives you the error you saw before using the conda activate <your-environment-name> command.”

A block of code is set as follows:

re.search('test', 'TeSt', re.IGNORECASE)
re.match('test', 'TeSt', re.IGNORECASE)
re.sub('test', 'xxxx', 'TesTing', flags=re.IGNORECASE)

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

import pandas as pd
import numpy
df = pd.DataFrame(dir(numpy))

Any command-line input or output is written as follows:

successfully initialized (spaCy Version: 3.5.0, language model: en_core_web_lg)
(python options: type = "condaenv", value = "C:\ProgramData\Miniconda3\envs\presidio_env")

Bold: Indicates a new term, an important word, or words that you see on the screen. For instance, words in menus or dialog boxes appear in the text like this. For example: “Personally Identifiable Information (PII), also known as personal information or personal data, is any information about an identifiable individual.”

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email feedback@packtpub.com and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at questions@packtpub.com.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you reported this to us. Please visit http://www.packtpub.com/submit-errata, click Submit Errata, and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packtpub.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.

Share your thoughts

Once you’ve read Extending Power BI with Python and R, Second Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application. 

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

  1. Scan the QR code or visit the link below

https://packt.link/free-ebook/9781837639533

  1. Submit your proof of purchase
  2. That’s it! We’ll send your free PDF and other benefits to your email directly
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Extending Power BI with Python and R - Second Edition
Published in: Mar 2024Publisher: PacktISBN-13: 9781837639533
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella