Reader small image

You're reading from  Extending Power BI with Python and R - Second Edition

Product typeBook
Published inMar 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781837639533
Edition2nd Edition
Languages
Right arrow
Author (1)
Luca Zavarella
Luca Zavarella
author image
Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella

Right arrow

Calling External APIs to Enrich Your Data

In the previous chapter, you saw an example of how to enrich existing data with external information. In that case, the data was exposed via CSV files, but this is not always the case. Very often, the data that is useful for enrichment is exposed through external application programming interfaces (APIs), most often in the form of web service endpoints. Power BI allows you to read data from a web service through a dedicated UI, but most of the time it is unusable. So, you have to resort to writing M code to get it done. Writing M code isn’t too difficult, but it’s not easy. You also have to be careful not to write code that causes refresh problems when you publish the report to the Power BI service. In addition, Power BI does not allow you to parallelize more than one call to the same web service to reduce latency when retrieving data. Using Python or R to retrieve data from a web service solves all of these issues very easily...

Technical requirements

This chapter requires you to have a working internet connection and Power BI Desktop already installed on your machine (we used version 2.116.966.0 64-bit, April 2023). You must have properly configured the R and Python engines and IDEs as outlined in Chapter 2, Configuring R with Power BI, and Chapter 3, Configuring Python with Power BI.

What is a web service?

In the course of your work as an analyst, you might sometimes need to use specialized interfaces for data retrieval, such as direct database queries or proprietary protocols. However, this is a rare case because today almost all external data sources are exposed as web services, even within an enterprise. Web services are the most common and popular way to communicate information between heterogeneous information systems, and they offer several advantages. They enable seamless integration by using standard internet protocols, ensuring broad compatibility across platforms and languages. Web services facilitate scalability and flexibility, allowing systems to adapt to changing needs without extensive reconfiguration. They also support secure data transmission through encryption and authentication mechanisms, improving data privacy and integrity. In addition, web services can significantly reduce development time and costs by leveraging reusable components and...

Join our book community on Discord

https://packt.link/EarlyAccessCommunity

Qr code Description automatically generated

Often, the analyses you are asked to perform in Power BI need to be based on fairly large data sources. In the previous chapter, you learned how to work with data that is larger than the RAM available on your machine, and you saw that query execution times are in the order of minutes.Now consider a report that needs to perform several calculations on the same corpus database. Obviously, the performance of the overall execution time of the report is strongly related to the expected number of operations to be performed on the database. Therefore, being able to reduce the execution time of individual queries will allow you to have a report whose dataset is updated as quickly as possible with each refresh.Queries implemented in a Power BI report most often need to retrieve aggregated measures that provide users with the information they need to take specific business actions. If the data is available in relational...

Technical Requirements

This chapter requires you to have a working internet connection and Power BI Desktop already installed on your machine (we used the version 2.114.664.0 64-bit, February 2022). You must have properly configured the R and Python engines and IDEs as outlined in Chapter 2, Configuring R with Power BI, and Chapter 3, Configuring Python with Power BI.

From CSV to the Parquet file format

The traditional approach of storing structured data in CSV files has long been the method of choice for many organizations. For example, the very dataset used in Chapter 9, Loading Large Datasets beyond the Available RAM in Power BI, which contains monthly U.S. flight data from 1987 to 2012, consists of many CSV files. However, this approach has several significant limitations that can negatively impact data processing and analysis:

  1. The CSV file format is not optimized for columnar storage and stores data in a row-based format. As a result, CSV files can have slower read and write times, especially for large datasets. This can result in slower query execution times and reduced overall performance, negatively impacting the efficiency of data processing and analysis.
  2. Although CSV files can handle basic data types such as integers and strings, they can struggle when it comes to dealing with more complex data structures such as arrays and nested data types...

Using the Parquet format to speed up a Power BI report

Now that you have learned how to take advantage of the Parquet format in both Python and R, you can apply this knowledge to real-world cases in Power BI. An example might be as follows.Suppose you need to integrate a report commissioned by your boss with data on the maximum flight delays in the U.S. on each day of the past two months for each departure airport. At the beginning of each month, a CSV can be retrieved with information on the above flights for the previous month. To be consistent with the data we have available, assume that this integration was commissioned for you in December 2012 and that you were able to find all data for flights from October 1987 through November 2012.In order to implement a data flow that optimizes both the time it takes to load new flight data and the time it takes to perform analysis on all the data in the history, the workflow to be followed will be as follows:

  1. At first, the Parquet dataset of...

Summary

In this chapter, we discussed the limitations of the native Parquet connector in Power BI. You learned that these limitations can be overcome by using specific packages in both Python and R. We are talking about Dask and Arrow, the latter being available in both APIs for Python and R. In the last part of the chapter, you saw how to use these techniques in a Power BI report to append new data to very large datasets and query them in a very short amount of time.In the next chapter, we'll finally start working with R and Python scripts in Power BI, doing data ingestion and data transformation.

References

Test your knowledge

Q01. What are the main features of the Parquet format?Q02. What are the benefits and limitations of the native Parquet connector in Power BI?Q03. What’s the difference between predicate pushdown and partition pruning?Q04. How do you explain the differences in performance between Dask and PyArrow?Q05. Is there a specific API to transform referenced data from an Arrow dataset into R?

Answers

A01. Parquet is an open-source columnar storage format designed to optimize file efficiency. It stores data in a columnar format, which allows for efficient compression and encoding of data, resulting in faster query execution times and reduced storage requirements. Parquet can handle complex data structures and uses binary encoding to enhance data security. It is compatible with interactive and serverless technologies like Azure Synapse Analytics and Azure Databricks and Fabric and can be used in Power BI with Python and R. Basically, Parquet is a high-performance storage...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Extending Power BI with Python and R - Second Edition
Published in: Mar 2024Publisher: PacktISBN-13: 9781837639533
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella