Reader small image

You're reading from  Extending Power BI with Python and R - Second Edition

Product typeBook
Published inMar 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781837639533
Edition2nd Edition
Languages
Right arrow
Author (1)
Luca Zavarella
Luca Zavarella
author image
Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella

Right arrow

Exploratory Data Analysis

In Chapter 17, we discussed the challenges of using machine learning without premium or embedded capacity. One of the key pitfalls we highlighted was blindly applying automated machine learning (AutoML) solutions to a dataset, which often results in inaccurate models. To overcome this limitation, a critical step is to gain a deep understanding of the inherent characteristics of the dataset.

To accomplish this, this chapter introduces the concept of exploratory data analysis (EDA). This approach to analysis, pioneered by John Tukey, encourages statisticians to thoroughly explore the data and formulate hypotheses. By doing so, we can extract valuable information that ultimately enhances our understanding of the dataset and leads to the discovery of meaningful patterns among the variables.

By using EDA techniques, you can make informed decisions when selecting the most appropriate machine learning models and feature engineering methods. This chapter...

Technical requirements

This chapter requires you to have a working internet connection and Power BI Desktop already installed on your machine (version: 2.121.644.0 64-bit, September 2023). You must have properly configured the R and Python engines and IDEs as outlined in Chapter 2, Configuring R with Power BI.

What is the goal of EDA?

The primary goal of EDA is to ensure that the dataset used for complex processes is clean and reliable. This involves addressing two critical aspects: eliminating missing values and outliers that have the potential to skew subsequent analyses, and selecting relevant variables that contribute substantive information while discarding those that are primarily noise.

By thoroughly cleaning the dataset, we eliminate potential sources of inaccuracy in the conclusions derived from subsequent processes. Missing values and outliers can disrupt the integrity of statistical analyses and lead to inaccurate results. Therefore, one of the first focuses of EDA is to identify and handle missing values appropriately, either by imputing appropriate estimates or by removing them altogether. Similarly, outliers, extreme observations that deviate significantly from the overall pattern, are identified and treated to avoid undue influence on subsequent analyses.

In addition...

Understanding your data

In this first phase, it is important to understand the meaning of each variable in the context of the problem that the dataset represents. Once the measurable business entities with which the variables are associated are clear, it is easier to infer how they interact with each other.

Having an idea of the size of the dataset, understood as the number of variables and the number of observations (rows), will help you get a first idea of the size of the data you will be dealing with. Next, it is crucial to identify and immediately define the type of variables involved (which can be numerical or categorical) in order to visualize them in the most appropriate way.

Then, knowing the descriptive statistics of the numerical variables in the dataset helps to gain greater sensitivity to their values. In addition, when you’re looking at them and trying to figure out how they’re distributed, there are a few different types of graphs that can help...

EDA with Python and R

If you need to do data exploration using only Python or R, there are tools that automatically generate a number of visualizations to make your life easier. We have two lists below, one for Python tools and the other for R tools, in case you need them. It is easier to find tools for interactive data analysis in Python than in R. Packages available in R often provide wrappers that greatly simplify EDA via coding.

The Python libraries for EDA are as follows:

EDA in Power BI

In this section, we will make extensive use of the ggplot2 package, an advanced R library designed for creating plots based on the Grammar of Graphics. It is not our intention to go into detail about every feature exposed by the package, even though it is used quite extensively in the code that accompanies the chapter. Our goal, as always, is to provide code that can be easily adapted for use in other projects and, above all, to provide a starting point for a more detailed look at the functions used. For more details, see the References section in this chapter.

In addition to the tools provided by Tidyverse in R (including ggplot2), we will also use the summarytools and DataExplorer packages to create EDA reports in Power BI. It is therefore necessary to install them:

  1. Open RStudio and make sure it is pointing to your latest CRAN R (version 4.2.2 in our case).
  2. Check that the Rcpp package is installed in the Packages tab on the bottom-right side...

Summary

In this chapter, you learned what EDA is used for and what goals it helps achieve. You also learned which tools are most commonly used to perform automated EDA using Python and R. Finally, you developed a complete and dynamic EDA report to analyze a dataset using R and its most popular graphing packages.

In the next chapter, you’ll see how to develop advanced visualizations.

References

For additional reading, check out the following books and articles:

Test your knowledge

  1. What is the primary goal of Exploratory Data Analysis (EDA)?
  2. What are the essential phases in the EDA process?
  3. What does the EDA process entail when it comes to understanding or getting to know your data?
  4. How can you visualize categorical variables in EDA?
  5. In EDA, what types of graphs can be used to understand the relationship between two different variables?
  6. What is the process of cleaning or tidying up your data in EDA?
  7. What is the primary objective when identifying connections between variables during EDA?
  8. Why is the definition of data types important in EDA and how does it relate to Power BI and R?
  9. How can I ensure the graphs utilizing R packages are displayed correctly in the Power BI service?
  10. What should I do if the report we developed can only be used on Power BI Desktop because some packages are not available on the R engine that is on the Power BI service?
  11. Given the large number of variables...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Extending Power BI with Python and R - Second Edition
Published in: Mar 2024Publisher: PacktISBN-13: 9781837639533
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella