You're reading from Extending Power BI with Python and R - Second Edition

Product typeBook

Published inMar 2024

Reading LevelIntermediate

PublisherPackt

ISBN-139781837639533

Edition2nd Edition

Languages

Python

Tools

Power BI

Concepts

Business Intelligence

Author (1)

Luca Zavarella

Exploratory Data Analysis

In Chapter 17, we discussed the challenges of using machine learning without premium or embedded capacity. One of the key pitfalls we highlighted was blindly applying automated machine learning (AutoML) solutions to a dataset, which often results in inaccurate models. To overcome this limitation, a critical step is to gain a deep understanding of the inherent characteristics of the dataset.

To accomplish this, this chapter introduces the concept of exploratory data analysis (EDA). This approach to analysis, pioneered by John Tukey, encourages statisticians to thoroughly explore the data and formulate hypotheses. By doing so, we can extract valuable information that ultimately enhances our understanding of the dataset and leads to the discovery of meaningful patterns among the variables.

By using EDA techniques, you can make informed decisions when selecting the most appropriate machine learning models and feature engineering methods. This chapter...

Technical requirements

This chapter requires you to have a working internet connection and Power BI Desktop already installed on your machine (version: 2.121.644.0 64-bit, September 2023). You must have properly configured the R and Python engines and IDEs as outlined in Chapter 2, Configuring R with Power BI.

What is the goal of EDA?

The primary goal of EDA is to ensure that the dataset used for complex processes is clean and reliable. This involves addressing two critical aspects: eliminating missing values and outliers that have the potential to skew subsequent analyses, and selecting relevant variables that contribute substantive information while discarding those that are primarily noise.

By thoroughly cleaning the dataset, we eliminate potential sources of inaccuracy in the conclusions derived from subsequent processes. Missing values and outliers can disrupt the integrity of statistical analyses and lead to inaccurate results. Therefore, one of the first focuses of EDA is to identify and handle missing values appropriately, either by imputing appropriate estimates or by removing them altogether. Similarly, outliers, extreme observations that deviate significantly from the overall pattern, are identified and treated to avoid undue influence on subsequent analyses.

In addition...

Understanding your data

In this first phase, it is important to understand the meaning of each variable in the context of the problem that the dataset represents. Once the measurable business entities with which the variables are associated are clear, it is easier to infer how they interact with each other.

Having an idea of the size of the dataset, understood as the number of variables and the number of observations (rows), will help you get a first idea of the size of the data you will be dealing with. Next, it is crucial to identify and immediately define the type of variables involved (which can be numerical or categorical) in order to visualize them in the most appropriate way.

Then, knowing the descriptive statistics of the numerical variables in the dataset helps to gain greater sensitivity to their values. In addition, when you’re looking at them and trying to figure out how they’re distributed, there are a few different types of graphs that can help...

EDA with Python and R

If you need to do data exploration using only Python or R, there are tools that automatically generate a number of visualizations to make your life easier. We have two lists below, one for Python tools and the other for R tools, in case you need them. It is easier to find tools for interactive data analysis in Python than in R. Packages available in R often provide wrappers that greatly simplify EDA via coding.

The Python libraries for EDA are as follows:

Sweetviz (https://pypi.org/project/sweetviz/): An open-source Python library that generates beautiful, high-density visualizations to kickstart EDA with just two lines of code
Lux (https://lux-api.readthedocs.io/): A Python library that facilitates fast and easy data exploration by automating the visualization and data analysis process
pandas Profiling (https://pandas-profiling.github.io/pandas-profiling/): Generates profile reports from a pandas DataFrame for data analysis
pandasGUI...

EDA in Power BI

In this section, we will make extensive use of the ggplot2 package, an advanced R library designed for creating plots based on the Grammar of Graphics. It is not our intention to go into detail about every feature exposed by the package, even though it is used quite extensively in the code that accompanies the chapter. Our goal, as always, is to provide code that can be easily adapted for use in other projects and, above all, to provide a starting point for a more detailed look at the functions used. For more details, see the References section in this chapter.

In addition to the tools provided by Tidyverse in R (including ggplot2), we will also use the summarytools and DataExplorer packages to create EDA reports in Power BI. It is therefore necessary to install them:

Open RStudio and make sure it is pointing to your latest CRAN R (version 4.2.2 in our case).
Check that the Rcpp package is installed in the Packages tab on the bottom-right side...

Summary

In this chapter, you learned what EDA is used for and what goals it helps achieve. You also learned which tools are most commonly used to perform automated EDA using Python and R. Finally, you developed a complete and dynamic EDA report to analyze a dataset using R and its most popular graphing packages.

In the next chapter, you’ll see how to develop advanced visualizations.

References

For additional reading, check out the following books and articles:

Powerful EDA (Exploratory Data Analysis) in just two lines of code using Sweetviz (https://towardsdatascience.com/powerful-eda-exploratory-data-analysis-in-just-two-lines-of-code-using-sweetviz-6c943d32f34)
Lux: A Python API for Intelligent Visual Data Discovery (https://www.youtube.com/watch?v=YANIids_Nkk)
pandas Profiling of the Titanic Dataset (https://docs.profiling.ydata.ai/latest/)
pandasGUI Demo (https://www.youtube.com/watch?v=NKXdolMxW2Y)
A Comprehensive Guide to the Grammar of Graphics for Effective Visualization of Multi-dimensional Data (https://towardsdatascience.com/a-comprehensive-guide-to-the-grammar-of-graphics-for-effective-visualization-of-multi-dimensional-1f92b4ed4149)
The Complete ggplot2 Tutorial (http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html)
Everything you need to know about interpreting correlations (https...

Test your knowledge

What is the primary goal of Exploratory Data Analysis (EDA)?
What are the essential phases in the EDA process?
What does the EDA process entail when it comes to understanding or getting to know your data?
How can you visualize categorical variables in EDA?
In EDA, what types of graphs can be used to understand the relationship between two different variables?
What is the process of cleaning or tidying up your data in EDA?
What is the primary objective when identifying connections between variables during EDA?
Why is the definition of data types important in EDA and how does it relate to Power BI and R?
How can I ensure the graphs utilizing R packages are displayed correctly in the Power BI service?
What should I do if the report we developed can only be used on Power BI Desktop because some packages are not available on the R engine that is on the Power BI service?
Given the large number of variables...

The rest of the chapter is locked

You have been reading a chapter from

Extending Power BI with Python and R - Second Edition

Published in: Mar 2024Publisher: PacktISBN-13: 9781837639533

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages