You're reading from Extending Power BI with Python and R - Second Edition

Product typeBook

Published inMar 2024

Reading LevelIntermediate

PublisherPackt

ISBN-139781837639533

Edition2nd Edition

Languages

Python

Tools

Power BI

Concepts

Business Intelligence

Author (1)

Luca Zavarella

Adding Statistical Insights: Outliers and Missing Values

In the previous chapter, we explored a range of statistical functions in Power BI, including calculating correlations between variables. Now, we will continue to extend the data enrichment possibilities in Power BI through statistical functions by focusing on methodologies to detect univariate and multivariate outliers in your dataset. Additionally, we will explore advanced techniques to impute possible missing values in datasets and time series.

In this chapter, we will delve into what outliers are, highlight their impact on analysis accuracy, and provide strategies for dealing with them in Power BI. Furthermore, missing values can cause significant issues during data analysis and reporting. Unfortunately, Power BI lacks native tools, but luckily we can turn to Python or R to fill this gap. Therefore, understanding how to diagnose missing values accurately and implement effective imputation algorithms becomes crucial for...

Technical requirements

This chapter requires you to have a working internet connection and Power BI Desktop installed on your machine (version: 2.118.828.0 64-bit, June 2023). You must have properly configured the R and Python engines and IDEs as outlined in Chapter 2, Configuring R With Power BI, and Chapter 3, Configuring Python with Power BI.

What outliers are

Generally, outliers are defined as observations that are at an unusual distance from other observations in a data sample. In other words, they are uncommon values in a dataset. The abnormal distance we’re talking about doesn’t have a fixed measure, of course, but is strictly dependent on the dataset you’re analyzing. Simply put, it will be the analyst who decides the distance beyond which others will be considered abnormal distances, based on their experience and functional knowledge of the business reality represented by the dataset.

IMPORTANT NOTE

It makes sense to talk about outliers for numeric variables, or for numeric variables grouped by elements of categorical variables. It does not make sense to talk about outliers only for categorical variables.

But why is there so much focus on outlier management? The answer is that they very often have undesirable macroscopic effects on some statistical operations. The most...

The causes of outliers

Before considering any action to be taken on the outliers of a variable, it is necessary to consider what may have caused them. Once the cause is identified, it may be possible to correct the outliers immediately. Here is a possible categorization of the causes of outliers:

Data entry errors: There may be an analyst collecting the data who made a mistake in compiling the data. For example, if the analyst is collecting the birth dates of a group of people, the analyst may write 177 instead of 1977. If the dates collected are in the 1900-2100 range, it is easy to correct the outlier created by the data entry error. Other times, it is not possible to recover the correct value.
Intentional outliers: Very often, the introduction of “errors” is intentional on the part of the individuals to whom the measurements apply. For example, adolescents typically do not accurately report the amount of alcohol they consume.
Data processing...

Identifying outliers

There are different methods for detecting outliers, depending on whether you are analyzing one variable at a time (univariate analysis) or several variables at once (multivariate analysis). In the univariate case, the analysis is fairly straightforward. The multivariate case, however, is more complex. Let’s examine it in detail.

Univariate outliers

One of the most direct and common ways to identify outliers for a single variable is to make use of boxplots, which you learned about in Chapter 15, Adding Statistical Insights: Associations. Some of the key points of a boxplot are the interquartile range (IQR), defined as the distance from the first quartile (Q1) to the third quartile (Q3), the lower whisker (Q1 - 1.5 x IQR), and the upper whisker (Q3 + 1.5 x IQR):

A diagram of a bar chart Description automatically generated

Figure 16.2: Boxplot’s main characteristics

Specifically, all observations that are before the lower whisker and after the upper whisker are identified as outliers. This...

Dealing with outliers

The most widely used approaches to deal with outliers are as follows:

Dropping them: The analyst concludes that eliminating the outliers altogether will guarantee better results in the final analysis.
Capping them: It is common to use the strategy of assigning a fixed extreme value (cap or winsorize) to all those observations that exceed it (in absolute value) when it is certain that all extreme observations behave in the same way as those with the cap value.
Assigning a new value: In this case, outliers are eliminated by replacing them with null values, and these null values are imputed using one of the simplest techniques: the replacement of null values with a fixed value that could be, for example, the mean or median of the variable in question. You’ll see more complex imputation strategies in the next sections.
Transforming the data: When the analyst is dealing with natural outliers, very often the histogram of the variable...

Implementing outlier detection algorithms

The first thing you’ll do is implement what you’ve just learned in Python.

Implementing outlier detection in Python

In this section, we will use the Wine Quality dataset created by Paulo Cortez et al. (https://archive.ics.uci.edu/ml/datasets/wine+quality) to show how to detect outliers in Python. The dataset contains as many observations as there are different types of red wine, each described by the organoleptic properties measured by the variables, except for the quality variable, which provides a measure of the quality of the product using a discrete grade scale from 1 to 10.

You’ll find the code used in this section in the Python\01-detect-outliers-in-python.py file in the Chapter 16 folder.

Once you have loaded the data from the winequality-red.csv file directly from the web into the df variable, let’s start by examining the sulphates variable. Let’s check if it contains any outliers...

What missing values are and how to deal with them

Data describing real-world phenomena often has a lot of missing data. A lack of data is a fact that cannot be overlooked, especially if the analyst wants to do an advanced study of the dataset to understand how much the variables in it are correlated.

The consequences of mishandling missing values can be many:

The statistical power of variables with missing values is reduced, especially if a significant number of values are missing for a single variable.
The representativeness of the dataset subject to missing values may also be diminished, and thus the dataset in question may not correctly represent the substantive characteristics of the set of all observations of a phenomenon.
Any statistical estimates may not converge to the values of the entire population, thus introducing bias.
The results of the analysis performed may not be correct.

But, first, let’s look at the possible causes...

Diagnosing missing values in R and Python

Before thinking about imputing missing values in a dataset, we first need to know the extent to which the missing values affect each individual variable.

The code used in this section can be found in the R\03-diagnose-missing-values-in-r.R and Python\03-diagnose-missing-values-in-python.py files in Chapter 16. To properly run this code and the code in the following sections, you must install the required R and Python packages as follows:

Open Anaconda Prompt.
Enter the conda activate pbi_powerquery_env command.
Enter the pip install missingno==0.5.2 command.
Enter the pip install upsetplot==0.8.0 command.
Then, open RStudio and make sure it is referencing your latest CRAN R (version 4.4.2 in our case).
Click on the Console window and enter install.packages(c("naniar", "imputeTS", "forecast", "ggpubr", "missForest", "mice", "miceadds...

Implementing missing value imputation algorithms

So far, we have often used Python and R indiscriminately to implement solutions to the problems addressed in this book. But when it comes to missing value analysis, we will focus on R over Python. There’s a compelling reason for this choice. R has traditionally been used by statisticians and data miners for statistical software development and data analysis, and it has an extensive collection of packages designed specifically for statistical analysis. Some of these packages, designed specifically for missing value analysis, are truly unrivaled when compared to Python’s ecosystem. In other words, R comes bundled with powerful, statistically specialized tools that are not only more sophisticated than their Python counterparts but also very easy to use.

So, suppose you need to compute the Pearson correlation coefficient between the two numeric variables Age and Fare of the Titanic disaster dataset. Let’s first...

Summary

In this chapter, you learned about outliers, their definition, common causes, and how to deal with them. You were also introduced to methods for identifying outliers using Python and R, taking into account the number of variables and their types.

Another focus of this chapter was on filling in missing values in tabular and time series datasets. You also learned how to diagnose missing values and implement imputation techniques using R. Next, you applied these imputation algorithms in Power BI.

In the next chapter, we will explore how you can use machine learning algorithms in Power BI without Premium or Embedded capabilities.

References

For additional reading, check out the following books and articles:

Add Marginal Plot to ggplot2 Scatterplot Using ggExtra Package in R (https://statisticsglobe.com/ggplot2-graphic-with-marginal-plot-in-r)
5 Things You Should Know About Covariance (https://towardsdatascience.com/5-things-you-should-know-about-covariance-26b12a0516f1)
Mahalanobis Distance and its Limitations (https://rpubs.com/jjsuarestra99/mahalanobis)
The Box-Cox transformation for a dependent variable in a regression (https://blogs.sas.com/content/iml/2022/08/17/box-cox-regression.html)
How to Use Power Transforms for Machine Learning (https://machinelearningmastery.com/power-transforms-with-scikit-learn/)
The Relationship between the Mahalanobis Distance and the Chi-Squared Distribution (https://markusthill.github.io/mahalanbis-chi-squared/)
Unsupervised anomaly detection algorithms on real-world data: how many do we need? (https://arxiv.org/abs/2305.00735...

Test your knowledge

What are outliers and why are they important to consider in data analysis?
What are the causes of outliers and how can they be categorized?
What are the different approaches to deal with outliers?
What are missing values and why do they pose challenges in data analysis?
How can missing values be diagnosed and visualized in a dataset?
What is the concept of multiple imputation and how does it overcome the limitations of single imputation?
How can missing values be imputed in a time series dataset?
Why is R often considered more suitable than Python for conducting missing value imputation?

Learn more on Discord

To join the Discord community for this book – where you can share feedback, ask questions to the author, and learn about new releases – follow the QR code below:

https://discord.gg/MKww5g45EB

The rest of the chapter is locked

You have been reading a chapter from

Extending Power BI with Python and R - Second Edition

Published in: Mar 2024Publisher: PacktISBN-13: 9781837639533

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages