Reader small image

You're reading from  Extending Power BI with Python and R - Second Edition

Product typeBook
Published inMar 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781837639533
Edition2nd Edition
Languages
Right arrow
Author (1)
Luca Zavarella
Luca Zavarella
author image
Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella

Right arrow

Adding Statistical Insights: Outliers and Missing Values

In the previous chapter, we explored a range of statistical functions in Power BI, including calculating correlations between variables. Now, we will continue to extend the data enrichment possibilities in Power BI through statistical functions by focusing on methodologies to detect univariate and multivariate outliers in your dataset. Additionally, we will explore advanced techniques to impute possible missing values in datasets and time series.

In this chapter, we will delve into what outliers are, highlight their impact on analysis accuracy, and provide strategies for dealing with them in Power BI. Furthermore, missing values can cause significant issues during data analysis and reporting. Unfortunately, Power BI lacks native tools, but luckily we can turn to Python or R to fill this gap. Therefore, understanding how to diagnose missing values accurately and implement effective imputation algorithms becomes crucial for...

Technical requirements

This chapter requires you to have a working internet connection and Power BI Desktop installed on your machine (version: 2.118.828.0 64-bit, June 2023). You must have properly configured the R and Python engines and IDEs as outlined in Chapter 2, Configuring R With Power BI, and Chapter 3, Configuring Python with Power BI.

What outliers are

Generally, outliers are defined as observations that are at an unusual distance from other observations in a data sample. In other words, they are uncommon values in a dataset. The abnormal distance we’re talking about doesn’t have a fixed measure, of course, but is strictly dependent on the dataset you’re analyzing. Simply put, it will be the analyst who decides the distance beyond which others will be considered abnormal distances, based on their experience and functional knowledge of the business reality represented by the dataset.

IMPORTANT NOTE

It makes sense to talk about outliers for numeric variables, or for numeric variables grouped by elements of categorical variables. It does not make sense to talk about outliers only for categorical variables.

But why is there so much focus on outlier management? The answer is that they very often have undesirable macroscopic effects on some statistical operations. The most...

The causes of outliers

Before considering any action to be taken on the outliers of a variable, it is necessary to consider what may have caused them. Once the cause is identified, it may be possible to correct the outliers immediately. Here is a possible categorization of the causes of outliers:

  • Data entry errors: There may be an analyst collecting the data who made a mistake in compiling the data. For example, if the analyst is collecting the birth dates of a group of people, the analyst may write 177 instead of 1977. If the dates collected are in the 1900-2100 range, it is easy to correct the outlier created by the data entry error. Other times, it is not possible to recover the correct value.
  • Intentional outliers: Very often, the introduction of “errors” is intentional on the part of the individuals to whom the measurements apply. For example, adolescents typically do not accurately report the amount of alcohol they consume.
  • Data processing...

Identifying outliers

There are different methods for detecting outliers, depending on whether you are analyzing one variable at a time (univariate analysis) or several variables at once (multivariate analysis). In the univariate case, the analysis is fairly straightforward. The multivariate case, however, is more complex. Let’s examine it in detail.

Univariate outliers

One of the most direct and common ways to identify outliers for a single variable is to make use of boxplots, which you learned about in Chapter 15, Adding Statistical Insights: Associations. Some of the key points of a boxplot are the interquartile range (IQR), defined as the distance from the first quartile (Q1) to the third quartile (Q3), the lower whisker (Q1 - 1.5 x IQR), and the upper whisker (Q3 + 1.5 x IQR):

A diagram of a bar chart  Description automatically generated

Figure 16.2: Boxplot’s main characteristics

Specifically, all observations that are before the lower whisker and after the upper whisker are identified as outliers. This...

Dealing with outliers

The most widely used approaches to deal with outliers are as follows:

  • Dropping them: The analyst concludes that eliminating the outliers altogether will guarantee better results in the final analysis.
  • Capping them: It is common to use the strategy of assigning a fixed extreme value (cap or winsorize) to all those observations that exceed it (in absolute value) when it is certain that all extreme observations behave in the same way as those with the cap value.
  • Assigning a new value: In this case, outliers are eliminated by replacing them with null values, and these null values are imputed using one of the simplest techniques: the replacement of null values with a fixed value that could be, for example, the mean or median of the variable in question. You’ll see more complex imputation strategies in the next sections.
  • Transforming the data: When the analyst is dealing with natural outliers, very often the histogram of the variable...

Implementing outlier detection algorithms

The first thing you’ll do is implement what you’ve just learned in Python.

Implementing outlier detection in Python

In this section, we will use the Wine Quality dataset created by Paulo Cortez et al. (https://archive.ics.uci.edu/ml/datasets/wine+quality) to show how to detect outliers in Python. The dataset contains as many observations as there are different types of red wine, each described by the organoleptic properties measured by the variables, except for the quality variable, which provides a measure of the quality of the product using a discrete grade scale from 1 to 10.

You’ll find the code used in this section in the Python\01-detect-outliers-in-python.py file in the Chapter 16 folder.

Once you have loaded the data from the winequality-red.csv file directly from the web into the df variable, let’s start by examining the sulphates variable. Let’s check if it contains any outliers...

What missing values are and how to deal with them

Data describing real-world phenomena often has a lot of missing data. A lack of data is a fact that cannot be overlooked, especially if the analyst wants to do an advanced study of the dataset to understand how much the variables in it are correlated.

The consequences of mishandling missing values can be many:

  • The statistical power of variables with missing values is reduced, especially if a significant number of values are missing for a single variable.
  • The representativeness of the dataset subject to missing values may also be diminished, and thus the dataset in question may not correctly represent the substantive characteristics of the set of all observations of a phenomenon.
  • Any statistical estimates may not converge to the values of the entire population, thus introducing bias.
  • The results of the analysis performed may not be correct.

But, first, let’s look at the possible causes...

Diagnosing missing values in R and Python

Before thinking about imputing missing values in a dataset, we first need to know the extent to which the missing values affect each individual variable.

The code used in this section can be found in the R\03-diagnose-missing-values-in-r.R and Python\03-diagnose-missing-values-in-python.py files in Chapter 16. To properly run this code and the code in the following sections, you must install the required R and Python packages as follows:

  1. Open Anaconda Prompt.
  2. Enter the conda activate pbi_powerquery_env command.
  3. Enter the pip install missingno==0.5.2 command.
  4. Enter the pip install upsetplot==0.8.0 command.
  5. Then, open RStudio and make sure it is referencing your latest CRAN R (version 4.4.2 in our case).
  6. Click on the Console window and enter install.packages(c("naniar", "imputeTS", "forecast", "ggpubr", "missForest", "mice", "miceadds...

Implementing missing value imputation algorithms

So far, we have often used Python and R indiscriminately to implement solutions to the problems addressed in this book. But when it comes to missing value analysis, we will focus on R over Python. There’s a compelling reason for this choice. R has traditionally been used by statisticians and data miners for statistical software development and data analysis, and it has an extensive collection of packages designed specifically for statistical analysis. Some of these packages, designed specifically for missing value analysis, are truly unrivaled when compared to Python’s ecosystem. In other words, R comes bundled with powerful, statistically specialized tools that are not only more sophisticated than their Python counterparts but also very easy to use.

So, suppose you need to compute the Pearson correlation coefficient between the two numeric variables Age and Fare of the Titanic disaster dataset. Let’s first...

Summary

In this chapter, you learned about outliers, their definition, common causes, and how to deal with them. You were also introduced to methods for identifying outliers using Python and R, taking into account the number of variables and their types.

Another focus of this chapter was on filling in missing values in tabular and time series datasets. You also learned how to diagnose missing values and implement imputation techniques using R. Next, you applied these imputation algorithms in Power BI.

In the next chapter, we will explore how you can use machine learning algorithms in Power BI without Premium or Embedded capabilities.

References

For additional reading, check out the following books and articles:

Test your knowledge

  1. What are outliers and why are they important to consider in data analysis?
  2. What are the causes of outliers and how can they be categorized?
  3. What are the different approaches to deal with outliers?
  4. What are missing values and why do they pose challenges in data analysis?
  5. How can missing values be diagnosed and visualized in a dataset?
  6. What is the concept of multiple imputation and how does it overcome the limitations of single imputation?
  7. How can missing values be imputed in a time series dataset?
  8. Why is R often considered more suitable than Python for conducting missing value imputation?

Learn more on Discord

To join the Discord community for this book – where you can share feedback, ask questions to the author, and learn about new releases – follow the QR code below:

https://discord.gg/MKww5g45EB

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Extending Power BI with Python and R - Second Edition
Published in: Mar 2024Publisher: PacktISBN-13: 9781837639533
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella