Reader small image

You're reading from  Extending Power BI with Python and R - Second Edition

Product typeBook
Published inMar 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781837639533
Edition2nd Edition
Languages
Right arrow
Author (1)
Luca Zavarella
Luca Zavarella
author image
Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella

Right arrow

Adding Statistical Insights: Associations

In the previous chapter, we discussed the process of enriching your data, which involves improving the quality and depth of information through the use of complex algorithms. However, there are additional methods that can be used to extract valuable insights from data. One effective approach is to apply statistical techniques. Statistics plays a critical role in data analysis by providing a framework for examining the relationships between variables in your dataset. By using statistical methods, you can gain meaningful insights into the relationships between different variables.

In this chapter, we will cover the basic concepts of some statistical procedures. By understanding these statistical techniques, you will be able to gain a deeper understanding of your data and make informed decisions based on the insights gained from the analysis. You will learn about the following topics:

  • Exploring associations between variables
  • ...

Technical requirements

This chapter requires you to have a working internet connection and Power BI Desktop already installed on your machine (version 2.118.828.0, 64-bit, June 2023). You must have properly configured the R and Python engines and IDEs as outlined in Chapter 2, Configuring R with Power BI, and Chapter 3, Configuring Python with Power BI.

Exploring associations between variables

At first glance, you may wonder what the point of finding relationships between variables is. The ability to understand the behavior of a pair of variables and to identify a pattern in their behavior helps business owners identify key factors that can skew certain indicators of business health in their favor.

Knowing the pattern that binds the trend of two variables gives you the power to predict one of them with some certainty by knowing the other. So, knowing the tools to uncover these patterns gives you a kind of analytical superpower that is always attractive to business owners.

In general, two variables are associated if the values of one are somehow related to the values of the other. If you can somehow measure the extent of the association between two variables, it is called a correlation. The concept of correlation is directly applicable in a case where the two variables are numerical. Let’s see how.

Correlation between numeric variables

The first thing we generally do to understand whether there is an association between two numeric variables is to plot them on the two Cartesian axes to obtain a scatterplot:

A graph with green dots  Description automatically generated

Figure 15.1: A simple scatterplot

Using a scatterplot, it is possible to identify three important characteristics of a possible association:

  • Direction: This can be positive (increasing), negative (decreasing), or not defined (no association found – or both increasing and decreasing at the same time). If the increment of one variable is in accordance with the increment of the other, the direction is positive; if the increment of one variable is in accordance with the decrement of the other, it is negative; otherwise, it is not defined:
A graph of negative direction  Description automatically generated

Figure 15.2: Direction types of the association

  • Form: This describes the general form that the association takes in its simplest sense. Obviously, there are many possible forms, but there...

Correlation between non-numeric variables

We have shown that, in the case of two numeric variables, you can get a sense of the association between them by looking at their scatterplot. Obviously, this strategy cannot be used when one or both variables are non-numeric. Note that a variable is categorical (or qualitative or nominal) when it takes on values that are names or labels, such as smartphone operating systems (iOS, Android, Linux, and so on). Let’s see how to analyze the case of two categorical variables.

The first question that comes to mind is the following: is there a graphical representation that helps us to understand whether there is a significant association between two categorical variables? The answer is yes, and it is called a mosaic plot. In short, the goal of the mosaic plot is to show, at a glance, the strength of the association between the individual elements of each variable by the color of the tiles representing the pairs of elements in question...

Correlation between non-numeric and numeric variables

If you want to graphically represent an association between a numeric variable and a categorical (non-numeric) variable, the boxplot or violin plot will be the graphical representation for you. If you have already come across the problem of having to represent the distribution of a variable by highlighting key statistics, then you should be familiar with a boxplot:

A diagram of a number of different colored squares  Description automatically generated

Figure 15.31: Graphical explanation of a boxplot

A violin plot is nothing more than a combination of a histogram/distribution plot and a boxplot for the same variable:

Violin plots explained. Learn how to use violin plots and what… | by ...

Figure 15.32: Graphical explanation of a violin plot

See the References section for more details about boxplots and violin plots.

If you need to relate a numeric variable to a categorical variable, you can create a violin plot for each element of the categorical variable. Returning to the example of the Titanic disaster dataset, given the Pclass (categorical) and Age (numeric...

Summary

In this chapter, you discovered several methods for calculating the correlation coefficient for different types of variables in your data analysis. First, you learned how to calculate the correlation coefficient using the Pearson, Spearman, and Kendall methods for two numeric variables. These methods help you understand the strength and direction of the relationship between two numeric variables. You also explored how to calculate the correlation coefficient for two categorical variables using Cramér’s V and Theil’s coefficient of uncertainty. Finally, you learned how to calculate the correlation coefficient between a numeric variable and a categorical variable using the correlation ratio.

In the next chapter, you will see how statistics are really important for identifying outliers and imputing missing values in your dataset.

Test your knowledge

  1. Why is it important to explore associations between variables in data analysis?
  2. What are the different types of associations between variables?
  3. How can we measure the strength and direction of an association between numeric variables?
  4. What are Cramér’s V coefficient, Theil’s U uncertainty coefficient, and Pearson’s correlation ratio, and how are they used in analyzing associations between categorical and numeric variables?
  5. How can we use correlation coefficients to identify key predictors in a dataset?
  6. How can we visualize associations between variables?

Learn more on Discord

To join the Discord community for this book – where you can share feedback, ask questions to the author, and learn about new releases – follow the QR code below:

https://discord.gg/MKww5g45EB

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Extending Power BI with Python and R - Second Edition
Published in: Mar 2024Publisher: PacktISBN-13: 9781837639533
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella