Reader small image

You're reading from  Extending Power BI with Python and R - Second Edition

Product typeBook
Published inMar 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781837639533
Edition2nd Edition
Languages
Right arrow
Author (1)
Luca Zavarella
Luca Zavarella
author image
Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella

Right arrow

Calculating Columns Using Complex Algorithms: Distances

The data ingestion phase allows you to gather all the information you need for your analysis from any data source. Once the various datasets have been imported, some of this information may not be useful in describing a phenomenon from an analytical point of view. After the data ingestion phase, it’s not uncommon to find that some of the raw information doesn’t directly contribute to analytical insights as is. Recognizing this, it is essential to refine and enhance the dataset with additional computations that can provide new perspectives and answers to our questions. This often involves the creation of calculated columns that provide measures that are more aligned with our analytical goals. For example, in the context of our exploration, the calculation of the distance between two geographic points or the dissimilarity between two strings can transform seemingly abstract or unrelated data into powerful tools for...

Technical requirements

This chapter requires you to have a working internet connection and Power BI Desktop already installed on your machine (we used version 2.114.664.0 64-bit, February 2023). You must have properly configured the R and Python engines and IDEs as outlined in Chapter 2, Configuring R with Power BI, and Chapter 3, Configuring Python with Power BI.

What is a distance?

A distance, in the context of data analysis and pattern recognition, is a quantitative measure that captures the dissimilarity or similarity between objects or points in a given space. It provides a numerical representation of the extent to which two entities are separate or close to each other and allows us to objectively quantify the relationships and differences between data points so that we can systematically compare and analyze them.

The concept of distance is particularly valuable because it provides a common metric for comparing and evaluating different types of data. Whether dealing with numerical attributes, categorical variables, or even complex structures such as images or text, distances can be defined and calculated to quantify the dissimilarity between instances. By using the concept of distance, analysts and data scientists gain insight into the relationships, patterns, and structures inherent in their data.

The concept of distance finds...

The distance between two geographic locations

It is often the case that you have coordinates in your dataset, expressed in latitude and longitude, that identify points on the globe. Depending on the purpose of the analysis you want to perform, you can use these coordinates to calculate measures that best describe the scenario you want to address. For example, assuming you have the geographic coordinates of some hotels in a dataset, it might be useful to calculate the distance of each hotel to the nearest airport if you want to provide an additional value of interest to a visitor.

Some theory first

To fully understand a phenomenon well, to know what it consists of and what technologies have been developed to deal with it, it is necessary to go deeper into the theory behind it. Since we are talking about measuring the distance between two points on the globe, the first thing that comes to mind is to simplify the phenomenon by using a model that approximates reality. So let&...

The distance between two strings

When considering the concept of distance, our first thoughts often focus on measuring the physical space between two points in a well-defined environment. Whether it’s solving problems in plane geometry or navigating the three-dimensional world we inhabit, distance plays a crucial role. However, it is important to recognize that the concept of distance extends beyond physical dimensions.

Some theory first

As you may recall from the introductory part of this chapter, there are numerous domains where distance is of immense importance in describing events and relationships. One such domain that may surprise you is the space defined by strings of text. Surprisingly, this includes the mathematical domain represented by strings of text. This domain encompasses a set or a range of all conceivable values or arrangements that can be embodied by an entity such as a string of text. That’s right – it is perfectly possible to construct...

Summary

In this chapter, we ventured into the fascinating realm of distances and their many applications. We began by exploring the calculation of geographic distances, introducing the remarkable formulas of the law of Cosines, the law of Haversines, and Vincenty’s distance. Using the PyGeodesy package in Python and the geosphere library in R, we harnessed the power of computation to accurately measure distances between geographic locations.

Expanding our horizons, we delved into the realm of string distances. We encountered the metrics of Hamming, Levenshtein, Jaro-Winkler, and Jaccard distances, each offering unique insights into the dissimilarity or similarity between strings. Python’s TextDistance package and R’s stringdist library provided us with the essential tools to effortlessly compute these string distances.

In your study, you encountered a significant computational hurdle: the quadratic nature of the distance algorithms implemented. With the...

References

Test your knowledge

  1. Why is the concept of distance particularly valuable?
  2. What was one of the most practical benefits introduced by the definition of the Haversine function?
  3. What is the assumption that makes Vincenty’s formula for calculating the distance between two geographic locations so much more accurate than others?
  4. What libraries are used in Python and R to compute distances between geographic points?
  5. If Hamming distance is very powerful, why is it not often used in common string comparison problems?
  6. When is it recommended to use the Damerau-Levenshtein distance?
  7. When is it recommended to use the Jaro-Winkler distance?
  8. When is it recommended to use the Jaccard distance?
  9. What libraries are used in Python and R to compute distances between strings?

Learn more on Discord

To join the Discord community for this book – where you can share feedback, ask questions to the author, and learn about new releases...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Extending Power BI with Python and R - Second Edition
Published in: Mar 2024Publisher: PacktISBN-13: 9781837639533
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Luca Zavarella

Luca Zavarella has a rich background as an Azure Data Scientist Associate and Microsoft MVP, with a Computer Engineering degree from the University of L'Aquila. His decade-plus experience spans the Microsoft Data Platform, starting as a T-SQL developer on SQL Server 2000 and 2005, then mastering the full suite of Microsoft Business Intelligence tools (SSIS, SSAS, SSRS), and advancing into data warehousing. Recently, his focus has shifted to advanced analytics, data science, and AI, contributing to the community as a speaker and blogger, especially on Medium. Currently, he leads the Data & AI division at iCubed, and he also holds an honors degree in classical piano from the "Alfredo Casella" Conservatory in L'Aquila.
Read more about Luca Zavarella