Reader small image

You're reading from  Hands-On Data Preprocessing in Python

Product typeBook
Published inJan 2022
PublisherPackt
ISBN-139781801072137
Edition1st Edition
Concepts
Right arrow
Author (1)
Roy Jafari
Roy Jafari
author image
Roy Jafari

Roy Jafari, Ph.D. is an assistant professor of business analytics at the University of Redlands. Roy has taught and developed college-level courses that cover data cleaning, decision making, data science, machine learning, and optimization. Roy's style of teaching is hands-on and he believes the best way to learn is to learn by doing. He uses active learning teaching philosophy and readers will get to experience active learning in this book. Roy believes that successful data preprocessing only happens when you are equipped with the most efficient tools, have an appropriate understanding of data analytic goals, are aware of data preprocessing steps, and can compare a variety of methods. This belief has shaped the structure of this book.
Read more about Roy Jafari

Right arrow

Chapter 12: Data Fusion and Data Integration

The popular understanding of data pre-processing goes hand in hand with data cleaning. Although data cleaning is a major and important part of data preprocessing, there are other important areas regarding this subject. In this chapter, we will learn about two of those important areas: data fusion and data integration. In short, data fusion and integration have a lot to do with mixing two or more sources of data for analytic goals.

First, we will learn about the similarities and differences between data fusion and data integration. After that, we will learn about six frequent challenges regarding data fusion and data integration. Then, by looking at three complete analytic examples, we will get to encounter these challenges and deal with them.

In this chapter, we are going to cover the following main topics:

  • What are data fusion and data integration?
  • Frequent challenges regarding data fusion and integration
  • Example 1...

Technical requirements

You can find the code and dataset for this chapter in this book's GitHub repository, which can be found at https://github.com/PacktPublishing/Hands-On-Data-Preprocessing-in-Python. You can find chapter12 in this repository and download the code and the data for a better learning experience.

What are data fusion and data integration?

In most cases, data fusion and data integration are terms that are used interchangeably, but there are conceptual and technical distinctions between them. We will get to those shortly. Let's start with what both have in common and what they mean. Whenever the data we need for our analytic goals are from different sources, before we can perform the data analytics, we need to integrate the data sources into one dataset that we need for our analytic goals. The following diagram summarizes this integration visually:

Figure 12.1 – Data integration from different sources

In the real world, data integration is much more difficult than what's shown in the preceding figure. There are many challenges that you need to overcome before integration is possible. These challenges could be due to organizational privacy and security challenges that restrict our data accessibility. But even assuming that these challenges...

Frequent challenges regarding data fusion and integration

While every data integration task is unique, there are a few challenges that you will face frequently. In this chapter, you will learn about those challenges and, through examples, you will pick up the skills to handle them. First, let's learn about each. Then, through examples that feature one or more of them, we will pick up valuable skills to handle them.

Challenge 1 – entity identification

The entity identification challenge – or as it is known in the literature, the entity identification problem – may occur when the data sources are being integrated by adding attributes. The challenge is that the data objects in all the data sources are the same real-world entities with the same definitions of data objects, but they are not easy to connect due to the unique identifiers in the data sources. For instance, in the data integration example section, the sales department and the marketing department...

Example 1 (challenges 3 and 4)

In this example, we have two sources of data. The first was retrieved from the local electricity provider that holds the electricity consumption (Electricity Data 2016_2017.csv), while the other was retrieved from the local weather station and includes temperature data (Temperature 2016.csv). We want to see if we can come up with a visualization that can answer if and how the amount of electricity consumption is affected by the weather.

First, we will use pd.read_csv() to read these CSV files into two pandas DataFrames called electric_df and temp_df. After reading the datasets into these DataFrames, we will look at them to understand their data structure. You will notice the following issues:

  • The data object definition of electric_df is the electric consumption in 15 minutes, but the data object definition of temp_df is the temperature every 1 hour. This shows that we have to face the aggregation mismatch challenge of data integration (Challenge...

Example 2 (challenges 2 and 3)

In this example, we will be using the Taekwondo_Technique_Classification_Stats.csv and table1.csv datasets from https://www.kaggle.com/ali2020armor/taekwondo-techniques-classification. The datasets were collected by 2020 Armor (https://2020armor.com/), the first ever provider of e-scoring vests and applications. The data includes the sensor performance readings of six taekwondo athletes, who have varying levels of experience and expertise. We would like to see if the athlete's gender, age, weight, and experience influence the level of impact they can create when they perform the following techniques:

  • Roundhouse/Round Kick (R)
  • Back Kick (B)
  • Cut Kick (C)
  • Punch (P)

The data is stored in two separate files. We will use pd.read_csv() to read table1.csv into athlete_df and Taekwondo_Technique_Classification_Stats.csv into unknown_df. Before reading on, take a moment to study athlete_df and unknown_df and evaluate their state...

Example 3 (challenges 1, 3, 5, and 6)

In this example, we would like to figure out what makes a song rise to the top 10 songs on Billboard (https://www.billboard.com/charts/hot-100) and stay there for at least 5 weeks. Billboard magazine publishes a weekly chart that ranks popular songs based on sales, radio play, and online streaming in the United States. We will integrate three CSV files – billboardHot100_1999-2019.csv, songAttributes_1999-2019.csv, and artistDf.csv from https://www.kaggle.com/danield2255/data-on-songs-from-billboard-19992019 to do this.

This is going to be a long example with many pieces that come together. How you organize your thoughts and work in such data integration challenges is very important. So, before reading on, spend some time getting to know these three data sources and form a plan. This will be a very valuable practice.

Now that you've had a chance to think about how you would go about this, let's do this together. These datasets...

Summary

Congratulations on your excellent progress in this chapter. First, we learned the difference between data fusion and data integration before becoming familiar with six common data integration challenges. Then, through three comprehensive examples, we used the programming and analytic tools that we've picked up throughout this book to face these data integration challenges and preprocess the data sources so that we were able to meet the analytic goals.

In the next chapter, we will focus on another data preprocessing concept that is crucial, especially for algorithmic data analytics due to the limitations of computational resources: data reduction.

Before you start your journey on data reduction, take some time and try out the following exercises to solidify your learning.

Exercise

  1. In your own words, what is the difference between data fusion and data integration? Provides examples other than the ones given in this chapter.
  2. Answer the following question about Challenge 4 – aggregation mismatch. Is this challenge a data fusion one, a data integration one, or both? Explain why.
  3. How come Challenge 2 – unwise data collection is somehow both a data cleaning step and a data integration step? Do you think it is essential that we categorize an unwise data collection under data cleaning or data integration?
  4. In Example 1 of this chapter, we used multi-level indexing using Date and Hour to overcome the index mismatched formatting challenge. For this exercise, repeat this example but this time, use single-level indexing using the Python DataTime object instead.
  5. Recreate Figure 5.20 from Chapter 5, Data Visualization, but instead of using WH Report_preprocessed.csv, integrate the following three files yourself first: WH Report.csv...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Preprocessing in Python
Published in: Jan 2022Publisher: PacktISBN-13: 9781801072137
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Roy Jafari

Roy Jafari, Ph.D. is an assistant professor of business analytics at the University of Redlands. Roy has taught and developed college-level courses that cover data cleaning, decision making, data science, machine learning, and optimization. Roy's style of teaching is hands-on and he believes the best way to learn is to learn by doing. He uses active learning teaching philosophy and readers will get to experience active learning in this book. Roy believes that successful data preprocessing only happens when you are equipped with the most efficient tools, have an appropriate understanding of data analytic goals, are aware of data preprocessing steps, and can compare a variety of methods. This belief has shaped the structure of this book.
Read more about Roy Jafari