Reader small image

You're reading from  Data Cleaning with Power BI

Product typeBook
Published inFeb 2024
PublisherPackt
ISBN-139781805126409
Edition1st Edition
Right arrow
Author (1)
Gus Frazer
Gus Frazer
author image
Gus Frazer

Gus Frazer is a seasoned analytics consultant who focuses on business intelligence solutions. With over eight years of experience working for the two market-leading platforms, Power BI (Microsoft) and Tableau, he has amassed a wealth of knowledge and expertise. He also has experience in helping hundreds of customers to drive their digital and data transformations, scope data requirements, drive actionable insights, and most important of all, clean data ready for analysis.
Read more about Gus Frazer

Right arrow

The Most Common Data Cleaning Operations

Now that you’ve built a strong knowledge of data quality and the importance of assessing and documenting your data cleaning process, it’s time to roll your sleeves up and get stuck into some data.

In this chapter, you will be learning how to deal with the most common data cleaning steps within Power BI, as listed next. For each of these example topics, you will find a step-by-step walk-through on how to carry out these transformations yourself.

We will cover the following specific topics:

  • Removing duplicates
  • Removing missing data
  • Splitting columns
  • Merging columns
  • Replacing outliers
  • Creating calculated columns versus measures

By the end of the chapter, you will have built a strong base of foundational knowledge on how to tackle some of the most common and simplest transformations that we often see needing to be done when connecting to data in Power BI.

Technical requirements

You will find the uncleaned dataset to be used during this chapter at the following link: https://github.com/PacktPublishing/Data-Cleaning-with-Power-BI.

Removing duplicates

In many cases, as we start working with data, there will often be duplicates within the data. As we discussed in Chapter 2, Understanding Data Quality and Why Data Cleaning is Important, there are a number of reasons why the values in your data may have been duplicated. For example, say we're a retailer and we accidentally entered two product items for the same product. We don’t want to have inaccurate numbers for that product by leaving the duplicate data in, so it’s key that we remove it before we get started with our analysis.

So, let’s get started. In the following example, we will find, select, and remove the duplicate in the data:

  1. Download the Products.xlsx dataset from the given GitHub repository.
  2. Connect to this CSV using Power BI Desktop by selecting Get data in the toolbar (as shown) and then selecting Excel workbook:
Figure 4.1 – The Get data menu within Power BI Desktop

Figure 4.1 – The Get data menu within Power BI Desktop

...

Removing missing data

Next, we have the very common issue of missing data or, as most people would recognize, null values. In Chapter 2, Understanding Data Quality and Why Data Cleaning is Important, we understood the reasons why this might happen – for example, due to the type of join between two tables, which might cause many null values to show.

These null values can often either ruin the look of your reporting or potentially skew the numbers being used or analyzed, so it’s often best we look to remove these.

In the example of our products table, we can see that we have a row with blank or 0 values shown in Figure 4.5. If you were viewing this from within the Power Query Editor, then the blank values would be showing as null. While this would otherwise be acceptable as we won’t necessarily see the null product within a visualization, there is a price and cost value against the null product with the 0 value. This could affect the analysis, particularly...

Splitting columns

When working with data, particularly data extracted directly from source systems, there often comes a time when we will need to split columns to gain the desired dimensions for our analysis. This might be because the software or database of that source system might store that data in a particular format/encoding/arrangement. The most common example of this could be splitting a Date field in order to extract dimensions for [Day], [Month], and [Year].

In this example, we will connect and open the calendar.xlsx file. This Excel table includes one column of dates (as shown in Figure 4.8) and is to be used as a date table within Power BI. In this example, we might need to extract the individual date components for our analysis. Once connected, select Transform data to enter Power Query once again.

Figure 4.8 – The Date column within the date table in Power BI

Figure 4.8 – The Date column within the date table in Power BI

In order to split the columns, we will use the prebuilt function for splitting...

Merging columns

Just like the previous scenario, there are often situations where you need to merge columns to achieve the desired format. For example, depending how date data is stored, you may need to merge the [Day], [Month], and [Year] columns to achieve a singular Date column. In the following figure, we can see an example description and configuration of how you might do this using the prebuilt function of Merge Columns in Power Query:

Figure 4.12 – The drop-down menu above the Merge Columns function (left) and a close-up of the UI for the Merge Columns function (right)

Figure 4.12 – The drop-down menu above the Merge Columns function (left) and a close-up of the UI for the Merge Columns function (right)

To use the Merge Columns function, follow these steps:

  1. Select the columns you would like to merge.
  2. Select the Merge Columns function, which can be located on the Home or Transform tab in the ribbon toolbar.
  3. In the Merge Columns configuration window, you can select a separator to sit between the merged values. As we are working with dates, we can select Custom...

Replacing values

When connecting to and analyzing data, there are often times when we might find outliers within the data. If we identify that there are values skewing the data or showing incorrectly, it’s important for us to be able to replace the data with the correct values.

There are many scenarios where you might need to do this in Power BI. Here are just some example scenarios:

  • Replacing variations of N/A or Not Applicable with a consistent value such as Unknown.
  • You may need to rename or reclassify certain values in your dataset to align with your reporting needs or to create meaningful categories. For instance, you can replace abbreviations or acronyms with their full names or group similar values together.
  • You can replace numeric code with descriptive labels or convert coded values into meaningful text representations.
  • In cases where your data contains missing or null values, you can use the Replace Values function to replace them with a specific...

Creating calculated columns versus measures

When preparing data for analysis in Power BI, we often need to add additional data to the model (often derived from existing data). This can often come in the form of using DAX to either create a measure or an additional column. This could be for a new value we need or a new dimension. Before understanding which option is best, it’s important to understand what the difference is between a calculated column and a measure within Power BI.

The best way to add a new measure or column is to first navigate to Table view on the far-left toolbar.

The view will automatically jump to the Table tools tab along the top toolbar, which will then present you with the following options for calculations: New measure, Quick measure, New column, and New table, as shown:

Figure 4.16 – The Calculations section within the Table view of Power BI

Figure 4.16 – The Calculations section within the Table view of Power BI

It’s important to note that if you are working from Model view...

Summary

In this chapter, you began your journey into the practical aspects of data cleaning within Power BI. You covered some of the most common data cleaning steps in Power BI, including removing duplicates, handling missing data, splitting columns, merging tables, dealing with date formats, replacing values, and creating calculated columns versus measures.

The chapter also highlighted the importance of replacing values in your data. Outliers, incorrect values, or inconsistent formats can hinder analysis. You learned about various scenarios where replacing values is necessary and used the Replace Values function in Power Query to fix errors and standardize data.

Lastly, the chapter explored the difference between calculated columns and measures in Power BI and explained when to use each option and their respective benefits. Calculated columns are best suited for row-level calculations, while measures are ideal for aggregations and calculations based on visual context. The chapter...

Questions

  1. Why is it important to remove duplicates from your data before building a model in Power BI?
    1. To increase file size
    2. To enhance data accuracy in the analysis
    3. To speed up data loading
    4. To add complexity to the data model
  2. In the provided example with the products table, which column is selected for removing duplicates, and why is it crucial to choose the right column?
    1. Product ID, for simplicity
    2. Cost, for accurate financial analysis
    3. Product Name, as the main identifier
    4. Date, for chronological precision
  3. How can missing data, represented as null values, impact the analysis of your dataset?
    1. Enhances visual appeal
    2. Distorts analysis results
    3. Speeds up data processing
    4. Reduces data complexity
  4. When might you need to split columns in Power BI, and what example is given in the chapter?
    1. To increase data complexity – for example, splitting product codes
    2. To enhance visual appeal – for example, splitting financial data
    3. To gain desired dimensions for analysis – for example...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Cleaning with Power BI
Published in: Feb 2024Publisher: PacktISBN-13: 9781805126409
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Gus Frazer

Gus Frazer is a seasoned analytics consultant who focuses on business intelligence solutions. With over eight years of experience working for the two market-leading platforms, Power BI (Microsoft) and Tableau, he has amassed a wealth of knowledge and expertise. He also has experience in helping hundreds of customers to drive their digital and data transformations, scope data requirements, drive actionable insights, and most important of all, clean data ready for analysis.
Read more about Gus Frazer