Reader small image

You're reading from  Expert Data Modeling with Power BI - Second Edition

Product typeBook
Published inApr 2023
PublisherPackt
ISBN-139781803246246
Edition2nd Edition
Right arrow
Author (1)
Soheil Bakhshi
Soheil Bakhshi
author image
Soheil Bakhshi

Soheil Bakhshi is the founder of Data Vizioner and is a sought after BI consultant. Working in data and analytics for more than 20 years, Soheil's experience lies in Microsoft BI, Data Warehousing, and Power BI platform. He possesses MSCE, MCSA certifications and is a Microsoft MVP (Most Valuable Professional). He has a passion for sharing knowledge via his website and speaking at conferences and Power BI community events locally and globally. In following his desire for simplicity and efficiency, he is behind Power BI community tools and commercial products such as Power BI Exporter and Power BI Documenter.
Read more about Soheil Bakhshi

Right arrow

Common Data Preparation Steps

In the previous chapter, we discussed some data sources that are frequently used in Power BI. We also covered data source certifications and the differences between various connection modes, storage modes, and dataset modes. This chapter looks at common data preparation steps such as table, text, Date, DateTime, and DateTimeZone manipulations.

We look at each of these by providing real-world scenarios that can help deal with real daily data preparation challenges. In Power Query Editor, the data preparation activities are categorized into three separate tabs, as shown in the following screenshot:

  1. Home: Contains more generic actions, such as creating a new query, creating or managing query parameters, and performing common data preparation steps such as split column, Group By, and more.
  2. Transform: Contains more transformation functionalities that can be performed through the UI.
  3. Add Column: Contains data preparation steps related...

Data type conversion

Data type conversion is one of the most common steps in Power Query, yet it is one of the most important ones that can become tricky if not managed well. One cool feature of Power BI, if enabled, is to detect data types automatically. While this is a handy feature in many cases, it can be the root cause of some issues down the road. The critical point to note is how Power BI automatically detects data types. Power BI automatically detects column data types based on the first few hundred rows; that is where things may go wrong, as the data types are not detected based on the whole dataset. Instead, the data types are detected based on the part loaded for preview. In most cases, we deal with data type conversion in table values. Either we use Power Query Editor UI or manually write the expressions; here, we use the following function:

Table.TransformColumnTypes(Table as table, TypeTransformations as list, optional Culture as nullable text)

In the Table...

Splitting a column by delimiter

One of the most common transformation steps in Power Query is Split Column by Delimiter. There are many use cases for this transformation. The following are some of the use cases:

  • Splitting First Name, Middle Name, and Last Name from a Full Name column
  • Splitting date, time, and time zone elements from DateTimeZone values into separate columns
  • Splitting comma-delimited values into separate columns

Let us move forward with a scenario. In the previous section’s scenario, we converted the OrderDateTime column’s type to Date. In this scenario, the business needs to analyze the Internet Sales data at both the Date and Time levels. There are many techniques we can use to satisfy this new requirement, such as the following:

  • Creating a new Time table, which can be done either using DAX (we discussed this in Chapter 2, Data Analysis eXpressions and Data Modeling, in the Creating a Time dimension with DAX...

Merging columns

A typical transformation under the Add Column category is Merge Columns. There are many use cases where we need to merge different columns, such as merging First Name, Middle Name, and Last Name to create a Full Name column or merging a multipart address like (AddressLine1 or AddressLine2) to get an Address column containing the full address. Another common use case is to merge multiple columns to create a unique ID column. Let’s continue with an example from the Chapter 5, Common Data Preparation Steps, Split Column By Delimiter.pbix sample file. You can download the file from here:

https://github.com/PacktPublishing/Expert-Data-Modeling-with-Power-BI-Second-Edition/blob/9d6388ffe6e83586a02de81ff959e04c31dbf1da/Samples/Chapter%2005/Chapter%205,%20Common%20Data%20Preparation%20Steps,%20Split%20Column%20By%20Delimiter.pbix

After opening the file, head to Power Query Editor and follow these steps:

  1. Select the Customer table from the Queries pane...

Adding a custom column

Adding a new column is one of the most common steps during data preparation. There are countless scenarios where we need to add a new column, such as adding some new analytical equations as a new column, creating data clusters in a new column, adding an index column as a new column, or using some machine learning (ML) and artificial intelligence (AI) algorithms. You may also have many other scenarios in mind. Whether we use Power Query Editor UI or manually write the Power Query expressions, we must add a custom column using the following function:

Table.AddColumn(Table as table, NewColumnName as text, ColumnGenerator as function, optional ColumnType as nullable type)

In the Table.AddColumn() function, we have the following:

  • Table: This is the input table value, the result of the previous step, or other queries that provide table output.
  • NewColumnName: The new column name.
  • ColumnGenerator: The expressions we use to create a new...

Adding a column from examples

Adding a column from examples is a brilliant feature of Power Query. It not only helps speed up the development process but also helps developers learn Power Query. The idea is that we can create a new column from sample data by entering the expected values in a sample column. Power Query then guesses what sort of transformation we are after and generates the expressions needed to achieve the results we entered manually. We can create new columns from selected columns or all columns. Let us have a quick look at this feature by example.

Using the Chapter 5, Common Data Preparation Steps, Added Custom Columns.pbix sample file from the previous section, we want to extract the usernames of the customers from their EmailAddress column, while the email structure is UserName@adventure-works.com, from the Customer table.

You can download the sample file from here:

https://github.com/PacktPublishing/Expert-Data-Modeling-with-Power-BI-Second-Edition...

Duplicating a column

Another common transformation step under the Add Column tab is duplicating a column. In many scenarios, we want to duplicate a column, such as keeping the original column available in our model while needing to transform it into a new column. Let us revisit the scenario that we looked at earlier in this chapter in the Splitting a column by delimiter section. In that scenario, we split the OrderDateTime column from the Internet Sales table into two columns, Order Date and Order Time. In this section, we do the same thing, but this time we duplicate the OrderDateTime column instead of splitting it. We use the Chapter 5, Common Data Preparation Steps, Fixed PostCode.pbix sample file that can be downloaded from here: https://github.com/PacktPublishing/Expert-Data-Modeling-with-Power-BI-Second-Edition/blob/main/Samples/Chapter%2005/Chapter%205,%20Common%20Data%20Preparation%20Steps,%20Fixed%20PostCode.pbix.

After opening the sample file, open Power Query Editor...

Filtering rows

The other common transformation is filtering rows. There are many use cases where we may want to restrict the results by specific values. For instance, we may want to filter the Product table to show the products with a Status of Current. Filtering the rows based on columns’ values is very simple. We have to select the desired column, click the arrow down button () from the column’s caption, and select the values we want to use to filter the rows. The following image shows the preceding use case:

Figure 5.29: Filtering rows

While this is a straightforward step, not all filtering use cases are simple, such as when we do not have specific values to filter the rows upon. Instead, the business provides a list specifying the values to use in the filters. Let us look at this with a scenario.

The business provides a list of Approved Product Subcategories every season in Excel format. We need to filter the Product table on the Product Subcategory...

Working with Group By

One of the most valuable and advanced techniques in data modeling is creating summary tables. In many scenarios, using this method is very beneficial. We can use this method to manage our Power BI file’s size and improve performance and memory consumption. Summarization is a known technique in data warehousing where we change the granularity of a fact table to a higher grain. But in Power Query, there are other cases where we can use the Group By functionality to cleanse the data. From a data modeling point of view, we summarize a table by grouping it into descriptive columns and aggregating the numeric values.

Let us go through a scenario and see how the Group By functionality works.

In this section, we use the Chapter 5, Common Data Preparation Steps, Fixed OrderDateTime Data Type.pbix sample file that we created before. You can download the file from here:

https://github.com/PacktPublishing/Expert-Data-Modeling-with-Power-BI-Second-Edition...

Appending queries

There are some scenarios where we get the data with the same structure from different sources, and we want to consolidate it into a single table. In those cases, we need to append the queries. We have two options to append the queries:

  • Append the queries to the first query
  • Append the queries as a new query

The latter is prevalent when we follow ETL best practices. We unload all the queries, append them as a new query, and load the new query into the data model. Therefore, the unloaded queries work as ETL pipelines. However, it does not mean that the first option is not applicable.

Suppose we have a simple business requirement that can be achieved by appending two or more queries to the first query. In that case, use the first option instead. The critical point to note when we append queries is that the Table.Combine(tables as list, optional columns as any) function accepts a list of tables. If the column names in the tables are the same...

Merging queries

Another common data transformation operation is merging queries. The merge queries functionality is useful when you want to denormalize snowflakes absorbing the data stored in different tables into one table. When using the UI in Power Query, depending on the selected matching type, Power Query uses one of the following functions behind the scenes:

  • Table.NestedJoin()
  • Table.FuzzyNestedJoin()

The following image shows the relevant UI:

Figure 5.43: Merging queries via the UI uses different Power Query functions, depending on the matching type

If we do not tick the Use fuzzy matching to perform the merge box, Power Query generates the following function:

Table.NestedJoin(
	FirstTable as table
	, KeyColumnofFirstTable as any
	, SecondTable as any
	, KeyColumnofSecondTable as any
	, NewColumnName as text
	, optional JoinKind as nullable JoinKind.Type
	)

Otherwise, Power Query generates the following:

Table.FuzzyNestedJoin...

Duplicating and referencing queries

Duplicating and referencing queries are somehow similar. We duplicate a query when we need to have all the transformation steps we already took on the original query. At the same time, we want to change those steps or add more transformation steps. In that case, we must change the original query’s nature, translating it so that it has a different meaning from a business point of view. But when we reference a query, we are referencing the final results of the query. Therefore, we do not get the transformation steps in the new query (the referencing query). Referencing a query is a common way to break down the transformation activities. This is the preferred way of preparing the data for most Extract, Transformation, and Load (ETL) experts and data warehousing professionals. In that sense, we can do the following:

  • We have base queries connected to the source system that resembles the Extract part of the ETL process.
  • We reference...

Replacing values

In data cleansing, replacing values is one of the most common transformation activities. A simple example is when we have a description column in the source system containing free text, and we want to replace some parts of the description with something else.

Power Query Editor UI uses the Table.ReplaceValue(table as table, OldValue as any, NewValue as any, Replacer as function, columnsToSearch as list) function behind the scenes to replace a value in a table. If we want to replace the value of a List, it uses the List.ReplaceValue(list as list, OldValue as any, NewValue as any, Replacer as function) function. Depending on the value’s data type, the Replacer function can be either Replacer.ReplaceText or Replacer.ReplaceValue. The difference between the two is that we can use Replacer.ReplaceText to replace text values, while we can use Replacer.ReplaceValue to replace any values. For instance, if we want to replace semicolons with colons in a text column...

Extracting numbers from text

Another common data preparation step is when we need to extract a number from text values. An excellent example is when we want to extract a flat number or a ZIP code from an address. Other examples include extracting the numeric part of a sales order number or cleaning full names of typos, such as when some names contain numbers. We will continue using the Chapter 5, Common Data Preparation Steps, Replaced Values.pbix sample file from the previous section. In our scenario, we want to add two new columns to the Customer table, as follows:

  • Extract Flat Number as a new column from AddressLine1.
  • Extract the rest of the address, Street Name, as a new column.

As the following image shows, the AddressLine1 column contains the flat number in different parts of the address; therefore, splitting by transitioning from digit to non-digit would not work:

Graphical user interface  Description automatically generated

Figure 5.49: Flat numbers appear in different places in AddressLine1

To...

Dealing with Date, DateTime, and DateTimeZone

Generating date, datetime, and datetimezone values in Power Query is simple. We just need to use one of the following three functions:

  1. To generate date values, we can use the following command:
    #date(year as number, month as number, day as number)
    
  2. To generate datetime values, we can use the following command:
    #datetime(year as number, month as number, day as number, hour as number, minute as number, second as number)
    
  3. To generate datetimezone values, we can use the following command:
    #datetimezone(year as number, month as number, day as number, hour as number, minute as number, second as number, offsetHours as number, offsetMinutes as number)
    

The following code generates a record of the Date, DateTime, and DateTimeZone values:

let
    Source = [
	Date = #date(2020, 8, 9)
	, DateTime = #datetime(2020, 8, 9, 17, 0, 0)
	, DateTimeZone = #datetimezone(2020,...

Pivoting tables

Working with Pivot tables comes naturally for Power BI developers with an Excel background. The pivot table is a compelling and common functionality across many data analysis tools, including Power Query in Power BI. The pivoting functionality accepts a table, rotates the values of the selected other column as column headers, groups the unique values of all other columns, and calculates the aggregation of values of the selected column by the other columns. In other words, the pivot function aggregates the intersection of the column that represents column headers and all other columns in the table. When we pivot a table in Power Query, we use the following function:

Table.Pivot(table as table, pivotValues as list, attributeColumn as text, valueColumn as text, optional aggregationFunction as nullable function)

In the preceding function:

  • table is the function’s input.
  • pivotValues is a list of values of the column that rotates as column...

Summary

This chapter explained some common data preparation steps, which means we now know how data type conversion works and what can go wrong during this. We learned how to split a column, merge columns, add a custom column, and filter rows. We also learned how to use the Group By functionality in queries to create summarized tables. We also learned how to append queries and merge queries; we dealt with scenarios related to Date, DateTime, and DateTimeZone. Lastly, we solved some challenges while pivoting a table.

These skills give us firm ground to move on to the next chapter. In the next chapter, we learn how to prepare a Star Schema in Power Query Editor.

Join us on Discord!

Join The Big Data and Analytics Community on the Packt Discord Server!

Hang out with 558 other members and enjoy free voice and text chat.

https://packt.link/ips2H

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Expert Data Modeling with Power BI - Second Edition
Published in: Apr 2023Publisher: PacktISBN-13: 9781803246246
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Soheil Bakhshi

Soheil Bakhshi is the founder of Data Vizioner and is a sought after BI consultant. Working in data and analytics for more than 20 years, Soheil's experience lies in Microsoft BI, Data Warehousing, and Power BI platform. He possesses MSCE, MCSA certifications and is a Microsoft MVP (Most Valuable Professional). He has a passion for sharing knowledge via his website and speaking at conferences and Power BI community events locally and globally. In following his desire for simplicity and efficiency, he is behind Power BI community tools and commercial products such as Power BI Exporter and Power BI Documenter.
Read more about Soheil Bakhshi