You're reading from Microsoft Power BI Cookbook. - Second Edition

Product type Book

Published in Sep 2021

Publisher Packt

ISBN-13 9781801813044

Pages 656 pages

Edition 2nd Edition

Languages

Concepts

Business Intelligence

Authors (2):

Gregory Deckler

Brett Powell

View More author details

Table of Contents (16) Chapters

Preface

Configuring Power BI Tools

Accessing and Retrieving Data

Building a Power BI Data Model

Authoring Power BI Reports

Working in the Service

Getting Serious with Date Intelligence

Parameterizing Power BI Solutions

Implementing Dynamic User-Based Visibility in Power BI

Applying Advanced Analytics and Custom Visuals

Administering and Monitoring Power BI

Enhancing and Optimizing Existing Power BI Solutions

Deploying and Distributing Power BI Content

Integrating Power BI with Other Applications

Other Book You May Enjoy

Index

Transforming and Cleansing Data

The transformations applied within Power BI's M queries serve to protect the integrity of the data model and to support enhanced analysis and visualization. The specific transformations to implement varies based on data quality, integration needs, and the goals of the overall solution. However, at a minimum, developers should look to protect the integrity of the model's relationships and to simplify the user experience via denormalization and standardization. Additionally, developers should check with owners of the data source to determine whether certain required transformations can be implemented in the source, or perhaps made available via SQL view objects such that Power Query (M) expressions are not necessary.

This recipe demonstrates how to protect a model from duplicate values within the source data that can prevent forming proper relationships within the data model, which may even result in query failures. While a simple scenario is used, this recipe demonstrates scenarios you may run into while attempting to merge multiple data sources and eliminating duplicates.

Getting ready

To prepare, start by importing the DimProduct and FactResellerSales tables from the AdventureWorksDW2019 database by doing the following:

Open Power BI Desktop and choose Transform data from the ribbon of the Home tab to open the Power Query Editor.
Create an Import mode data source query called AdWorksDW. This query should be similar to the following:
```
let
    Source = Sql.Database("localhost\MSSQLSERVERDEV", "AdventureWorksDW2019")
in
    Source
```
Isolate this query in a query group called Data Sources.
Right-click AdWorksDW and choose Reference, select the DimProduct table in the data preview area, and rename this query DimProduct. Right-click the EnglishProductName column and select Remove Other Columns.
Repeat the previous step, but this time choose FactResellerSales. Expand the DimProduct column and only choose EnglishProductName. Rename this column to EnglishProductName.
Drag the DimProduct and FactResellerSales queries into the Other Queries group and apply the queries to the data model.
In the Model view of Power BI Desktop, attempt to form a relationship between the tables using the EnglishProductName columns from both tables. Note the warning that is displayed.

Figure 2.31: Many-Many relationship cardinality warning

For additional details on performing these steps, see the Managing Queries and Data Sources recipe in this chapter.

How to Transform and Cleanse Data

We wish to remove duplicates from the EnglishProductName column in our DimProduct query. To implement this recipe, use the following steps:

Remove any leading and trailing empty spaces in the EnglishProductName column with a Text.Trim function.
Create a duplicate column of the EnglishProductName key column with the Table.DuplicateColumn function and name this new column Product Name.
Add an expression to force uppercase on the EnglishProductName column via the Table.TransformColumns function. This new expression must be applied before the duplicate removal expressions are applied.
Add an expression to the DimProduct query with the Table.Distinct function to remove duplicate rows.
Add another Table.Distinct expression to specifically remove duplicate values from the EnglishProductName column.

Drop the capitalized EnglishProductName column via Table.RemoveColumns.

The final query should resemble the following:

let
    Source = AdWorksDW,
    dbo_DimProduct = Source{[Schema="dbo",Item="DimProduct"]}[Data],
    RemoveColumns = Table.SelectColumns(dbo_DimProduct,{"EnglishProductName"}),
    TrimText = 
        Table.TransformColumns(
            RemoveColumns,{"EnglishProductName",Text.Trim}
        ),
    DuplicateKey = 
        Table.DuplicateColumn(
            TrimText,"EnglishProductName","Product Name"
        ),
    UpperCase = 
        Table.TransformColumns(
            DuplicateKey,{{"EnglishProductName", Text.Upper}}
        ),
    DistinctProductRows = Table.Distinct(UpperCase),
    DistinctProductNames = 
        Table.Distinct(
            DistinctProductRows, {"EnglishProductName"}
        ),
    RemoveEnglishProductName = 
        Table.RemoveColumns(
            DistinctProductNames,"EnglishProductName"
        )
in
    RemoveEnglishProductName

How it works

In the TrimText expression, the Trim.Text function removes white space from the beginning and end of a column. Different amounts of empty space make those rows distinct within the query engine, but not necessarily distinct within the model. Therefore, it is always a good idea to use Trim.Text first and then remove duplicate rows and values.

In the next expression, DuplicateKey, the Table.DuplicateColumn function is used to duplicate the column where we will be removing duplicate values. We give this new column the name that we desire for our final column. This is done because we will need to transform the values in the column we are removing duplicates from, in order to account for mixed cases such as "Fender Set" and "Fender set". Thus, we wish to preserve the original values and casing by using this duplicate column.

In order to eliminate mixed casing issues, the UpperCase expression changes all values in the EnglishProductName column to uppercase using the Table.TransformColumns function, and specifying Text.Upper. The M engine considers mixed casing values unique, but the data model engine does not.

The next two expressions, DistinctProductRows and DistinctProductNames, simply demonstrate two different methods of using the Table.Distinct function. The first, DistinctProductRows, eliminates rows where the entire row (all column values) are identical. The second version looks only at the values in a single column when determining whether or not the row is a duplicate.

At this point, the query is now resilient to duplicate values and rows, mixed cases, and spaces. However, the EnglishProductName column is now in the uppercase format. Since we preserved a copy of the original values and casing in our Product Name column, we can simply drop the EnglishProductName column using the Table.RemoveColumns function.

We can now form a one-to-many relationship between our DimProduct and FactResellerSales tables.

Figure 2.32: Simple one-to-many relationship model

There's more...

To support troubleshooting, create a query that accesses the same source table and retrieves the values from the EnglishProductName column with more than one row.

let
    Source = AdWorksDW,
    dbo_DimProduct = Source{[Schema="dbo",Item="DimProduct"]}[Data],
    RemoveColumns = Table.SelectColumns(dbo_DimProduct,{"EnglishProductName"}),
    TrimText = 
        Table.TransformColumns(
            RemoveColumns,{"EnglishProductName",Text.Trim}
        ),
    UpperCase = 
        Table.TransformColumns(
            TrimText,{{"EnglishProductName", Text.Upper}}
        ),
    GroupedRows = 
        Table.Group(
            UpperCase, {"EnglishProductName"},
            {{"Rows", each Table.RowCount(_), Int64.Type}}
        ),
    Duplicates = Table.SelectRows(GroupedRows, each [Rows] > 1)
in
    Duplicates

The EnglishProductName column is selected, trimmed, converted to uppercase, grouped, and then filtered to always retrieve any duplicate key values. Disable the loading of this query, as the query would only exist for troubleshooting purposes.