Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Microsoft Power BI Cookbook. - Second Edition

You're reading from  Microsoft Power BI Cookbook. - Second Edition

Product type Book
Published in Sep 2021
Publisher Packt
ISBN-13 9781801813044
Pages 656 pages
Edition 2nd Edition
Languages
Authors (2):
Gregory Deckler Gregory Deckler
Profile icon Gregory Deckler
Brett Powell Brett Powell
Profile icon Brett Powell
View More author details

Table of Contents (16) Chapters

Preface Configuring Power BI Tools Accessing and Retrieving Data Building a Power BI Data Model Authoring Power BI Reports Working in the Service Getting Serious with Date Intelligence Parameterizing Power BI Solutions Implementing Dynamic User-Based Visibility in Power BI Applying Advanced Analytics and Custom Visuals Administering and Monitoring Power BI Enhancing and Optimizing Existing Power BI Solutions Deploying and Distributing Power BI Content Integrating Power BI with Other Applications Other Book You May Enjoy
Index

Transforming and Cleansing Data

The transformations applied within Power BI's M queries serve to protect the integrity of the data model and to support enhanced analysis and visualization. The specific transformations to implement varies based on data quality, integration needs, and the goals of the overall solution. However, at a minimum, developers should look to protect the integrity of the model's relationships and to simplify the user experience via denormalization and standardization. Additionally, developers should check with owners of the data source to determine whether certain required transformations can be implemented in the source, or perhaps made available via SQL view objects such that Power Query (M) expressions are not necessary.

This recipe demonstrates how to protect a model from duplicate values within the source data that can prevent forming proper relationships within the data model, which may even result in query failures. While a simple scenario is used, this recipe demonstrates scenarios you may run into while attempting to merge multiple data sources and eliminating duplicates.

Getting ready

To prepare, start by importing the DimProduct and FactResellerSales tables from the AdventureWorksDW2019 database by doing the following:

  1. Open Power BI Desktop and choose Transform data from the ribbon of the Home tab to open the Power Query Editor.
  2. Create an Import mode data source query called AdWorksDW. This query should be similar to the following:
    let
        Source = Sql.Database("localhost\MSSQLSERVERDEV", "AdventureWorksDW2019")
    in
        Source
    
  3. Isolate this query in a query group called Data Sources.
  4. Right-click AdWorksDW and choose Reference, select the DimProduct table in the data preview area, and rename this query DimProduct. Right-click the EnglishProductName column and select Remove Other Columns.
  5. Repeat the previous step, but this time choose FactResellerSales. Expand the DimProduct column and only choose EnglishProductName. Rename this column to EnglishProductName.
  6. Drag the DimProduct and FactResellerSales queries into the Other Queries group and apply the queries to the data model.
  7. In the Model view of Power BI Desktop, attempt to form a relationship between the tables using the EnglishProductName columns from both tables. Note the warning that is displayed.

Figure 2.31: Many-Many relationship cardinality warning

For additional details on performing these steps, see the Managing Queries and Data Sources recipe in this chapter.

How to Transform and Cleanse Data

We wish to remove duplicates from the EnglishProductName column in our DimProduct query. To implement this recipe, use the following steps:

  1. Remove any leading and trailing empty spaces in the EnglishProductName column with a Text.Trim function.
  2. Create a duplicate column of the EnglishProductName key column with the Table.DuplicateColumn function and name this new column Product Name.
  3. Add an expression to force uppercase on the EnglishProductName column via the Table.TransformColumns function. This new expression must be applied before the duplicate removal expressions are applied.
  4. Add an expression to the DimProduct query with the Table.Distinct function to remove duplicate rows.
  5. Add another Table.Distinct expression to specifically remove duplicate values from the EnglishProductName column.
  6. Drop the capitalized EnglishProductName column via Table.RemoveColumns.

    The final query should resemble the following:

    let
        Source = AdWorksDW,
        dbo_DimProduct = Source{[Schema="dbo",Item="DimProduct"]}[Data],
        RemoveColumns = Table.SelectColumns(dbo_DimProduct,{"EnglishProductName"}),
        TrimText = 
            Table.TransformColumns(
                RemoveColumns,{"EnglishProductName",Text.Trim}
            ),
        DuplicateKey = 
            Table.DuplicateColumn(
                TrimText,"EnglishProductName","Product Name"
            ),
        UpperCase = 
            Table.TransformColumns(
                DuplicateKey,{{"EnglishProductName", Text.Upper}}
            ),
        DistinctProductRows = Table.Distinct(UpperCase),
        DistinctProductNames = 
            Table.Distinct(
                DistinctProductRows, {"EnglishProductName"}
            ),
        RemoveEnglishProductName = 
            Table.RemoveColumns(
                DistinctProductNames,"EnglishProductName"
            )
    in
        RemoveEnglishProductName
    

How it works

In the TrimText expression, the Trim.Text function removes white space from the beginning and end of a column. Different amounts of empty space make those rows distinct within the query engine, but not necessarily distinct within the model. Therefore, it is always a good idea to use Trim.Text first and then remove duplicate rows and values.

In the next expression, DuplicateKey, the Table.DuplicateColumn function is used to duplicate the column where we will be removing duplicate values. We give this new column the name that we desire for our final column. This is done because we will need to transform the values in the column we are removing duplicates from, in order to account for mixed cases such as "Fender Set" and "Fender set". Thus, we wish to preserve the original values and casing by using this duplicate column.

In order to eliminate mixed casing issues, the UpperCase expression changes all values in the EnglishProductName column to uppercase using the Table.TransformColumns function, and specifying Text.Upper. The M engine considers mixed casing values unique, but the data model engine does not.

The next two expressions, DistinctProductRows and DistinctProductNames, simply demonstrate two different methods of using the Table.Distinct function. The first, DistinctProductRows, eliminates rows where the entire row (all column values) are identical. The second version looks only at the values in a single column when determining whether or not the row is a duplicate.

At this point, the query is now resilient to duplicate values and rows, mixed cases, and spaces. However, the EnglishProductName column is now in the uppercase format. Since we preserved a copy of the original values and casing in our Product Name column, we can simply drop the EnglishProductName column using the Table.RemoveColumns function.

We can now form a one-to-many relationship between our DimProduct and FactResellerSales tables.

Figure 2.32: Simple one-to-many relationship model

There's more...

To support troubleshooting, create a query that accesses the same source table and retrieves the values from the EnglishProductName column with more than one row.

let
    Source = AdWorksDW,
    dbo_DimProduct = Source{[Schema="dbo",Item="DimProduct"]}[Data],
    RemoveColumns = Table.SelectColumns(dbo_DimProduct,{"EnglishProductName"}),
    TrimText = 
        Table.TransformColumns(
            RemoveColumns,{"EnglishProductName",Text.Trim}
        ),
    UpperCase = 
        Table.TransformColumns(
            TrimText,{{"EnglishProductName", Text.Upper}}
        ),
    GroupedRows = 
        Table.Group(
            UpperCase, {"EnglishProductName"},
            {{"Rows", each Table.RowCount(_), Int64.Type}}
        ),
    Duplicates = Table.SelectRows(GroupedRows, each [Rows] > 1)
in
    Duplicates

The EnglishProductName column is selected, trimmed, converted to uppercase, grouped, and then filtered to always retrieve any duplicate key values. Disable the loading of this query, as the query would only exist for troubleshooting purposes.

See also

You have been reading a chapter from
Microsoft Power BI Cookbook. - Second Edition
Published in: Sep 2021 Publisher: Packt ISBN-13: 9781801813044
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}