You're reading from Expert Data Modeling with Power BI - Second Edition

Product typeBook

Published inApr 2023

PublisherPackt

ISBN-139781803246246

Edition2nd Edition

Concepts

Business Intelligence

Author (1)

Soheil Bakhshi

What data modeling means in Power BI

Data modeling is undoubtedly one of the most important parts of Power BI development. It is crucial to understand the purpose of data modeling in Power BI from data models in transactional systems. In a transactional system, the goal is to have a model optimized for recording transactional data. Nevertheless, a well-designed data model in Power BI must be optimized for querying the data and reducing the dataset size by aggregating the data.

In reality, not everyone has the luxury of having a data warehouse, so it is a vital skill to know how to create a data model in Power BI. While it is very tempting to get all the data from various data sources and import it to Power BI, answering business questions can quickly translate to complex queries that take a long time to process, which is not ideal. The best practice is to avoid importing everything from the data sources into Power BI and solving the related problems later such as performance issues, data model complexities, and having unnecessarily large data models. Instead, it is wise to get the data model right to precisely answer business-driven questions in the most performant way. When modeling data in Power BI, we must build a data model based on the business logic. So, we may need to join different tables and aggregate the data to a level that answers all business-driven questions, which can be tricky when we have data from various data sources of different grains.

Therefore, we need to transform and reshape the data in Power Query before loading it into the data model. After cutting all the noise from the data, we have a clean, easy-to-understand, and easy-to-work-with data model.

Semantic model

Power BI inherits its characteristics from Power Pivot and SSAS Tabular. Both of them use the xVelocity engine, an updated version of the VertiPaq engine designed for in-memory data analysis. The xVelocity engine leverages column store indexing and consists of semantic model objects such as tables, relationships, hierarchies, and measures stored in memory. All of this means that we would expect tremendous performance gains over highly compressed data, right? Well, it depends. We can expect fast and responsive reports if we efficiently transform and model the data supporting the business logic. Conceptually, the data model in Power BI is a semantic model. Let us untangle this a bit.

A semantic model is a unified data model that provides business contexts to data. The semantic model can be accessed from various data visualization tools, such as Excel, without further transformation. When we publish a Power BI report file (PBIX) to the Power BI Service, the service stores the report in two separate objects. The transformation steps and the data model are stored as a Dataset object and the report as a Report object. A dataset in the Power BI Service is indeed our semantic model. We can connect to the datasets from Power BI Desktop, analyze the datasets in Excel, or use third-party tools such as Tableau. The latter requires an XMLA endpoint connection to a Power BI Premium dataset.

We will not cover the details of XMLA endpoints in this book. You can read more about XMLA endpoints here: https://docs.microsoft.com/en-us/power-bi/enterprise/service-premium-connect-tools?WT.mc_id=DP-MVP-5003466.

Building an efficient data model in Power BI

An efficient data model is easy to understand and easy to maintain. At the same time, it must answer all data-driven questions the business may ask. Let us analyze the preceding sentence. An efficient model must do the following:

Perform well (be fast)
Be business-driven
Decrease the complexity of the required DAX expressions (easy to understand)
Low maintenance (low cost)

Let us look at the preceding points with a scenario.

We are tasked to create a report on top of the following three different data sources:

An OData data source with 15 tables. The tables have between 50 and 250 columns.
An Excel file with 20 sheets with interdependencies and many formulas.
A data warehouse hosted in SQL Server. The data comes from five dimensions and two fact tables:
- Within those five dimensions is a Date and a Time dimension. The grain of the Time dimension is the hour and minute.
- Each of the fact tables has between 50 and 200 million rows. From a Date and Time perspective, the grain of both fact tables is the day, hour, and minute.
- The organization owns Power BI Pro licenses.

Before getting the data from the source systems, there are essential points in the preceding scenario, and many points are unclear at this stage.

Let us analyze the scenario, along with some related questions we might ask the business to optimize the performance of the report and avoid customer dissatisfaction:

OData: OData is an online data source, so it could be slow to load the data.
The tables are wide, which can potentially impact the performance.
1. Do we need to import all the columns from those 15 tables?
2. Do we also need to import all data, or is just a portion of the data enough? For example, if the data source contains 10 years of data, does the business need to analyze all the historical data, or does bringing 1 or 2 years of data fit the purpose?
The organization owns Power BI Pro licenses, so a 1 GB file size limit applies.
Excel: Excel files with many formulas can be tricky data sources.
1. Does the business require you to analyze all the data contained in the 20 sheets? We may be able to exclude some of those sheets.
2. How often are the formulas edited? This is critical as modifying the formulas can easily break the data processing in Power Query and generate errors. It is best to replicate the Excel formulas in Power BI and load the raw data from Excel before the formulas are applied.
Data warehouse in SQL Server: It is beneficial to have a data warehouse as a source system, as data warehouses typically have a much better structure from an analytical viewpoint. In our scenario, the finest grain of both fact tables is down to a minute, which can quickly become an issue. Remember, we have Power BI Pro licenses, so we are limited to a 1 GB file size only.
1. Does the business need to analyze all the metrics down to the minute, or is day-level enough?
2. Is the business required to analyze the whole history, or is bringing a portion of the data enough?

We now have a handful of questions to ask. One common question on the list is about the necessity of analyzing all the historical data. What if the business needs to analyze the whole history? In that case, we should consider using some advanced modeling techniques such as composite models and aggregations.

On the bright side, the fact that we already have five dimensions in the data warehouse might become handy. We might reuse those dimensions with minimal changes in our data model. So, it is wise to look at the other data sources and find commonalities in the data patterns.

We may come up with some more legitimate points and questions later. The takeaway is that we have to talk to the business and ask questions before starting the job. It is a big mistake to start getting data from the source systems before framing the questions around the business processes, requirements, and technology limitations. There are also other points we need to think about from a project management perspective, which are beyond the scope of this book.

The following are initial points to take into account for building an efficient data model:

We must ask questions to avoid confusion and potential future reworks.
We need to understand the technical limitations and come up with solutions.
We have to have a good understanding of data modeling to look for common data patterns to prevent overlaps.

At this point, you may think, “OK, but how can we get there?” This book aims to cover all the preceding points and more. The rest is about you and how you apply your learning to your daily Power BI challenges. The section explains the star schema and snowflaking.

Star schema (dimensional modeling) and snowflaking

First things first, the star schema and dimensional modeling are the same things. In Power BI data modeling, the term star schema is more commonly used. The following sections are generic reminders about some star schema concepts.

Transactional modeling versus star schema modeling

In transactional systems, the main goal is improving the solution’s performance in creating new records and updating/deleting existing ones. So, when designing transactional systems, it is essential to go through the normalization process to decrease data redundancy and increase data entry performance by breaking the tables down into master-detail tables.

But the goal of a business analysis system is very different. In a business analysis solution, we need a data model optimized for querying in the most performant way.

Let us continue with a scenario. Suppose we have a transactional retail system for an international retail shop. We have hundreds of transactions every second from different parts of the world. The company owners want to see the total sales in the past 6 months.

This calculation sounds easy. It is just a simple SUM of sales. But wait, we have hundreds of transactions every second, right? If we have 100 transactions per second, then we have 8,640,000 transactions a day. So, for 6 months of data, we have more than 1.5 billion rows. Therefore, a simple SUM of sales takes a reasonable amount of time to process.

Now, the business raises a new request. The company owners now want to see the total sales in the past 6 months by country and city. They simply want to know what the best-selling cities are.

We need to add another condition to our simple SUM calculation, which translates into a join to the geography table. For those coming from a relational database design background, it is trivial that joins are relatively expensive operations. This scenario can go on and on. So, you can imagine how quickly a simple scenario turns into a rather tricky situation.

In the star schema, however, we already joined all those tables based on business entities. We aggregated and loaded the data into denormalized tables. In the preceding scenario, the business is not interested in seeing every transaction at the second level. So, summarizing the data at the day level decreases the row count from 1.5 billion to approximately 18,000 rows for the 6 months. Now you can imagine how fast the summation would run over 18,000 instead of 1.5 billion rows.

The idea of the star schema is to keep all numeric values in separate tables called fact tables and put all descriptive information into other tables called dimension tables. Usually, the fact tables are surrounded by dimensions explaining the facts. A data model with a fact table in the middle surrounded by dimensions looks like a star, which is why this modeling approach is called a star schema.

This book generally uses Adventure Works DW data, a renowned Microsoft sample dataset, unless stated otherwise. Adventure Works is an imaginary international bike shop selling products online and in their retail shops.

The following figure shows Internet Sales in a star schema shape:

Diagram Description automatically generated

Figure 1.9: Internet Sales star schema

Snowflaking

Snowflaking is when we do not have a perfect star schema when dimension tables surround the fact tables. In some cases, we have some levels of descriptions stored in different tables. Therefore, some dimensions in the model are linked to other tables describing the dimensions in more detail. Snowflaking is normalizing the dimension tables. In some cases, snowflaking is inevitable; nevertheless, the general rule of thumb in data modeling in Power BI (when following a star schema) is to avoid snowflaking as much as possible to have a simpler and more performant model. The following figure shows snowflaking in Adventure Works Internet Sales:

Figure 1.10: Adventure Works, Internet Sales snowflakes

We could avoid the Product Category and Product Subcategory snowflakes in the preceding model by denormalizing them into the Product table. The following section explains denormalization.

Understanding denormalization

In real-world scenarios, not everyone has the luxury of having a pre-built data warehouse designed in a star schema. In reality, snowflakes in data warehouse designs are inevitable. The data models are usually connected to various data sources, including transactional database systems and non-transactional data sources such as Excel files and CSV files. So, we almost always need to denormalize the model to a certain degree. Depending on the business requirements, we may have some normalization along with some denormalization. The reality is that there is no specific rule for the level of normalization and denormalization. The general rule of thumb is denormalizing the model, so each dimension describes all the related details as much as possible.

In the preceding example from Adventure Works DW, we have snowflakes of Product Category and Product Subcategory that can be denormalized into the Product dimension.

Let us go through a hands-on exercise.

Go through the following steps to denormalize the Product Category and Product Subcategory into the Product dimension.

You need to download the Adventure Works, Internet Sales.pbix file from here:

https://github.com/PacktPublishing/Expert-Data-Modeling-with-Power-BI-Second-Edition/blob/28e2af1762336ab5236a3b3961c41e9020de8200/Samples/Chapter%2001/Adventure%20Works,%20Internet%20Sales.pbix

You also need to download the Excel file used as the data source in the PBIX file from here:

https://github.com/PacktPublishing/Expert-Data-Modeling-with-Power-BI-Second-Edition/blob/28e2af1762336ab5236a3b3961c41e9020de8200/Samples/AdventureWorksDW(xlsx)/AdventureWorksDW2017.xlsx

Before we start, let us set up the sample file first. Open the Adventure Works, Internet Sales.pbix file, then change the Adventure Works DW Excel Path parameter as follows:

Click the Transform data drop-down button.
Click Edit parameters.
Enter the path for the .save location of the AdventureWorksDW2017.xlsx file.
Click OK.

The following image illustrates the preceding steps:

Graphical user interface, application, Word Description automatically generated

Figure 1.11: Setting up the Adventure Works, Internet Sales.pbix sample file

After changing the values of the parameter, we have to Apply changes.

Figure 1.12: Applying changes after changing the parameter value

The preceding process reloads the data from the Excel file.

Now that we have correctly set up the sample file, follow these steps:

Click Transform data in the Home tab of the Queries section.
Click the Product query.
Click Merge Queries in the Home tab of the Combine section.
Select Product Subcategory from the drop-down list.
Click ProductSubcategoryKey in the Product table.
Click ProductSubcategoryKey in the Product Subcategory table.
Select Left Outer (all from first matching from the second) from the Join Kind drop-down.
Click OK.

Graphical user interface, application, table, Excel Description automatically generated

Figure 1.13: Merging Product and Product Subcategory

This adds a new step named Merged Queries. As you can see, the values of this column are all Table. This type of column is called a Structured Column. The merging step creates a new structured column named Product Subcategory:

You will learn more about structured columns in Chapter 3, Data Preparation in Power Query Editor.

Table Description automatically generated

Figure 1.14: Merging the Product and Product Subcategory tables

Now let us look at how to expand a structured column in the Query Editor:

Click the Expand button to expand the Product Subcategory column.
Select ProductCategoryKey.
Select the EnglishProductSubcategoryName columns and unselect the rest.
Unselect Use original column name as prefix.
Click OK.

Figure 1.15: Expanding Structured Column in the Query Editor

So far, we have added the EnglishProductSubcategoryName and ProductCategoryKey columns from the Product Subcategory query to the Product query. The next step is to add EnglishProductCategoryName from the Product Category query. To do so, we need to merge the Product query with Product Category:

Click Merge Queries again.
Select Product Category from the drop-down list.
Select ProductCategoryKey from the Product table.
Select ProductCategoryKey from the Product Category table.
Select Left Outer (all from first matching from second).
Click OK:

Graphical user interface, application, table Description automatically generated

Figure 1.16: Merging Product and Product Category

This adds a new structured column named Product Category. We now need to do the following:

Expand the new column.
Pick EnglishProductCategoryName from the list.
Unselect Use original column name as prefix.
Click OK:

Graphical user interface, text, application, chat or text message Description automatically generated

Figure 1.17: Merging Product and Product Category

So far, we moved the EnglishProductCategoryName into the Product table. Therefore, the ProductCategoryKey column is no longer needed. So, the next step is removing the ProductCategoryKey column as we no longer need it. To do so, follow these steps:

Click on the ProductCategoryKey column.
Click the Remove Columns button in the Managed Column section of the Home tab:

Figure 1.18: Removing a column in the Query Editor

We have merged the Product Category and Product Subcategory snowflakes with the Product query. So, we successfully denormalized the snowflakes.

The very last step is to unload both the Product Category and Product Subcategory queries:

Right-click on each query.
Untick Enable load from the menu.
Click Continue on the Possible Data Loss Warning pop-up message:

Graphical user interface, text, application Description automatically generated

Figure 1.19: Unloading queries in the Query Editor

Now we need to import the data into the data model by clicking Close & Apply:

Figure 1.18 – Importing data into the data model

Figure 1.20: Importing data into the data model

We have now achieved what we were after: we denormalized the Product Category and Product Subcategory tables, so instead of loading those two tables, we now have EnglishProductCategoryName and EnglishProductSubcategoryName as new columns in the Product table. So the data model is simpler now, and we load less data, leading to better performance.

Job done!

You have been reading a chapter from

Expert Data Modeling with Power BI - Second Edition

Published in: Apr 2023Publisher: PacktISBN-13: 9781803246246

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Soheil Bakhshi

Soheil Bakhshi is the founder of Data Vizioner and is a sought after BI consultant. Working in data and analytics for more than 20 years, Soheil's experience lies in Microsoft BI, Data Warehousing, and Power BI platform. He possesses MSCE, MCSA certifications and is a Microsoft MVP (Most Valuable Professional). He has a passion for sharing knowledge via his website and speaking at conferences and Power BI community events locally and globally. In following his desire for simplicity and efficiency, he is behind Power BI community tools and commercial products such as Power BI Exporter and Power BI Documenter.
Read more about Soheil Bakhshi

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages