You're reading from Mastering Microsoft Power BI – Second Edition - Second Edition

Product typeBook

Published inJun 2022

PublisherPackt

ISBN-139781801811484

Edition2nd Edition

Concepts

Business Intelligence

Authors (2):

Gregory Deckler

Brett Powell

View More author details

Dataset planning

After the source data is profiled and evaluated against the requirements identified in the four-step dataset design process, the BI team can further analyze the implementation options for the dataset. In almost all Power BI projects, even with significant investments in enterprise data warehouse architecture and ETL tools and processes, some level of additional logic, integration, or transformation is needed to enhance the quality and value of the source data or to effectively support a business requirement.

A priority of the dataset planning stage is to determine how the identified data transformation issues are addressed to support the dataset. Additionally, based on all available information and requirements, the project team must determine whether to develop an import mode dataset, DirectQuery dataset, or composite dataset. Import, DirectQuery, and composite datasets are explained in the section Import, DirectQuery, Live, and Composite Datasets.

The initial step in the dataset planning process is planning for data transformations, which we’ll review next.

Data transformations

To help clarify the dataset planning process, a diagram such as Figure 1.17 can be created that identifies the different layers of the data warehouse and Power BI dataset where transformation and business logic can be implemented:

Figure 1.17: Dataset planning architecture

In some projects, minimal transformation logic is needed and can be easily included in the Power BI dataset or the SQL views accessed by the dataset. For example, if only a few additional columns are needed for a dimension table and there’s straightforward guidance on how these columns should be computed, the IT organization may choose to implement these transformations within Power BI’s Power Query (M) queries rather than revise the data warehouse, at least in the short term.

If a substantial gap between BI needs and the corporate data warehouse is allowed to persist and grow due to various factors, such as cost, project expediency, and available data warehouse skills, then Power BI datasets become more complex to build and maintain. Dataset designers should regularly analyze and communicate the implications of datasets assuming greater levels of complexity.

However, if the required transformation logic is complex or extensive with multiple join operations, row filters, and data type changes, then the IT organization may choose to implement essential changes in the data warehouse to support the new dataset and future BI projects. For example, a staging table and a SQL stored procedure may be needed to support a revised nightly update process, or the creation of an index may be needed to deliver improved query performance for a DirectQuery dataset.

Ideally, all required data transformation and shaping logic could be implemented in the source data warehouse and its ETL processes so that Power BI is exclusively used for analytics and visualization. However, in the reality of scarce IT resources and project delivery timelines, typically at least a portion of these issues must be handled through other means, such as SQL view objects or Power BI’s M query functions.

A best practice is to implement data transformation operations within the data warehouse or source system. This minimizes the resources required to process an import mode dataset and, for DirectQuery datasets, can significantly improve query performance, as these operations would otherwise be executed during report queries.

For many common data sources, such as Oracle and Teradata, M query expressions are translated into equivalent SQL statements (if possible) and these statements are passed back to the source system via a process called query folding. See Chapter 2, Preparing Data Sources, for more details on query folding.

As per the dataset planning architecture diagram, a layer of SQL views should serve as the source objects to datasets created with Power BI Desktop. By creating a SQL view for each dimension and fact table of the dataset, the data source owner or administrator is able to identify the views as dependencies of the source tables and is therefore less likely to implement changes that would impact the dataset without first consulting the BI team.

Additionally, the SQL views improve the availability of the dataset, as modifications to the source tables are much less likely to cause the refresh process to fail.

As a general rule, the BI team and IT organization should avoid the use of DAX for data transformation and shaping logic, such as DAX calculated tables and calculated columns. The primary reason for this is that it weakens the link between the dataset and the data source, as these expressions are processed entirely by the Power BI dataset after source queries have been executed.

Additionally, the distribution of transformation logic across multiple layers of the solution (SQL, M, DAX) causes datasets to become less flexible and manageable. Moreover, tables and columns created via DAX do not benefit from the same compression algorithms applied to standard tables and columns and thus can represent both a waste of resources as well as a performance penalty for queries accessing these columns.

In the event that required data transformation logic cannot be implemented directly in the data warehouse or its ETL or Extract-Load-Transform (ELT) process, a second alternative is to build this logic into the layer of SQL views supporting the Power BI dataset. For example, a SQL view for the product dimension could be created that joins the Product, Product Subcategory, and Product Category dimension tables, and this view could be accessed by the Power BI dataset.

As a third option, M functions in the Power BI query expressions could be used to enhance or transform the data provided by the SQL views. See Chapter 2, Preparing Data Sources, for details on these functions and the Power BI data access layer generally.

Once data transformation planning is complete, the next step is to determine the mode of the dataset as explained in the next section.

Import, DirectQuery, Live, and Composite datasets

A subsequent but closely related step in dataset planning is choosing between the default import mode, DirectQuery mode, Live mode, or composite mode. In some projects, this is a simple decision as only one option is feasible or realistic given the known requirements while other projects entail significant analysis of the pros and cons of either design.

If a data source is considered slow or ill-equipped to handle a high volume of analytical queries, then an import mode dataset is very likely the preferred option. Likewise, if near real-time visibility of a data source is an essential business requirement, then DirectQuery or Live mode are the only options.

The DirectQuery and Live modes are very similar to one another. Both methods do not store data within the dataset itself but rather query source systems directly to retrieve data based upon user interaction with reports and dashboards. However, Live mode is only supported for Power BI datasets, Analysis Services (both multi-dimensional and tabular), and Dataverse.

When DirectQuery/Live is a feasible option or can be made a feasible option via minimal modifications, organizations may be attracted to the prospect of leveraging investments in high-performance database and data warehouse systems. However, the overhead costs and version control concerns of import mode can be reduced via Power BI features, such as the dataset refresh APIs or pipelines discussed in Chapter 10, Managing Application Workspaces and Content, and incremental data refresh.

The following list of questions can help guide an import versus DirectQuery/Live decision:

Is there a single data source for our dataset that Power BI supports as a DirectQuery/Live source?
For example, each fact and dimension table needed by the dataset is stored in a single data warehouse database, such as Oracle, Teradata, SQL Server, or Azure SQL Database.

The following URL identifies the data sources supported for DirectQuery/Live with Power BI, including sources that are currently only in beta: http://bit.ly/2AcMp25.
If DirectQuery/Live is an option per question 1, is this source capable of supporting the analytical query workload of Power BI?
For example, although Azure Synapse (formerly Azure SQL Data Warehouse) technically supports DirectQuery, it’s not recommended to use Azure Synapse as a DirectQuery data source, given the limitations on the volume of concurrent queries supported and a lack of query plan caching.

In many other scenarios, the data source may not be optimized for analytical queries, such as with star schema designs and indexes that target common BI/reporting queries. Additionally, if the database is utilized for Online Transaction Processing (OLTP) workloads and/or other BI/analytical tools, then it’s necessary to evaluate any potential impact on these applications and the availability of resources.
Is an import mode dataset feasible, given the size of the dataset and any requirements for near real-time visibility of the data source?
Currently, Power BI Premium supports import mode datasets up to 400 GB in size. However, the true limit for model sizes in Premium is limited to the total available amount of RAM within the capacity. In addition, PPU datasets are limited to 100 GB and Pro datasets are limited to 10 GB. Therefore, truly massive datasets must either use a DirectQuery data source or a Live connection to an Analysis Services model.

Additionally, Power BI Premium currently supports a maximum of 48 refreshes per day for import mode datasets. Therefore, if there’s a need to view data source data for the last several minutes or seconds, an import mode dataset is not feasible.
If the DirectQuery/Live source is capable of supporting a Power BI workload as per question 2, is the DirectQuery/Live connection more valuable than the additional performance and flexibility provided via the import mode?
In other words, if an import mode dataset is feasible, as per question 3, then an organization should evaluate the trade-offs of the two modes. For example, since an import mode dataset is hosted in the Power BI service and in a compressed and columnar in-memory data store, it is likely to provide a performance advantage. This is particularly the case if the DirectQuery/Live source is hosted on-premises and thus queries from the Power BI cloud service must pass through the on-premises data gateway reviewed in Chapter 11, Managing the On-Premises Data Gateway.
Additionally, any future data sources and most future data transformations need to be integrated into the DirectQuery/Live source. With an import mode dataset, the scheduled import process can include many data transformations and potentially other data sources without negatively impacting query performance.

For organizations that have invested in powerful data source systems for BI workloads, there’s a strong motivation to leverage this system via DirectQuery/Live. In general, business intelligence teams and architects are averse to copying data into another data store and thus creating both another data movement and a source of reporting that must be supported.

Let’s now take a more detailed look at each of the possible dataset modes.

Import mode

An import mode dataset can include multiple data sources, such as SQL Server, Oracle, and an Excel file. Since a snapshot of the source data is loaded into the Power BI cloud service, in addition to its in-memory columnar compressed structure, query performance is usually good for most scenarios.

Another important advantage of import mode datasets is the ability to implement data transformations without negatively impacting query performance. Unlike DirectQuery/Live datasets, the operations of data source SQL views and the M queries of import datasets are executed during the scheduled data refresh process. The Query design per dataset mode section of Chapter 2, Preparing Data Sources, discusses this issue in greater detail.

Given the performance advantage of the in-memory mode relative to DirectQuery/Live, the ability to integrate multiple data sources, and the relatively few use cases where real-time visibility is required, most Power BI datasets are designed using import mode.

Next, we provide more detail about DirectQuery/Live mode.

DirectQuery/Live mode

While DirectQuery and Live connections are different, as explained previously, they are similar to one another and share common traits, such as not storing data within the dataset itself but rather querying source systems directly to retrieve data based upon user interaction with reports and dashboards.

A DirectQuery/Live dataset is traditionally limited to a single data source and serves as merely a thin semantic layer or interface to simplify the report development and data exploration experience. DirectQuery/Live datasets translate report queries into compatible queries for the data source and leverage the data source for query processing, thus eliminating the need to store and refresh an additional copy of the source data.

A common use case of DirectQuery/Live is to provide near real-time visibility to changes in source data. For example, a manufacturer may want to monitor activities occurring on a manufacturing floor and potentially link this monitoring solution to notifications or alerts.

The performance of DirectQuery/Live datasets is strongly influenced by the design and resources available to the source system. Successful DirectQuery/Live datasets generally result from performance optimizations implemented in the source system such as via columnstore indexes, materialized views, and star schema designs that reduce the complexity of report queries.

With import and DirectQuery/Live modes understood, we next cover the relatively new composite mode.

Composite mode

Composite mode is perhaps the most significant enhancement to Power BI in recent years as it enables table-level control over a table’s storage mode (Import, DirectQuery, Dual).

Storage modes are covered in greater detail in Chapter 2, Preparing Data Sources. When designed effectively, a composite model can deliver the performance benefits of import (in-memory) models but also provide scalability for large DirectQuery source systems.

A common design pattern with composite models is to set the storage mode of a massive fact table to DirectQuery but configure a smaller, targeted aggregation table in import mode and related dimension tables in dual mode. Power BI automatically utilizes the in-memory aggregation table to resolve incoming report queries if the given aggregation table(s) and its related Dual mode dimension tables contain the necessary data.

Even more recently, Microsoft has unveiled DirectQuery for Power BI datasets and Azure Analysis Services. With DirectQuery for Power BI datasets and Azure Analysis Services, datasets developed and published to Power BI can be chained or extended to produce new datasets that incorporate additional import or DirectQuery sources.

With data transformation and data model storage mode decisions made, the dataset planning process is complete. Let’s now take a look at how these planning processes and decisions apply to our sample project introduced earlier in this chapter.

You have been reading a chapter from

Mastering Microsoft Power BI – Second Edition - Second Edition

Published in: Jun 2022Publisher: PacktISBN-13: 9781801811484

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Gregory Deckler

Greg Deckler is Vice President of the Microsoft Practice at Fusion Alliance and has been a professional technology systems consultant for over 25 years. Internationally recognized as an expert in Power BI, Greg Deckler is a Microsoft MVP for Data Platform and a superuser within the Power BI community with over 100,000 messages read, more than 11,000 replies, over 2,300 answers, and more than 75 entries in the Quick Measures Gallery. Greg founded the Columbus Azure ML and Power BI User Group (CAMLPUG) and presents at numerous conferences and events, including SQL Saturday, DogFood, and the Dynamic Communities User Group/Power Platform Summit.
Read more about Gregory Deckler

Brett Powell

Brett Powell is the owner of and business intelligence consultant at Frontline Analytics LLC, a data and analytics research and consulting firm and Microsoft Power BI partner. He has worked with Power BI technologies since they were first introduced as the PowerPivot add-in for Excel 2010 and has been a Power BI architect and lead BI consultant for organizations across the retail, manufacturing, and financial services industries. Additionally, Brett has led Boston's Power BI User Group, delivered presentations at technology events such as Power BI World Tour, and maintains the popular Insight Quest Microsoft BI blog.
Read more about Brett Powell

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages