Reader small image

You're reading from  Power BI Machine Learning and OpenAI

Product typeBook
Published inMay 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781837636150
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Greg Beaumont
Greg Beaumont
author image
Greg Beaumont

Greg Beaumont is a data architect at Microsoft, where he enjoys identifying and solving complex problems backed by his experience in data architecture and a passion for innovation. Focusing on the healthcare industry, Greg works closely with customers to plan enterprise analytics strategies, evaluate new tools and products, conduct training sessions and hackathons, and architect solutions that improve the quality of care and reduce costs. He strives to be a trusted advisor to his customers and is always seeking new ways to drive progress and help organizations thrive. He is a veteran of the Microsoft data speaker network and has worked with hundreds of customers on their data management and analytics strategies.
Read more about Greg Beaumont

Right arrow

Designing a preliminary data model

Earlier in this chapter, we made two simple assumptions about data modeling:

  • Most of the time, a star schema design will provide the most efficient storage and query performance for business intelligence data models
  • Basic ML models, such as the ones you can build in this book, are usually created with a flattened table

Now that you have a grasp of the underlying data and requirements, it is time to think about the data model for your FAA Wildlife Strike data solution. Logically, you can describe your tables of data as follows:

  • STRIKE_REPORTS (from wildlife.accdb): Each row represents a report that was filed. The table of data contains both descriptive values (date, location, and type) along with values that can be summed up and averaged (height and costs).
  • Engine Codes (from read-me.xls): This contains information about the aircraft engines that can be tied to STRIKE_REPORTS.
  • Aircraft Type (from read-me.xls): This contains information about the aircraft that can be tied to STRIKE_REPORTS.
  • Engine Position (from read-me.xls): This contains information about the aircraft engine positions that can be tied to STRIKE_REPORTS.

At this point, you are faced with some data model choices. No matter what decision you make, some people might question your architecture, since there is no perfect design. Depending on how end users will use the data, the data model design may change. This book will demonstrate some of the differences in data model designs for ML models versus traditional BI designs. At a high level, there are three basic approaches you can take in Power BI:

  • Flatten: You can flatten all the data onto a single table by joining Engine Codes, Aircraft Type, and Engine Position onto STRIKE_REPORTS.
  • Star schema: You can build out a true star schema with STRIKE_REPORTS as a fact table and Engine Codes, Aircraft Type, and Engine Position as dimension tables. Some additional data from STRIKE_REPORTS would also be broken out into separate dimension tables. For example, AIRPORT_ID, AIRPORT, STATE, and FAAREGION could be separate dimension tables.
  • Hybrid design: You can build out a hybrid design using both a flattened and star schema design pattern for the sake of practicality and ease of use.

Let’s look at each of these in turn.

Flattening the data

Flattening the FAA Wildlife Strike reports’ data would require joining the Engine Codes, Aircraft Type, and Engine Position tables onto the STRIKE_REPORTS table so that everything is on one big flat table of data. The result would be something that looks like this:

Figure 1.12 – Tables on the left are combined to form a single table on the right

Figure 1.12 – Tables on the left are combined to form a single table on the right

The following table contains some, but not all, of the pros and cons of a flattened table of data:

Pros

Cons

  • Simplicity
  • No joins needed for code
  • Commonly used by data scientists
  • Can compress well with columnar databases
  • No relational data models for business users
  • Repetitive data can lead to an inefficient storage footprint
  • Limitations for queries with advanced logic
  • Less flexibility for future change and evolution of solution
  • Complex logical queries can be less efficient

Figure 1.13 – Pros and cons of a flattened table for BI

Next, let’s look at the star schema.

Star schema

A true star schema built to best practices would include relationships between Engine Codes, Aircraft Type, and Engine Position with the STRIKE_REPORTS table. It would also break off parts of the STRIKE_REPORTS table into smaller dimension tables. The following figure is a representation of the approach for a true star schema. There may be more dimension tables that would need to be broken off of the STRIKE_REPORTS table in addition to Location and Species, but this is an example of how it might look:

Figure 1.14 – Tables on the left are combined into a star schema, and some data is split off into new dimension tables

Figure 1.14 – Tables on the left are combined into a star schema, and some data is split off into new dimension tables

The following table contains some, but not all, of the pros and cons of a true star schema design:

Pros

Cons

  • Tables often line up with business logic
  • Balance of minimal data duplication and efficient queries
  • Usually expandable if the scope of the solution grows and new data is introduced
  • Traditionally considered the gold standard for BI data models
  • With modern tools, the benefits of reducing data duplication are less impactful versus older tools
  • Complicated ETL
  • Machine learning models are usually trained with flat tables of data
  • Don’t always scale well with very large data volumes having tens of billions of rows

Figure 1.15 – Pros and cons of a star schema for BI

Hybrid design

For the FAA Wildlife Strike data, combining aspects of a flattened design and a star schema is also an option. At the time of this book’s writing, the entire STRIKE_REPORTS table is fewer than 300,000 rows and has fewer than 100 columns. Only two columns contain verbose free text, so data volume is not an issue when using Power BI. For this particular use case, the differences in data storage requirements between flattened and star schema data models are minimal. With data volumes of this small size, you can design the data model to meet the needs of the solution without some of the performance concerns that would be introduced for data sources with tens of millions of rows or hundreds of columns containing free text fields. Columns of data left on a transaction table that will be used as categories are technically called degenerate dimensions. A hybrid design could look something like the following example:

Figure 1.16 – Tables from the left are enhanced and combined into a star schema with some descriptive data still in the fact table

Figure 1.16 – Tables from the left are enhanced and combined into a star schema with some descriptive data still in the fact table

The following table contains some, but not all, of the pros and cons of a hybrid design:

Pros

Cons

  • Rapid prototyping
  • Less logic in the data transformation layer
  • Flexible design
  • Possibly less performant than a star schema for traditional BI
  • Additional logical complexity for users versus a big flat table
  • Data will still need to be flattened out for machine learning
  • Not perfect for either BI or ML, but a compromise between the two

Figure 1.17 – Pros and cons of a hybrid design and considerations for additional data

Before finalizing a preliminary logical design for your FAA Wildlife Strike solution, take a step back to think about the data and the requirements. You can review the expected deliverables from earlier in the chapter, including an analytic report and predictions of damage, size, and height.

In addition to the FAA Wildlife Strike data you’ve been using, what other data might be useful for the solution? Also, what is the effort to get the data? Here are a few examples that you could research:

Additional Data Sources

Level of Effort

Date-based table of aggregations such as Month, Quarter, Season, and Holidays

Easy

Time-based table of aggregations such as hour, AM/PM, and so on

Easy

Data for flights that didn’t have a wildlife strike could provide a baseline for the percentage of flights with strikes

Difficult

Weather data that could be mapped to the date and time of wildlife strikes

Difficult

Additional data about wildlife species such as weight ranges, habitat ranges, and so on

Difficult

Figure 1.18 – Additional potential data sources for the solution

Additional flight, weather, and wildlife data could provide greater analytic and predictive value for the solution. However, adding those sources would require quite a bit of effort that exceeds the scope of your project and the length of this book. If the initial project goes well, you can circle back to these options for future enhancements.

There may be value in adding a Time table to the solution, so open up Power Query and take another look at the TIME column. Notice that 95% of the entries are empty:

Figure 1.19 – 95% of the values for TIME are empty

Figure 1.19 – 95% of the values for TIME are empty

Due to a lack of complete data, you decide to leave a Time table out of the initial build.

How about a Date table so that you can roll up data by week, month, quarter, year, holidays, weekends, and more? Looking at the INCIDENT_DATE column in Power Query, it is populated for every entry in the preview:

Figure 1.20 – INCIDENT_DATE is fully populated with date values

Figure 1.20 – INCIDENT_DATE is fully populated with date values

INCIDENT_DATE can be used as a key for connecting to a Date table containing many different date-based aggregations. You decide to pull in a Date table for the architecture. The resulting preliminary data model will now look as follows:

Figure 1.21 – A Date table is added to the preliminary data model

Figure 1.21 – A Date table is added to the preliminary data model

The Date table was not present in the source data, but in your reporting model, it will allow you to slice and dice data by day, week, month, quarter, year, weekend, and more. When you explore data in future chapters, it will add new ways to dive into and explore date-based trends. The Date table will be added in Chapter 2.

In the final section of the chapter, we’ll look at what else we need to take into account for ML.

Previous PageNext Page
You have been reading a chapter from
Power BI Machine Learning and OpenAI
Published in: May 2023Publisher: PacktISBN-13: 9781837636150
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Greg Beaumont

Greg Beaumont is a data architect at Microsoft, where he enjoys identifying and solving complex problems backed by his experience in data architecture and a passion for innovation. Focusing on the healthcare industry, Greg works closely with customers to plan enterprise analytics strategies, evaluate new tools and products, conduct training sessions and hackathons, and architect solutions that improve the quality of care and reduce costs. He strives to be a trusted advisor to his customers and is always seeking new ways to drive progress and help organizations thrive. He is a veteran of the Microsoft data speaker network and has worked with hundreds of customers on their data management and analytics strategies.
Read more about Greg Beaumont