Home Data Power BI Machine Learning and OpenAI

Power BI Machine Learning and OpenAI

By Greg Beaumont
books-svg-icon Book
eBook $39.99 $27.98
Print $49.99 $29.98
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $39.99 $27.98
Print $49.99 $29.98
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Chapter 1: Requirements, Data Modeling, and Planning
About this book
Microsoft Power BI is the ultimate solution for businesses looking to make data-driven decisions and unlock the full potential of their data. Unleashing Your Data with Power BI Machine Learning and OpenAI is designed for data scientists and BI professionals seeking to improve their existing solutions and workloads using AI. The book explains the intricacies of the subject by using a workshop-style data story for data ingestion, data modeling, analytics, and predictive analytics with Power BI machine learning. Along the way, you’ll learn about AI features, AI visuals, R/Python integration, and OpenAI integration. The workshop-style content allows you to practice all your learnings in real-life challenges and gain hands-on experience. Additionally, you’ll gain an understanding of AI/ML, step by step, with replicable examples and references. From enhancing data visualizations to building SaaS Power BI ML models, and integrating Azure OpenAI, this book will help you unlock new capabilities in Power BI. By the end of this book, you’ll be well-equipped to build ML models in Power BI, plan projects for both BI and ML, understand R/Python visuals with Power BI, and introduce OpenAI to enhance your analytics solutions.
Publication date:
May 2023
Publisher
Packt
Pages
308
ISBN
9781837636150

 

Requirements, Data Modeling, and Planning

You begin your journey by assessing the requirements and data for your project. The use case will be a fictional scenario, but everything will be built using real data from the Federal Aviation Administration’s (FAA) Wildlife Strike Database. The data is real, the topic can be understood by anyone, and the findings within the data are interesting and fun. According to the FAA’s website, about 47 animal strikes are reported daily by aircraft. These incidents can damage airplanes, potentially endanger passengers, and negatively impact wild animal (especially bird) populations.

For the use case, you have been assigned to provide your leadership with tools to do an interactive analysis of the FAA Wildlife Strike data, find insights about factors that influence the incidents, and also make predictions about future wildlife strike incidents and the associated costs. The primary goal of your project, predicting the future impact of FAA Wildlife Strikes, will require building some Power BI machine learning models.

Before uploading data to Power BI’s machine learning (ML) tools, you’ll need to create tables of data that will train the ML models. There is an old saying about data and analytics: “Garbage in, garbage out.” Software as a Service (SaaS) machine learning tools are easy to use, but you still need to feed them good-quality curated data. Identifying the right training data and getting it into the right format are crucial steps in an ML project.

This project will encompass data exploration, data transformation, data analysis, and additional downstream data transformations before you begin working with Power BI ML tools. You are already an experienced business intelligence (BI) professional and Power BI user, and now you are ready to take your skills to the next level with ML in Power BI!

Power BI supports connections to source data in many different formats, ranging from relational databases to unstructured sources to big flat tables of raw data. Countless books have been written about the best ways to structure and model data for different use cases. Rather than dive into the specifics of data modeling, for this book, we will begin with two simple assumptions:

  • Most of the time, a star schema design will provide the most efficient storage and query performance for business intelligence data models
  • Basic ML models, such as the ones you will build in this book, are usually created with a flattened table

Just to be clear, not every solution will follow these assumptions. Rather, these assumptions are generalizations that can provide you with a starting point as you approach the design of a new data model. Quite often, there will not be a perfect answer, and the optimal design will be dictated by the types of queries and business logic that are generated by the end consumers of the data model.

If you’ve never heard the terms star schema and flattened data before, don’t worry! The book will progress at a pace that is intended to help you learn and will also stay at a level that makes sense when you review the FAA data. Let’s browse the FAA Wildlife Strike data and decide upon the best data modeling strategy for your new project!

In this chapter, we will take the following steps so that you can understand the data, think through how it will be used, and then formulate a preliminary plan for the data model:

  • Reviewing the source data
  • Reviewing the requirements for the solution
  • Designing a preliminary data model
  • Considerations for ML
 

Technical requirements

For this chapter, you will need the following:

 

Reviewing the source data

You begin your journey by digging into the source data that you will be using for your project. Let’s get started!

Accessing the data

The source data that you will be using for this book is real data from the United States FAA. The data contains reports filed when aircraft struck wildlife. There is a website providing details, documentation, updates, and access instructions at this URL: https://wildlife.faa.gov/home. The URL (and all URLs) will also be linked from the affiliated GitHub site at https://github.com/PacktPublishing/Unleashing-Your-Data-with-Power-BI-Machine-Learning-and-OpenAI in case changes are made after this book has been published.

If you’d prefer to follow along using the finished version of the content from this chapter rather than building it all step by step, you can download the PBIT version of the file at the Packt GitHub site folder for Chapter 1: https://github.com/PacktPublishing/Unleashing-Your-Data-with-Power-BI-Machine-Learning-and-OpenAI/tree/main/Chapter-01.

Within the FAA Wildlife Strike Database website, you can navigate to this link and run basic queries against the data and familiarize yourself with the content: https://wildlife.faa.gov/search. There are also two files linked from this page that you can reference while reviewing the source data.

At the time of this book’s writing, the second heading on the web page is titled Download the FAA Wildlife Strike Database, and it has a link titled Download that allows you to download the entire historical database along with a reference file. You can download the files from the FAA site for the purposes of this book. There will also be a Power BI PBIT file containing the results of the efforts of this chapter at the GitHub repository. A PBIT file is a Power BI template that can be populated with the files that you download from the Packt GitHub site. If the FAA data ever becomes unavailable, you can still proceed with the contents from the GitHub site to recreate the contents of every chapter.

The files you will be using from the FAA are as follows:

Exploring the FAA Wildlife Strike report data

The wildlife.accdb file is in an Access file format that can be opened with many different tools including Microsoft Access, Microsoft Excel, Power BI, and many more. For the purpose of this book, you will open it using Power BI Desktop. Power BI Desktop is available as a free download at this link: https://powerbi.microsoft.com/en-us/downloads/.

  1. First, open up Power BI Desktop. Once it is open on your desktop, select the Get data drop-down menu from the ribbon and click on More… as shown in the following screenshot:
Figure 1.1 – Connecting to data with Power BI Desktop

Figure 1.1 – Connecting to data with Power BI Desktop

  1. Next, within the Get data window, select Access database and click Connect:
Figure 1.2 – Access database connector in Power BI

Figure 1.2 – Access database connector in Power BI

  1. Select the Access database file that was downloaded and unzipped from the FAA Wildlife Strike Database, named wildlife:
Figure 1.3 – The wildlife file shows up in Power BI

Figure 1.3 – The wildlife file shows up in Power BI

  1. Select the STRIKE_REPORTS table and click Transform Data:
Figure 1.4 – Preview of the data before making transformations

Figure 1.4 – Preview of the data before making transformations

  1. The Power Query window will open in Power BI Desktop with a preview of the FAA Wildlife Strike data. On the ribbon, select the View header for Data Preview, and then check the boxes for Column quality, Column distribution, and Column profile. These features will provide some insights for the data preview, that helps you explore and understand the data:
Figure 1.5 – Data Preview features in Power Query

Figure 1.5 – Data Preview features in Power Query

In Figure 1.5, notice that the first column, INDEX_NR, is highlighted. You can see that none of the values are empty, none have errors, and in Column statistics at the bottom of Figure 1.6, every value is a unique integer. The name INDEX_NR gives it away, but this column is the unique identifier for each row of data.

Let’s review another column in Power Query. Go ahead and highlight TIME_OF_DAY. As you can see in Figure 1.6, there are four distinct values and about 12% are blank. Blank values are an important consideration for this solution. Non-empty values include terms such as Day, Dawn, Dusk, and Night. What does an empty value mean? Was the field left blank by the person filing the report? Was it not entered properly into the system? You’ll revisit this topic later in the book.

Figure 1.6 – Column statistics help with understanding data

Figure 1.6 – Column statistics help with understanding data

Since there are over 100 columns in the FAA Wildlife Strike reports’ data, we won’t discuss all of them in this chapter. That being said, reviewing each and every column would be a great way to review the data for errors, empty fields, distribution of values, and more. For the purposes of this chapter, go ahead and open up the read_me.xls file that was included with the ZIP file from the FAA. The first sheet is Column Name and contains the names and descriptive data about the columns in the wildlife.accdb file. Most of the columns fall into one of the following categories:

  • Date and time fields detailing the dates, times, and years for different events related to each report
  • Descriptive information about the event such as height of contact, latitudes and longitudes, originating airports, and flight numbers
  • Descriptive information about the aircraft such as ownership, aircraft type and manufacturer, number of engines, location of engines, and so on
  • Estimates of the damage due to the strike such as costs, costs adjusted for inflation, damage location on the aircraft, and more
  • Information about the wildlife struck by the aircraft including species, size, quantities hit, and so on

Once you’ve finished browsing the report data, close the read_me.xlsx document on your desktop, and then connect to it from Power BI per the following steps. The document version used in this book can be downloaded from the Packt GitHub site here: https://github.com/PacktPublishing/Unleashing-Your-Data-with-Power-BI-Machine-Learning-and-OpenAI/tree/main/Chapter-01.

  1. Click on Excel Workbook in the left-hand panel:
Figure 1.7 – Excel Workbook is a new source of data

Figure 1.7 – Excel Workbook is a new source of data

  1. Select the read_me file from the browser and click Open:
Figure 1.8 – Excel file ready to open in Power Query

Figure 1.8 – Excel file ready to open in Power Query

  1. Tick the Aircraft Type, Engine Codes, and Engine Position boxes. Then, click OK.
Figure 1.9 – Sheets in Power Query can be individually selected

Figure 1.9 – Sheets in Power Query can be individually selected

After clicking OK and importing the three sheets, notice that Aircraft Type, Engine Codes, and Engine Position are now available in Power Query as three separate tables of data:

Figure 1.10 – Three new tables are previewed in Power Query

Figure 1.10 – Three new tables are previewed in Power Query

The three tables contain descriptive information about values that exist in the FAA Wildlife Strike reports’ data:

  • Aircraft Type: A table that maps the Aircraft Code to a description such as Airplane, Helicopter, or Glider
  • Engine Codes: Information about engine manufacturer and model numbers
  • Engine Position: Details about the location of an engine on the aircraft

For all three of these tables, you’ll notice that there are some unnecessary rows and blank values. You will address these later in the book, so there is no need to make any modifications in Power Query at this time.

Once you’ve browsed the different columns from all the different tables in Power Query, click Close & Apply to import the data into Power BI and save it, per the following diagram:

Figure 1.11 – The Close & Apply button will import data into Power BI

Figure 1.11 – The Close & Apply button will import data into Power BI

Once the data is imported, you can save your Power BI Desktop file as a .pbix file. A copy of the PBIT file named Chapter 1 Template.pbit, which can be populated with the data and then saved as a PBIX, can be found at this GitHub link: https://github.com/PacktPublishing/Unleashing-Your-Data-with-Power-BI-Machine-Learning-and-OpenAI/tree/main/Chapter-01.

 

Reviewing the requirements for the solution

Now that you’ve explored the FAA Wildlife Strike data, you have a better understanding of what data is available for your solution. The original assignment that you were given by your stakeholders was as follows:

  • Provide leadership with tools to do interactive analysis of the FAA Wildlife Strike data
  • Find insights about factors that influence the incidents
  • Make predictions about future wildlife strike incidents

Those requirements sound pretty vague! Now that you have a better understanding of the available data, it’s a good time to circle back with the stakeholders and clarify those requirements. You ask them questions such as the following:

  • What types of interactive analysis do you want to do?
  • Are you interested in the impact on endangered bird species?
  • Maybe you’d like to view trends due to wildlife strikes and compare airports and regions?
  • Perhaps you’d like to see the height and frequency of wildlife strikes at different times of the year?
  • Would understanding correlations to factors such as aircraft size, time of day, season, geography, and height be useful?
  • Are you interested in predicting specific risks or outcomes?

Entire books have been written about gathering requirements for data and analytics projects, and discussions on this particular topic could also be extensive. For the sake of keeping this book consumable, let’s assume that your discussions with the stakeholders led to prioritizing the following deliverables for the project:

  • Analytic report: Viewing trends over time such as number of incidents, location of incidents, height, and details such as types of aircraft and wildlife species
  • Predict damage: When a strike is reported, make a prediction as to whether there will be a cost associated with any damage
  • Predict size: When a strike is reported, make a prediction about the size of the wildlife that struck the aircraft
  • Predict height: For wildlife strikes, predict the height of the incidents

Now, you review the notes you took about the FAA Wildlife Strike data during your data exploration efforts. In doing so, you can think about how the data might match up to the use cases. Based on the requirements and your initial exploration of the data, you decide that the FAA Wildlife Strike data from the wildlife.accdb file and the tables from the read_me.xls file (Engine Codes, Aircraft Type, and Engine Position) are appropriate content to include during the initial phases of your project.

 

Designing a preliminary data model

Earlier in this chapter, we made two simple assumptions about data modeling:

  • Most of the time, a star schema design will provide the most efficient storage and query performance for business intelligence data models
  • Basic ML models, such as the ones you can build in this book, are usually created with a flattened table

Now that you have a grasp of the underlying data and requirements, it is time to think about the data model for your FAA Wildlife Strike data solution. Logically, you can describe your tables of data as follows:

  • STRIKE_REPORTS (from wildlife.accdb): Each row represents a report that was filed. The table of data contains both descriptive values (date, location, and type) along with values that can be summed up and averaged (height and costs).
  • Engine Codes (from read-me.xls): This contains information about the aircraft engines that can be tied to STRIKE_REPORTS.
  • Aircraft Type (from read-me.xls): This contains information about the aircraft that can be tied to STRIKE_REPORTS.
  • Engine Position (from read-me.xls): This contains information about the aircraft engine positions that can be tied to STRIKE_REPORTS.

At this point, you are faced with some data model choices. No matter what decision you make, some people might question your architecture, since there is no perfect design. Depending on how end users will use the data, the data model design may change. This book will demonstrate some of the differences in data model designs for ML models versus traditional BI designs. At a high level, there are three basic approaches you can take in Power BI:

  • Flatten: You can flatten all the data onto a single table by joining Engine Codes, Aircraft Type, and Engine Position onto STRIKE_REPORTS.
  • Star schema: You can build out a true star schema with STRIKE_REPORTS as a fact table and Engine Codes, Aircraft Type, and Engine Position as dimension tables. Some additional data from STRIKE_REPORTS would also be broken out into separate dimension tables. For example, AIRPORT_ID, AIRPORT, STATE, and FAAREGION could be separate dimension tables.
  • Hybrid design: You can build out a hybrid design using both a flattened and star schema design pattern for the sake of practicality and ease of use.

Let’s look at each of these in turn.

Flattening the data

Flattening the FAA Wildlife Strike reports’ data would require joining the Engine Codes, Aircraft Type, and Engine Position tables onto the STRIKE_REPORTS table so that everything is on one big flat table of data. The result would be something that looks like this:

Figure 1.12 – Tables on the left are combined to form a single table on the right

Figure 1.12 – Tables on the left are combined to form a single table on the right

The following table contains some, but not all, of the pros and cons of a flattened table of data:

Pros

Cons

  • Simplicity
  • No joins needed for code
  • Commonly used by data scientists
  • Can compress well with columnar databases
  • No relational data models for business users
  • Repetitive data can lead to an inefficient storage footprint
  • Limitations for queries with advanced logic
  • Less flexibility for future change and evolution of solution
  • Complex logical queries can be less efficient

Figure 1.13 – Pros and cons of a flattened table for BI

Next, let’s look at the star schema.

Star schema

A true star schema built to best practices would include relationships between Engine Codes, Aircraft Type, and Engine Position with the STRIKE_REPORTS table. It would also break off parts of the STRIKE_REPORTS table into smaller dimension tables. The following figure is a representation of the approach for a true star schema. There may be more dimension tables that would need to be broken off of the STRIKE_REPORTS table in addition to Location and Species, but this is an example of how it might look:

Figure 1.14 – Tables on the left are combined into a star schema, and some data is split off into new dimension tables

Figure 1.14 – Tables on the left are combined into a star schema, and some data is split off into new dimension tables

The following table contains some, but not all, of the pros and cons of a true star schema design:

Pros

Cons

  • Tables often line up with business logic
  • Balance of minimal data duplication and efficient queries
  • Usually expandable if the scope of the solution grows and new data is introduced
  • Traditionally considered the gold standard for BI data models
  • With modern tools, the benefits of reducing data duplication are less impactful versus older tools
  • Complicated ETL
  • Machine learning models are usually trained with flat tables of data
  • Don’t always scale well with very large data volumes having tens of billions of rows

Figure 1.15 – Pros and cons of a star schema for BI

Hybrid design

For the FAA Wildlife Strike data, combining aspects of a flattened design and a star schema is also an option. At the time of this book’s writing, the entire STRIKE_REPORTS table is fewer than 300,000 rows and has fewer than 100 columns. Only two columns contain verbose free text, so data volume is not an issue when using Power BI. For this particular use case, the differences in data storage requirements between flattened and star schema data models are minimal. With data volumes of this small size, you can design the data model to meet the needs of the solution without some of the performance concerns that would be introduced for data sources with tens of millions of rows or hundreds of columns containing free text fields. Columns of data left on a transaction table that will be used as categories are technically called degenerate dimensions. A hybrid design could look something like the following example:

Figure 1.16 – Tables from the left are enhanced and combined into a star schema with some descriptive data still in the fact table

Figure 1.16 – Tables from the left are enhanced and combined into a star schema with some descriptive data still in the fact table

The following table contains some, but not all, of the pros and cons of a hybrid design:

Pros

Cons

  • Rapid prototyping
  • Less logic in the data transformation layer
  • Flexible design
  • Possibly less performant than a star schema for traditional BI
  • Additional logical complexity for users versus a big flat table
  • Data will still need to be flattened out for machine learning
  • Not perfect for either BI or ML, but a compromise between the two

Figure 1.17 – Pros and cons of a hybrid design and considerations for additional data

Before finalizing a preliminary logical design for your FAA Wildlife Strike solution, take a step back to think about the data and the requirements. You can review the expected deliverables from earlier in the chapter, including an analytic report and predictions of damage, size, and height.

In addition to the FAA Wildlife Strike data you’ve been using, what other data might be useful for the solution? Also, what is the effort to get the data? Here are a few examples that you could research:

Additional Data Sources

Level of Effort

Date-based table of aggregations such as Month, Quarter, Season, and Holidays

Easy

Time-based table of aggregations such as hour, AM/PM, and so on

Easy

Data for flights that didn’t have a wildlife strike could provide a baseline for the percentage of flights with strikes

Difficult

Weather data that could be mapped to the date and time of wildlife strikes

Difficult

Additional data about wildlife species such as weight ranges, habitat ranges, and so on

Difficult

Figure 1.18 – Additional potential data sources for the solution

Additional flight, weather, and wildlife data could provide greater analytic and predictive value for the solution. However, adding those sources would require quite a bit of effort that exceeds the scope of your project and the length of this book. If the initial project goes well, you can circle back to these options for future enhancements.

There may be value in adding a Time table to the solution, so open up Power Query and take another look at the TIME column. Notice that 95% of the entries are empty:

Figure 1.19 – 95% of the values for TIME are empty

Figure 1.19 – 95% of the values for TIME are empty

Due to a lack of complete data, you decide to leave a Time table out of the initial build.

How about a Date table so that you can roll up data by week, month, quarter, year, holidays, weekends, and more? Looking at the INCIDENT_DATE column in Power Query, it is populated for every entry in the preview:

Figure 1.20 – INCIDENT_DATE is fully populated with date values

Figure 1.20 – INCIDENT_DATE is fully populated with date values

INCIDENT_DATE can be used as a key for connecting to a Date table containing many different date-based aggregations. You decide to pull in a Date table for the architecture. The resulting preliminary data model will now look as follows:

Figure 1.21 – A Date table is added to the preliminary data model

Figure 1.21 – A Date table is added to the preliminary data model

The Date table was not present in the source data, but in your reporting model, it will allow you to slice and dice data by day, week, month, quarter, year, weekend, and more. When you explore data in future chapters, it will add new ways to dive into and explore date-based trends. The Date table will be added in Chapter 2.

In the final section of the chapter, we’ll look at what else we need to take into account for ML.

 

Considerations for ML

Now that you’ve created a preliminary data model that will serve as the basis for analytic reporting in Power BI, you start thinking about a process for creating tables of data to be used with Power BI machine learning. You will need to create a single table of flattened data for each machine learning model that you train, test, and deploy.

Creating tables of data to train a machine learning model entails treating each column as a feature of the algorithm that you will be training and then using to make predictions. For example, if you wanted to create a machine learning algorithm that predicts whether something is an insect, the features (ML terminology for columns on a single table) might be [Six Legs Y/N?], [Life Form Y/N?], [Count of Eyes], and [Weight], and then a column that will be predicted, such as [Insect Y/N?]. Each row would represent something that is being evaluated for a prediction to answer the question, “Is this an insect?”

You decide to take the following approach, in the following order, so that you can do everything within Power BI:

  1. Data exploration and initial data model creation in Power BI Desktop Power Query.
  2. Analytic report created in Power BI.
  3. Feature discovery in Power BI.
  4. Create training data sets in Power Query.
  5. Move training data sets to Power BI dataflows.
  6. Train, test, deploy a Power BI machine learning model in Power BI dataflows.

This process is shown in Figure 1.22.

Figure 1.22 – All of the ETL (extract, transform, load) will happen in Power BI Power Query and Power BI dataflows

Figure 1.22 – All of the ETL (extract, transform, load) will happen in Power BI Power Query and Power BI dataflows

Power BI ML offers three different types of predictive model types. Those types, as defined in the Power BI service, are as follows:

  • A binary prediction model predicts whether an outcome will be achieved. Effectively, a prediction of “Yes” or “No” is returned.
  • General classification models predict more than two possible outcomes such as A, B, C, or D.
  • A regression model will predict a numeric value along a spectrum of possible values. For example, it will predict the costs of an event based on similar past events.

As part of your preliminary planning, you consider how these options could map to the deliverables that were prioritized by your stakeholders:

  • Analytic report: This deliverable will be a Power BI analytic report and could use some Power BI AI features, but it will not be a Power BI ML model. The analytic report will help you explore and identify the right data for Power BI machine learning models.
  • Predict damage: Predicting whether or not damage will result from a wildlife strike is a good match for a binary prediction model since the answer will have two possible outcomes: yes or no.
  • Predict size: Predicting the size of the wildlife that struck an aircraft based upon factors such as damage cost, damage location, height, time of year, and airport location will probably have multiple values that can be predicted such as Large, Medium, and Small. This requirement could be a good fit for a general classification model.
  • Predict height: This deliverable predicts the height at which wildlife strikes will happen and provides that prediction as a numeric value representing height above ground level in feet. It is likely a good fit for a regression model, which predicts numeric values.

There is no way of knowing with certainty whether the FAA Wildlife Strike data will support these specific use cases, but you won’t know until you try! Discovery is a key part of the process. First, you must identify features in the data that might have predictive value, and then train and test the machine learning models in Power BI. Only then will you know what types of predictions might be possible for your project.

 

Summary

In this chapter, you explored the data available for your project and reviewed subsequent options for mapping data to the requirements of your stakeholders. You reviewed data architecture options to meet both business intelligence and ML requirements in Power BI and decided upon a hybrid approach that blends a star schema design with flattened data. You also formulated a plan to explore, analyze, design, build, and deploy your solution. Finally, you decided upon three use cases for predictive ML models in Power BI.

In the next chapter, you will ingest and prep data from the FAA Wildlife Strike database using Power Query within Power BI. You’ll deep dive into data characteristics, decide what is needed for your design, and build out a flexible foundation that will support both the current project and future iterations and changes. Your approach in Power Query will support both business intelligence analytics in Power BI and predictive analytics in Power BI ML.

About the Author
  • Greg Beaumont

    Greg Beaumont is a data architect at Microsoft, where he enjoys identifying and solving complex problems backed by his experience in data architecture and a passion for innovation. Focusing on the healthcare industry, Greg works closely with customers to plan enterprise analytics strategies, evaluate new tools and products, conduct training sessions and hackathons, and architect solutions that improve the quality of care and reduce costs. He strives to be a trusted advisor to his customers and is always seeking new ways to drive progress and help organizations thrive. He is a veteran of the Microsoft data speaker network and has worked with hundreds of customers on their data management and analytics strategies.

    Browse publications by this author
Latest Reviews (1 reviews total)