You're reading from Learn Microsoft Fabric

Product typeBook

Published inFeb 2024

Reading LevelN/a

PublisherPackt

ISBN-139781835082287

Edition1st Edition

Languages

Tools

Azure Data Factory Power BI

Concepts

Data Analysis

Authors (2):

Arshad Ali

Bradley Schacht

View More author details

Building an End-to-End Analytics System – Data Science

Almost every organization on the planet is on its journey of taking advantage of innovation and the advancements of artificial intelligence (AI) and machine learning (ML). However, the challenge is that there are so many products and libraries to consider that you will spend most of your time trying to figure out ways of doing it right and promptly.

Microsoft Fabric has been designed from the ground up for the era of ML and AI to drive business value for your organization.

In this chapter, you will learn about the data science capabilities in Fabric by following the end-to-end data science life cycle and building an ML model, all the way from data ingestion to cleansing, feature engineering, training, and operationalizing models. We will cover the following topics:

Understanding the data science project development life cycle and how Fabric’s capabilities help in each of the stages
Data and storage...

Technical requirements

This chapter assumes you have followed the instructions mentioned in the Getting started with Microsoft Fabric section of Chapter 2, Understanding Different Workloads and Getting Started with Microsoft Fabric, to create/enable Fabric in your tenant and have created a Fabric workspace to work in.

The code files for this chapter are available on GitHub at https://github.com/PacktPublishing/Learn-Microsoft-Fabric/tree/main/ch6.

End-to-end data science scenario

A typical data analytics system for data science in Fabric would consist of the components and layers shown in Figure 6.1:

Figure 6.1 – Reference architecture for data science in Fabric

Let’s review these components in detail:

Data sources: To ingest data into the lakehouse either from Azure data services or from other cloud platforms or on-premise sources, Fabric provides native or built-in ready-to-use connectors to make use of it, which makes building a data ingestion flow quick and easy. In Fabric, you might also use the data from the lakehouse and data warehouse, which you have brought in and transformed, to train your model.
Data cleansing and preparation: Fabric offers different options for you to prepare, clean, and transform your data before you train your model efficiently. For example, if you prefer a user interface experience, you can use Data Wrangler, with its intuitive interface...

Data and storage – creating a lakehouse and ingesting data using Apache Spark

To ingest data into the lakehouse, you can use an existing lakehouse or create a new one. Follow these steps to create a new lakehouse for this chapter:

After logging into your Fabric tenant, select the Workspaces flyout from the left hand.
Search for the workspace (Learn Microsoft Fabric) that you created in Chapter 2, Understanding Different Workloads and Getting Started with Microsoft Fabric, by typing its name into the search box at the top and clicking on your workspace to open it. You can also pin it so that it always appears at the top of the list.
From the workload switcher located at the bottom left of the screen, select Data Engineering.
In the Data Engineering experience, under + New, select Lakehouse to create a lakehouse.
Enter nyctaxilake in the Name box and click Create. The new lakehouse will be created and opened automatically.

Importing notebooks

...

Problem formulation/ideation (business understanding)

In this stage, as a data scientist, you primarily work with different stakeholders to understand the business problem you are trying to solve. You work with business leaders to define the problem and the expected outcome of the project and then work with the data engineering team to get access to data for building and training ML models to solve the defined business problem.

To learn how to conduct an end-to-end data science implementation with Fabric, we will be using the NYC Taxi & Limousine Commission – yellow taxi trip records dataset (https://learn.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow) from Azure Open datasets. The records in this dataset contain fields such as pickup and drop-off dates/times, pickup and drop-off locations, trip distances, payment types, itemized fares (fair amount, tax amount, and tip amount), rate types, and driver-reported passenger counts. This dataset contains 1.5 billion...

Data acquisition, discovery, and preprocessing

Often, data for building and training ML models is provided by the data engineering team and given to the data science team. In our case, data engineers might have already brought data into either the lakehouse, data warehouse, or both. However, for simplicity’s sake, in this example, we will ingest data from Azure Open Datasets (https://learn.microsoft.com/en-us/azure/open-datasets/overview-what-are-open-datasets) into the lakehouse that we created earlier in this chapter.

Data acquisition

In Chapter 3, Building an End-to-End Analytics System – Lakehouse, we learned how to open an imported notebook and how to attach a lakehouse as a default lakehouse for the opened notebook. Please ensure you attach the lakehouse (nyctaxilake) you created in the Data and storage – creating a lakehouse and ingesting data using Apache Spark section of this chapter. Once you’ve done that, you can import the data you will...

Experimenting and modeling

In this section, we will use a regression ML algorithm to train an ML model to predict trip duration based on several features in the dataset, such as date, time, pickup and drop-off locations, distance, and so on. To learn about the capabilities related to the Data Science experience in Fabric, we will create two versions of the trained model with different sets of hyperparameters and then register each of them in the model registry. While doing this, we will log all the hyperparameters and evaluation metrics by taking advantage of the native integration of MLflow in Fabric.

Note

MLflow is an open source platform for managing the end-to-end ML life cycle. You can read more about it at https://mlflow.org/docs/latest/index.html.

The code that will be discussed in this section can be found in the Data Science – Model Training notebook. Please make sure you attach the lakehouse (nyctaxilake) you created in the Data and storage – creating...

Enriching and operationalizing

In this section, we will look at loading an already trained ML model from the model registry and generate predictions on new incoming data. Once these predictions have been generated, we will save this data in another Delta table so that we can create a report on it.

Note

The code that will be discussed in this section can be found in the Data Science - Perform Prediction or Scoring notebook. Please make sure you attach the lakehouse (nyctaxilake) you created in the Data and storage – creating a lakehouse and ingesting data using Apache Spark section of this chapter to this notebook.

The steps are as follows:

The first step is to import the required libraries into the current Spark session. Next, we must load the trained ML model from the MLflow-based model registry. While specifying the model’s name, we also need to specify the version of the model – in this case, it’s version = 2:
```
import mlflow
from pyspark...
```

Analyzing and getting insights

In this section, we will use Power BI to create a report by connecting it to the lakehouse table we created in the previous section. Power BI is natively integrated into the whole Fabric experience. This provides a unique mode of accessing the data, called Direct Lake, which we discussed in earlier chapters, from the lakehouse to provide the most performant query and reporting experience. Let’s create a report based on the data from the nyctaxilake lakehouse:

Open the nyctaxilake lakehouse and click on SQL endpoint under mode selection at the top right of the screen to switch to SQL endpoint mode for the selected lakehouse.
Once you are in SQL endpoint mode, you should be able to see all the tables you’ve created. If you don’t see them, please click on the Refresh icon at the top. Next, click on the New report icon at the top to create a Power BI report.
On the Power BI report canvas, you can add a title for your report...

Summary

In this chapter, you learned about the data science process and data science project development life cycle, after which you learned about different capabilities in Microsoft Fabric that empower you at each step in this journey. You learned about all these capabilities by implementing an end-to-end data science project in Microsoft Fabric based on the regression model and also learned how to leverage advanced capabilities, such as using the model registry and tracking with MLflow, Semantic Link, AutoML, and SynapseML.

This whole experience is natively integrated and built into Fabric, giving you the power and flexibility to build end-to-end data science projects without having to switch to or learn about other technologies. This includes capabilities for data ingestion, data transformation, and feature engineering with Notebooks/Spark to training ML models in a distributed manner with SynapseML. Furthermore, it allows you to leverage different open source libraries such...

The rest of the chapter is locked

You have been reading a chapter from

Learn Microsoft Fabric

Published in: Feb 2024Publisher: PacktISBN-13: 9781835082287

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Arshad Ali

Arshad Ali is a principal product manager at Microsoft, working on the Microsoft Fabric product team in Redmond, WA. He focuses on Spark Runtime, which empowers both data engineering and data science experiences. In his previous role, he helped strategic customers and partners adopt Azure Synapse and Microsoft Fabric. Arshad has more than 20 years of industry experience and has been with Microsoft for over 16 years. He is the co-author of the book Big Data Analytics with Azure HDInsight and the author of over 200 technical articles and blogs on data and analytics. Arshad holds an MBA from the Foster School of Business at the University of Washington and an MCA from India.
Read more about Arshad Ali

Bradley Schacht

Bradley Schacht is a principal program manager on the Microsoft Fabric product team based in Saint Augustine, Florida. Bradley is a former consultant and trainer and has co-authored five books on SQL Server and Power BI. As a member of the Microsoft Fabric product team, Bradley works directly with customers to solve some of their most complex data problems and helps shape the future of Microsoft Fabric. Bradley gives back to the community by speaking at events, such as the PASS Summit, SQL Saturday, Code Camp, and user groups across the country, including locally at the Jacksonville SQL Server User Group (JSSUG). He is a contributor on SQLServerCentral and blogs on his personal site, BradleySchacht.
Read more about Bradley Schacht

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages