Reader small image

You're reading from  Learn Microsoft Fabric

Product typeBook
Published inFeb 2024
Reading LevelN/a
PublisherPackt
ISBN-139781835082287
Edition1st Edition
Languages
Right arrow
Authors (2):
Arshad Ali
Arshad Ali
author image
Arshad Ali

Arshad Ali is a principal product manager at Microsoft, working on the Microsoft Fabric product team in Redmond, WA. He focuses on Spark Runtime, which empowers both data engineering and data science experiences. In his previous role, he helped strategic customers and partners adopt Azure Synapse and Microsoft Fabric. Arshad has more than 20 years of industry experience and has been with Microsoft for over 16 years. He is the co-author of the book Big Data Analytics with Azure HDInsight and the author of over 200 technical articles and blogs on data and analytics. Arshad holds an MBA from the Foster School of Business at the University of Washington and an MCA from India.
Read more about Arshad Ali

Bradley Schacht
Bradley Schacht
author image
Bradley Schacht

Bradley Schacht is a principal program manager on the Microsoft Fabric product team based in Saint Augustine, Florida. Bradley is a former consultant and trainer and has co-authored five books on SQL Server and Power BI. As a member of the Microsoft Fabric product team, Bradley works directly with customers to solve some of their most complex data problems and helps shape the future of Microsoft Fabric. Bradley gives back to the community by speaking at events, such as the PASS Summit, SQL Saturday, Code Camp, and user groups across the country, including locally at the Jacksonville SQL Server User Group (JSSUG). He is a contributor on SQLServerCentral and blogs on his personal site, BradleySchacht.
Read more about Bradley Schacht

View More author details
Right arrow

Building an End-to-End Analytics System – Data Science

Almost every organization on the planet is on its journey of taking advantage of innovation and the advancements of artificial intelligence (AI) and machine learning (ML). However, the challenge is that there are so many products and libraries to consider that you will spend most of your time trying to figure out ways of doing it right and promptly.

Microsoft Fabric has been designed from the ground up for the era of ML and AI to drive business value for your organization.

In this chapter, you will learn about the data science capabilities in Fabric by following the end-to-end data science life cycle and building an ML model, all the way from data ingestion to cleansing, feature engineering, training, and operationalizing models. We will cover the following topics:

  • Understanding the data science project development life cycle and how Fabric’s capabilities help in each of the stages
  • Data and storage...

Technical requirements

This chapter assumes you have followed the instructions mentioned in the Getting started with Microsoft Fabric section of Chapter 2, Understanding Different Workloads and Getting Started with Microsoft Fabric, to create/enable Fabric in your tenant and have created a Fabric workspace to work in.

The code files for this chapter are available on GitHub at https://github.com/PacktPublishing/Learn-Microsoft-Fabric/tree/main/ch6.

End-to-end data science scenario

A typical data analytics system for data science in Fabric would consist of the components and layers shown in Figure 6.1:

Figure 6.1 – Reference architecture for data science in Fabric

Figure 6.1 – Reference architecture for data science in Fabric

Let’s review these components in detail:

  • Data sources: To ingest data into the lakehouse either from Azure data services or from other cloud platforms or on-premise sources, Fabric provides native or built-in ready-to-use connectors to make use of it, which makes building a data ingestion flow quick and easy. In Fabric, you might also use the data from the lakehouse and data warehouse, which you have brought in and transformed, to train your model.
  • Data cleansing and preparation: Fabric offers different options for you to prepare, clean, and transform your data before you train your model efficiently. For example, if you prefer a user interface experience, you can use Data Wrangler, with its intuitive interface...

Data and storage – creating a lakehouse and ingesting data using Apache Spark

To ingest data into the lakehouse, you can use an existing lakehouse or create a new one. Follow these steps to create a new lakehouse for this chapter:

  1. After logging into your Fabric tenant, select the Workspaces flyout from the left hand.
  2. Search for the workspace (Learn Microsoft Fabric) that you created in Chapter 2, Understanding Different Workloads and Getting Started with Microsoft Fabric, by typing its name into the search box at the top and clicking on your workspace to open it. You can also pin it so that it always appears at the top of the list.
  3. From the workload switcher located at the bottom left of the screen, select Data Engineering.
  4. In the Data Engineering experience, under + New, select Lakehouse to create a lakehouse.
  5. Enter nyctaxilake in the Name box and click Create. The new lakehouse will be created and opened automatically.

Importing notebooks

...

Problem formulation/ideation (business understanding)

In this stage, as a data scientist, you primarily work with different stakeholders to understand the business problem you are trying to solve. You work with business leaders to define the problem and the expected outcome of the project and then work with the data engineering team to get access to data for building and training ML models to solve the defined business problem.

To learn how to conduct an end-to-end data science implementation with Fabric, we will be using the NYC Taxi & Limousine Commission – yellow taxi trip records dataset (https://learn.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow) from Azure Open datasets. The records in this dataset contain fields such as pickup and drop-off dates/times, pickup and drop-off locations, trip distances, payment types, itemized fares (fair amount, tax amount, and tip amount), rate types, and driver-reported passenger counts. This dataset contains 1.5 billion...

Data acquisition, discovery, and preprocessing

Often, data for building and training ML models is provided by the data engineering team and given to the data science team. In our case, data engineers might have already brought data into either the lakehouse, data warehouse, or both. However, for simplicity’s sake, in this example, we will ingest data from Azure Open Datasets (https://learn.microsoft.com/en-us/azure/open-datasets/overview-what-are-open-datasets) into the lakehouse that we created earlier in this chapter.

Data acquisition

In Chapter 3, Building an End-to-End Analytics System – Lakehouse, we learned how to open an imported notebook and how to attach a lakehouse as a default lakehouse for the opened notebook. Please ensure you attach the lakehouse (nyctaxilake) you created in the Data and storage – creating a lakehouse and ingesting data using Apache Spark section of this chapter. Once you’ve done that, you can import the data you will...

Experimenting and modeling

In this section, we will use a regression ML algorithm to train an ML model to predict trip duration based on several features in the dataset, such as date, time, pickup and drop-off locations, distance, and so on. To learn about the capabilities related to the Data Science experience in Fabric, we will create two versions of the trained model with different sets of hyperparameters and then register each of them in the model registry. While doing this, we will log all the hyperparameters and evaluation metrics by taking advantage of the native integration of MLflow in Fabric.

Note

MLflow is an open source platform for managing the end-to-end ML life cycle. You can read more about it at https://mlflow.org/docs/latest/index.html.

The code that will be discussed in this section can be found in the Data Science – Model Training notebook. Please make sure you attach the lakehouse (nyctaxilake) you created in the Data and storage – creating...

Enriching and operationalizing

In this section, we will look at loading an already trained ML model from the model registry and generate predictions on new incoming data. Once these predictions have been generated, we will save this data in another Delta table so that we can create a report on it.

Note

The code that will be discussed in this section can be found in the Data Science - Perform Prediction or Scoring notebook. Please make sure you attach the lakehouse (nyctaxilake) you created in the Data and storage – creating a lakehouse and ingesting data using Apache Spark section of this chapter to this notebook.

The steps are as follows:

  1. The first step is to import the required libraries into the current Spark session. Next, we must load the trained ML model from the MLflow-based model registry. While specifying the model’s name, we also need to specify the version of the model – in this case, it’s version = 2:
    import mlflow
    from pyspark...

Analyzing and getting insights

In this section, we will use Power BI to create a report by connecting it to the lakehouse table we created in the previous section. Power BI is natively integrated into the whole Fabric experience. This provides a unique mode of accessing the data, called Direct Lake, which we discussed in earlier chapters, from the lakehouse to provide the most performant query and reporting experience. Let’s create a report based on the data from the nyctaxilake lakehouse:

  1. Open the nyctaxilake lakehouse and click on SQL endpoint under mode selection at the top right of the screen to switch to SQL endpoint mode for the selected lakehouse.
  2. Once you are in SQL endpoint mode, you should be able to see all the tables you’ve created. If you don’t see them, please click on the Refresh icon at the top. Next, click on the New report icon at the top to create a Power BI report.
  3. On the Power BI report canvas, you can add a title for your report...

Summary

In this chapter, you learned about the data science process and data science project development life cycle, after which you learned about different capabilities in Microsoft Fabric that empower you at each step in this journey. You learned about all these capabilities by implementing an end-to-end data science project in Microsoft Fabric based on the regression model and also learned how to leverage advanced capabilities, such as using the model registry and tracking with MLflow, Semantic Link, AutoML, and SynapseML.

This whole experience is natively integrated and built into Fabric, giving you the power and flexibility to build end-to-end data science projects without having to switch to or learn about other technologies. This includes capabilities for data ingestion, data transformation, and feature engineering with Notebooks/Spark to training ML models in a distributed manner with SynapseML. Furthermore, it allows you to leverage different open source libraries such...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learn Microsoft Fabric
Published in: Feb 2024Publisher: PacktISBN-13: 9781835082287
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Arshad Ali

Arshad Ali is a principal product manager at Microsoft, working on the Microsoft Fabric product team in Redmond, WA. He focuses on Spark Runtime, which empowers both data engineering and data science experiences. In his previous role, he helped strategic customers and partners adopt Azure Synapse and Microsoft Fabric. Arshad has more than 20 years of industry experience and has been with Microsoft for over 16 years. He is the co-author of the book Big Data Analytics with Azure HDInsight and the author of over 200 technical articles and blogs on data and analytics. Arshad holds an MBA from the Foster School of Business at the University of Washington and an MCA from India.
Read more about Arshad Ali

author image
Bradley Schacht

Bradley Schacht is a principal program manager on the Microsoft Fabric product team based in Saint Augustine, Florida. Bradley is a former consultant and trainer and has co-authored five books on SQL Server and Power BI. As a member of the Microsoft Fabric product team, Bradley works directly with customers to solve some of their most complex data problems and helps shape the future of Microsoft Fabric. Bradley gives back to the community by speaking at events, such as the PASS Summit, SQL Saturday, Code Camp, and user groups across the country, including locally at the Jacksonville SQL Server User Group (JSSUG). He is a contributor on SQLServerCentral and blogs on his personal site, BradleySchacht.
Read more about Bradley Schacht