Reader small image

You're reading from  Learn Microsoft Fabric

Product typeBook
Published inFeb 2024
Reading LevelN/a
PublisherPackt
ISBN-139781835082287
Edition1st Edition
Languages
Right arrow
Authors (2):
Arshad Ali
Arshad Ali
author image
Arshad Ali

Arshad Ali is a principal product manager at Microsoft, working on the Microsoft Fabric product team in Redmond, WA. He focuses on Spark Runtime, which empowers both data engineering and data science experiences. In his previous role, he helped strategic customers and partners adopt Azure Synapse and Microsoft Fabric. Arshad has more than 20 years of industry experience and has been with Microsoft for over 16 years. He is the co-author of the book Big Data Analytics with Azure HDInsight and the author of over 200 technical articles and blogs on data and analytics. Arshad holds an MBA from the Foster School of Business at the University of Washington and an MCA from India.
Read more about Arshad Ali

Bradley Schacht
Bradley Schacht
author image
Bradley Schacht

Bradley Schacht is a principal program manager on the Microsoft Fabric product team based in Saint Augustine, Florida. Bradley is a former consultant and trainer and has co-authored five books on SQL Server and Power BI. As a member of the Microsoft Fabric product team, Bradley works directly with customers to solve some of their most complex data problems and helps shape the future of Microsoft Fabric. Bradley gives back to the community by speaking at events, such as the PASS Summit, SQL Saturday, Code Camp, and user groups across the country, including locally at the Jacksonville SQL Server User Group (JSSUG). He is a contributor on SQLServerCentral and blogs on his personal site, BradleySchacht.
Read more about Bradley Schacht

View More author details
Right arrow

Building an End-to-End Analytics System – Lakehouse

Traditionally, for their analytics needs, companies have struggled to manage two different analytics systems: a relational data warehouse to manage and process primarily structured data and a data lake for big data processing (primarily unstructured data). This has not only created data silos and redundancy across multiple systems but has also increased the overall effort to develop and manage the increased total cost of ownership. Microsoft Fabric bridges this gap by unifying different data stores (data warehouses and data lakes) by standardizing data storage using the Delta Lake format in OneLake for lakehouses.

In this chapter, we are going to take an example of a retail organization and build its end-to-end analytics system based on a lakehouse from start to finish—all the way from data ingestion and transformation to reporting and visualization. The key stages are as follows:

  • Creating a lakehouse using...

Technical requirements

This chapter assumes that you have followed the instructions mentioned in the Getting started with Microsoft Fabric section in the previous chapter to create/enable Fabric in your tenant and have created a Fabric workspace to work in.

The code files for this chapter are available on GitHub: https://github.com/PacktPublishing/Learn-Microsoft-Fabric/tree/main/ch3.

Once you arrive at this link, you can open an individual notebook and then click on the Download raw file icon at the top right of the preview pane to download this individual notebook file.

You can also click on https://github.com/PacktPublishing/Learn-Microsoft-Fabric/ and then click on Download ZIP under the Code button at the top of the middle of the screen to download all notebook files in one go.

Understanding end-to-end scenarios

A lakehouse in Microsoft Fabric is a data storage layer that allows organizations to store and manage virtually any type of data (structured, semi-structured, and unstructured data) in a single location, allowing various tools and frameworks to process and analyze such data as per organizational needs and/or an individual’s preference.

A lakehouse combines the best aspects of a data lake and a data warehouse, removing the data duplicity and friction of ingesting, transforming, and sharing organizational data, all in the open format of Delta Lake. Ingested data flow into the lakehouse by default in the Delta Lake format (https://delta.io/), and tables are automatically discovered and registered in the metastore on behalf of users so that they’re available to seamlessly work with all the engines within Fabric.

A data analytics system based on a lakehouse typically follows Medallion architecture (https://learn.microsoft.com/en-us...

Storage

In this section, you will create three lakehouses (each representing each zone of the Medallion architecture) by following these steps:

  1. When logged into Fabric tenant, select the Workspaces flyout on the left-hand side.
  2. Search for the workspace that you created in Chapter 2, Understanding Different Workloads and Getting Started with Microsoft Fabric, by typing its name in the search textbox at the top and clicking on your workspace to open it. You can also pin it so that it always appears at the top of the list.
  3. From the workload switcher located at the bottom left of the screen, select Data Engineering.
  4. In the Data Engineering experience, select the Lakehouse type of item to create a lakehouse under + New.
  5. Enter wwi_bronze in the Name box and click on Create. The new lakehouse will be created and automatically opened.

Repeat steps 4–5 to create two more lakehouses named wwi_silver and wwi_gold. When you switch to the workspace again, you...

Ingestion

In this section, you will use a Pipeline (Data Factory) to ingest sample data from a source (Azure storage account) to the Files section of the Bronze zone (wwi_bronze) of the Medallion architecture:

  1. Choose the workspace that you created in Chapter 2, Understanding Different Workloads and Getting Started with Microsoft Fabric, from the Workspaces fly out on the left-hand side and open it. Create a Data pipeline from the +New button on the workspace page. If you don’t see an option for Data pipeline, click on the Show All menu item at the bottom and then select Data pipeline under Data Factory.
Figure 3.4 – Creating a new data pipeline

Figure 3.4 – Creating a new data pipeline

  1. For the New pipeline, specify the name as IngestDataFromSourceToBronze and click on Create. This will create a new data factory pipeline and open its canvas on the screen to work on.
  2. On the newly created data factory pipeline, click on Add pipeline activity to add an activity to...

Transformation

Now that you have already ingested the raw data from the source to the Files section of the wwi_bronze lakehouse, you can take this data and transform and prepare it to create Delta Lake tables in the wwi_silver lakehouse as a next step.

Importing notebooks

The first step is to import notebooks using the following steps:

  1. Download the notebooks found in the ch3 folder of this chapter’s GitHub repo (https://github.com/PacktPublishing/Learn-Microsoft-Fabric/tree/main/ch3) to your local machine. If required, unzip or uncompress them.
  2. From the workload switcher located at the bottom left of the screen, select Data engineering. Select Import notebook from the New section at the top of the landing page of the Data Engineering experience.
Figure 3.12 – The option to import notebooks

Figure 3.12 – The option to import notebooks

  1. Select Upload from the Import status pane that opens on the right-hand side of the screen. Select all three notebooks that were...

Analyze

Now that we have data integrated into the lakehouse and have prepared them for reporting, we’ll analyze these data to get insights. We will look at two methods: first, we will use Power BI to create visualizations (reports and dashboards), and then we will use SQL endpoint to connect the lakehouse for running analytical queries.

Power BI

Power BI is natively integrated within the whole Fabric experience; this native integration brings a unique mode of accessing the data (called Direct Lake, which we discussed in earlier chapters) from the lakehouse to provide the most performant query and reporting experience. Let’s create a report based on the data from the Gold zone:

  1. Open the wwi_gold lakehouse and click on SQL endpoint under mode selection on the top right of the screen to switch to SQL endpoint mode for the selected lakehouse.
Figure 3.23 – Switching to SQL endpoint mode

Figure 3.23 – Switching to SQL endpoint mode

  1. Once you are in SQL endpoint...

Orchestrate data ingestion and transformation flow and schedule notebooks and pipelines

Fabric provides flexibility in how you schedule your jobs. For example, you can schedule a notebook by clicking on the settings (cogwheel) icon at the top under the Home menu tab when the notebook is open or by clicking on the ellipsis () next to the name of the notebook in the workspace item view and then clicking on the Setting menu.

On the Setting page, click on the Schedule tab and define the schedule for this notebook to be executed.

Figure 3.36 – Schedule a notebook

Figure 3.36 – Schedule a notebook

Furthermore, if you have multiple notebooks/jobs, some of which you would like to be executed in parallel while others in sequence, then you can create a data pipeline and define a schedule for when and how frequently this pipeline should be executed. Figure 3.37 shows an example pipeline that has three activities being executed in sequence (this is just one example; you might have...

Data meshes in Fabric – a primer

A data mesh is a federated data architecture that emphasizes decentralizing data across business functions or domains such as marketing, sales, human resources, and more. It facilitates organizing and managing data in a logical way to facilitate the more targeted and efficient use and governance of the data across organizations. This provides more ownership to the producers of a given dataset by encouraging a shift away from the giant, monolithic enterprise-wide data architecture.

Important note

The term data mesh was coined by Zhamak Dehghani (https://martinfowler.com/articles/data-mesh-principles.html) and is founded on four principles: “domain-driven ownership of data”, “data as a product”, “self-serve data infrastructure platform”, and “federated governance”. A detailed discussion about data meshes is out of the scope of this chapter; however, you can learn more about it at https...

Summary

Since a lakehouse based on Medallion architecture combines the best of data lakes and data warehouses by breaking silos and removing data duplicity, it’s becoming more popular as the de facto standard for building data platform architecture. Microsoft Fabric, with its native capabilities, makes it easy to build data analytics systems based on lakehouses.

In this chapter, you learned about creating an end-to-end lakehouse-based data analytics system. You learned about the different components in this architecture pattern and how to implement them quickly to derive business values. Further, you learned about ingesting data from a data source to your lakehouse using pipelines, transforming this data with notebooks/Spark, and then using Power BI—with its new Direct Lake mode—to create reports and dashboards. You also learned about the capabilities that Fabric provides to build a decentralized data architecture with data meshes.

In the next chapter, you...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learn Microsoft Fabric
Published in: Feb 2024Publisher: PacktISBN-13: 9781835082287
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Arshad Ali

Arshad Ali is a principal product manager at Microsoft, working on the Microsoft Fabric product team in Redmond, WA. He focuses on Spark Runtime, which empowers both data engineering and data science experiences. In his previous role, he helped strategic customers and partners adopt Azure Synapse and Microsoft Fabric. Arshad has more than 20 years of industry experience and has been with Microsoft for over 16 years. He is the co-author of the book Big Data Analytics with Azure HDInsight and the author of over 200 technical articles and blogs on data and analytics. Arshad holds an MBA from the Foster School of Business at the University of Washington and an MCA from India.
Read more about Arshad Ali

author image
Bradley Schacht

Bradley Schacht is a principal program manager on the Microsoft Fabric product team based in Saint Augustine, Florida. Bradley is a former consultant and trainer and has co-authored five books on SQL Server and Power BI. As a member of the Microsoft Fabric product team, Bradley works directly with customers to solve some of their most complex data problems and helps shape the future of Microsoft Fabric. Bradley gives back to the community by speaking at events, such as the PASS Summit, SQL Saturday, Code Camp, and user groups across the country, including locally at the Jacksonville SQL Server User Group (JSSUG). He is a contributor on SQLServerCentral and blogs on his personal site, BradleySchacht.
Read more about Bradley Schacht