Modern information systems work with massive amounts of data, with a constant flow that increases every day at an exponential rate. This flow comes from different sources, including sales information, transactional data, social media, and more. Organizations have to work with this information in processes that include transformation and aggregation to develop applications that seek to extract value from this data.
Apache Spark was developed to process this massive amount of data. Azure Databricks is built on top of Apache Spark, abstracting most of the complexities of implementing it, and with all the benefits that come with integration with other Azure services. This book aims to provide an introduction to Azure Databricks and explore the applications it has in modern data pipelines to transform, visualize, and extract insights from large amounts of data in a distributed computation environment.
In this introductory chapter, we will explore these topics:
These concepts will help us to later understand all of the aspects of the execution of our jobs in Azure Databricks and to move easily between all its assets.
To understand the topics presented in this book, you must be familiar with data science and data engineering terms, and have a good understanding of Python, which is the main programming language used in this book, although we will also use SQL to make queries on views and tables.
In terms of the resources required, to execute the steps in this section and those presented in this book, you will require an Azure account as well as an active subscription. Bear in mind that this is a service that is paid, so you will have to introduce your credit card details to create an account. When you create a new account, you will receive a certain amount of free credit, but there are certain options that are limited to premium users. Always remember to stop all the services if you are not using them.
To work with the huge amount of information available to modern consumers, Apache Spark was created. It is a distributed, cluster-based computing system and a highly popular framework used for big data, with capabilities that provide speed and ease of use, and includes APIs that support the following use cases:
It can run very quickly on large datasets thanks to its in-memory processing design that allows it to run with very few read/write disk operations. It has a SQL-like interface and its object-oriented design makes it very easy to understand and write code for; it also has a large support community.
Despite its numerous benefits, Apache Spark has its limitations. These limitations include the following:
Let's look at the solution to these issues: Azure Databricks.
With these and other limitations in mind, Databricks was designed. It is a cloud-based platform that uses Apache Spark as a backend and builds on top of it, to add features including the following:
Built as a joint effort by Microsoft and the team that started Apache Spark, Azure Databricks also allows easy integration with other Azure products, such as Blob Storage and SQL databases, alongside AWS services, including S3 buckets. It has a dedicated support team that assists the platform's clients.
Databricks streamlines and simplifies the setup and maintenance of clusters while supporting different languages, such as Scala and Python, making it easy for developers to create ETL pipelines. It also allows data teams to have real-time, cross-functional collaboration thanks to its notebook-like integrated workspace, while keeping a significant amount of backend services managed by Azure Databricks. Notebooks can be used to create jobs that can later be scheduled, meaning that locally developed notebooks can be deployed to production easily. Other features that make Azure Databricks a great tool for any data team include the following:
Let's examine the architecture of Databricks next.
Each Databricks cluster is a Databricks application composed of a set of pre-configured, VMs running as Azure resources managed as a single group. You can specify the number and type of VMs that it will use while Databricks manages other parameters in the backend. The managed resource group is deployed and populated with a virtual network called VNet, a security group that manages the permissions of the resources, and a storage account that will be used, among other things, as the Databricks filesystem. Once everything is deployed, users can manage these clusters through the Azure Databricks UI. All the metadata used is stored in a geo-replicated and fault-tolerant Azure database. This can all be seen in Figure 1.1:
The immediate benefit this architecture gives to users is that there is a seamless connection with Azure, allowing them to easily connect Azure Databricks to any resource within the same Azure account and have a centrally managed Databricks from the Azure control center with no additional setup.
As mentioned previously, Azure Databricks is a managed application on the Azure cloud that is composed by a control plane and a data plane. The control plane is on the Azure cloud and hosts services such as cluster management and jobs services. The data plane is a component that includes the aforementioned VNet, NSG, and the storage account that is known as DBFS.
You could also deploy the data plane in a customer-managed VNet to allow data engineering teams to build and secure the network architecture according to their organization policies. This is called VNet injection.
Now that we have seen how everything is laid out under the hood, let's discuss some of the core concepts behind Databricks.
Before diving into the specifics of how to create our cluster and start working with Databricks, there are a certain number of concepts with which we must familiarize ourselves first. Together, these define the fundamental tools that Databricks provides to the user and are available both in the web application UI as well as the REST API:
There is a distinction between all-purpose clusters and job clusters. All-purpose clusters are where we work collaboratively and interactively using notebooks, but job clusters are where we execute automatic and more concrete jobs. The way of creating these clusters differs depending on whether it is an all-purpose cluster or a job cluster. The former can be created using the UI, CLI, or REST API, while the latter is created using the job scheduler to run a specific job and is terminated when this is done.
This section focused on the core pieces of Azure Databricks. In the next section, you will learn how to interact with Azure Databricks through the workspace, which is the place where we interact with our assets.
The Azure Databricks workspace is where you can manage objects such as notebooks, libraries, and experiments. It is organized into folders and it also provides access to data, clusters, and jobs:
Access and control of a workspace and its assets can be made through the UI, CLI, or API. We will focus on using the UI.
In the Azure Databricks workspace, you can manage different assets, most of which we have discussed in the terminology. These assets are as follows:
In the following sections, we will dive deeper into how to work with folders and other workspaces objects. The management of these objects is central to running our tasks in Azure Databricks.
All of our static assets within a workspace are stored in folders within the workspace. The stored assets can be notebooks, libraries, experiments, and other folders. Different icons are used to represent folders, notebooks, directories, or experiments. Click a directory to deploy the drop-down list of items:
Clicking on the drop-down arrow in the top-right corner will unfold the menu item, allowing the user to perform actions with that specific folder:
The Azure Databricks workspace has three special folders that you cannot rename or move to a special folder. These special folders are as follows:
The Workspace root folder is a folder that contains all of your static assets. To navigate to this folder, click the workspace or home icon and then click the go back icon:
Within the Workspace root folder, you either select Shared or Users. The former is for sharing objects with other users that belong to your organization, and the latter contains a folder for a specific user.
By default, the Workspace root folder and all of its contents are available for all users, but you can control and manage access by enabling workspace access control and setting permissions.
Within your organization, every user has their own directory, which will be their root directory:
Objects in a user folder will be private to a specific user if workspace access control is enabled. If a user's permissions are removed, they will still be able to access their home folder.
To perform an action on a workspace object, right-click the object or click the drop-down icon at the right side of an object to deploy the drop-down menu:
If the object is a folder, from this menu, the user can do the following:
If it is an object, the user can choose to do the following:
When the user deletes an object, this object goes to the Trash folder, in which everything is deleted after 30 days. Objects can be restored from the Trash folder or be eliminated permanently.
Now that you have learned how to interact with Azure Databricks assets, we can start working with Azure Databricks notebooks to manipulate data, create ETLs, ML experiments, and more.
In this section, we will describe the basics of working with notebooks within Azure Databricks.
There are different ways to interact with notebooks in Azure Databricks. We can either access them through the UI using CLI commands, or by means of the workspace API. We will focus on the UI for now:
Note
If you have an Azure Databricks Premium plan, you can apply access control to the workspace assets.
Azure Databricks supports several notebook formats, which can be scripts in one of the supported languages (Python, Scala, SQL, and R), HTML documents, DBC archives (Databricks native file format), IPYNB Jupyter notebooks, and R Markdown documents.
We can import notebooks into the Azure workspace by clicking in the drop-down menu and selecting Import. After this, we can specify either a file or a URL that contains the file in one of the supported formats and then click Import:
You can export a notebook in one of the supported file formats by clicking on the File button in the notebook toolbar and then selecting Export. Bear in mind that the results of each cell will be included if you have not cleared them.
To be able to work, a notebook needs to be attached to a running cluster. We will now learn about how notebooks connect to the clusters and how to manage these executions.
When a notebook is attached to a cluster, a read-eval-print-loop (REPL) environment is created. This environment is specific to each one of the supported languages and is contained in an execution context.
There is a limit of 145 execution contexts running in a single cluster. Once that number is reached, you cannot attach any more notebooks to that cluster or create a new execution context.
If an execution context has passed a certain time threshold without any executions, it is considered idle and automatically detached from the notebook. This threshold is, by default, 25 hours.
One thing to consider is that when a cluster reaches its maximum context limit, Azure Databricks will remove the least recently used idle execution contexts. This is called an eviction.
If a notebook gets evicted from the cluster it was attached to, the UI will display a message:
We can configure this behavior when creating the cluster or we can disable it by setting the following:
spark.databricks.chauffeur.enableIdleContextTracking false
Notebooks are attached to a cluster by selecting one from the drop-down menu in the notebook toolbar.
A notebook attached to a running cluster has the following Spark environment variables by default:
We can check the Spark version running in the cluster where the notebook is attached by running the following Python code in one of the cells:
spark.version
We can also see the current Databricks runtime version with the following command:
spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion")
These properties are required by the Clusters and Jobs APIs to communicate between themselves.
On the cluster details page, the Notebooks tab will show all the notebooks attached to the cluster, as well as the status and the last time it was used:
Attaching a notebook to a cluster is necessary in order to make them work; otherwise, we won't be able to execute the code in it.
Notebooks are detached from a cluster by clicking in the currently attached cluster and selecting Detach:
This causes the cluster to lose all the values stored as variables in that notebook. It is good practice to always detach the notebooks from the cluster once we have finished working on them. This prevents the autostopping of running clusters, in case there is a process running in the notebook (which could cause undesired costs).
As mentioned before, notebooks can be scheduled to be executed periodically. To schedule a notebook job to run periodically, click the Schedule button at the top right of the notebook toolbar.
Now, we'll look at how you can use a notebook.
Notebooks have a toolbar that contains information on the cluster to which it is attached, and to perform actions such as exporting the notebook or changing the predefined language (depending on the Databricks runtime version):
This toolbar helps us to navigate the general options in our notebook and makes it easier to manage how we interact with the computation cluster.
Cells have code that can be executed:
At the top-left corner of a cell, in the cell actions, you have the following options: Run this cell, Dashboard, Edit, Hide, and Delete:
Specific cells can be run from the cell actions toolbar. To run several cells, we can choose between Run all, all above, or all below. We can also select Run All, Run All Above, or Run All Below from the Run option in the notebook toolbar. Bear in mind that Run All Below includes the cells you are currently in.
The default language for each notebook is shown in parentheses next to the notebook name, which, in the following example, is SQL:
If you click the name of the language in parentheses, you will be prompted by a dialog box in which you can change the default language of the notebook:
When the default language is changed, magic commands will be added to the cells that are not in the new default language in order to keep them working.
The language can also be specified in each cell by using the magic commands. Four magic commands are supported for language specification: %python
, %r
, %scala
, and %sql
.
There are also other magic commands such as %sh
, which allows you to run shell code; %fs
to use dbutils
filesystem commands; and %md
to specify Markdown, for including comments and documentation. We will look at this in a bit more detail.
Markdown is a lightweight markup language with plain text-formatting syntax, often used for formatting readme files, which allows the creation of rich text using plain text.
As we have seen before, Azure Databricks allows Markdown to be used for documentation by using the %md magic command
. The markup is then rendered into HTML with the desired formatting. For example, the next code is used to format text as a title:
%md # Hello This is a Title
It is rendered as an HTML title:
Documentation blocks are one of the most important features of Azure Databricks notebooks. They allow us to state the purpose of our code and how we interpret our results.
Users can add comments to specific portions of code by highlighting it and clicking on the comment button in the bottom-right corner of the cell:
This will prompt a textbox in which we can place comments to be reviewed by other users. Afterward, the commented text will be highlighted:
Comments allow us to propose changes or require information on specific portions of the notebook without intervening in the content.
You can download the tabular results from a cell to your local machine by clicking on the download button at the bottom of a cell:
By default, Azure Databricks limits you to viewing 1,000 rows of a DataFrame, but if there is more data present, we can click on the drop-down icon and select Download full results to see more.
Formatting SQL code can take up a lot of time, and enforcing standards across notebooks can be difficult.
Azure Databricks has a functionality for formatting SQL code in notebook cells, so as to reduce the amount of time dedicated to formatting code, and also to help in applying the same coding standards in all notebooks. To apply automatic SQL formatting to a cell, you can select it from the cell context menu. This is only applicable to SQL code cells:
Applying the autoformatting of SQL code is a feature that can improve the readability of our code, and reduce possible mistakes due to bad formatting.
In this section, we will dive into how to manage data in Azure Databricks in order to perform analytics, create ETL pipelines, train ML algorithms, and more. First, we will briefly describe types of data in Azure Databricks.
In Azure Databricks, a database is composed of tables; table collections of structured data. Users can work with these tables, using all of the operations supported by Apache Spark DataFrames, and query tables using Spark API and Spark SQL.
These tables can be either global or local, accessible to all clusters. Global tables are stored in the Hive metastore, while local tables are not.
Tables can be populated using files in the DBFS or with data from all of the supported data sources.
Tables related to the cluster you are currently using can be viewed by clicking on the data icon button in the sidebar. The Databases folder will display the list of tables in each of the selected databases:
Users can select a different cluster by clicking on the drop-down icon at the top of the Databases folder and selecting the cluster:
We can have several queries on a cluster, each with its own filesystem. This is very important when we reference data in our notebooks.
Local files can be uploaded to the Azure Databricks filesystem using the UI.
Data can be imported into Azure Databricks DBFS to be stored in the FileStore using the UI. To do this, you can either go to the Upload Data UI and select the files to be uploaded as well as the DBFS target directory:
Another option available to you for uploading data to a table is to use the Create Table UI, accessible in the Import & Explore Data box in the workspace:
For production environments, it is recommended to use the DBFS CLI, DBFS API, or the Databricks filesystem utilities (dbutils.fs
).
Users can create tables either programmatically using SQL, or via the UI, which creates global tables. By clicking on the data icon button in the sidebar, you can select Add Data in the top-right corner of the Databases and Tables display:
After this, you will be prompted by a dialog box in which you can upload a file to create a new table, selecting the data source and cluster, the path to where it will be uploaded into the DBFS, and also be able to preview the table:
Creating tables through the UI or the Add data options are two of the many options that we have to ingest data into Azure Databricks.
Users can preview the contents of a table by clicking the name of the table in the Tables folder. This will show a view of the table where we can see the table schema and a sample of the data that is contained within:
These table details allow us to plan transformations in advance to fit data to our needs.
In this section, we will briefly describe how to manage Azure Databricks clusters, the computational backbone of all of our operations. We will describe how to display information on clusters, as well as how to edit, start, terminate, delete, and monitor logs.
To display the clusters in your workspace, click the clusters icon in the sidebar. You will see the Cluster page, which displays clusters in two tabs: All-Purpose Clusters and Job Clusters:
On top of the common cluster information, All-Purpose Clusters displays information on the number of notebooks attached to them.
Actions such as terminate, restart, clone, permissions, and delete actions can be accessed at the far right of an all-purpose cluster:
Cluster actions allow us to quickly operate in our clusters directly from our notebooks.
Apart from creating a new cluster, you can also start a previously terminated cluster. This lets you recreate a previously terminated cluster with its original configuration. Clusters can be started from the Cluster list, on the cluster detail page of the notebook in the cluster icon attached dropdown:
You also have the option of using the API to programmatically start a cluster.
Each cluster is uniquely identified and when you start a terminated cluster, Azure Databricks automatically installs libraries and reattaches notebooks to it.
To save resources, you can terminate a cluster. The configuration of a terminated cluster is stored so that it can be reused later on.
Clusters can be terminated manually or automatically following a specified period of inactivity:
It's good to bear in mind that inactive clusters will be terminated automatically.
Deleting a cluster terminates the cluster and removes its configuration. Use this carefully because this action cannot be undone.
To delete a cluster, click the delete icon in the cluster actions on the Job Clusters or All-Purpose Clusters tab:
You can also invoke the permanent delete API endpoint to programmatically delete a cluster.
Detailed information on Spark jobs is displayed in the Spark UI, which can be accessed from the cluster list or the cluster details page. The Spark UI displays the cluster history for both active and terminated clusters:
Cluster information allows us to have an insight into the progress of our process and identify any possible bottlenecks that could point us to possible optimization opportunities.
Azure Databricks provides three kinds of logging of cluster-related activity:
Azure Databricks provides cluster event logs with information on life cycle events that are manually or automatically triggered, such as creation and configuration edits. There are also logs for Apache Spark drivers and workers, as well cluster init script logs.
Events are stored for 60 days, which is comparable to other data retention times in Azure Databricks.
To view a cluster event log, click on the Cluster button at the sidebar, click on the cluster name, and then finally click on the Event Log tab:
Cluster events provide us with specific information on the actions that were taken on the cluster during the execution of our jobs.
Azure Databricks allows the user to perform access control to manage access to workspace objects, clusters, pools, and data tables. Admin users manage access control lists and also users with delegated permissions.
By default, in Azure Databricks, all users can create or modify clusters. Before using cluster access control, an admin user must enable it. After this, there are two types of cluster permissions, which are as follows:
When cluster access control is enabled, only admins and users with Can Manage permissions can configure, create, terminate, or delete clusters.
Cluster access control can be configured by clicking on the cluster button in the sidebar and, in the Actions options, selecting the Permissions button. This will prompt a permission dialog box where users can do the following:
These options are visible in Figure 1.39:
Cluster permissions allow us to enforce fine-grained control over the computational resources used in our projects.
Folders have five levels of permissions: No Permissions, Read, Run, Edit, and Manage. Any notebook or experiment will inherit the folder permissions that contain them.
Besides the current access control, these permissions are maintained:
When there is no workspace access control, users can only edit items in their Workspace folder.
With workspace access control enabled, the following permissions exist:
Understanding these permissions helps us to know in advance how possible changes in these policies could affect how users interact with the organization's data.
Notebooks have the same five permission levels as folders: No Permissions, Read, Run, Edit, and Manage.
Users can configure notebook permissions by clicking on the Permissions button in the notebook context bar. Select the folder and then click on Permissions from the drop-down menu:
From there, you can grant permissions to users or groups as well as edit existing permissions:
Access control on notebooks can easily be applied in this way by selecting one of the options from the drop-down menu.
You can assign six permission levels to MLflow Models registered in the MLflow Model Registry: No Permissions, Read, Edit, Manage Staging Versions, Manage Production Versions, and Manage.
Besides the current workspace access control, these permissions are maintained:
When there is no workspace access control, users can manage any of the models in the registry.
With workspace access control enabled, the following permissions exist:
These options are applied to MLflow Models created in Azure Databricks.
One thing to keep in mind is that only administrators belong to the admins with the Manage permissions group, while the rest of the users belong to the all users group.
MLflow Model permissions can be modified by clicking on the model's icon in the sidebar, selecting the model name, clicking on the drop-down icon to the right of the model name, and finally selecting Permissions. This will show us a dialog box from which we can select specific users or groups and add specific permissions:
You can update the permissions of a user or group by selecting the new permission from the Permission drop-down menu:
By selecting one of these options, we can control how MLflow experiments interact with our data and which users can create models that work with it.
In this chapter, we have tried to cover all the main aspects of how Azure Databricks works. Some of the things we have discovered include how notebooks can be created to execute code, how we can import data to use, how to create and manage clusters, and so on. This is important because when creating ETLs and ML experiments in Azure Databricks within an organization, aside from how to code the ETL in our notebooks, we will need to know how to manage the data and computational resources required, how to share assets, and how to manage the permissions of each one of them.
In the next chapter, we will apply this knowledge to explore in more detail how to create and manage the resources needed to work with data in Azure Databricks, and learn more about custom VNets and the different alternatives that we have in order to interact with them, either through the Azure Databricks UI or the CLI tool.
Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.
If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.
Please Note: Packt eBooks are non-returnable and non-refundable.
Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:
If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:
Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.
You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.
Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.
When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.
For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.