You're reading from Distributed Data Systems with Azure Databricks

Product type Book

Published in May 2021

Publisher Packt

ISBN-13 9781838647216

Pages 414 pages

Edition 1st Edition

Languages

Python

Concepts

Data Science

Author (1):

Alan Bernardo Palacio

Table of Contents (17) Chapters

Preface

1. Section 1: Introducing Databricks

2. Chapter 1: Introduction to Azure Databricks

3. Chapter 2: Creating an Azure Databricks Workspace

4. Section 2: Data Pipelines with Databricks

5. Chapter 3: Creating ETL Operations with Azure Databricks

6. Chapter 4: Delta Lake with Azure Databricks

7. Chapter 5: Introducing Delta Engine

8. Chapter 6: Introducing Structured Streaming

9. Section 3: Machine and Deep Learning with Databricks

10. Chapter 7: Using Python Libraries in Azure Databricks

11. Chapter 8: Databricks Runtime for Machine Learning

12. Chapter 9: Databricks Runtime for Deep Learning

13. Chapter 10: Model Tracking and Tuning in Azure Databricks

14. Chapter 11: Managing and Serving Models with MLflow and MLeap

15. Chapter 12: Distributed Deep Learning in Azure Databricks

16. Other Books You May Enjoy

Exploring data management

In this section, we will dive into how to manage data in Azure Databricks in order to perform analytics, create ETL pipelines, train ML algorithms, and more. First, we will briefly describe types of data in Azure Databricks.

Databases and tables

In Azure Databricks, a database is composed of tables; table collections of structured data. Users can work with these tables, using all of the operations supported by Apache Spark DataFrames, and query tables using Spark API and Spark SQL.

These tables can be either global or local, accessible to all clusters. Global tables are stored in the Hive metastore, while local tables are not.

Tables can be populated using files in the DBFS or with data from all of the supported data sources.

Viewing databases and tables

Tables related to the cluster you are currently using can be viewed by clicking on the data icon button in the sidebar. The Databases folder will display the list of tables in each of the selected databases:

Figure 1.25 – Default tables

Users can select a different cluster by clicking on the drop-down icon at the top of the Databases folder and selecting the cluster:

Figure 1.26 – Selecting databases in a different cluster

We can have several queries on a cluster, each with its own filesystem. This is very important when we reference data in our notebooks.

Importing data

Local files can be uploaded to the Azure Databricks filesystem using the UI.

Data can be imported into Azure Databricks DBFS to be stored in the FileStore using the UI. To do this, you can either go to the Upload Data UI and select the files to be uploaded as well as the DBFS target directory:

Figure 1.27 – Uploading the data UI

Another option available to you for uploading data to a table is to use the Create Table UI, accessible in the Import & Explore Data box in the workspace:

Figure 1.28 – Creating a table UI in Import & Explore Data

For production environments, it is recommended to use the DBFS CLI, DBFS API, or the Databricks filesystem utilities (dbutils.fs).

Creating a table

Users can create tables either programmatically using SQL, or via the UI, which creates global tables. By clicking on the data icon button in the sidebar, you can select Add Data in the top-right corner of the Databases and Tables display:

Figure 1.29 – Adding data to create a new table

After this, you will be prompted by a dialog box in which you can upload a file to create a new table, selecting the data source and cluster, the path to where it will be uploaded into the DBFS, and also be able to preview the table:

Figure 1.30 – Creating a new table UI

Creating tables through the UI or the Add data options are two of the many options that we have to ingest data into Azure Databricks.

Table details

Users can preview the contents of a table by clicking the name of the table in the Tables folder. This will show a view of the table where we can see the table schema and a sample of the data that is contained within: