Reader small image

You're reading from  Distributed Data Systems with Azure Databricks

Product typeBook
Published inMay 2021
Reading LevelBeginner
PublisherPackt
ISBN-139781838647216
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Alan Bernardo Palacio
Alan Bernardo Palacio
author image
Alan Bernardo Palacio

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.
Read more about Alan Bernardo Palacio

Right arrow

What this book covers

Chapter 1, Introduction to Azure Databricks, takes you through the core functionalities of Databricks, including how we interact with the workspace environment, a quick look into the main applications, and how we will be using the tool for Python users. This covers topics such as workspace, interface, computation management, and Databricks notebooks.

Chapter 2, Creating an Azure Databricks Workspace, teaches you how to apply all the previous concepts using the different tools that Azure has in order to interact with the workspace. This includes using PowerShell and the Azure CLI to manage all Databricks' resources.

Chapter 3, Creating ETL Operations with Azure Databricks, shows you how to manage different data sources, transform them, and create an entire event-driven ETL.

Chapter 4, Delta Lake with Azure Databricks, explores Delta Lake and how to implement it for various operations.

Chapter 5, Introducing Delta Engine, explores Delta Engine and also shows you how to use it along with Delta Lake and create efficient ETLs in Databricks.

Chapter 6, Introducing Structured Streaming, provides explanations on notebooks, details on how to use specific types of streaming sources and sinks, how to put streaming into production, and notebooks demonstrating example use cases.

Chapter 7, Using Python Libraries in Azure Databricks, explores all the nuances regarding working with Python, as well as introducing core concepts regarding models and data that will be studied in more detail later on.

Chapter 8, Databricks Runtime for Machine Learning, acts as a deep dive for us in the development of classic ML algorithms to train and deploy models based on tabular data, all while exploring libraries and algorithms as well. The examples will be focused on the particularities and advantages of using Databricks for ML.

Chapter 9, Databricks Runtime for Deep Learning, acts as a deep dive for us in the development of classic DL algorithms to train and deploy models based on unstructured data, all while exploring libraries and algorithms as well. The examples will be focused on the particularities and advantages of using Databricks for DL.

Chapter 10, Model Tracking and Tuning in Azure Databricks, focuses on model tuning, deployment, and control using Databricks' functionalities, such as AutoML and Delta Lake, while using it in conjunction with popular libraries such as TensorFlow.

Chapter 11, Managing and Serving Models with MLflow and MLeap, explores in more detail the MLflow library, an open source platform for managing the end-to-end ML life cycle. This library allows the user to track experiments, record and compare parameters, centralize model storage, and more. You will learn how to use it in combination with what was learned in the previous chapters.

Chapter 12, Distributed Deep Learning in Azure Databricks, demonstrates how to use Horovod to make distributed DL faster by taking single-GPU training scripts and scaling them to train across many GPUs in parallel.

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Distributed Data Systems with Azure Databricks
Published in: May 2021Publisher: PacktISBN-13: 9781838647216

Author (1)

author image
Alan Bernardo Palacio

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.
Read more about Alan Bernardo Palacio