Reader small image

You're reading from  Distributed Data Systems with Azure Databricks

Product typeBook
Published inMay 2021
Reading LevelBeginner
PublisherPackt
ISBN-139781838647216
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Alan Bernardo Palacio
Alan Bernardo Palacio
author image
Alan Bernardo Palacio

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.
Read more about Alan Bernardo Palacio

Right arrow

Chapter 12: Distributed Deep Learning in Azure Databricks

In the previous chapter, we have learned how we can effectively serialize machine learning pipelines and manage the full development life cycle of machine learning models in Azure Databricks. This chapter will focus on how we can apply distributed training in Azure Databricks.

Distributed training of deep learning models is a technique in which the training process is distributed across workers in clusters of computers. This process is not trivial and its implementation requires us to fine-tune the way in which the workers communicate and transmit data between them, otherwise distributing training can take longer than single-machine training. Azure Databricks Runtime for Machine Learning includes Horovod, a library that allows us to solve most of the issues that arise from distributed training of deep learning algorithms. We will also show how we can leverage the native Spark support of the TensorFlow machine learning framework...

Technical requirements

In this chapter, we will use concepts and apply tools related to common topics in data science and optimization tasks. We will assume that you already have a good understanding of concepts such as neural networks and hyperparameter tuning, and also general knowledge of machine learning frameworks such as TensorFlow, PyTorch, or Keras.

In this chapter, we will discuss various techniques in order to distribute and optimize the training of deep learning models using Azure Databricks, so if you are not familiar with these terms, it would be advisable for you to review the TensorFlow documentation on how neural networks are designed and trained.

In order to work on the examples given in this chapter, you will need to have the following:

  • An Azure Databricks subscription.
  • An Azure Databricks notebook attached to a running cluster with Databricks Runtime ML version 7.0 or higher.

Distributed training for deep learning

Deep neural networks (DNNs) have driven the advancement of artificial intelligence (AI) in the last decades in areas such as computer vision and neural network processing. These are applied every day to solve challenges in diverse use cases.

In order to scale the performance of models, it is necessary to develop complex model architectures with millions of trainable parameters, making the computations required for the training a resourceful operation. As the amount of available data to train models increases, we need to scale up the training pipeline of deep learning models in order to be able to use this available data..

Commonly, in order to train a DNN, we need to follow three basic steps, which are listed here:

  1. Pass the data through the layers of the network to compute the model loss in an operation called forward propagation.
  2. Backpropagate this loss from the output layer to the first layer in order to compute the gradients...

Using the Horovod distributed learning library in Azure Databricks

horovod is a library for distributed deep learning training. It supports commonly used frameworks such as TensorFlow, Keras, and PyTorch. As mentioned before, it is based on the tensorflow-allreduce library and implements the ring allreduce algorithm in order to ease the migration from single-graphics processing unit (GPU) training to parallel-GPU distributed training.

In order to do this, we adapt a single-GPU training script of a deep learning model to use the horovod library during the training process. Once we have adapted the script, it can run on single or multiple GPUs without changes to the code.

The horovod library uses a data parallelization strategy by allowing efficient distribution of the training to multiple GPUs in parallel in an optimized way, by implementing the ring allreduce algorithm to overcome communication limitations.

It is implemented in a way that each GPU gets a mini-batch of data...

Using the Spark TensorFlow Distributor package

One of the most commonly used frameworks in deep learning is the TensorFlow library, which also supports distributed training on both CPU and GPU clusters. We can use it to train deep learning models in Azure Databricks by using Spark TensorFlow Distributor, which is a library that aims to ease the process of training TensorFlow models with complex architecture and lots of trainable parameters in distributed computing systems with large amounts of data.

Spark was limited to distributed training because of the standard execution mode, which is the Map/Reduce mode. In this mode, jobs are executed independently in each worker without any communication between them. In Spark 3.0, there is a new execution mode named barrier execution that allows us to easily train deep learning models in a distributed way by allowing communication between workers during the execution.

Spark TensorFlow Distributor is a TensorFlow-native package that makes...

Summary

We have learned in this chapter about how we can improve the performance of our training pipelines for deep learning algorithms, using distributed learning with Horovod and the native TensorFlow for Spark in Azure Databricks. We have discussed the core algorithms that drive the capability of being able to effectively distribute key operations such as gradient descent and model weights update, how this is implemented in the horovod library, included with Azure Databricks Runtime for Machine Learning, and how we can use the native support now available for Spark in the TensorFlow framework for distributed training of deep learning models.

This chapter concludes this book. Hopefully, it enabled you to learn in an easier way the incredible number of features available in Azure Databricks for data engineering and data science. As mentioned before, most of the code examples are modifications of the official libraries or are taken from the Azure Databricks documentation in order...

Why subscribe?

  • Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
  • Improve your learning with Skill Plans built especially for you
  • Get a free eBook or video every month
  • Fully searchable for easy access to vital information
  • Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at customercare@packtpub.com for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Distributed Data Systems with Azure Databricks
Published in: May 2021Publisher: PacktISBN-13: 9781838647216
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Alan Bernardo Palacio

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.
Read more about Alan Bernardo Palacio