Reader small image

You're reading from  Distributed Data Systems with Azure Databricks

Product typeBook
Published inMay 2021
Reading LevelBeginner
PublisherPackt
ISBN-139781838647216
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Alan Bernardo Palacio
Alan Bernardo Palacio
author image
Alan Bernardo Palacio

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.
Read more about Alan Bernardo Palacio

Right arrow

Preface

Microsoft Azure Databricks helps you to harness the power of distributed computing and apply it to create robust data pipelines, along with training and deploying Machine Learning (ML) and Deep Learning (DL) models. Databricks' advanced features enable developers to process, transform, and explore data. Distributed data systems with Azure Databricks will help you to put your knowledge of Databricks to work to create big data pipelines.

The book provides a hands-on approach to implementing Azure Databricks and its associated methodologies that will make you productive in no time. Complete with detailed explanations of essential concepts, practical examples, and self-assessment questions, you'll begin with a quick introduction to Databricks' core functionalities, before performing distributed model training and inference using TensorFlow and Spark MLlib. As you advance, you'll explore MLflow Model Serving on Azure Databricks and distributed training pipelines using HorovodRunner in Databricks. Finally, you'll discover how to transform, use, and obtain insights from massive amounts of data to train predictive models and create entire fully working data pipelines.

By the end of this MS Azure book, you'll have gained a solid understanding of how to work with Databricks to create and manage an entire big data pipeline.

Who this book is for

This book is for software engineers, ML engineers, data scientists, and data engineers who are new to Azure Databricks and who want to build high-quality data pipelines without worrying about infrastructure. Knowledge of the basics of Azure Databricks is required to learn the concepts covered in this book more effectively. A basic understanding of ML concepts and beginner-level Python programming knowledge is also recommended.

What this book covers

Chapter 1, Introduction to Azure Databricks, takes you through the core functionalities of Databricks, including how we interact with the workspace environment, a quick look into the main applications, and how we will be using the tool for Python users. This covers topics such as workspace, interface, computation management, and Databricks notebooks.

Chapter 2, Creating an Azure Databricks Workspace, teaches you how to apply all the previous concepts using the different tools that Azure has in order to interact with the workspace. This includes using PowerShell and the Azure CLI to manage all Databricks' resources.

Chapter 3, Creating ETL Operations with Azure Databricks, shows you how to manage different data sources, transform them, and create an entire event-driven ETL.

Chapter 4, Delta Lake with Azure Databricks, explores Delta Lake and how to implement it for various operations.

Chapter 5, Introducing Delta Engine, explores Delta Engine and also shows you how to use it along with Delta Lake and create efficient ETLs in Databricks.

Chapter 6, Introducing Structured Streaming, provides explanations on notebooks, details on how to use specific types of streaming sources and sinks, how to put streaming into production, and notebooks demonstrating example use cases.

Chapter 7, Using Python Libraries in Azure Databricks, explores all the nuances regarding working with Python, as well as introducing core concepts regarding models and data that will be studied in more detail later on.

Chapter 8, Databricks Runtime for Machine Learning, acts as a deep dive for us in the development of classic ML algorithms to train and deploy models based on tabular data, all while exploring libraries and algorithms as well. The examples will be focused on the particularities and advantages of using Databricks for ML.

Chapter 9, Databricks Runtime for Deep Learning, acts as a deep dive for us in the development of classic DL algorithms to train and deploy models based on unstructured data, all while exploring libraries and algorithms as well. The examples will be focused on the particularities and advantages of using Databricks for DL.

Chapter 10, Model Tracking and Tuning in Azure Databricks, focuses on model tuning, deployment, and control using Databricks' functionalities, such as AutoML and Delta Lake, while using it in conjunction with popular libraries such as TensorFlow.

Chapter 11, Managing and Serving Models with MLflow and MLeap, explores in more detail the MLflow library, an open source platform for managing the end-to-end ML life cycle. This library allows the user to track experiments, record and compare parameters, centralize model storage, and more. You will learn how to use it in combination with what was learned in the previous chapters.

Chapter 12, Distributed Deep Learning in Azure Databricks, demonstrates how to use Horovod to make distributed DL faster by taking single-GPU training scripts and scaling them to train across many GPUs in parallel.

To get the most out of this book

You need a functional PC with an internet connection and an Azure Services subscription.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Distributed-Data-Systems-with-Azure-Databricks. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/9781838647216_ColorImages.pdf..

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "When using Azure Databricks Runtime ML, we have the option to use the dbfs:/ml folder."

A block of code is set as follows:

diamonds = spark.read.format('csv').options(header='true', inferSchema='true').load('/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv')

Tips or important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at customercare@packtpub.com.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Distributed Data Systems with Azure Databricks
Published in: May 2021Publisher: PacktISBN-13: 9781838647216
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Alan Bernardo Palacio

Alan Bernardo Palacio is a data scientist and an engineer with vast experience in different engineering fields. His focus has been the development and application of state-of-the-art data products and algorithms in several industries. He has worked for companies such as Ernst and Young, Globant, and now holds a data engineer position at Ebiquity Media helping the company to create a scalable data pipeline. Alan graduated with a Mechanical Engineering degree from the National University of Tucuman in 2015, participated as the founder in startups, and later on earned a Master's degree from the faculty of Mathematics in the Autonomous University of Barcelona in 2017. Originally from Argentina, he now works and resides in the Netherlands.
Read more about Alan Bernardo Palacio