Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Practical Machine Learning on Databricks

You're reading from  Practical Machine Learning on Databricks

Product type Book
Published in Nov 2023
Publisher Packt
ISBN-13 9781801812030
Pages 244 pages
Edition 1st Edition
Languages
Author (1):
Debu Sinha Debu Sinha
Profile icon Debu Sinha

Table of Contents (16) Chapters

Preface Part 1: Introduction
Chapter 1: The ML Process and Its Challenges Chapter 2: Overview of ML on Databricks Part 2: ML Pipeline Components and Implementation
Chapter 3: Utilizing the Feature Store Chapter 4: Understanding MLflow Components on Databricks Chapter 5: Create a Baseline Model Using Databricks AutoML Part 3: ML Governance and Deployment
Chapter 6: Model Versioning and Webhooks Chapter 7: Model Deployment Approaches Chapter 8: Automating ML Workflows Using Databricks Jobs Chapter 9: Model Drift Detection and Retraining Chapter 10: Using CI/CD to Automate Model Retraining and Redeployment Index Other Books You May Enjoy

Understanding the typical machine learning process

The following diagram summarizes the ML process in an organization:

Figure 1.1 – The data science development life cycle consists of three main stages – data preparation, modeling, and deployment

Figure 1.1 – The data science development life cycle consists of three main stages – data preparation, modeling, and deployment

Note

Source: https://azure.microsoft.com/mediahandler/files/resourcefiles/standardizing-the-machine-learning-lifecycle/Standardizing%20ML%20eBook.pdf.

It is an iterative process. The raw structured and unstructured data first lands into a data lake from different sources. A data lake utilizes the scalable and cheap storage provided by cloud storage such as Amazon Simple Storage Service (S3) or Azure Data Lake Storage (ADLS), depending on which cloud provider an organization uses. Due to regulations, many organizations have a multi-cloud strategy, making it essential to choose cloud-agnostic technologies and frameworks to simplify infrastructure management and reduce operational overhead.

Databricks defined a design pattern called the medallion architecture to organize data in a data lake. Before moving forward, let’s briefly understand what the medallion architecture is:

Figure 1.2 – Databricks medallion architecture

Figure 1.2 – Databricks medallion architecture

The medallion architecture is a data design pattern that’s used in a Lakehouse to organize data logically. It involves structuring data into layers (Bronze, Silver, and Gold) to progressively improve its quality and structure. The medallion architecture is also referred to as a “multi-hop” architecture.

The Lakehouse architecture, which combines the best features of data lakes and data warehouses, offers several benefits, including a simple data model, ease of implementation, incremental extract, transform, and load (ETL), and the ability to recreate tables from raw data at any time. It also provides features such as ACID transactions and time travel for data versioning and historical analysis. We will expand more on the lakehouse in the Exploring the Databricks Lakehouse architecture section.

In the medallion architecture, the Bronze layer holds raw data sourced from external systems, preserving its original structure along with additional metadata. The focus here is on quick change data capture (CDC) and maintaining a historical archive. The Silver layer, on the other hand, houses cleansed, conformed, and “just enough” transformed data. It provides an enterprise-wide view of key business entities and serves as a source for self-service analytics, ad hoc reporting, and advanced analytics.

The Gold layer is where curated business-level tables reside that have been organized for consumption and reporting purposes. This layer utilizes denormalized, read-optimized data models with fewer joins. Complex transformations and data quality rules are applied here, facilitating the final presentation layer for various projects, such as customer analytics, product quality analytics, inventory analytics, and more. Traditional data marts and enterprise data warehouses (EDWs) can also be integrated into the lakehouse to enable comprehensive “pan-EDW” advanced analytics and ML.

The medallion architecture aligns well with the concept of a data mesh, where Bronze and Silver tables can be joined in a “one-to-many” fashion to generate multiple downstream tables, enhancing data scalability and autonomy.

Apache Spark has taken over Hadoop as the de facto standard for processing data at scale in the last six years due to advancements in performance and large-scale developer community adoption and support. There are many excellent books on Apache Spark written by the creators of Apache Spark themselves; these have been listed in the Further reading section. They can give more insights into the other benefits of Apache Spark.

Once the clean data lands in the Gold standard tables, features are generated by combining gold datasets, which act as input for ML model training.

During the model development and training phase, various sets of hyperparameters and ML algorithms are tested to identify the optimal combination of the model and corresponding hyperparameters. This process relies on predetermined evaluation metrics such as accuracy, R2 score, and F1 score.

In the context of ML, hyperparameters are parameters that govern the learning process of a model. They are not learned from the data itself but are set before training. Examples of hyperparameters include the learning rate, regularization strength, number of hidden layers in a neural network, or the choice of a kernel function in a support vector machine. Adjusting these hyperparameters can significantly impact the performance and behavior of the model.

On the other hand, training an ML model involves deriving values for other model parameters, such as node weights or model coefficients. These parameters are learned during the training process using the training data to minimize a chosen loss or error function. They are specific to the model being trained and are determined iteratively through optimization techniques such as gradient descent or closed-form solutions.

Expanding beyond node weights, model parameters can also include coefficients in regression models, intercept terms, feature importance scores in decision trees, or filter weights in convolutional neural networks. These parameters are directly learned from the data during the training process and contribute to the model’s ability to make predictions.

Parameters

You can learn more about parameters at https://en.wikipedia.org/wiki/Parameter.

The finalized model is deployed either for batch, streaming, or real-time inference as a Representational State Transfer (REST) endpoint using containers. In this phase, we set up monitoring for drift and governance around the deployed models to manage the model life cycle and enforce access control around usage. Let’s take a look at the different personas involved in taking an ML use case from development to production.

You have been reading a chapter from
Practical Machine Learning on Databricks
Published in: Nov 2023 Publisher: Packt ISBN-13: 9781801812030
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}