Reader small image

You're reading from  Practical Machine Learning on Databricks

Product typeBook
Published inNov 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781801812030
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Debu Sinha
Debu Sinha
author image
Debu Sinha

Debu is an experienced Data Science and Engineering leader with deep expertise in Software Engineering and Solutions Architecture. With over 10 years in the industry, Debu has a proven track record in designing scalable Software Applications, Big Data, and Machine Learning systems. As Lead ML Specialist on the Specialist Solutions Architect team at Databricks, Debu focuses on AI/ML use cases in the cloud and serves as an expert on LLMs, Machine Learning, and MLOps. With prior experience as a startup co-founder, Debu has demonstrated skills in team-building, scaling, and delivering impactful software solutions. An established thought leader, Debu has received multiple awards and regularly speaks at industry events.
Read more about Debu Sinha

Right arrow

Model Drift Detection and Retraining

In the last chapter, we covered various workflow management options available in Databricks for automating machine learning (ML) tasks. Now, we will expand upon our understanding of the ML life cycle up to now and introduce the fundamental concept of drift. We will discuss why model monitoring is essential and how you can ensure your ML models perform as expected over time.

At the time of writing this book, Databricks has a product that is in development that will simplify monitoring model performance and data out of the box. In this chapter, we will go through an example of how to use the existing Databricks functionalities to implement drift detection and monitoring.

We will be covering the following topics:

  • Motivation for model monitoring
  • Introduction to model drift
  • Introduction to Statistical Drift
  • Techniques for drift detection
  • Implementing drift detection on Databricks

Let’s go through the technical...

Technical requirements

The following are the prerequisites for the chapter:

  • Access to a Databricks workspace
  • A running cluster with Databricks Runtime for Machine Learning (Databricks Runtime ML) with a version higher than 10.3
  • Notebooks from Chapter 9 imported into the Databricks workspace
  • Introductory knowledge of hypothesis testing and interpreting statistical tests

Let’s take a look at the motivation behind why model monitoring is important.

The motivation behind model monitoring

According to an article in Forbes magazine by Enrique Dans, July 21, 2019, 87% of data science projects never make it to production (https://www.forbes.com/sites/enriquedans/2019/07/21/stop-experimenting-with-machine-learning-and-start-actually-usingit/?sh=1004ff0c3365).

There are a lot of reasons why ML models fail; however, if we look purely at the reason for ML project failure in a production environment, it comes down to a lack of re-training and testing the deployed models for performance consistency over time.

The performance of the model keeps degrading over time. Many data scientists neglect the maintenance aspect of the models post-production. The following visualizations offer a comparative analysis between two distinct approaches to model management—one where the model is trained once and then deployed for an extended period and another where the model undergoes regular retraining with fresh data while being monitored for...

Introduction to model drift

ML models can experience a decline in performance over time, which is a common issue in projects. The main reasons for this are changes in the input data that is fed into the model. These changes can occur due to various reasons, such as the underlying distribution of the data changing, an alteration in the relationship between the dependent and independent features, or changes in the source system that generates the data itself.

The performance degradation of deployed models over time is called Model Drift. To effectively identify instances of Model Drift, various metrics can be monitored:

  • Accuracy: A declining trend in accuracy can serve as a strong indicator of model drift.
  • Precision and Recall: A noticeable decrease in these values may highlight the model's diminishing ability to make accurate and relevant predictions.
  • F1 Score: This is a harmonized metric that encapsulates both precision and recall. A drop in the F1 Score...

Introduction to Statistical Drift

Statistical drift refers to changes in the underlying data distribution itself. It can affect both the input features and the target variable. This drift may or may not affect the model's performance but understanding it is crucial for broader data landscape awareness.

To effectively identify instances of Statistical Drift, various metrics can be monitored:

  • Mean and Standard Deviation: Significant changes can indicate drift.
  • Kurtosis and Skewness: Changes signal data distribution alterations.
  • Quantile Statistics: Look at changes in 25th, 50th, and 75th percentiles for example.

To fully grasp how Model Drift and Statistical Drift are interconnected, consider the following key points:

  • Cause and Effect Relationship: Statistical drift in either the features or the target variable frequently serves as a precursor to model drift. For example, should the age demographic of your customer base shift (indicative...

Techniques for drift detection

To ensure effective monitoring of our model’s performance over time, we should track changes in summary statistics and distributions of both the model features and target variables. This will enable us to detect any potential data drift early on.

Furthermore, it’s important to monitor offline model metrics such as accuracy and F1 scores that were utilized during the initial training of the model.

Lastly, we should also keep an eye on online metrics or business metrics to ensure that our model remains relevant to the specific business problem we are trying to solve.

The following table provides an overview of various statistical tests and methods that can be employed to identify drift in your data and models. Please note that this compilation is not all-encompassing.

Implementing drift detection on Databricks

The necessary files for this chapter are located within the Chapter-09 folder. This example demonstrates how you can arrange your code into specific modules to keep it organized.

Figure 9.8 – A screenshot showing the layout of the files in our code base

Figure 9.8 – A screenshot showing the layout of the files in our code base

The setup notebook in the config folder is designed to establish the folder structure for data reading and writing. It also sets up the MLflow experiment for tracking model performance over time and manages other variables that will be utilized in our model-drift notebook.

The datagen notebook within the data folder serves the purpose of creating a synthetic dataset that effectively demonstrates the concept of model drift. This dataset encompasses time series data of online sales for an e-commerce website spanning three months.

In this dataset, we have a set of independent features and a target feature, along with simulated relationships between them...

Summary

In this chapter, we extensively explored the significance of monitoring both models and data, emphasizing the crucial role of drift detection. Our understanding deepened as we delved into the spectrum of statistical tests at our disposal, which are adept at identifying diverse forms of drift encompassing numerical and categorical features.

Moreover, we engaged in a comprehensive walk-through, exemplifying the application of these concepts. Through a simulated model drift scenario using a synthetic e-commerce dataset, we harnessed the power of various statistical tests from the scipy.stats package to accurately pinpoint instances of drift.

As we venture into the next chapter, our focus will pivot toward elucidating the organization within the Databricks workspace and delving into the realm of continuous integration/continuous deployment (CI/CD).

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Practical Machine Learning on Databricks
Published in: Nov 2023Publisher: PacktISBN-13: 9781801812030
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Debu Sinha

Debu is an experienced Data Science and Engineering leader with deep expertise in Software Engineering and Solutions Architecture. With over 10 years in the industry, Debu has a proven track record in designing scalable Software Applications, Big Data, and Machine Learning systems. As Lead ML Specialist on the Specialist Solutions Architect team at Databricks, Debu focuses on AI/ML use cases in the cloud and serves as an expert on LLMs, Machine Learning, and MLOps. With prior experience as a startup co-founder, Debu has demonstrated skills in team-building, scaling, and delivering impactful software solutions. An established thought leader, Debu has received multiple awards and regularly speaks at industry events.
Read more about Debu Sinha

Data Type to Monitor

Sub-Category

Statistical Measures and Tests

...