You're reading from Practical Machine Learning on Databricks

Product typeBook

Published inNov 2023

Reading LevelIntermediate

PublisherPackt

ISBN-139781801812030

Edition1st Edition

Languages

Python

Concepts

Data Science

Author (1)

Debu Sinha

Model Drift Detection and Retraining

In the last chapter, we covered various workflow management options available in Databricks for automating machine learning (ML) tasks. Now, we will expand upon our understanding of the ML life cycle up to now and introduce the fundamental concept of drift. We will discuss why model monitoring is essential and how you can ensure your ML models perform as expected over time.

At the time of writing this book, Databricks has a product that is in development that will simplify monitoring model performance and data out of the box. In this chapter, we will go through an example of how to use the existing Databricks functionalities to implement drift detection and monitoring.

We will be covering the following topics:

Motivation for model monitoring
Introduction to model drift
Introduction to Statistical Drift
Techniques for drift detection
Implementing drift detection on Databricks

Let’s go through the technical...

Technical requirements

The following are the prerequisites for the chapter:

Access to a Databricks workspace
A running cluster with Databricks Runtime for Machine Learning (Databricks Runtime ML) with a version higher than 10.3
Notebooks from Chapter 9 imported into the Databricks workspace
Introductory knowledge of hypothesis testing and interpreting statistical tests

Let’s take a look at the motivation behind why model monitoring is important.

The motivation behind model monitoring

According to an article in Forbes magazine by Enrique Dans, July 21, 2019, 87% of data science projects never make it to production (https://www.forbes.com/sites/enriquedans/2019/07/21/stop-experimenting-with-machine-learning-and-start-actually-usingit/?sh=1004ff0c3365).

There are a lot of reasons why ML models fail; however, if we look purely at the reason for ML project failure in a production environment, it comes down to a lack of re-training and testing the deployed models for performance consistency over time.

The performance of the model keeps degrading over time. Many data scientists neglect the maintenance aspect of the models post-production. The following visualizations offer a comparative analysis between two distinct approaches to model management—one where the model is trained once and then deployed for an extended period and another where the model undergoes regular retraining with fresh data while being monitored for...

Introduction to model drift

ML models can experience a decline in performance over time, which is a common issue in projects. The main reasons for this are changes in the input data that is fed into the model. These changes can occur due to various reasons, such as the underlying distribution of the data changing, an alteration in the relationship between the dependent and independent features, or changes in the source system that generates the data itself.

The performance degradation of deployed models over time is called Model Drift. To effectively identify instances of Model Drift, various metrics can be monitored:

Accuracy: A declining trend in accuracy can serve as a strong indicator of model drift.
Precision and Recall: A noticeable decrease in these values may highlight the model's diminishing ability to make accurate and relevant predictions.
F1 Score: This is a harmonized metric that encapsulates both precision and recall. A drop in the F1 Score...

Introduction to Statistical Drift

Statistical drift refers to changes in the underlying data distribution itself. It can affect both the input features and the target variable. This drift may or may not affect the model's performance but understanding it is crucial for broader data landscape awareness.

To effectively identify instances of Statistical Drift, various metrics can be monitored:

Mean and Standard Deviation: Significant changes can indicate drift.
Kurtosis and Skewness: Changes signal data distribution alterations.
Quantile Statistics: Look at changes in 25th, 50th, and 75th percentiles for example.

To fully grasp how Model Drift and Statistical Drift are interconnected, consider the following key points:

Cause and Effect Relationship: Statistical drift in either the features or the target variable frequently serves as a precursor to model drift. For example, should the age demographic of your customer base shift (indicative...

Techniques for drift detection

To ensure effective monitoring of our model’s performance over time, we should track changes in summary statistics and distributions of both the model features and target variables. This will enable us to detect any potential data drift early on.

Furthermore, it’s important to monitor offline model metrics such as accuracy and F1 scores that were utilized during the initial training of the model.

Lastly, we should also keep an eye on online metrics or business metrics to ensure that our model remains relevant to the specific business problem we are trying to solve.

The following table provides an overview of various statistical tests and methods that can be employed to identify drift in your data and models. Please note that this compilation is not all-encompassing.

Implementing drift detection on Databricks

The necessary files for this chapter are located within the Chapter-09 folder. This example demonstrates how you can arrange your code into specific modules to keep it organized.

Figure 9.8 – A screenshot showing the layout of the files in our code base

The setup notebook in the config folder is designed to establish the folder structure for data reading and writing. It also sets up the MLflow experiment for tracking model performance over time and manages other variables that will be utilized in our model-drift notebook.

The datagen notebook within the data folder serves the purpose of creating a synthetic dataset that effectively demonstrates the concept of model drift. This dataset encompasses time series data of online sales for an e-commerce website spanning three months.

In this dataset, we have a set of independent features and a target feature, along with simulated relationships between them...

Summary

In this chapter, we extensively explored the significance of monitoring both models and data, emphasizing the crucial role of drift detection. Our understanding deepened as we delved into the spectrum of statistical tests at our disposal, which are adept at identifying diverse forms of drift encompassing numerical and categorical features.

Moreover, we engaged in a comprehensive walk-through, exemplifying the application of these concepts. Through a simulated model drift scenario using a synthetic e-commerce dataset, we harnessed the power of various statistical tests from the scipy.stats package to accurately pinpoint instances of drift.

As we venture into the next chapter, our focus will pivot toward elucidating the organization within the Databricks workspace and delving into the realm of continuous integration/continuous deployment (CI/CD).

The rest of the chapter is locked

You have been reading a chapter from

Practical Machine Learning on Databricks

Published in: Nov 2023Publisher: PacktISBN-13: 9781801812030

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Debu Sinha

Debu is an experienced Data Science and Engineering leader with deep expertise in Software Engineering and Solutions Architecture. With over 10 years in the industry, Debu has a proven track record in designing scalable Software Applications, Big Data, and Machine Learning systems. As Lead ML Specialist on the Specialist Solutions Architect team at Databricks, Debu focuses on AI/ML use cases in the cloud and serves as an expert on LLMs, Machine Learning, and MLOps. With prior experience as a startup co-founder, Debu has demonstrated skills in team-building, scaling, and delivering impactful software solutions. An established thought leader, Debu has received multiple awards and regularly speaks at industry events.
Read more about Debu Sinha

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

Data Type to Monitor	Sub-Category	Statistical Measures and Tests ...