Reader small image

You're reading from  The Deep Learning Architect's Handbook

Product typeBook
Published inDec 2023
PublisherPackt
ISBN-139781803243795
Edition1st Edition
Right arrow
Author (1)
Ee Kin Chin
Ee Kin Chin
author image
Ee Kin Chin

Ee Kin Chin is a Senior Deep Learning Engineer at DataRobot. He holds a Bachelor of Engineering (Honours) in Electronics with a major in Telecommunications. Ee Kin is an expert in the field of Deep Learning, Data Science, Machine Learning, Artificial Intelligence, Supervised Learning, Unsupervised Learning, Python, Keras, Pytorch, and related technologies. He has a proven track record of delivering successful projects in these areas and is dedicated to staying up to date with the latest advancements in the field.
Read more about Ee Kin Chin

Right arrow

Exploring Model Evaluation Methods

A trained deep learning model without any form of validation cannot be deployed to production. Production, in the context of the machine learning software domain, refers to the deployment and operation of a machine learning model in a live environment for actual consumption of its predictions. More broadly, model evaluation serves as a critical component in any deep learning project. Typically, a deep learning project will result in many models being built, and a final model will be chosen to serve in a production environment. A good model evaluation process for any project leads to the following:

  • A better-performing final model through model comparisons and metrics
  • Fewer production prediction mishaps by understanding common model pitfalls
  • More closely aligned practitioner and final model behaviors through model insights
  • A higher probability of project success through success metric evaluation
  • A final model that is less biased...

Technical requirements

For this chapter, we will have a practical implementation using the Python programming language. To complete it, you will only need to install the matplotlib library in Python.

The code files are available on GitHub at https://github.com/PacktPublishing/The-Deep-Learning-Architect-Handbook/tree/main/CHAPTER_10.

Exploring the different model evaluation methods

Most practitioners are familiar with accuracy-related metrics. This is the most basic evaluation method. Typically, for supervised problems, a practitioner will treat an accuracy-related metric as the golden source of truth. In the context of model evaluation, the term “accuracy metrics” is often used to collectively refer to various performance metrics such as accuracy, F1 score, recall, precision, and mean squared error. When coupled with a suitable cross-validation partitioning strategy, using metrics as a standalone evaluation strategy can go a long way in most projects. In deep learning, accuracy-related metrics are typically used to monitor the progress of the model at each epoch. The monitoring process can subsequently be extended to perform early stopping to stop training the model when it doesn’t improve anymore and to determine when to reduce the learning rate. Additionally, the best model weights can be...

Engineering the base model evaluation metric

Engineering a metric for your use case is a skill that is often overlooked. This is most likely because most projects work on a publicly available dataset, which almost always already has a metric proposed. This includes projects on Kaggle and many public datasets people use to benchmark against. However, this does not happen in real life and a metric doesn’t just get served to you. Let’s explore this topic further here and gain this skillset.

The model evaluation metric is the first evaluation method that is essential in supervised projects, excluding unsupervised-based projects. There are a few baseline metrics that exist to be the de facto metrics depending on the problem and target type. Additionally, there are also more customized versions of these baseline metrics that are catered to special objectives. For example, generative-based tasks can be evaluated through a special human-based opinion score called the mean...

Exploring custom metrics and their applications

Base metrics are generally sufficient to meet the requirements of most use cases. However, custom metrics build upon base metrics and incorporate additional goals that are specific to a given scenario. It’s helpful to think of base metrics as a bachelor’s degree and custom metrics as a master’s or PhD degree. It’s perfectly fine to use only base metrics if they meet your needs and you don’t have any additional requirements.

Custom ideals often arise naturally early on in a project and are highly dependent on the specific use case. Most real use cases don’t expose their chosen metrics to the public, even when the prediction of the model is meant to be utilized publicly, such as Open AI’s ChatGPT. However, in machine learning competitions, companies with real use cases accompanied by data publish their chosen metric publicly to find the best model that can be built. In such a setting for...

Exploring statistical tests for comparing model metrics

In machine learning, metric-based model evaluation often involves using averages of aggregated metrics from different folds or partitions, such as holdout and validation sets, to compare the performance of various models. However, relying solely on these average performance metrics may not provide a comprehensive assessment of a model’s performance and generalizability. A more robust approach to model evaluation is the incorporation of statistical hypothesis tests, which assess whether observed differences in performance are statistically significant or due to random chance.

Statistical hypothesis tests are procedures used to determine whether observed data provides sufficient evidence to reject a null hypothesis in favor of an alternative hypothesis, helping to quantify the likelihood that the observed differences are due to random chance or a genuine effect. In statistical tests, the null hypothesis (H0) is a default...

Relating the evaluation metric to success

Defining success in a machine learning project is crucial and should be done at the early stages of the project as introduced in the Defining success section in Chapter 1, Deep Learning Life Cycle. Success can be defined as achieving higher-level objectives, such as improving the efficiency of processes or increasing the accuracy of processes in comparison to manual labor. In some rare cases, machine learning can enable processes that were previously impossible due to human limitations. The ultimate success of achieving these objectives is to save costs or earn more revenue for an organization.

A model with a metric performance score of 0.80 F1 score or 0.00123 RMSE doesn’t really mean anything at face value and has to be translated to something tangible in the use case. For instance, one should answer questions such as what estimated model score can allow the project to achieve the targeted cost savings or revenue improvements. Quantifying...

Directly optimizing the metric

The loss and the metric used to train a deep learning model are two separate components. One of the tricks you can use to improve a model’s accuracy performance against the chosen metric is to directly optimize against it instead of just monitoring performance for the purpose of choosing the best performing model weights and using early stopping. In other words, using the metric as a loss directly!

By directly optimizing for the metric of interest, the model has a chance to improve in a way that is relevant to the end goal rather than optimizing for a proxy loss function that may not be directly related to the ultimate performance of the model. This simply means that the model can result in a much better performance when using the metric as a loss directly.

However, not all metrics can be used as a loss, as not all metrics can be differentiable. Remember that backpropagation requires all functions used to be differentiable so that gradients...

Summary

In this chapter, we briefly explored an overview of different model evaluation methods and how they can be used to measure the performance of a deep learning model. We started with the topic of metric engineering among all the introduced methods. We introduced common base model evaluation metrics. On top of this, we discussed the limitations of using base model evaluation metrics and introduced the concept of engineering a model evaluation metric tailored to the specific problem at hand. We also explored the idea of optimizing directly against the evaluation metric by using it as a loss function. While this approach can be beneficial, it is important to consider the potential pitfalls and limitations, as well as the specific use case for which this approach may be appropriate.

The evaluation of deep learning models requires careful consideration of appropriate evaluation methods, metrics, and statistical tests. Hopefully, after reading through this chapter, I have helped...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Deep Learning Architect's Handbook
Published in: Dec 2023Publisher: PacktISBN-13: 9781803243795
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Ee Kin Chin

Ee Kin Chin is a Senior Deep Learning Engineer at DataRobot. He holds a Bachelor of Engineering (Honours) in Electronics with a major in Telecommunications. Ee Kin is an expert in the field of Deep Learning, Data Science, Machine Learning, Artificial Intelligence, Supervised Learning, Unsupervised Learning, Python, Keras, Pytorch, and related technologies. He has a proven track record of delivering successful projects in these areas and is dedicated to staying up to date with the latest advancements in the field.
Read more about Ee Kin Chin