Home Data The Deep Learning Architect's Handbook

The Deep Learning Architect's Handbook

By Ee Kin Chin
books-svg-icon Book
eBook $42.99 $29.99
Print $52.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $42.99 $29.99
Print $52.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
  1. Free Chapter
    Chapter 1: Deep Learning Life Cycle
About this book
Deep learning enables previously unattainable feats in automation, but extracting real-world business value from it is a daunting task. This book will teach you how to build complex deep learning models and gain intuition for structuring your data to accomplish your deep learning objectives. This deep learning book explores every aspect of the deep learning life cycle, from planning and data preparation to model deployment and governance, using real-world scenarios that will take you through creating, deploying, and managing advanced solutions. You’ll also learn how to work with image, audio, text, and video data using deep learning architectures, as well as optimize and evaluate your deep learning models objectively to address issues such as bias, fairness, adversarial attacks, and model transparency. As you progress, you’ll harness the power of AI platforms to streamline the deep learning life cycle and leverage Python libraries and frameworks such as PyTorch, ONNX, Catalyst, MLFlow, Captum, Nvidia Triton, Prometheus, and Grafana to execute efficient deep learning architectures, optimize model performance, and streamline the deployment processes. You’ll also discover the transformative potential of large language models (LLMs) for a wide array of applications. By the end of this book, you'll have mastered deep learning techniques to unlock its full potential for your endeavors.
Publication date:
December 2023
Publisher
Packt
Pages
516
ISBN
9781803243795

 

Deep Learning Life Cycle

In this chapter, we will explore the intricacies of the deep learning life cycle. Sharing similar characteristics to the machine learning life cycle, the deep learning life cycle is a framework as much as it is a methodology that will allow a deep learning project idea to be insanely successful or to be completely scrapped when it is appropriate. We will grasp the reasons why the process is cyclical and understand some of the life cycle’s initial processes on a deeper level. Additionally, we will go through some high-level sneak peeks of the later processes of the life cycle that will be explored at a deeper level in future chapters.

Comprehensively, this chapter will help you do the following:

  • Understand the similarities and differences between the deep learning life cycle and its machine learning life cycle counterpart
  • Understand where domain knowledge fits in a deep learning project
  • Understand the few key steps in planning a deep learning project to make sure it can tangibly create real-world value
  • Grasp some deep learning model development details at a high level
  • Grasp the importance of model interpretation and the variety of deep learning interpretation techniques at a high level
  • Explore high-level concepts of model deployments and their governance
  • Learn to choose the necessary tools to carry out the processes in the deep learning life cycle

We’ll cover this material in the following sections:

  • Machine learning life cycle
  • The construction strategy of a deep learning life cycle
  • The data preparation stage
  • Deep learning model development
  • Delivering model insights
  • Managing risks
 

Technical requirements

This chapter includes some practical implementations in the Python programming language. To complete it, you need to have a computer with the following libraries installed:

  • pandas
  • matplotlib
  • seaborn
  • tqdm
  • lingua

The code files are available on GitHub: https://github.com/PacktPublishing/The-Deep-Learning-Architect-Handbook/tree/main/CHAPTER_1.

 

Understanding the machine learning life cycle

Deep learning is a subset of the wider machine learning category. The main characteristic that sets it apart from other machine learning algorithms is the foundational building block called neural networks. As deep learning has advanced tremendously since the early 2000s, it has made many previously unachievable feats possible through its machine learning counterparts. Specifically, deep learning has made breakthroughs in recognizing complex patterns that exist in complex and unstructured data such as text, images, videos, and audio. Some of the successful applications of deep learning today are face recognition with images, speech recognition from audio data, and language translation with textual data.

Machine learning, on the other hand, is a subset of the wider artificial intelligence category. Its algorithms, such as tree-based models and linear models, which are not considered to be deep learning models, still serve a wide range of use cases involving tabular data, which is the bulk of the data that’s stored by small and big organizations alike. This tabular data may exist in multiple structured databases and can span from 1 to 10 years’ worth of historical data that has the potential to be used for building predictive machine learning models. Some of the notable predictive applications for machine learning algorithms are fraud detection in the finance industry, product recommendations in e-commerce, and predictive maintenance in the manufacturing industry. Figure 1.1 shows the relationships between deep learning, machine learning, and artificial intelligence for a clearer visual distinction between them:

Figure 1.1 – Artificial intelligence relationships

Figure 1.1 – Artificial intelligence relationships

Now that we know what deep learning and machine learning are in a nutshell, we are ready for a glimpse of the machine learning life cycle, as shown in Figure 1.2:

Figure 1.2 – Deep learning/machine learning life cycle

Figure 1.2 – Deep learning/machine learning life cycle

As advanced and complex the deep learning algorithm is compared to other machine learning algorithms, the guiding methodologies that are needed to ensure success in both domains are unequivocally the same. The machine learning life cycle involves six stages that interact with each other in different ways:

  1. Planning
  2. Data Preparation
  3. Model Development
  4. Deliver Model Insights
  5. Model Deployment
  6. Model Governance

Figure 1.2 shows these six stages and the possible stage transitions depicted with arrows. Typically, a machine learning project will iterate between stages, depending on the business requirements. In a deep learning project, most of the innovative predictive use cases require manual data collection and data annotation, which is a process that lies in the realm of the Data Preparation stage. As this process is generally time-consuming, especially when the data itself is not readily available, a go-to solution would be to start with an acceptable initial number of data and transition into the Model Development stage and, subsequently, to the Deliver Model Insight stage to make sure results from the ideas are sane.

After the initial validation process, depending again on business requirements, practitioners would then decide to transition back into the Data Preparation stage and continue to iterate through these stages cyclically in different data size milestones until results are satisfactory toward both the model development and business metrics. Once it gets approval from the necessary stakeholders, the project then goes into the Model Deployment stage, where the built machine learning model will be served to allow its predictions to be consumed. The final stage is Model Governance, where practitioners carry out tasks that manage the risk, performance, and reliability of the deployed machine learning model. Model deployment and model governance both deserve more in-depth discussion and will be introduced in separate chapters closer to the end of this book. Whenever any of the key metrics fail to maintain themselves to a certain determined confidence level, the project will fall back into the Data Preparation stage of the cycle and repeat the same flow all over again.

The ideal machine learning project flows through the stages cyclically for as long as the business application needs it. However, machine learning projects are typically susceptible to a high probability of failure. According to a survey conducted by Dimensional Research and Alegion, covering around 300 machine learning practitioners from 20 different business industries, 78% of machine learning projects get held back or delayed at some point before deployment. Additionally, Gartner predicted that 85% of machine learning projects will fail (https://venturebeat.com/2021/06/28/why-most-ai-implementations-fail-and-what-enterprises-can-do-to-beat-the-odds/). By expecting the unexpected, and anticipating failures before they happen, practitioners can likely circumvent potential failure factors early down the line in the planning stage. This also brings us to the trash icon bundled together in Figure 1.2. Proper projects with a good plan typically get discarded only at the Deliver Model Insights stage, when it’s clear that the proposed model and project can’t deliver satisfactory results.

Now that we’ve covered an overview of the machine learning life cycle, let’s dive into each of the stages individually, broken down into sections, to help you discover the key tips and techniques that are needed the complete each stage successfully. These stages will be discussed in an abstract format and are not a concrete depiction of what you should ultimately be doing for your project since all projects are unique and strategies should always be evaluated on a case-by-case basis.

 

Strategizing the construction of a deep learning system

A deep learning model can only realize real-world value by being part of a system that performs some sort of operation. Bringing deep learning models from research papers to actual real-world usage is not an easy task. Thus, performing proper planning before conducting any project is a more reliable and structured way to achieve the desired goals. This section will discuss some considerations and strategies that will be beneficial when you start to plan your deep learning project toward success.

Starting the journey

Today, deep learning practitioners tend to focus a lot on the algorithmic model-building part of the process. It takes a considerable amount of mental strength to not get hooked on the hype of state-of-the-art (SOTA) research-focused techniques. With crazy techniques such as pixtopix, which is capable of generating high-resolution realistic color images from just sketches or image masks, and natural language processing (NLP) techniques such as GPT-3, a 175-billion parameters text generation model from OpenAI, and GPT-4, a multimodal text generation model that is a successor to GPT-3 and its sub-models, that are capable of generating practically anything you ask it to in a text format that ranges from text summarization to generating code, why wouldn’t they?!

Jokes aside, to become a true deep learning architect, we need to come to a consensus that any successful machine learning or deep learning project starts with the business problem and not from the shiny new research paper you just read online complete with a public GitHub repository. The planning stage often involves many business executives who are not savvy about the details of machine learning algorithms and often, the same set of people wouldn’t care about it at all. These algorithms are daunting for business-focused stakeholders to understand and, when added on top of the tough mental barriers of the adoption of artificial intelligence technologies itself, it doesn’t make the project any more likely to be adopted.

Evaluating deep learning’s worthiness

Deep learning shines the most in handling unstructured data. This includes image data, text data, audio data, and video data. This is largely due to the model’s ability to automatically learn and extract complex, high-level features from the raw data. In the case of images and videos, deep learning models can capture spatial and temporal patterns, recognizing objects, scenes, and activities. With audio data, deep learning can understand the nuances of speech, noise, and various sound elements, making it possible to build applications such as speech recognition, voice assistants, and audio classification systems. For text data, deep learning models can capture the context, semantics, and syntax, enabling NLP tasks such as sentiment analysis, machine translation, and text summarization.

This means that if this data exists and is utilized by your company in its business processes, there may be an opportunity to solve a problem with the help of deep learning. However, never overcomplicate problems just so you can solve them with deep learning. Equating this to something more relatable, you wouldn’t use a huge sledgehammer to get a nail into wood. It could work and you might get away with it, but you’d risk bending the nail or injuring yourself while using it.

Once a problem has been identified, evaluate the business value of solving it. Not all problems are born the same and they can be ranked based on their business impact, value, complexity, risks, costs, and suitability for deep learning. Generally, you’d be looking for high impact, high value, low complexity, low risks, low cost, and high suitability to deep learning. Trade-offs between these metrics are expected but simply put, make sure the problem you’ve discovered is worth solving at all with deep learning. A general rule of thumb is to always resort to a simpler solution for a problem, even if it ends up abandoning the usage of deep learning technologies. Simple approaches tend to be more reliable, less costly, less prone to risks, and faster to fruition.

Consider a problem where a solution is needed to remove background scenes in a video feed and leave only humans or necessary objects untouched so that a more suitable background scene can be overlaid as a background instead. This is a common problem in the professional filmmaking industry in all film genres today.

Semantic segmentation, which is the task of assigning a label to every pixel of an image in the width and height dimensions, is a method that is needed to solve such a problem. In this case, the task needs to assign labels that can help identify which pixels need to be removed. With the advent of many publicly available semantic segmentation datasets, deep learning has been able to advance considerably in the semantic segmentation field, allowing itself to achieve a very satisfactory fine-grained understanding of the world, enough so that it can be applied in the industry of autonomous driving and robot navigation most prominently. However, deep learning is not known to be 100% error-free and almost always has some error, even in the controlled evaluation dataset. In the case of human segmentation, for example, the model would likely result in the most errors in the fine hair areas. Most filmmakers aim for perfect depictions of their films and require that every single pixel gets removed appropriately without fail since a lot of money is spent on the time of the actors hired for the film. Additionally, a lot of time and money would be wasted in manually removing objects that could be otherwise simply removed if the scene had been shot with a green screen. This is an example of a case where we should not overcomplicate the problem. A green screen is all you need to solve the problem described: specifically, the rare chromakey green color. When green screens are prepped properly in the areas where the desired imagery will be overlaid digitally, image processing techniques alone can remove the pixels that are considered to be in the small light intensity range centered on the chromakey green color and achieve semantic segmentation effectively with a rule-based solution. The green screen is a simpler solution that is cost-effective, foolproof, and fast to set up.

That was a mouthful! Now, let’s go through a simpler problem. Consider a problem where we want to automatically and digitally identify when it rains. In this use case, it is important to understand the actual requirements and goals of identifying the rain: is it sufficient to detect rain exactly when it happens? Or do we need to identify whether rain will happen in the near future? What will we use the information of rain events for? These questions will guide whether deep learning is required or not. We, as humans, know that rain can be predicted by visual input by either looking at the presence of raindrops falling or looking at cloud conditions. However, if the use case is sufficient to detect rain when it happens, and the goal of detecting rain is to determine when to water the plants, a simpler approach would be to use an electronic sensor to detect the presence of water or humidity. Only when you want to estimate whether it will rain in the future, let’s say in 15 minutes, does deep learning make more sense to be applied as there are a lot of interactions between meteorological factors that can affect rainfall. Only by brainstorming each use case and analyzing all potential solutions, even outside of deep learning, can you make sure deep learning brings tangible business value compared to other solutions. Do not just apply deep learning because you want to.

At times, when value isn’t clear when you’re directly considering a use case, or when value is clear but you have no idea how to execute it, consider finding reference projects from companies in the same industry. Companies in the same industry have a high chance of wanting to optimize the same processes or solve the same pain points. Similar reference projects can serve as a guide to designing a deep learning system and can serve as proof that the use case being considered is worthy of the involvement of deep learning technologies. Of course, not everybody has access to details like this, but you’d be surprised what Google can tell you these days. Even if there isn’t a similar project being carried out for direct reference, you would likely be able to pivot upon the other machine learning project references that already have a track record of bringing value to the same industry.

Admittedly, rejecting deep learning at times would be a hard pill to swallow considering that most practitioners get paid to implement deep learning solutions. However, dismissing it earlier will allow you to focus your time on more valuable problems that would be more useful to solve with deep learning and prevent the risk of undermining the potential of deep learning in cases where simpler solutions can outperform deep learning. Criteria for deep learning worthiness should be evaluated on a case-by-case basis and as a practitioner, the best advice to follow is to simply practice common sense. Spend a good amount of time going through the problem exploration and the worthiness evaluation process. The last thing you want is to spend a painstaking amount of time preparing data, building a deep learning model, and delivering very convincing model insights only to find out that the label you are trying to predict does not provide enough value for the business to invest further.

Defining success

Ever heard sentences like “My deep learning model just got 99% accuracy on my validation dataset!”? Data scientists often make the mistake of determining the success of a machine learning project just by using validation metrics they use to evaluate their machine learning models during the model development process. Model-building metrics such as accuracy, precision, or recall are important metrics to consider in a machine learning project but unless they add business values and connect to the business objectives in some way, they rarely mean anything. A project can achieve a good accuracy score but still fail to achieve the desired business goals. This can happen in cases when no proper success metrics have been defined early and subsequently cause a wrong label to be used in the data preparation and model development stages. Furthermore, even when the model metric positively impacts business processes directly, there is a chance that the achievement won’t be communicated effectively to business stakeholders and the worst case not considered to be successful when reported as-is.

Success metrics, when defined early, act as the machine learning project’s guardrails and ensure that the project goals are aligned with the business goals. One of the guardrails is that a success metric can help guide the choice of a proper label that can at inference time, tangibly improve the business processes or otherwise create value in the business. First, let’s make sure we are aligned with what a label means, which is a value that you want the machine learning model to predict. The purpose of a machine learning model is to assign these labels automatically given some form of input data, and thus during the data preparation and model development stages, a label needs to be chosen to serve that purpose. Choosing the wrong label can be catastrophic to a deep learning project as sometimes, when data is not readily available, it means the project has to start all over again from the data preparation stage. Labels should always be indirectly or directly attributed to the success metric.

Success metrics, as the name suggests, can be plural, and range from time-based success definitions or milestones to the overall project success, and from intangible to tangible. It’s good practice to generally brainstorm and document all the possible success criteria from a low level to a high level. Another best practice is to make sure to always define tangible success metrics alongside intangible metrics. Intangible metrics generate awareness, but tangible metrics make sure things are measurable and thus make them that much more attainable. A few examples of intangible and hard-to-measure metrics are as follows:

  • Increasing customer satisfaction
  • Increasing employee performance
  • Improving shareholder outlook

Metrics are ways to measure something and are tied to goals to seal the deal. Goals themselves can be intangible, similar to the few examples listed previously, but so long as it is tied to tangible metrics, the project is off to a good start. When you have a clear goal, ask yourself in what way the goal can be proven to be achieved, demonstrated, or measured. A few examples of tangible success metrics for machine learning projects that could align with business goals are as follows:

  • Increase the time customers spend, which can be a proxy for customer delight
  • Increase company revenue, which can be a proxy for employee performance
  • Increase the click-through rate (CTR), which can be a proxy for the effectiveness of targeted marketing campaigns
  • Increase the customer lifetime value (CLTV), which can be a proxy for long-term customer satisfaction and loyalty
  • Increase conversion rate, which can be a proxy for the success of promotional campaigns and website user experience

This concept is not new nor limited to just machine learning projects – just about any single project carried out for a company as every single real-world project needs to be aligned with the business goal. Many foundational project management techniques can be applied similarly to machine learning projects, and spending time gaining some project management skills out of the machine learning field would be beneficial and transferable to machine learning projects. Additionally, as machine learning is considered to be a software-based technology, software project management methodologies also apply.

A final concluding thought to take away is that machine learning systems are not about how advanced your machine learning models are, but instead about how humans and machine intelligence can work together to achieve a greater good and create value.

Planning resources

Deep learning often involves neural network architectures with a large set of parameters, otherwise called weights. These architecture’s sizes can go from holding a few parameters up to holding hundreds of billions of parameters. For example, an OpenAI GPT-3 text generation model holds 175 billion neural network parameters, which amounts to around 350 GB in computer storage size. This means that to run GPT-3, you need a machine with a random access memory (RAM) size of at least 350 GB!

Deep learning model frameworks such as PyTorch and TensorFlow have been built to work with devices called graphics processing units (GPUs), which offer tremendous neural network model training and inference speedups. Off-the-shelf GPU devices commonly have a GPU RAM of 12 GB and are nowhere near the requirements needed to load a GPT-3 model in GPU mode. However, there are still methods to partition big models into multiple GPUs and run the model on GPUs. Additionally, some methods can allow for distributed GPU model training and inference to support larger data batch sizes at any one usage point. GPUs are not considered cheap devices and can cost anywhere from a few hundred bucks to hundreds of thousands from the most widely used GPU brand, Nvidia. With the rise of cryptocurrency technologies, the availability of GPUs is also reduced significantly due to people buying them immediately when they are in stock. All these emphasize the need to plan computing resources for training and inferencing deep learning models beforehand.

It is important to align your model development and deployment needs to your computing resource allocation early in the project. Start by gauging the range of sizes of deep learning architectures that are suitable for the task at hand either by browsing research papers or websites that provide a good summary of techniques, and setting aside computing resources for the model development process.

Tip

paperswithcode.com provides summaries of a wide variety of techniques grouped by a wide variety of tasks!

When computing resources are not readily available, make sure you always make purchase plans early, especially if it involves GPUs. But what if a physical machine is not desired? An alternative to using computing resources is to use paid cloud computing resource providers you can access online easily from anywhere in the world. During the model development stage, one of the benefits of having more GPUs with more RAM allocated is that it can allow you to train models faster by either using a larger data batch size during training or allowing the capability to train multiple models at any one time. It is generally fine to also use CPU-only deep learning model training, but the model training time would just inevitably be much longer.

The GPU and CPU-based computing resources that are required during training are often considered overkill to be used during inference time when they are deployed. Different applications have different deployment computing requirements and the decision on what resource specification to allocate can be gauged by asking yourself the following three questions:

  • How often are the inference requests made?
    • Many inference requests in a short period might signal the need to have more than one inference service up in multiple computing devices in parallel
  • What is the average amount of samples that are requested for a prediction at any one time?
    • Device RAM requirements should match batch size expectations
  • How fast do you need a reply?
    • GPUs are needed if it’s seconds or a faster response time requirement
    • CPUs can do the job if you don’t care about the response time

Resource planning is not restricted to just computing resource planning – it also expands to human resource planning. Assumptions for the number of deep learning engineers and data scientists working together in a team would ultimately affect the choices of software libraries and tools used in the model development process. The analogy of choosing these tools will be introduced in future sections.

The next step is to prepare your data.

 

Preparing data

Data is to machine learning models as is the fuel to your car, the electricity to your electronic devices, and the food for your body. A machine learning model works by trying to capture the relationships between the provided input and output data. Similar to how human brains work, a machine learning model will attempt to iterate through collected data examples and slowly build a memory of the patterns required to map the provided input data to the provided target output data. The data preparation stage consists of methods and processes required to prepare ready-to-use data to build a machine learning model that includes the following:

  • Acquisition of raw input and targeted output data
  • Exploratory data analysis of the acquired data
  • Data pre-processing

We will discuss each of these topics in the following subsections.

Deep learning problem types

Deep learning can be broadly categorized into two problem types, namely supervised learning and unsupervised learning. Both of these problem types involve building a deep learning model that is capable of making informed predictions as outputs, given well-defined data inputs.

Supervised learning is a problem type where labels are involved that act as the source of truth to learn from. Labels can exist in many forms and can be broken down into two problem types, namely classification and regression. Classification is the process where a specific discrete class is predicted among other classes when given input data. Many more complex problems derive from the base classification problem types, such as instance segmentation, multilabel classification, and object detection. Regression, on the other hand, is the process where a continuous numerical value is predicted when given input data. Likewise, complex problem types can be derived from the base regression problem type, such as multi-regression and image bounding box regression.

Unsupervised learning, on the other hand, is a problem type where there aren’t any labels involved and the goals can vary widely. Anomaly detection, clustering, and feature representation learning are the most common problem types that belong to the unsupervised learning category.

We will go through these two problem types separately for deep learning in Chapter 8, Exploring Supervised Deep Learning, and Chapter 9, Exploring Unsupervised Deep Learning.

Next, let’s learn about the things you should consider when acquiring data.

Acquiring data

Acquiring data in the context of deep learning usually involves unstructured data, which includes image data, video data, text data, and audio data. Sometimes, data can be readily available and stored through some business processes in a database but very often, it has to be collected manually from the environment from scratch. Additionally, very often, labels for this data are not readily available and require manual annotation work. Along with the capability of deep learning algorithms to process and digest highly complex data comes the need to feed it more data compared to its machine learning counterparts. The requirement to perform data collection and data annotation in high volumes is the main reason why deep learning is considered to have a high barrier of entry today.

Don’t rush into choosing an algorithm quickly in a machine learning project. Spend a quality amount of time formally defining the features that can be acquired to predict the target variable. Get help from domain experts during the process and brainstorm potential predictive features that relate to the target variable. In actual projects, it is common to spend a big portion of your time planning and acquiring the data while making sure the acquired data is fit for a machine learning model’s consumption and subsequently spending the rest of the time in model building, model deployment, and model governance. A lot of research has been done into handling bad-quality data during the model development stage but most of these techniques aren’t comprehensive and are limited in ways that they can cover up the inherent quality of the data. Displaying ignorance in quality assurance during the data acquisition stage and showing enthusiasm only in the data science portion of the workflow is a strong indicator that the project would be doomed to failure right from the inception stage.

Formulating a data acquisition strategy is a daunting task when you don’t know what it means to have good-quality data. Let’s go through a few pillars of data quality you should consider for your data in the context of actual business use cases and machine learning:

  • Representativeness: How representative is the data concerning the real-world data population?
  • Consistency: How consistent are the annotation methods? Does the same pattern match the same label or are there some inconsistencies?
  • Comprehensiveness: Are all variations of a specific label covered in the collected dataset?
  • Uniqueness: Does the data contain a lot of duplicated or similar data?
  • Fairness: Is the collected data biased toward any specific labels or data groups?
  • Validity: Does the data contain invalid fields? Do the data inputs match up with their labels? Is there missing data?

Let’s look at each of these in detail.

Representativeness

Data should be collected in a way that it mimics what data you will receive during model deployment as much as possible. Very often in research-based deep learning projects, researchers collect their data in a closed environment with controlled environmental variables. One of the reasons researchers prefer collecting data from a controlled environment is that they can build stabler machine learning models and generally try to prove a point. Eventually, when the research paper is published, you see amazing results that were applied using handpicked data to impress. These models, which are built on controlled data, fail miserably when you apply them to random uncontrolled real-world examples. Don’t get me wrong – it’s great to have these controlled datasets available to contribute toward a stabler machine learning model at times, but having uncontrolled real-world examples as a main part of the training and evaluation datasets is key to achieving a generalizable model.

Sometimes, the acquired training data has an expiry date and does not stay representative forever. This scenario is called data drift and will be discussed in more detail in the Managing risk section closer to the end of this chapter. The representativeness metric for data quality should also be evaluated based on the future expectations of the data the model will receive during deployment.

Consistency

Data labels that are not consistently annotated make it harder for machine learning models to learn from them. This happens when the domain ideologies and annotation strategies differ among multiple labelers and are just not defined properly. For example, “Regular” and “Normal” mean the same thing, but to the machine, it’s two completely different classes; so are “Normal” and “normal” with just a capitalization difference!

Practice formalizing a proper strategy for label annotation during the planning stage before carrying out the actual annotation process. Cleaning the data for simple consistency errors is possible post-data annotation, but some consistency errors can be hard to detect and complex to correct.

Comprehensiveness

Machine learning thrives in building a decisioning mechanism that is robust to multiple variations and views of any specific label. Being capable and accomplishing it are two different things. One of the prerequisites of decisioning robustness is that the data that’s used for training and evaluation itself has to be comprehensive enough to provide coverage for all possible variations of each provided label. How can comprehensiveness be judged? Well, that depends on the complexity of the labels and how varied they can present themselves naturally when the model is deployed. More complex labels naturally require more samples and less complex labels require fewer samples.

A good point to start with, in the context of deep learning, is to have at least 100 samples for each label and experiment with building a model and deriving model insights to see if there are enough samples for the model to generalize on unseen variations of the label. When the model doesn’t produce convincing results, that’s when you need to cycle back to the data preparation stage again to acquire more data variations of any specific label. The machine learning life cycle is inherently a cyclical process where you will experiment, explore, and verify while transitioning between stages to obtain the answers you need to solve your problems, so don’t be afraid to execute these different stages cyclically.

Uniqueness

While having complete and comprehensive data is beneficial to build a machine learning model that is robust to data variations, having duplicated versions of the same data variation in the acquired dataset risks creating a biased model. A biased model makes biased decisions that can be unethical and illegal and sometimes renders such decisions meaningless. Additionally, the amount of data acquired for any specific label is rendered meaningless when all of them are duplicated or very similar to each other.

Machine learning models are generally trained on a subset of the acquired data and then evaluated on other subsets of the data to verify the model’s performance on unseen data. When the part of the dataset that is not unique gets placed in the evaluation partition of the acquired dataset by chance, the model risks reporting scores that are biased against the duplicated data inputs.

Fairness

Does the acquired dataset represent minority groups properly? Is the dataset biased toward the majority groups in the population? There can be many reasons why a machine learning model turns out to be biased, but one of the main causes is data representation bias. Making sure the data is represented fairly and equitably is an ethical responsibility of all machine learning practitioners. There are a lot of types of bias, so this topic will have its own section and will be introduced along with methods of mitigating it in Chapter 13, Exploring Bias and Fairness.

Validity

Are there outliers in the dataset? Is there missing data in the dataset? Did you accidentally add a blank audio or image file to the properly collected and annotated dataset? Is the annotated label for the data input considered a valid label? These are some of the questions you should ask when considering the validity of your dataset.

Invalid data is useless for machine learning models and some of these complicate the pre-processing required for them. The reasons for invalidity can range from simple human errors to complex domain knowledge mistakes. One of the methods to mitigate invalid data is to separate validated and unvalidated data. Include some form of automated or manual data validation process before a data sample gets included in the validated dataset category. Some of this validation logic can be derived from business processes or just common sense. For example, if we are taking age as input data, there are acceptable age ranges and there are age ranges that are just completely impossible, such as 1,000 years old. Having simple guardrails and verifying these values early when collecting them makes it possible to correct them then and there to get accurate and valid data. Otherwise, these data will likely be discarded when it comes to the model-building stage. Maintaining a structured framework to validate data ensures that the majority of the data stays relevant and usable by machine learning models and free from simple mistakes.

As for more complex invalidity, such as errors in the domain ideology, domain experts play a big part in making sure the data stays sane and logical. Always make sure you include domain experts when defining the data inputs and outputs in the discussion about how data should be collected and annotated for model development.

Making sense of data through exploratory data analysis (EDA)

After the acquisition of data, it is crucial to analyze the data to inspect its characteristics, patterns that exist, and the general quality of the data. Knowing the type of data you are dealing with allows you to plan a strategy for the subsequent model-building stage. Plot distribution graphs, calculate statistics, and perform univariate and multivariate analysis to understand the inherent relationships between the data that can help further ensure the validity of the data. The methods of analysis for different variable types are different and can require some form of domain knowledge beforehand. In the following subsections, we will be practically going through exploratory data analysis (EDA) for text-based data to get a sense of the benefits of carrying out an EDA task.

Practical text EDA

In this section, we will be manually exploring and analyzing a text-specific dataset using Python code, with the motive of building a deep learning model later in this book using the same dataset. The dataset we use will predict the categories of an item on an Indian e-commerce website based on its textual description. This use case will be useful to automatically group advertised items for user recommendation usage and can help increase purchasing volume on the e-commerce website:

  1. Let’s start by defining the libraries that we will use in a notebook. We will be using pandas for data manipulation and structuring, matplotlib and seaborn for plotting graphs, tqdm for visualizing iteration progress, and lingua for text language detection:
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from tqdm import tqdm
    from lingua import Language, LanguageDetectorBuilder
    tqdm.pandas()
  2. Next, let’s load the text dataset using pandas:
    dataset = pd.read_csv('ecommerceDataset.csv')
  3. pandas has some convenient functions to visualize and describe the loaded dataset; let’s use them. Let’s start by visualizing three rows of the raw data:
    dataset.head(3)

    This will display the following figure in your notebook:

Figure 1.3 – Visualizing the text dataset samples

Figure 1.3 – Visualizing the text dataset samples

  1. Next, let’s describe the dataset by visualizing the data column-based statistics:
    dataset.describe()

    This will display the following figure in your notebook:

Figure 1.4 – Showing the statistics of the dataset

Figure 1.4 – Showing the statistics of the dataset

  1. With these visualizations, it’s obvious that the description of the dataset aligns with what exists in the dataset, where the category column contains four unique class categories paired with the text description data column named description with evidence showing that they are strings. One important insight from the describing function is that there are duplicates in the text description. We can remove duplicates by taking the first example row among all the duplicates, but we also have to make sure that the duplicates have the same category, so let’s do that:
    for i in tqdm(range(len(unique_description_information))):
         assert(
        len(
          dataset[
            dataset['description'] ==
            unique_description_information.keys()[i]
          ]['category'].unique()
        ) == 1
      )
    dataset.drop_duplicates(subset=['description'], inplace=True)
  2. Let’s check the data types of the columns:
    dataset.dtypes

    This will display the following figure in your notebook:

Figure 1.5 – Showing the data types of the dataset columns

Figure 1.5 – Showing the data types of the dataset columns

  1. When some samples aren’t inherently a string data type, such as empty data or maybe numbers data, pandas automatically use an Object data type that categorizes the entire column as data types that are unknown to pandas. Let’s check for empty values:
    dataset.isnull().sum()

    This gives us the following output:

Figure 1.6 – Checking empty values

Figure 1.6 – Checking empty values

  1. It looks like the description column has one empty value, as expected. This might be rooted in a mistake when acquiring the data or it might truly be empty. Either way, let’s remove that row as we can’t do anything to recover it and convert the columns into strings:
    dataset.dropna(inplace=True)
    for column in ['category', 'description']:
        dataset[column] = dataset[column].astype("string")
  2. Earlier, we discovered four unique categories. Let’s make sure we have a decent amount of samples for each category by visualizing its distribution:
    sns.countplot(x="category", data=dataset)

    This will result in the following figure:

Figure 1.7 – A graph showing category distribution

Figure 1.7 – A graph showing category distribution

Each category has a good amount of data samples and doesn’t look like there are any anomaly categories.

  1. The goal here is to predict the category of the selling item through the item’s description on the Indian e-commerce website. From that context, we know that Indian citizens speak Hindi, so the dataset might not contain only English data. Let’s try to estimate and verify the available languages in the dataset using an open sourced language detector tool called Lingua. Lingua uses both rule-based and machine learning model-based methods to detect more than 70 text languages that work great for short phrases, single words, and sentences. Because of that, Lingua has a better runtime and memory performance. Let’s start by initializing the language detector instance from the lingua library:
    detector = LanguageDetectorBuilder.from_all_languages(
          ).with_preloaded_language_models().build()
  2. Now, we will randomly sample a small portion of the dataset to detect language as the detection algorithm takes time to complete. Using a 10% fraction of the data should allow us to adequately understand the data:
    sampled_dataset = dataset.sample(frac=0.1, random_state=1234)
    sampled_dataset['language'] = sampled_dataset[
        'description'
    ].progress_apply(lambda x: detector.detect_language_of(x))
  3. Now, let’s visualize the distribution of the language:
    sampled_dataset['language'].value_counts().plot(kind='bar')

    This will show the following graph plot:

Figure 1.8 – Text language distribution

Figure 1.8 – Text language distribution

  1. Interestingly, Lingua detected some anomalous samples that aren’t English. The anomalous languages look like they might be mistakes made by Lingua. Hindi is also detected among them; this is more convincing than the other languages as the data is from an Indian e-commerce website. Let’s check these samples out:
    sampled_dataset[
        sampled_dataset['language'] == Language.HINDI
    ].description.iloc[0]

    This will show the following text:

Figure 1.9 – Visualizing Hindi text

Figure 1.9 – Visualizing Hindi text

  1. It looks like there is a mix of Hindi and English here. How about another language, such as French?
    sampled_dataset[
        sampled_dataset['language'] == Language.FRENCH
    ].description.iloc[0]

    This will show the following text:

Figure 1.10 – Visualizing French text

Figure 1.10 – Visualizing French text

  1. It looks like potpourri was the focused word here as this is a borrowed French word, but the text is still generally English.
  2. Since the list of languages does not include languages that do not use space as a separator between logical word units, let’s attempt to gauge the distribution of words by using a space-based word separation. Word counts and character counts can affect the parameters of a deep learning neural network, so it will be useful to understand these values during EDA:
    dataset['word_count'] = dataset['description'].apply(
        lambda x: len(x.split())
    )
    plt.figure(figsize=(15,4))
    sns.histplot(data=dataset, x="word_count", bins=10)

    This will show the following bar plot:

Figure 1.11 – Word count distribution

Figure 1.11 – Word count distribution

From the exploration and analysis of the text data, we can deduce a couple of reasons that will help set up the model type and structure we should use during the model development stage:

  • The labels are decently sampled with 5,000-11,000 worth of samples per label, making them suitable for deep learning algorithms.
  • The original data is not clean, has missing data, and duplicates but is fixable through manual processing. Using it as-is for model development would have the potential of creating a biased model.
  • The dataset has more than one language but mostly English text; this will allow us to make appropriate model choices during the model development stage.
  • An abundance of samples has fewer than 1,000 words, and some samples have 1,000-8,000 words. In some non-critical use cases, we can safely cap the number of words to around 1,000 words so that we can build a model with better memory and runtime performance.

The preceding practical example should provide a simple experience of performing EDA that will be sufficient to understand the benefit and importance of running an in-depth EDA before going into the model development stage. Similar to the practical text EDA, we prepared a practical EDA sample workflow for other datasets that includes audio, image, and video datasets in our Packt GitHub repository that you should explore to get your hands dirty.

A major concept to grasp in this section is the importance of EDA and the level of curiosity you should display to uncover the truth about your data. Some methods are generalizable to other similar datasets, but treating any specific EDA workflow as a silver bullet blinds you to the increasing research people are contributing to this field. Ask questions about your data whenever you suspect something of it and attempt to uncover the answers yourself by doing manual or automated inspections however possible. Be creative in obtaining these answers and stay hungry in learning new ways you can figure out key information on your data.

In this section, we have methodologically and practically gone through EDA processes for different types of data. Next, we will explore what it takes to prepare the data for actual model consumption.

Data pre-processing

Data pre-processing involves data cleaning, data structuring, and data transforming so that a deep learning model will be capable of using the processed data for model training, evaluation, and inferencing during deployment. The processed data should not only be prepared just for the machine learning model to accept but should generally be processed in a way that optimizes the learning potential and increases the metric performance of the machine learning model.

Data cleaning is a process that aims to increase the quality of the data acquired. An EDA process is a prerequisite to figuring out anything wrong with the dataset before some form of data cleaning can be done. Data cleaning and EDA are often executed iteratively until a satisfactory data quality level is achieved. Cleaning can be as simple as duplicate values removal, empty values removal, or removing values that don’t make logical sense, either in terms of common sense or through business logic. These are concepts that we explained earlier, where the same risks and issues are applied.

Data structuring, on the other hand, is a process that orchestrates the data ingest and loading process from the stored data that is cleaned and verified of its quality. This process determines how data should be loaded from a source or multiple of them and fed into the deep learning model. Sounds simple enough, right? This could be very simple if this is a small, single CSV dataset where there wouldn’t be any performance or memory issues. In reality, this could be very complex in cases where data might be partitioned and stored in different sources due to storage limitations. Here are some concrete factors you’d need to consider in this process:

  • Do you have enough RAM in your computer to process your desired batch size to supply data for your model? Make sure you also take your model size into account so that you won’t get memory overloads and Out of Memory (OOM) errors!
  • Is your data from different sources? Make sure you have permission to access these data sources.
  • Is the speed latency when accessing these sources acceptable? Consider moving this data to a better hardware resource that you can access with higher speeds, such as a solid-state drive (SSD) instead of a hard disk drive (HDD), and from a remote network-accessible source to a direct local hardware source.
  • Do you even have enough local storage to store this data? Make sure you have enough storage to store this data, don’t overload the storage and risk performance slowdowns or worse, computer breakdowns.
  • Optimize the data loading and processing process so that it is fast. Store and cache outputs of data processes that are fixed so that you can save time that can be used to recompute these outputs.
  • Make sure the data structuring process is deterministic, even when there are processes that need randomness. Randomly deterministic is when the randomness can be reproduced in a repeat of the cycle. Determinism helps make sure that the results that have been obtained can be reproduced and make sure model-building methods can be compared fairly and reliably.
  • Log data so that you can debug the process when needed.
  • Data partitioning methods. Make sure a proper cross-validation strategy is chosen that’s suitable for your dataset. If a time-based feature is included, consider whether you should construct a time-based partitioning method where the training data consists of earlier time examples and the evaluation data is in the future. If not, a stratified partitioning method would be your best bet.

Different deep learning frameworks, such as PyTorch and TensorFlow, provide different application programming interfaces (APIs) to implement the data structuring process. Some frameworks provide simpler interfaces that allow for easy setup pipelines while some frameworks provide complex interfaces that allow for a higher level of flexibility. Fortunately, many high-level libraries attempt to simplify the complex interfaces while maintaining flexibility, such as keras on top of TensorFlow, Catalyst on top of PyTorch, fast ai on top of PyTorch, pytorch lightning on top of PyTorch, and ignite on top of PyTorch.

Finally, data transformation is a process that applies unique data variable-specific pre-processing to transform the raw cleaned data into more a representable, usable, and learnable format. An important factor to consider when attempting to execute the data structuring and transformation process is the type of deep learning model you intend to use. Any form of data transformation is often dependent on the deep learning architecture and dependent on the type of inputs it can accept. The most widely known and common deep learning model architectures are invented to tackle specific data types, such as convolutional neural networks for image data, transformer models for sequence-based data, and basic multilayer perceptrons for tabular data. However, deep learning models are considered to be flexible algorithms that can twist and bend to accept data of different forms and sizes, even in multimodal data conditions. Through collaboration with domain experts from the past few years, deep learning experts have been able to build creative forms of deep learning architectures that can handle multiple data modalities and even multiple unstructured data modalities that succeeded in learning cross-modality patterns. Here are two examples:

  • Robust Self-Supervised Audio-Visual Speech Recognition, by Meta AI (formerly Facebook) (https://arxiv.org/pdf/2201.01763v2.pdf):
    • This tackled the problem of speech recognition in the presence of multiple speeches by building a deep learning transformer-based model that can take in both audio and visual data called AV-HuBERT
    • Visual data acted as supplementary data to help the deep learning model discern which speaker to focus on.
    • It achieved the latest state-of-the-art results on the LRS3-TED visual and audio lip reading dataset
  • Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, by DAMO Academy and Alibaba Group (https://arxiv.org/pdf/2202.03052v1.pdf):
    • They built a model that took in text and image data and published a pre-trained model
    • At achieved state-of-the-art results on an image captioning task on the COCO captions dataset

With that being said, data transformations are mainly differentiated into two parts: feature engineering and data scaling. Deep learning is widely known for its feature engineering capabilities, which replace the need to manually craft custom features from raw data for learning. However, this doesn’t mean that it always makes sense to not perform any feature engineering. Many successful deep learning models have utilized engineered forms of features as input.

Now that we know what data pre-processing entails, let’s discuss and explore different data pre-processing techniques for unstructured data, both theoretically and practically.

Text data pre-processing

Text data can be in different languages and exist in different domains, ranging from description data to informational documents and natural human text comments. Some of the most common text data pre-processing methods that are used for deep learning are as follows:

  • Stemming: A process that removes the suffix of words in an attempt to reduce words into their base form. This promotes the cross-usage of the same features for different forms of the same word.
  • Lemmatization: A process that reduces a word into its base form that produces real English words. Lemmatization has many of the same benefits as stemming but is considered better due to the linguistically valid word reduction outputs it produces.
  • Text tokenization, by Byte Pair Encoding (BPE): Tokenization is a process that splits text into different parts that will be encoded and used by the deep learning models. BPE is a sub-word-based text-splitting algorithm that allows common words to be outputted as a single token but rare words get split into multiple tokens. These split tokens can reuse representations from matching sub-words. This is to reduce the vocabulary that can exist at any one time, reduce the amount of out-of-vocabulary tokens, and allow token representations to be learned more efficiently.

One uncommon pre-processing method that will be useful to build more generalizable text deep learning models is text data augmentation. Text data augmentation can be done in a few ways:

  • Replacing verbs with their synonyms: This can be done by using the set of synonym dictionaries from the NLTK library’s WordNet English lexical database. The obtained augmented text will maintain the same meaning with verb synonym replacement.
  • Back translation: This involves translating text into another language and back to the original language using translation services such as Google or using open sourced translation models. The obtained back-translated text will be in a slightly different form.

Audio data pre-processing

Audio data is essentially sequence-based data and, in some cases, multiple sequences exist. One of the most commonly used pre-processing methods for audio is raw audio data transformed into different forms of spectrograms using Short-Time Fourier Transform (STFT), which is a process that converts audio from the time domain into the frequency domain. A spectrogram audio conversion allows audio data to be broken down and represented in a range of frequencies instead of a single waveform representation that is a combination of the signals from all audio frequencies. These spectrograms are two-dimensional data and thus can be treated as an image and fed into convolutional neural networks. Data scaling methods such as log scaling and log-mel scaling are also commonly applied to these spectrograms to further emphasize frequency characteristics.

Image data pre-processing

Image data augmentation is a type of image-based feature engineering technique that is capable of increasing the comprehensiveness potential of the original data. A best practice for applying this technique is to structure the data pipeline to apply image augmentations randomly during the training process by batch instead of providing a fixed augmented set of data for the deep learning model. Choosing the type of image augmentation requires some understanding of the business requirements of the use case. Here are some examples where it doesn’t make sense to apply certain augmentations:

  • When the orientation of the image affects the validity of the target label, orientation modification types of augmentation such as rotation and image flipping wouldn’t be suitable
  • When the color of the image affects the validity of the target label, color modification types of augmentation such as grayscale, channel shuffle, hue saturation shift, and RGB shift aren’t suitable

After excluding obvious augmentations that won’t be suitable, a common but effective method to figure out the best set of augmentations list is iterative experiments and model comparisons.

 

Developing deep learning models

Let’s start with a short recap of what deep learning is. Deep learning’s core foundational building block is a neural network. A neural network is an algorithm that was made to simulate the human brain. Its building blocks are called neurons, which mimic the billions of neurons the human brain contains. Neurons, in the context of neural networks, are objects that store simple information called weights and biases. Think of these as the memory of the algorithm.

Deep learning architectures are essentially neural network architectures that have three or more neural network layers. Neural network layers can be categorized into three high-level groups – the input layer, the hidden layer, and the output layer. The input layer is the simplest layer group and whose functionality is to pass the input data to subsequent layers. This layer group does not contain biases and can be considered passive neurons, but the group still contains weights in its connections to neurons from subsequent layers. The hidden layer comprises neurons that contain biases and weights in their connections to neurons from subsequent layers. Finally, the output layer comprises neurons that relate to the number of classes and problem types and contains bias. A best practice when counting neural network layers is to exclude the input layer when doing so. So, a neural network with one input layer, one hidden layer, and one output layer is considered to be a two-layer neural network. The following figure shows a basic neural network, called a multilayer perceptron (MLP), with a single input layer, a single hidden layer, and a single output layer:

Figure 1.12 – A simple deep learning architecture, also called an MLP

Figure 1.12 – A simple deep learning architecture, also called an MLP

Being a subset of the wider machine learning category, deep learning models are capable of learning patterns from the data through a loss function and an optimizer algorithm that optimizes the loss function. A loss function defines the error made by the model so that its memory (weights and biases) can be updated to perform better in the next iteration. An optimizer algorithm is an algorithm that decides the strategy to update the weights given the loss value.

With this short recap, let’s dive into a summary of the common deep learning model families.

Deep learning model families

These layers can come in many forms as researchers have been able to invent new layer definitions to tackle new problem types and almost always comes with a non-linear activation function that allows the model to capture non-linear relationships between the data. Along with the variation of layers come many different deep learning architecture families that are meant for different problem types. A few of the most common and widely used deep learning models are as follows:

  • MLP for tabular data types. This will be explored in Chapter 2, Designing Deep Learning Architectures.
  • Convolutional neural network for image data types. This will be explored in Chapter 3, Understanding Convolutional Neural Networks.
  • Autoencoders for anomaly detection, data compression, data denoising, and feature representation learning. This will be explored in Chapter 5, Understanding Autoencoders.
  • Gated recurrent unit (GRU), Long Short-Term Memory (LSTM), and Transformers for sequence data types. These will be explored in Chapter 4, Understanding Recurrent Neural Networks, and Chapter 6, Understanding Neural Network Transformers, respectively.

These architectures will be the focus of Chapters 2 to 6, where we will discuss their methodology and go through some practical evaluation. Next, let’s discover the problem types we can tackle in deep learning.

The model development strategy

Today, deep learning models are easy to invent and create due to the advent of deep learning frameworks such as PyTorch and TensorFlow, along with their high-level library wrappers. Which framework you should choose at this point is a matter of preference regarding their interfaces as both frameworks are matured with years of improvement work done. Only when there is a pressing need for a very custom function to tackle a unique problem type will you need to choose the framework that can execute what you need. Once you’ve chosen your deep learning framework, the deep model creation, training, and evaluation process is pretty much covered all around.

However, model management functions do not come out of the box from these frameworks. Model management is an area of technology that allows teams, businesses, and deep learning practitioners to reliably, quickly, and effectively build models, evaluate models, deliver model insights, deploy models to production, and govern models. Model management can sometimes be referred to as machine learning operations (MLOps). You might still be wondering why you’d need such functionalities, especially if you’ve been building some deep learning models off Kaggle, a platform that hosts data and machine learning problems as competitions. So, here are some factors that drive the need to utilize these functionalities:

  • It is cumbersome to compare models manually:
    • Manually typing performance data in an Excel sheet to keep track of model performance is slow and unreliable
  • Model artifacts are hard to keep track of:
    • A model has many artifacts, such as its trained weights, performance graphs, feature importance, and prediction explanations
    • It is also cumbersome to compare model artifacts
  • Model versioning is needed to make sure model-building experiments are not repeated:
    • Overriding the top-performing model with the most reliable model insights is the last thing you want to experience
    • Versioning should depend on the data partitioning method, model settings, and software library versions
  • It is not straightforward to deploy and govern models

Depending on the size of the team involved in the project and how often components need to be reused, different software and libraries would fit the bill. These software and libraries are split into paid and free (usually open sourced) categories. Metaflow, an open sourced software, is suitable for bigger data science teams where there are many chances of components needing to be reused across other projects and MLFlow (open sourced software) would be more suitable for small or single-person teams. Other notable model management tools are Comet (paid), Weights & Biases (paid), Neptune (paid), and Algorithmia (paid).

With that, we have provided a brief overview of deep learning model development methodology and strategy; we will dive deeper into model development topics in the next few chapters. But before that, let’s continue with an overview of the topic of delivering model insights.

 

Delivering model insights

Model metric performance, when used exclusively for model comparisons and the model choosing process, is often not the most effective way to reliably obtain the true best model. When people care about the decisions that can potentially be made by the machine learning model, they typically require more information and insights to eventually put their trust in the ability of the model to make decisions. Ultimately, when models are not trusted, they don’t get deployed. However, trust doesn’t just depend on insights of the model. Building trust in a model involves ensuring accurate, reliable, and unbiased predictions that align with domain expertise and business objectives, while providing stakeholders with insights into the model’s performance metrics, decision-making logic, and rationale behind its predictions. Addressing potential biases and demonstrating fairness are crucial for gaining confidence in the model’s dependability. This ongoing trust-building process extends beyond initial deployment, as the model must consistently exhibit sound decision-making, justify predictions, and maintain unbiased performance. By fostering trust, the model becomes a valuable and reliable tool for real-world applications, leading to increased adoption and utilization across various domains and industries.

Deliver model insights that matter to the business. Other than delivering model insights with the obvious goal of ensuring model trust and eliminating trust issues, actual performance metrics are equally important. Make sure you translate model metrics into layman’s business metrics whenever possible to effectively communicate the potential positive impact that the model can bring to the business. Success metrics, which are defined earlier in the planning phase, should be reported with actual values at this stage.

The process of inducing trust in a model doesn’t stop after the model gets deployed. Similar to how humans are required to explain their decisions in life, machine learning models (if expected to replace humans to automate the decisioning process) are also required to do so. This process is called prediction explanations. In some cases, model decisions are expected to be used as a reference where there is a human in the loop that acts as a domain expert to verify decisions before the decisions are carried out. Prediction explanations are almost always a necessity in these conditions where the users of the model are interested in why the model made its predictions instead of using the predictions directly.

Model insights also allow you to improve a model’s performance. Remember that the machine learning life cycle is naturally an iterative process. Some concrete examples of where this condition could happen are as follows:

  • You realize that the model is biased against a particular group and either go back to the data acquisition stage to acquire more data from the less represented group or change to a modeling technique that is robust to bias
  • You realize that the model performs badly in one class and go back to the model development stage to use a different deep learning model loss function that can focus on the badly performing class

Deep learning models are known to be a black box model. However, in reality, today, there have been many published research papers on deep learning explanation methods that have allowed deep learning to break the boundaries of a black box. We will dive into the different ways we can interpret and provide insights for deep learning models in Part 2 of this book.

Now that we have more context of the processes involved in the deep learning life cycle, in the next section, we will discuss risks that can exist throughout the life cycle of the project that you need to worry about.

 

Managing risks

Deep learning systems are exposed to a multitude of risks starting early, from inception to system adoption. Usually, most people who are assigned to work on a deep learning project are responsible only for a specific stage of the machine learning life cycle, such as model development or data preparation. This practice can be detrimental when the work of one stage in the life cycle generates problems at a later stage, which often happens in cases where the team members involved have little sense of the bigger picture in play. Risks involved in a deep learning system generally involve interactions between stages in the machine learning life cycle and follow the concept of Garbage in, garbage out. Making sure everyone working on building the system has a sense of accountability for what eventually gets outputted from the entirety of the system instead of just the individual stages is one of the foundational keys to managing risks in a machine learning system.

But what are the risks? Let’s start with something that can be handled way before anything tangible is made – something that happens when a use case is evaluated for worthiness.

Ethical and regulatory risks

Deep learning can be applied in practically any industry but some of the hardest industries to get deep learning adopted in are the highly regulated industries, such as the medical and financial industries. The regulations that are imposed in these industries ultimately determine what a deep learning system can or cannot do in the industry. Regulations are mostly introduced by governments and most commonly involve ethical and legal considerations. In these highly regulated industries, it is common to experience audits being conducted monthly, weekly, or even daily to make sure the companies are compliant with the regulations imposed. One of the main reasons certain industries are regulated more aggressively is that the repercussions of any actions in these industries bear a heavy cost to the well-being of the people or the country. Deep learning systems need to be built in a way that they will be compliant with these regulations to avoid the risk of facing decommission, the risk of not getting adopted at all, and, worst of all, the risk of getting a huge fine by regulatory officials.

At times, deep learning models can perform way better than their human counterparts but, on the other side, no deep learning model is perfect. Humans make mistakes, everybody knows that for a fact, but another reality we need to realize is that machines undoubtedly also make mistakes. The highest risk is when humans trust the machines so much that they give 100% trust in them to make decisions! So, how can we account for these mistakes and who will be responsible for them?

Let’s see what we humans do in situations without machines. When the stakes are high, important decisions always go through a hierarchy of approvals before a decision can be final. These hierarchies of approvals signal the need to make important decisions accountable and reliable. The more approvals we get for a decision, the more we can say that we are confident in making that important decision. Some examples of important decisions that commonly require a hierarchy of approvals include the decision to hire an employee in a company, the decision on insurance premiums to charge, or the decision of whether to invest money into a certain business. With that context in mind, deep learning systems need to have similar workflows for obtaining approvals when a deep learning model makes a certain predictive decision in high-stakes use cases. These workflows can include any form of insight, and an explanation of the predictions can be used to make it easier for domain experts to judge the validity of the decisions. Adding a human touch makes the deep learning system many times more ethical and trustable enough to be part of the high-stake decision workflow.

Let’s take a medical industry use case – for instance, a use case to predict lung disease through X-ray scans. From an ethical standpoint, it is not right that a deep learning model can have the complete power to strip a person’s hope of life by predicting extremely harmful diseases such as end-stage lung cancer. If the predicted extreme disease is misdiagnosed, patients may have grieved unnecessarily or spent an unnecessary amount of money on expensive tests to verify the claims. Having an approval workflow in the deep learning system allows doctors to use these results as an assistive method would solve the ethical concerns of using an automated decisioning system.

Business context mismatch

Reiterating the previous point in the Defining success section, aligning the desired deep learning input data and target label choice to the business context and how the target predictions are consumed makes deep learning systems adoptable. The risk involved here is when the business value is either not properly defined or not matched properly by the deep learning system. Even when the target is appropriate for the business context, how these predictions are consumed holds the key to deriving value. Failing to match the business context in any way simply risks the rejection of the systems.

Understanding the business context involves understanding who the targeted user groups are and who they are not. A key step in this process is documenting and building user group personas. User personas hold information about their workflows, history, statistics, and concerns that should provide the context needed to build a system that aligns with the specific needs of potential users. Don’t be afraid to conduct market research and make proper validation of the targeted user needs in the process of building these user personas. After all, it takes a lot of effort to build a deep learning system and it would not be great to waste time building a system that nobody wants to use.

Data collection and annotation risks

Your machine learning model will only be as good as the quality of your data. Deep learning requires a substantially larger quantity of data compared to machine learning. Additionally, very often for new deep learning use cases, data is not available off the shelf and must be manually collected and meticulously annotated. Making sure that the data is collected and annotated in a way that ensures quality is upheld is a very hard job.

The strategies and methods that are used to collect and annotate data for deep learning vary across use cases. Sometimes, the data collection and annotation process can happen at the same time. This can happen either when there are natural labels or when a label is prespecified before data is collected:

  • Natural labels are labels that usually come naturally after some time passes. For example, this can be the gender of a baby through an ultrasound image; another use case is the price of a house as a label with the range of images of the property as the input data.
  • Prespecified labels are labels where the input data is collected with predetermined labels. For example, this can be the gender label of some speech data that was collected in a controlled environment just for building a machine learning model or the age of a person in a survey before taking a facial photo shot to build a machine learning model.

These two types of labels are relatively safe from risks since the labels are likely to be accurate.

A final method of data collection and data annotation that poses a substantial amount of risk is when the labels are post-annotated after the data is collected. Post annotation requires some form of domain knowledge about the characteristics of the desired label and does not always result in a 100% truth value due to human evaluation errors. Unconscious labeling errors happen when the labeler simply makes an error by accident. Conscious labeling errors, on the other hand, happen when the labeler actively decides to make a wrong label decision with intent. This intent can be from a strong or loosely rooted belief tied to a certain pattern in the data, or just a plain error made purposely for some reason. The risk being presented here is the potential of the labels being mistakenly annotated. Fortunately, there are a few strategies that can be executed to eliminate these risks; these will be introduced in the following paragraphs.

Unconscious labeling errors are hard to circumvent as humans are susceptible to a certain amount of error. However, different people generally have varying levels of focus and selective choice of the labelers could be a viable option for some. If the labelers are hired with financial remuneration just to annotate the labels for your data, another viable option is to periodically provide data that is already labeled in secret in between providing data that needs labeling for the labelers to label the data. This strategy allows you to evaluate the performance of individual labelers and provide compensation according to the accuracy of their annotation work (yes, we can use metrics that are used to evaluate machine learning models too). As a positive side effect, labelers also would be incentivized to perform better in their annotation work.

Conscious labeling errors are the most dangerous risk here due to their potential to affect the quality of the entire dataset. When a wrong consensus on the pattern of data associated with any target label is used for the entire labeling process, the legitimacy of the labels won’t be discovered until the later stages of the machine learning life cycle. Machine learning models are only equipped with techniques to learn the patterns required to map the input data to the provided target labels and will likely be able to perform very well, even in conditions where the labels are incorrect. So long as there is a pattern that the labeler enforced during the labeling process, no matter right or wrong, machine learning models will do their best to learn the patterns needed to perform the exact input-to-output mapping. So, how do we mitigate this risk? Making sure ideologies are defined properly with the help of domain experts plays an important part in the mitigation process, but this strategy by itself holds varying levels of effectiveness, depending on the number of domain experts involved, the expertise level of the domain experts themselves, the number of labelers, and the varying degrees of anomalies or quality of the collected data.

Even domain experts can be wrong at times. Let’s take doctors as an example – how many times have you heard about a doctor giving wrong prescriptions and wrong medical assessments? How about the times when you misheard someone’s speech and had to guess what the other person just said? In the last example, the domain expert is you, and the domain is language speech comprehension. Additionally, when the number of labelers exceeds one person and forms a labeling team, one of the most prominent risks that can happen is a mismatch between different preferred domain ideologies to label the data. Sometimes, this happens because of the different variations of a certain label that can exist in a digital format or the existence of confusing patterns that can deter the analytical capabilities of the labeler. Sometimes, this can also happen due to the inherent bias the labeler or the domain expert has toward a certain label or input data. Subsequently, bias in the data creates bias in the model and creates ethical issues that demote trust in the decisions of a machine learning model. When there is a lack of trust in a machine learning model, the project will fail to be adopted and will lose its business value. The topic of bias, fairness, and trust will be discussed more extensively in Chapter 13, Exploring Bias and Fairness, which will elaborate on its origins and ways to mitigate it.

Data with wrong labels that still have correct labels are called noisy labels. Fortunately, there are methods in deep learning that can help you work your way around noisy labels, such as weakly supervised learning. However, remember that it’s always a good idea to fix the issue at the source, which is when the data is collected and annotated, rather than after. Let’s dive into another strategy we can use to make the labeling process less risky. The data labeling process for deep learning projects usually involves using a software tool that allows the specific desired labels to be annotated. Software tools make annotation work faster and easier, and using a labeling collaboration software tool can make annotation work more error-proof. A good collaboration-based labeling tool will allow labelers to align their findings of the data with each other in some way, which will promote a common ideology when labeling. For instance, automatically alerting the entire labeler team when an easily misinterpreted pattern is identified can help prevent more data from being mislabeled and make sure all the previously related data gets rereviewed.

As a final point here, not as a risk, unlabeled data presents huge untapped potential for machine learning. Although it lacks specific labels to achieve a particular goal, there are inherent relationships between the data that can be learned. In Chapter 9, Exploring Unsupervised Deep Learning, we will explore how to use unsupervised learning to leverage this data as a foundation for subsequent supervised learning, which forms the workflow more widely known as semi-supervised learning.

Next, we will go through the next risk related to security in the consumption of data for machine learning models.

Data security risk

Security, in the context of machine learning, relates to preventing the unauthorized usage of data, protecting the privacy of data, and preventing events or attacks that are unwanted relating to the usage of the data. The risk here is when data security is compromised and can result in failure to comply with regulatory standards, the downfall of the model’s performance, or the corruption and destruction of business processes. In this section, we will go through four types of data security risk, namely sensitive data handling, data licensing, software licensing, and adversarial attacks.

Sensitive data handling

Some data is more sensitive than others and can be linked to regulatory risks. Sensitive data gives rise to data privacy regulations that govern the usage of personal data in different jurisdictions. Specifics of regulations vary but they generally revolve around lawfully protecting the rights of the usage of personal data and requiring consent for any actions taken on these types of data along with terms required when using such data. Examples of such regulations are the General Data Protection Regulation (GDPR), which covers the European Union, the Personal Data Protection Act (PDPA), which covers Singapore, the California Consumer Privacy Act (CCPA), which covers only the state of California in the United States, and the Consumer Data Protection Act (CDPA), which covers the state of Virginia in the United States. This means that you can’t just collect data that is categorized as personal, annotate it, build a model, and deploy it without adhering to these regulations as doing so would be considered a crime in some of these jurisdictions. Other than requesting consent, one of the common methods that’s used to mitigate this risk is to anonymize data so that the data can’t be identified by any single person. However, anonymization has to be done reliably so that the key general information is maintained while reliably removing any possibility of reconstructing the identity of a person. Making sure sensitive and personal data is handled properly goes a long way in building trust in the decisions of a machine learning model. Always practice extreme caution in handling sensitive and personal data to ensure the longevity of your machine learning project.

Data and software licensing

Particularly for deep learning projects, a lot of data is required to build a good quality model. The availability of publicly available datasets helps shorten the time and complexity of a project by partially removing the cost and time needed to collect and label data. However, most datasets, like software, have licenses associated with them. These licenses govern how the data associated with the license can be used. The most important criterion of data license about machine learning models for business problems is whether the data allows for commercial usage or not. As most business use cases are considered to be commercial use cases due to profits derived from them, datasets with a license that prevents commercial usage cannot be used. Examples of data licenses with terms that prevent commercial usage are all derivatives of the Creative Commons Attribution-NonCommercial (CC-BY-NC) license. Similarly, open sourced software also poses a risk to deep learning projects. Always make sure you triple-check the licenses before using any publicly available data or software for your project. Using data or code that has terms that prevent commercial usage in your commercial business project puts your project at risk of being fined or sued for license infringement.

Adversarial attacks

When an application of machine learning is widely known, it exposes itself to targeted attacks meant to maliciously derail and manipulate the decisions of a model. This brings us to a type of attack that can affect a deployed deep learning model, called an adversarial attack. Adversarial attacks are a type of attack that involves manipulating the data input in a certain way to affect the predictions from the machine learning model. The most common adversarial attacks are caused by adversarial data examples that are modified from the actual input data in a way that they appear to maintain their legitimacy as input data but are capable of skewing the model’s prediction. The level of risk of such an attack varies across different use cases and depends on how much a user can interact with the system that eventually passes the input data to the machine learning model. One of the most widely known adversarial examples for a deep learning model is an optimized image that looks like random color noise and, when used to perturb pixel values of another image by overlaying its own pixel values, generates an adversarial example. The original image that’s obtained after the perturbation looks as though it’s untouched visually to a human but is capable of producing erroneous misclassification. The following figure shows an example of this perturbation as part of a tutorial from Chapter 14, Analyzing Adversarial Performance. This figure depicts the result of a neural network called ResNet50, trained to classify facial identity, that, when given a facial image, correctly predicted the right identity. However, when added together with another noise image array that was strategically and automatically generated just with access to the predicted class probabilities, the model mistakenly predicts the identity of the facial image, even when the combined image looks visually the same as the original image:

Figure 1.13 – Example of a potential image-based adversarial attack

Figure 1.13 – Example of a potential image-based adversarial attack

Theoretically, adversarial attacks can be made using any type of data and are not limited to images. As the stakes of machine learning use cases grow higher, it’s more likely that attackers are willing to invest in a research team that produces novel adversarial examples.

Some of the adversarial examples and the methods that generate them rely on access to the deep learning model itself so that a targeted adversarial example can be made. This means that it is not remotely possible for a potential attacker to be able to succeed in confusing a model created by someone else unless they have access to the model. However, many businesses utilize publicly available pre-trained models as-is and apply transfer learning methods to reduce the amount of work needed to satisfy a business use case. Any publicly available pre-trained models will also be available to the attackers, allowing them to build targeted adversarial examples for individual models. Examples of such pre-trained models include all the publicly available ImageNet pre-trained convolutional neural network models and weights.

So, how do we attempt to mitigate this risk?

One of the methods we can use to mitigate the risk of an adversarial attack is to train deep learning models with examples from the known set of methods from public research. By training with the known adversarial examples, the model will learn how to ignore the adversarial information through the learning process and be unfazed by such examples during validation and inference time. Evaluating different variations of the adversarial examples also helps set expectations on when the model will fail.

In this section, we discussed an in-depth take on the security issues when dealing with and consuming data for machine learning purposes. In Chapter 14, Analyzing Adversarial Performance, we will go through a more in-depth practical evaluation of adversarial attack technologies for different data modalities and how to mitigate them in the context of deep learning. Next, we will dive into another category of risk that is at the core of the model development process.

Overfitting and underfitting

During the model development process, which is the process of training, validating, and testing a machine learning model, one of the most foundational risks to handle is overfitting a model and underfitting a model.

Overfitting is an event where a machine learning model becomes so biased toward the provided training dataset examples and learned patterns that it can only exclusively distinguish examples that belong in the training dataset while failing to distinguish any examples in the validation and testing dataset.

Underfitting, on the other hand, is an event where a machine learning model fails to capture any patterns of the provided training, validation, and testing dataset.

Learning a generalizable pattern to output mapping capability is the key to building a valuable and usable machine learning model. However, there is no silver bullet to achieving a nicely fitted model, and very often, it requires iterating between the data preparation, model development, and deliver model insights stages.

Here are some tips to prevent overfitting and ensure generalization in the context of deep learning:

  • Augment your data in expectation of the eventual deployment conditions
  • Collect enough data to cover every single variation possible of your targeted label
  • Collaborate with domain experts and understand key indicators and patterns
  • Use cross-validation methods to ensure the model gets evaluated fairly on unseen data
  • Use simpler neural network architectures whenever possible

Here are some tips to prevent underfitting:

  • Evaluate a variety of different models
  • Collaborate with domain experts and understand key indicators and patterns
  • Make sure the data is clean and has as low an error as possible
  • Make sure there is enough data
  • Start with a small set of data inputs when building your model and build your way up to more complex data to ensure models can fit appropriately on your data

In this section, we have discussed issues while building a model. Next, we will be discussing a type of risk that will affect the built model after it has ben trained.

Model consistency risk

One of the major traits of the machine learning process is that it is cyclical. Models get retrained constantly in hopes of finding better settings. These settings can be a different data partitioning method, a different data transformation method, a different model, or the same model with different model settings. Often, the models that are built need to be compared against each other fairly and equitably. Model consistency is the one feature that ensures that a fair comparison can be made between different model versions and performance metrics. Every single process before a model is obtained is required to be consistent so that when anybody tries to execute the same processes with the same settings, the same model should be obtainable and reproducible. We should reiterate that even when some processes require randomness when building the model, it needs to be randomly deterministic. This is needed so that the only difference in the setup is the targeted settings and nothing else.

Model consistency doesn’t stop with just the reproducibility of a model – it extends to the consistency of the predictions of a model. The predictions should be the same when predicted using a different batch size setting, and the same data input should always produce the same predictions. Inconsistencies in model predictions are a major red flag that signals that anything the model produces is not representative of what you will get during deployment and any derived model insights would be misinformation.

To combat model consistency risk, always make sure your code produces consistent results by seeding the random number generator whenever possible. Always validate model consistency either manually or automatically in the model development stage after you build the model.

Model degradation risk

When you have built a model, verified it, and demonstrated its impact on the business, you then take it into the model deployment stage and position it for use. One mistake is to think that this is the end of the deep learning project and that you can take your hands off and just let the model do its job. Sadly, most machine learning models degrade, depending on the level of generalization your model achieved during the model development stage on the data available in the wild. A common scenario that happens when a model gets deployed is that, initially, the model’s performance and the characteristics of the data received during deployment stay the same, and change over time. Time has the potential to change the conditions of the environment and anything in the world. Machines grow old, seasons change, and people change, and expecting that conditions and variables surrounding the machine learning model will change can allow you to make sure models stay relevant and impactful to the business.

How a model can degrade can be categorized into three categories, namely data drift, concept drift, and model drift. Conceptually, drift can be associated with a boat slowly drifting away from its ideal position. In the case of machine learning projects, instead of a boat drifting away, it’s the data or model drifting away from its original perceived behavior or pattern. Let’s briefly go through these types of drift.

Data drift is a form of degradation that is associated with the input data the model needs to provide a prediction value. When a deployed model experiences data drift, it means that the received that’s data during deployment doesn’t belong to the inherent distribution of the data that was trained and validated by the machine learning model. If there were ways to validate the model on the new data supplied during deployment, data drift would potentially cause a shift in the original expected metric performance obtained during model validation. An example of data drift in the context of a deep learning model can be a use case requiring the prediction of human actions outdoors. In this use case, data drift would be that the original data that was collected consisted of humans in summer clothing during the summer season, and due to winter, the people are wearing winter clothing instead.

Concept drift is a form of degradation that is associated with the change in how the input data interacts with the target output data. In the planning stage, domain experts and machine learning practitioners collaborate to define input variables that can affect the targeted output data. This defined input and output setup will subsequently be used for building a machine learning model. Sometimes, however, not all of the context that can affect the targeted output data is included as input variables due to either the availability of that data or just the lack of domain knowledge. This introduces a dependence on the conditions of the missing context associated with the data collected for machine learning. When the conditions of the missing context drift away from the base values that exist in the training and validation data, concept drift occurs, rendering the original concept irrelevant or shifted. In simpler terms, this means that the same input doesn’t map to the same output from the training data anymore. In the context of deep learning, we can take a sentiment classification use case that is based on textual data. Let’s say that a comment or speech can be a negative, neutral, or positive sentiment based on both the jurisdiction and the text itself. This means that in some jurisdictions, people have different thresholds on what is considered negative, neutral, or positive and grade things differently. Training the model with only textual data wouldn’t allow itself to generalize across jurisdictions accurately and thus would face concept drift any time it gets deployed in another jurisdiction.

Lastly, model drift is a form of degradation associated with operational metrics and easily measurable metrics. Some of the factors of degradation that can be categorized into model drift are model latency, model throughput, and model error rates. Model drift metrics are generally easy to measure and track compared to the other two types of drift.

One of the best workflows for mitigating these risks is by tracking the metrics under all of these three types of drift and having a clear path for a new machine learning model to be built, which is a process I call drift reset. Now that we’ve covered a brief overview of model degradation, in Chapter 16, Governing Deep Learning Models, and Chapter 17, Managing Drift Effectively in a Dynamic Environment, we will go more in-depth on these risks and discuss practically how to mitigate this risk in the context of deep learning.

 

Summary

In this chapter, you learned what it takes to complete a deep learning project successfully in a repeatable and consistent manner at a somewhat mid to high-level view. The topics we discussed in this chapter were structured to be more comprehensive at the earlier stages of the deep learning life, which covers the planning and data preparation stages.

In the following chapters, we will explore the mid to final stages of the life cycle more comprehensively. This involves everything after the data preparation stage, which includes model development, model insights, model deployment, and, finally, model governance.

In the next chapter, we will start to explore common and widely used deep learning architectures more extensively.

 

Further reading

  • Bowen Shi, Wei-Ning Hsu, Abdelrahman Mohamed. Robust Self-Supervised Audio-Visual Speech Recognition, 2022. Toyota Technological Institute at Chicago, Meta AI: https://arxiv.org/pdf/2201.01763v2.pdf.
  • Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang. Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. DAMO Academy, Alibaba Group: https://arxiv.org/pdf/2202.03052v1.pdf.
About the Author
  • Ee Kin Chin

    Ee Kin Chin is a Senior Deep Learning Engineer at DataRobot. He holds a Bachelor of Engineering (Honours) in Electronics with a major in Telecommunications. Ee Kin is an expert in the field of Deep Learning, Data Science, Machine Learning, Artificial Intelligence, Supervised Learning, Unsupervised Learning, Python, Keras, Pytorch, and related technologies. He has a proven track record of delivering successful projects in these areas and is dedicated to staying up to date with the latest advancements in the field.

    Browse publications by this author
Latest Reviews (1 reviews total)
The books are of top quality and easy to use. They contain practical examples, and I would recommend them to anyone wanting to dive into AI.
The Deep Learning Architect's Handbook
Unlock this book and the full library FREE for 7 days
Start now