Reader small image

You're reading from  The Deep Learning Architect's Handbook

Product typeBook
Published inDec 2023
PublisherPackt
ISBN-139781803243795
Edition1st Edition
Right arrow
Author (1)
Ee Kin Chin
Ee Kin Chin
author image
Ee Kin Chin

Ee Kin Chin is a Senior Deep Learning Engineer at DataRobot. He holds a Bachelor of Engineering (Honours) in Electronics with a major in Telecommunications. Ee Kin is an expert in the field of Deep Learning, Data Science, Machine Learning, Artificial Intelligence, Supervised Learning, Unsupervised Learning, Python, Keras, Pytorch, and related technologies. He has a proven track record of delivering successful projects in these areas and is dedicated to staying up to date with the latest advancements in the field.
Read more about Ee Kin Chin

Right arrow

Deploying Deep Learning Models to Production

In the previous chapters, we delved into the intricacies of data preparation, deep learning (DL) model development, and how to deliver insightful outcomes from our DL models. Through meticulous data analysis, feature engineering, model optimization, and model analysis, we have learned the techniques to ensure our DL models can perform well and as desired. As we transition into the next phase of our journey, the focus now shifts toward deploying these DL models in production environments.

Reaching the stage of deploying a DL model to production is a significant accomplishment, considering that most models don’t make it that far. If your project has reached this milestone, it signifies that you have successfully satisfied stakeholders, presented valuable insights, and performed thorough value and metric analysis. Congratulations, as you are now one step closer to joining the small percentage of successful projects amidst countless...

Technical requirements

We will have a practical topic in the last section of this chapter. This tutorial requires you to have a Linux machine with an NVIDIA GPU device ideally in Ubuntu with Python 3.10 and the nvidia-docker tool installed. Additionally, we will require the following Python libraries to be installed:

  • numpy
  • transformers==4.21.3
  • nvidia-tensorrt==8.4.1.5
  • torch==1.12.0
  • transformers-deploy
  • tritonclient

The code files are available on GitHub: https://github.com/PacktPublishing/The-Deep-Learning-Architect-Handbook/tree/main/CHAPTER_15.

Exploring the crucial components for DL model deployment

So, what does it take to deploy a DL model? It starts with having a holistic view of each required component and defining clear requirements that guide decision-making for every aspect. This approach ensures alignment with the business goals and requirements, maximizing the chances of a successful deployment. With careful planning, diligent execution, and a focus on meeting the needs of the business, you can increase the likelihood of successfully deploying your DL model and unlocking its value for users. We will start by discovering components that are required to deploy a DL model.

Deploying a DL model to production involves more than just the trained model itself. It requires seamless collaboration among various components, working together to enable users to effectively extract value from the model’s predictions. These components are as follows:

  • Architectural choices: The overall design and structure of...

Identifying key DL model deployment requirements

To determine the most suitable deployment strategy from a variety of options, it is essential to identify and define seven key requirements. These are latency and availability, cost, scalability, model hardware, data privacy, safety, and trust and reliability requirements. Let’s dive into each of these requirements in detail:

  • Latency and availability requirements: These are two closely connected components and should be defined together. Availability requirements refer to the desired level of uptime and accessibility of the model’s prediction. Latency requirements refer to the maximum acceptable delay or response time that the models must meet to provide timely predictions or results. A deployment with a low availability requirement usually can tolerate high latency predictions, and vice versa. One reason is that a low-latency capable infrastructure can’t ensure low latency if it is not available when model...

Choosing the right DL model deployment options

Selecting the right deployment options for your DL model is a crucial step in ensuring optimal performance, scalability, and cost-effectiveness. To assist you in making an informed decision, we will explore recommended options based on different requirements. These recommendations encompass various aspects, such as hardware and physical infrastructure, monitoring and logging components, and deployment strategies. By carefully evaluating your model’s characteristics, resource constraints, and desired outcomes, you should be able to identify the most suitable deployment solution that aligns with your objectives while maximizing efficiency and return on investment through this guide. The tangible deployment components we will explore here are architectural decisions, computing hardware, model packaging and frameworks, communication protocols, and user interfaces. Let’s dive into each component one by one, starting with architectural...

Exploring deployment decisions based on practical use cases

In this section, we will explore practical deployment decisions for DL models in production, focusing on two distinct use cases: a sentiment analysis application for an e-commerce company and a face detection and recognition system for security cameras. By examining these real-world scenarios, we will gain valuable insights into establishing robust deployment strategies tailored to specific needs and objectives.

Exploring deployment decisions for a sentiment analysis application

Suppose you are developing a sentiment analysis application to be used by an e-commerce company to analyze customer reviews in real-time. The system needs to process a large number of reviews every day, and low latency is essential to provide immediate insights for the company. In this case, your choices could be as follows:

  • Architectural choice: As an independent service, as it would allow better scalability and easier updates to handle...

Discovering general recommendations for DL deployment

Here, we will discover DL deployment recommendations related to three verticals, namely model safety, trust, and reliability assurance, model latency optimization, and tools that help abstract model deployment-related decisions and ease the model deployment process. We will dive into the three verticals one by one.

Model safety, trust, and reliability assurance

Ensuring model safety, trust, and reliability is a crucial aspect of deploying DL systems. In this section, we will explore various recommendations and best practices to help you establish a robust framework for maintaining the integrity of your models. This includes compliance with regulations, implementing guardrails, prediction consistency, comprehensive testing, staging and production deployment strategies, usability tests, retraining and updating deployed models, human-in-the-loop decision-making, and model governance. By adopting these measures, you can effectively...

Deploying a language model with ONNX, TensorRT, and NVIDIA Triton Server

The three tools are ONNX, TensorRT, and NVIDIA Triton Server. ONNX and TensorRT are meant to perform GPU-based inference acceleration, while NVIDIA Triton Server is meant to host HTTP or GRPC APIs. We will explore these three tools practically in this section. TensorRT is known to perform the best model optimization toward the GPU to speed up inference, while NVIDIA Triton Server is a battle-tested tool for hosting DP models that have compatibility with TensorRT natively. ONNX, on the other hand, is an intermediate framework in the setup, which we will use primarily to host the weight formats that are directly supported by TensorRT.

In this practical tutorial, we will be deploying a Hugging Face-sourced language model that can be supported on most NVIDIA GPU devices. We will be converting our PyTorch-based language model from Hugging Face into ONNX weights, which will allow TensorRT to load the Hugging Face...

Summary

In this chapter, we explored the various aspects of deploying DL models in production environments, focusing on key components, requirements, and strategies. We discussed architectural choices, hardware infrastructure, model packaging, safety, trust, reliability, security, authentication, communication protocols, user interfaces, monitoring, and logging components, along with continuous integration and deployment.

This chapter also provided a step-by-step guide for choosing the right deployment options based on specific needs, such as latency, availability, scalability, cost, model hardware, data privacy, and safety requirements. We also explored general recommendations for ensuring model safety, trust, and reliability, optimizing model latency, and utilizing tools that simplify the deployment process.

A practical tutorial on deploying a language model with ONNX, TensorRT, and NVIDIA Triton Server was presented, showcasing a minimal workflow needed for accelerated deployment...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Deep Learning Architect's Handbook
Published in: Dec 2023Publisher: PacktISBN-13: 9781803243795
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Ee Kin Chin

Ee Kin Chin is a Senior Deep Learning Engineer at DataRobot. He holds a Bachelor of Engineering (Honours) in Electronics with a major in Telecommunications. Ee Kin is an expert in the field of Deep Learning, Data Science, Machine Learning, Artificial Intelligence, Supervised Learning, Unsupervised Learning, Python, Keras, Pytorch, and related technologies. He has a proven track record of delivering successful projects in these areas and is dedicated to staying up to date with the latest advancements in the field.
Read more about Ee Kin Chin