Effective Planning of Deep Learning-Driven Projects
In the first chapter of the book, we would like to introduce what deep learning (DL) is and how DL projects are typically carried out. The chapter begins with an introduction to DL, providing some insight into how it shapes our daily lives. Then, we will shift our focus to DL projects by describing how they are structured. Throughout the chapter, we will put extra emphasis on the first phase, project planning; you will learn key concepts such as the comprehension of business objectives, how to define appropriate evaluation metrics, identification of stakeholders, resource planning, and the difference between a minimum viable product (MVP) and a fully featured product (FFP). By the end of this chapter, you should be able to construct a DL project playbook that clearly describes the goal of the project, milestones, tasks, resource allocation, and its timeline.
In this chapter, we’re going to cover the following main topics:
- What is DL?
- Understanding the role of DL in our daily lives
- Overview of DL projects
- Planning a DL project
You can download the supplemental material for this chapter from the following GitHub link:
What is DL?
It has only been a decade since DL emerged but it has rapidly started playing an important role in our daily lives. Within the field of artificial intelligence (AI), a popular set of methods categorized as machine learning (ML) exists. By extracting meaningful patterns from historical data, the goal of ML is to build a model that makes sensible predictions and decisions for newly collected data. DL is an ML technique that exploits artificial neural networks (ANNs) to capture a given pattern. Figure 1.1 presents the key components of the AI evolution that started around 1950s, with Alan Turing conducting discussions about intelligent machines, among other godfathers of the field. While various ML algorithms have been introduced sporadically since the advent of AI, it actually took another decades for the field to bloom. Similarly, it has only been about a decade since DL has became the main stream because many of the algorithms in this field require extensive computational power.
Figure 1.1 – A history of AI
As shown in Figure 1.2, the key advantage of DL comes from ANNs, which enable the automatic selection of necessary features. Similar to the way that human brains are structured, ANNs are also made up of components called neurons. A group of neurons forms a layer and multiple layers are linked together to form a network. This kind of architecture can be understood as multiple steps of nested instructions. As the input data passes through the network, each neuron extracts different information, and the model is trained to select the most relevant features for the given task. Considering the role of neurons as building blocks for pattern recognition, deeper networks generally lead to greater performance, as they learn the details better:
Figure 1.2 – The difference between ML and DL
While typical ML techniques require features to be manually selected, DL learns to select important features on its own. Therefore, it can potentially be adapted to a broader range of problems. However, this advantage does not come for free. In order to train a DL model properly, the datasets for training need to be large and diverse enough, which leads to an increase in training time. Interestingly, graphics processing unit (GPU) has played a major role in reducing the training time. While a central processing unit (CPU) demonstrates its effectiveness in carrying out complex computations with its broader instruction sets, a GPU is more suitable for processing simpler but larger computations with its massive parallelism. By exploiting such an advantage in the matrix multiplications that the DL model heavily depends on, GPU has become a critical component within DL.
As we are still in the early stages of the AI era, chip technology is evolving continuously, and it is not yet clear how these technologies will change in the future. It is worth mentioning that new designs come from start-ups as well as big tech companies. This ongoing race clearly shows that more and more products and services based on AI will be introduced. Considering the growth in the market and job opportunities, we believe that it is a great time to learn about DL.
Things to remember
a. DL is an ML technique that exploits ANNs for pattern recognition.
b. DL is highly flexible because it selects the most relevant features automatically for the given task throughout training.
c. GPUs can speed up DL model training with its massive parallelism.
Understanding the role of DL in our daily lives
By exploiting the flexibility of DL, researchers have made remarkable progress in the domains in which traditional ML techniques have shown limited performance (see Figure 1.3). The first flag has been planted in the field of computer vision (CV) for digit recognition and object detection tasks. Then, DL has been adopted for natural language processing (NLP), showing meaningful progress in translation and speech recognition tasks. It also demonstrates its effectiveness in reinforcement learning (RL) as well as generative modeling.
Following diagram shows various applications of DL:
Figure 1.3 – Applications of DL
However, integrating DL into an existing technology infrastructure is not an easy task; difficulties can arise from various aspects, including but not limited to budget, time, as well as talent. Therefore, a thorough understanding of DL has become an essential skill for those who manage DL projects: project managers, technology leads, as well as C-suite executives. Furthermore, the knowledge in this fast-growing field will allow them to find future opportunities in their specific verticals and across the organization, as people working on AI projects actively gather new knowledge to derive innovative and competitive advantages. Overall, an in-depth understanding of DL pipelines and developing production-ready outputs allows managers to execute projects better by effectively avoiding commonly known pitfalls.
Unfortunately, DL projects are not yet in a plug-and-play state. In many cases, they involve extensive research and adjustment phases, which can quickly drain the available resources. Above all, we have noticed that many companies struggle to move from proof of concept to production because critical decisions are made by the few who only have a limited understanding of the project requirements and DL pipelines. With that being said, our book aims to provide a complete picture of a DL project; we will start with project planning, and then discuss how to develop MVPs and FFPs, how to utilize cloud services to scale up, and finally, how to deliver the product to targeted users.
Things to remember
a. DL has been applied to many problems in various fields, including but not limited to CV, NLP, RL, and generative modeling.
b. An in-depth understanding of DL is crucial for those leading DL projects, regardless of their position or background.
c. This book provides a complete picture of a DL project by describing how DL projects are carried out from project planning to deployment.
Overview of DL projects
While DL and other software engineering projects have a lot in common, DL projects emphasize planning, due to the extensive need for resources, mainly coming from the complexity of the models and the high volume of data involved. In general, DL projects can be split into the following phases:
- Project planning
- Building MVPs
- Building FFPs
- Deployment and maintenance
- Project evaluation
In this section, we provide high-level overviews of these phases. The following sections cover each phase in detail.
As the first step, the project lead must clearly define what needs to be achieved by the project and understand groups that can affect or be affected by the project. The evaluation metrics need to be defined and agreed upon, as they will be revisited during project evaluation. Then, the team members group together to discuss individual responsibilities and achieve business objectives using available resources. This process naturally leads to a timeline, an estimate of how long the project would take. Overall, project planning should result in the generation of a document called a playbook, which includes a thorough description of how the project will be carried out and evaluated.
Building minimum viable products
Once the direction is clear for everyone, the next step is to build an MVP, a simplistic version of the target deliverable that showcases the project’s value. Another important aspect of this phase is to understand the project’s difficulties and reject paths with greater risks or less promising outcomes by following the fail fast, fail often philosophy. Therefore, data scientists and engineers typically work with partially sampled datasets in development settings and ignore insignificant optimizations.
Building fully featured products
Once the feasibility of the project has been confirmed by the MVP, it must be packaged into an FFP. This phase aims to polish up the MVP to build a production-ready deliverable with various optimizations. In the case of DL projects, additional data preparation techniques are introduced to improve the quality of input data, or the model pipeline gets augmented slightly for greater model performance. Additionally, the data preparation pipeline and model training pipeline may be migrated to the cloud, exploiting various web services for higher throughput and quality. In this case, the whole pipeline often gets reimplemented using different tools and services. This book focuses on Amazon Web Services (AWS), the most popular web service for handling high volumes of data and expensive computations.
Deployment and maintenance
In many cases, the deployment settings are different from the development settings. Therefore, different sets of tools are often involved when moving an FFP to production. Furthermore, deployment may introduce problems that weren’t visible during development, which mainly arise as a result of limited computational resources. Consequently, many engineers and scientists spend additional time improving the user experience during this phase. Most people believe that deployment is the last step. However, there is one more step: maintenance. The quality of data and model performance needs to be monitored consistently to provide stable services to targeted users.
In the last phase, project evaluation, the team should revisit the discussions made during project planning to evaluate whether the project has been carried out successfully or not. Furthermore, the details of the project need to be recorded, and potential improvements must be discussed so that the next projects can be achieved more efficiently.
Things to remember
a. The phases within DL projects are split into project planning, building MVPs, building FFPs, deployment and maintenance, and project evaluation.
b. During the project planning phase, the project goal and evaluation metrics are defined, and the team discusses an individual's responsibility, available resources, and the timeline for the project.
c. The purpose of building an MVP is to understand the difficulties of the project and reject paths that pose greater risks or offer less promising outcomes.
d. The FFP is a production-ready deliverable that is an optimized version of the MVP. The data preparation pipeline and model training pipeline may be migrated to the cloud, exploiting various web services for higher throughput and quality.
e. Deployment settings often provide limited computational resources. In this case, the system needs to be tuned to provide stable services to target users.
f. Upon the completion of the project, the team needs to revisit the timeline, assigned responsibilities, and business requirements to evaluate the success of the project.
Planning a DL project
Every project starts with planning. Throughout the planning, the purpose of the project needs to be clearly defined, and key members should have a thorough understanding of the available resources that can be allocated to the project. Once team members and stakeholders are identified, the next step is to discuss each individual’s responsibility and create a timeline for the project.
This phase should result in a well-documented project playbook that precisely defines business objectives and how the project will be evaluated. A typical playbook contains an overview of key deliverables, a list of stakeholders, a Gantt chart defining steps and bottlenecks, definitions of responsibilities, timelines, and evaluation criteria. In the case of highly complex projects, following the Project Management Body of Knowledge (PMBOK®) Guide (https://www.pmi.org/pmbok-guide-standards/foundational/pmbok) and considering every knowledge domain (for example, integration management, project scope management, and time management) are strongly recommended. Of course, other project management frameworks exist, such as PRINCE2 (https://www.prince2.com/usa/what-is-prince2), which can provide a good starting point. Once the playbook is constructed, every stakeholder must review and revise it until everyone agrees with the contents.
In real life, many people underestimate the importance of planning. Especially in start-ups, engineers are eager to dive into MVP development and spend minimal time planning. However, it is especially dangerous to do so in the case of DL projects because the training process can quickly drain all the available resources.
Defining goal and evaluation metrics
The very first step of planning is to understand what purpose the project serves. The goal might be developing a new product, improving the performance of an existing service, or saving on operational costs. The motivation of the project naturally helps define the evaluation metrics.
In the case of DL projects, there are two types of evaluation metrics: business-related metrics and model-based metrics. Some examples of business-related metrics are as follows: conversion rate, click-through rate (CTR), lifetime value, user engagement measure, savings in operational cost, return on investment (ROI), and revenue. These are commonly used in advertising, marketing, and product recommendation verticals.
On the other hand, model-based metrics include accuracy, precision, recall, F1-score, rank accuracy metrics, mean absolute error (MAE), mean squared error (MSE), root-mean-square error (RMSE), and normalized mean absolute error (NMAE). In general, tradeoffs can be made between the various metrics. For example, a slight decrease in accuracy may be acceptable if meeting latency requirements is more critical to the project.
Along with the target evaluation metric, which differs from project to project, there are other metrics that are commonly found in most projects. These are due dates and resource usage. The target state must be reached by a certain date using available resources.
The goal and corresponding evaluation metrics need to be fair. If the goal is too hard to achieve, project members can possibly lose motivation. If the metric for the evaluation is not correct, understanding the impact of the project becomes difficult. As a result, it is recommended that the selected evaluation metrics are shared with others and considered fair for everyone.
Figure 1.4 – A sample playbook with the project description section filled out
As shown in Figure 1.4, the first section of a playbook begins with a general description, an estimated complexity of the technical aspects, and a list of required tools and frameworks. Next, it clearly describes the objective of the project and how the project will be evaluated.
In the same way that the term stakeholder is used for a business, a stakeholder for a project refers to a person or group who can affect or be affected by the project. Stakeholders can be grouped into two types, internal and external. Internal stakeholders are those that are directly involved in project executions, while external stakeholders may be outside of the circle, supporting the project execution in an indirect way.
Each stakeholder has a different role within the project. First, we’ll look at internal stakeholders. Internal stakeholders are the main drivers of the project. Therefore, they work closely together to process and analyze data, develop a model, and build deliverables. Table 1.1 lists internal stakeholders that are commonly found in DL projects:
Table 1.1 – Common internal stakeholders for DL projects
On the other hand, external stakeholders often play supportive roles, such as collecting necessary data for the project or providing feedback about the deliverable. In Table 1.2, we describe some common external stakeholders for DL projects:
Table 1.2 – Common external stakeholders for DL projects
Stakeholders are described in the second section of a playbook. As shown in Figure 1.4, the playbook must list stakeholders and their responsibilities in the project.
A milestone refers to a point in a project where a significant event occurs. Therefore, there is a set of requirements leading up to a milestone. Once the requirements are met, a milestone can be claimed to have been reached. One of the most important steps in project planning is defining milestones and their associated tasks. The ordering of tasks that lead to the goal is called the critical path. It is worth mentioning that tasks don’t need to be tackled sequentially all the time. The understanding of a critical path is important because it allows the team to prioritize tasks appropriately to ensure the success of the project.
In this step, it is also critical to identify level-of-effort (LOE) activities and supportive activities, which are required for project execution. In the case of software development projects, LOE activities include supplementary tasks, such as setting up Git repositories or reviewing others’ merge requests. The following figure (Figure 1.5) describes a typical critical path for a DL project. It will be structured differently if the underlying project consists of different tasks, requirements, technologies, and desired levels of detail:
Figure 1.5 – A typical critical path for a DL project
For a DL project, there are two main resources that require explicit resource allocations: human and computational resources. Human resources refer to employees that will actively work on individual tasks. In general, they hold positions in data engineering, data science, DevOps, or software engineering. When people talk about human resources, they often consider headcount only. However, the knowledge and skills that individuals hold are other critical factors. Human resources are closely related to how fast the project can be carried out.
Computational resources refer to hardware and software resources that are allocated to the project. Unlike typical software engineering projects, such as mobile app development or web page development, DL projects require heavy computation and large amounts of data. Common techniques for speeding up the development process involve parallelism or using computationally stronger machines. In some cases, tradeoffs need to be made between them, as a single machine of high computational power can cost more than multiple machines of low computational power.
Overall, novel DL pipelines take advantage of flexible and stateless resources, such as AWS Spot instances with fault-tolerant code. Besides hardware resources, there are frameworks and services that may provide necessary features out of the box. If the necessary service requires a payment, it is important to understand how it can change the project execution and what the demand on human resources would be if the team decided to handle it in-house.
In this step, available resources need to be allocated to each task. Figure 1.6 lists the tasks described in the previous section and describes the allocated resources, along with estimates of operational costs. Each task has its own risk level indicator. It is designed for a small team of three people working on a simple DL project with limited computational resources on a couple of AWS Elastic Compute Cloud (EC2) instances for around 4 to 6 months. Please note that the cost estimation of human resources is not included in the example, as it differs a lot depending on geographic location and team seniority:
Figure 1.6 – A sample resource allocation section of a DL project
Before we move on to the next step, we would like to mention that it is important to set aside a portion of the resources as a backup, in case the milestone requires more resources than that have been allocated.
Defining a timeline
Now that we know the available resources, we should be able to construct a timeline for the project. In this step, the team needs to discuss how long each step would take to complete. It is worth mentioning that things don’t work out as planned all the time. There will be many difficulties throughout the project that can delay the delivery of the final product.
Therefore, including buffers within the timeline is a common practice in many organizations. It is important that every stakeholder agrees with the timeline. If anyone believes that it’s not reasonable, the adjustment needs to be made right away. Figure 1.7 is a sample Gantt chart with the most likely estimated timeline for the information presented in Figure 1.6:
Figure 1.7 – A sample Gantt chart describing the timeline
It is worth mentioning that the chart can also be used to monitor the progress of each task and the overall project. In such cases, additional comments or visualizations summarizing the progress can be attached to each indicator bar.
Managing a project
Another important aspect of a DL project that needs to be discussed during the project planning phase is the process that the team will follow to update other team members and ensure on-time delivery of the project. Out of various project management methodologies, Agile fits perfectly, as it helps to split work into smaller parts and quickly iterate over development cycles until the FFP emerges. As DL projects are generally considered highly complex, it is more convenient to work within short cycles of research, development, and optimization phases. At the end of each cycle, stakeholders review results and adjust their long-term goals. Agile methodology is particularly suitable for small teams of experienced individuals. In a typical setting, 2-week sprints are found to be the most effective, especially when the short-term goals are clearly defined.
During a sprint meeting, the team reviews goals from the last sprint and defines goals for the upcoming sprint. It is also recommended to have short daily meetings to go over work performed on the previous day and plan for the upcoming day, as this process can help the team to quickly recognize bottlenecks and adjust their priorities as necessary. Commonly used tools for this process are Jira, Asana, and Quickbase. The majority of the aforementioned tools also support budget management, timeline reviewing, idea management, and resource tracking.
Things to remember
a. Project planning should result in a playbook that clearly describes what purpose the project serves and how the team will move together to reach a particular goal state.
b. The first step of project planning is to define a goal and its corresponding evaluation metrics. In the case of DL projects, there are two types of evaluation metrics: business-related metrics and model-based metrics.
c. A stakeholder refers to a person or a group who can affect or be affected by the project. Stakeholders can be grouped into two types: internal and external.
d. The next stage of project planning is task organization. The team needs to identify milestones, identify tasks, along with LOE activities, and understand the critical path.
e. For DL projects, there are two main resources that require explicit resource allocation: human and computational resources. During resource allocation, it is important to put aside a portion of the resources as a backup.
f. When estimating the timeline for the project, it must be shared within the team, and every stakeholder must agree with the schedule.
g. Agile methodology is a perfect fit for managing DL projects, as it helps to split work into smaller parts and quickly iterate over development cycles.
This chapter is an introduction to our journey. In the first two sections, we have described where DL sits within the wider picture of AI and how it continually shapes our daily lives. The key takeaways are the fact that DL is highly flexible due to its unique model architecture and the fact that DL has been actively adopted to the domain which traditional ML techniques have failed to demonstrate notable accomplishments.
Then, we have provided a high-level view of the DL project. In general, DL projects can be split into the following phases: project planning, building MVPs, building FFPs, development and maintenance, and project evaluation.
The main contents of this chapter covered the most important step of the DL project: project planning. In this phase, the purpose of the project needs to be clearly defined, along with the evaluation metrics, everyone must have a solid understanding of the stakeholders and their respective roles, and lastly, the tasks, milestones, and timeline need to be agreed upon by the participants. The outcome of this phase would be a well-formatted document called a playbook. In the next chapter, we will learn how to prepare data for DL projects.
Here is a list of references that can help you gain more knowledge about the topics that are relevant to this chapter. The following research papers summarize popular use cases of DL:
- Gradient-based learning applied to document recognition by LeCun et al.
- ImageNet: A Large-Scale Hierarchical Image Database by Deng et al.
- A Neural Probabilistic Language Model by Bengio et al.
- Speech Recognition with Deep Recurrent Neural Networks by Grave et al.
- An Introduction to Deep Reinforcement Learning by François-Lavet et al.
- Generative modeling
- Generative Adversarial Networks by Goodfellow et al.