Computer vision tasks
Deep learning achieves state-of-the-art results in many CV tasks. The most common CV task is image classification, in which a deep learning model gives a class label for a given image. However, recent advancements in deep learning allow computers to perform more advanced vision tasks. There are many of these advanced vision tasks.
However, this book focuses on more common and important ones, including object detection, instance segmentation, keypoint detection, semantic segmentation, and panoptic segmentation. It might be challenging for readers to differentiate between these tasks. Figure 1.1 depicts the differences between them. This section outlines what they are and when to use them, and the rest of the book focuses on how to implement these tasks using Detectron2. Let’s get started!
![Figure 1.1: Common computer vision tasks](https://static.packt-cdn.com/products/9781800561625/graphics/image/B16704_01_01.jpg)
Figure 1.1: Common computer vision tasks
Object detection
Object detection generally includes object localization and classification. Specifically, deep learning models for this task predict where objects of interest are in an image by applying the bounding boxes around these objects (localization). Furthermore, these models also classify the detected objects into types of interest (classification).
One example of this task is specifying people in pictures and applying bounding boxes to the detected humans (localization only), as shown in Figure 1.1 (b). Another example is to detect road damage from a recorded road image by providing bounding boxes to the damage (localization) and further classifying the damage into types such as longitudinal cracks, traverse cracks, alligator cracks, and potholes (classification).
Instance segmentation
Like object detection, instance segmentation also involves object localization and classification. However, instance segmentation takes things one step further while localizing the detected objects of interest.
Specifically, besides classification, models for this task localize the detected objects at the pixel level. In other words, it identifies all the pixels of each detected object. Instance segmentation is needed in applications that require shapes of the detected objects in images and need to track every individual object. Figure 1.1 (c) shows the instance segmentation result on the input image in Figure 1.1 (a). Specifically, besides the bounding boxes, every pixel of each person is also highlighted.
Keypoint detection
Besides detecting objects, keypoint detection also indicates important parts of the detected objects called keypoints. These keypoints describe the detected object’s essential trait. This trait is often invariant to image rotation, shrinkage, translation, or distortion. For instance, the keypoints of humans include the eyes, nose, shoulders, elbows, hands, knees, and feet. Keypoint detection is important for applications such as action estimation, pose detection, or face detection. Figure 1.1 (d) shows the keypoint detection result on the input image in Figure 1.1 (a). Specifically, besides the bounding boxes, it highlights all keypoints for every detected individual.
Semantic segmentation
A semantic segmentation task does not detect specific instances of objects but classifies each pixel in an image into some classes of interest. For instance, a model for this task classifies regions of images into pedestrians, roads, cars, trees, buildings, and the sky in a self-driving car application. This task is important when providing a broader view of groups of objects with different classes (i.e., a higher level of understanding of the image). Specifically, if individual class instances are in one region, they are grouped into one mask instead of having a different mask for each individual.
One example of the application of semantic segmentation is to segment the images into foreground objects and background objects (e.g., to blur the background and provide a more artistic look for a portrait image). Figure 1.1 (e) shows the semantic segmentation result on the input image in Figure 1.1 (a). Specifically, the input picture is divided into regions classified as things (people or front objects) and background objects such as the sky, a mountain, dirt, grass, and a tree.
Panoptic segmentation
Panoptic literally means “everything visible in the image”. In other words, it can be viewed as combining common CV tasks such as instance segmentation and semantic segmentation. It helps to show the unified and global view of segmentation. Generally, it classifies objects in an image into foreground objects (that have proper geometries) and background objects (that do not have appropriate geometries but are textures or materials).
Examples of foreground objects include people, animals, and cars. Likewise, examples of background objects include the sky, dirt, trees, mountains, and grass. Different from semantic segmentation, panoptic segmentation does not group consecutive individual objects of the same class into one region. Figure 1.1 (f) shows the panoptic segmentation result on the input image in Figure 1.1 (a).
Specifically, it looks similar to the semantic segmentation result, except it highlights the individual instances separately.
Important note – other CV tasks
There are other advanced CV projects developed on top of Detectron2, such as DensePose and PointRend. However, this book focuses on developing CV applications for the more common ones, including object detection, instance segmentation, keypoint detection, semantic segmentation, and panoptic segmentation in Chapter 2. Furthermore, Part 2 and Part 3 of this book further explore developing custom CV applications for the two most important tasks (object detection and instance segmentation). There is also a section that describes how to use PointRend to improve instance segmentation quality. Additionally, it is relatively easy to expand the code for other tasks once you understand these tasks.
Let’s get started by getting to know Detectron2 and its architecture!