You're reading from Modern Computer Vision with PyTorch

Product type Book

Published in Nov 2020

Publisher Packt

ISBN-13 9781839213472

Pages 824 pages

Edition 1st Edition

Languages

Python

Concepts

Computer Vision

Authors (2):

V Kishore Ayyadevara

Yeshwanth Reddy

View More author details

Advanced Object Detection

In the previous chapter, we learned about R-CNN and Fast R-CNN techniques, which leveraged region proposals to generate predictions of the locations of objects in an image along with the classes corresponding to objects in the image. Furthermore, we learned about the bottleneck of the speed of inference, which happens because of having two different models – one for region proposal generation and another for object detection. In this chapter, we will learn about different modern techniques, such as Faster R-CNN, YOLO, and Single-Shot Detector (SSD), that overcome slow inference time by employing a single model to make predictions for both the class of object and the bounding box in a single shot. We will start by learning about anchor boxes and then proceed to learn about how each of the techniques works and how to implement them to detect objects...

Components of modern object detection algorithms

The drawback of the R-CNN and Fast R-CNN techniques is that they have two disjointed networks – one to identify the regions that likely contain an object and the other to make corrections to the bounding box where an object is identified. Furthermore, both the models require as many forward propagations as there are region proposals. Modern object detection algorithms focus heavily on training a single neural network and have the capability to detect all objects in one forward pass. In the subsequent sections, we will learn about the various components of a typical modern object detection algorithm:

Anchor boxes
Region proposal network (RPN)
Region of interest pooling

Anchor boxes

So far, we have had region proposals coming from the selectivesearch method. Anchor boxes come in as a handy replacement for selective search – we will learn how they replace selectivesearch-based region proposals in this section.

Typically, a...

Training Faster R-CNN on a custom dataset

In the following code, we will train the Faster R-CNN algorithm to detect the bounding boxes around objects present in images. For this, we will work on the same truck versus bus detection exercise that we worked on in the previous chapter:

The following code is available as Training_Faster_RCNN.ipynb in the Chapter08 folder of this book's GitHub repository - https://tinyurl.com/mcvp-packt.

Download the dataset:

import os
if not os.path.exists('images'):
    !pip install -qU torch_snippets
    from google.colab import files
    files.upload() # upload kaggle.json
    !mkdir -p ~/.kaggle
    !mv kaggle.json ~/.kaggle/
    !ls ~/.kaggle
    !chmod 600 /root/.kaggle/kaggle.json
    !kaggle datasets download \
        -d sixhky/open-images-bus-trucks/
    !unzip -qq open-images-bus-trucks.zip
    !rm open-images-bus-trucks.zip

Read the DataFrame containing metadata of information about images and their bounding box, and classes:

from torch_snippets...

Working details of YOLO

You Only Look Once (YOLO) and its variants are one of the prominent object detection algorithms. In this section, we will understand at a high level how YOLO works and the potential limitations of R-CNN-based object detection frameworks that YOLO overcomes.

First, let's learn about the possible limitations of R-CNN-based detection algorithms. In Faster R-CNN, we slide over the image using anchor boxes and identify the regions that are likely to contain an object, and then we make the bounding box corrections. However, in the fully connected layer, where only the detected region's RoI pooling output is passed as input, in the case of regions that do not fully encompass the object (where the object is beyond the boundaries of the bounding box of region proposal), the network has to guess the real boundaries of object, as it has not seen the full image (but has seen only the region proposal).

YOLO comes in handy in such scenarios, as it looks at the whole...

Training YOLO on a custom dataset

Building on top of others' work is very important to becoming a successful practitioner in deep learning. For this implementation, we will use the official YOLO-v4 implementation to identify the location of buses and trucks in images. We will clone the repository of the authors' own implementation of YOLO and customize it to our needs in the following code.

The following code is available as Training_YOLO.ipynb in the Chapter08 folder of this book's GitHub repository - https://tinyurl.com/mcvp-packt.

Installing Darknet

First, pull the darknet repository from GitHub and compile it in the environment. The model is written in a separate language called Darknet, which is different from PyTorch. We will do so using the following code:

Pull the Git repo:

!git clone https://github.com/AlexeyAB/darknet
%cd darknet

Reconfigure the Makefile file:

!sed -i 's/OPENCV=0/OPENCV=1/' Makefile
# In case you dont have a GPU, make sure to comment...

Working details of SSD

So far, we have seen a scenario where we made predictions after gradually convolving and pooling the output from the previous layer. However, we know that different layers have different receptive fields to the original image. For example, the initial layers have a smaller receptive field when compared to the final layers, which have a larger receptive field. Here, we will learn how SSD leverages this phenomenon to come up with a prediction of bounding boxes for images.

The workings behind how SSD helps overcome the issue of detecting objects with different scales is as follows:

We leverage the pre-trained VGG network and extend it with a few additional layers until we obtain a 1 x 1 block.
Instead of leveraging only the final layer for bounding box and class predictions, we will leverage all of the last few layers to make class and bounding box predictions.
In place of anchor boxes, we will come up with default boxes that have a specific set of scale and aspect...

Training SSD on a custom dataset

In the following code, we will train the SSD algorithm to detect the bounding boxes around objects present in images. We will use the truck versus bus object detection task we have been working on:

The following code is available as Training_SSD.ipynb in the Chapter08 folder of this book's GitHub repository - https://tinyurl.com/mcvp-packt The code contains URLs to download data from and is moderately lengthy. We strongly recommend you to execute the notebook in GitHub to reproduce results while you understand the steps to perform and explanation of various code components from text.

Download the image dataset and clone the Git repository hosting the code for the model and the other utilities for processing the data:

import os
if not os.path.exists('open-images-bus-trucks'):
    !pip install -q torch_snippets
    !wget --quiet https://www.dropbox.com/s/agmzwk95v96ihic/\
    open-images-bus-trucks.tar.xz
    !tar -xf open-images-bus-trucks.tar...

Summary

In this chapter, we have learned about the working details of modern object detection algorithms: Faster R-CNN, YOLO, and SSD. We learned how they overcome the limitation of having two separate models – one for fetching region proposals and the other for fetching class and bounding box offsets on region proposals. Furthermore, we implemented Faster R-CNN using PyTorch, YOLO using darknet, and SSD from scratch.

In the next chapter, we will learn about image segmentation, which goes one step beyond object localization by identifying the pixels that correspond to an object.

Furthermore, in Chapter 15, Combining Computer Vision and NLP Techniques, we will learn about DETR, a transformer-based object detection algorithm, and in Chapter 10, Applications of Object Detection, and Segmentation, we will learn about the Detectron2 framework, which helps in not only detecting objects but also segmenting them in a single shot.