Building Custom Object Detectors

This chapter delves deeper into the concept of object detection, which is one of the most common challenges in computer vision. Having come this far in the book, you are perhaps wondering when you will be able to put computer vision into practice on the streets. Do you dream of building a system to detect cars and people? Well, you are not too far from your goal, actually.

We have already looked at some specific cases of object detection and recognition in previous chapters. We focused on upright, frontal human faces in Chapter 5, Detecting and Recognizing Faces, and on objects with corner-like or blob-like features in Chapter 6, Retrieving Images and Searching Using Image Descriptors. Now, in the current chapter, we will explore algorithms that have a good ability to generalize or extrapolate, in the sense that they can cope with the real-world...

Technical requirements

This chapter uses Python, OpenCV, and NumPy. Please refer back to Chapter 1, Setting Up OpenCV, for installation instructions.

The completed code for this chapter can be found in this book's GitHub repository, at https://github.com/PacktPublishing/Learning-OpenCV-4-Computer-Vision-with-Python-Third-Edition, in the chapter07 folder. Sample images can be found in the repository in the images folder.

Understanding HOG descriptors

HOG is a feature descriptor, so it belongs to the same family of algorithms as scale-invariant feature transform (SIFT), speeded-up robust features (SURF), and Oriented FAST and rotated BRIEF (ORB), which we covered in Chapter 6, Retrieving Images and Searching Using Image Descriptors. Like other feature descriptors, HOG is capable of delivering the type of information that is vital for feature matching, as well as for object detection and recognition. Most commonly, HOG is used for object detection. The algorithm – and, in particular, its use as a people detector – was popularized by Navneet Dalal and Bill Triggs in their paper Histograms of Oriented Gradients for Human Detection (INRIA, 2005), which is available online at https://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf.

HOG's internal mechanism is really clever...

Understanding NMS

The concept of NMS might sound simple. From a set of overlapping solutions, just pick the best one! However, the implementation is more complex than you might initially think. Remember the image pyramid? Overlapping detections can occur at different scales. We must gather up all our positive detections, and convert their bounds back to a common scale before we check for overlap. A typical implementation of NMS takes the following approach:

Construct an image pyramid.
Scan each level of the pyramid with the sliding window approach, for object detection. For each window that yields a positive detection (beyond a certain arbitrary confidence threshold), convert the window back to the original image's scale. Add the window and its confidence score to a list of positive detections.
Sort the list of positive detections by order of descending confidence score...

Understanding SVMs

Without going into details of how an SVM works, let's just try to grasp what it can help us accomplish in the context of machine learning and computer vision. Given labeled training data, an SVM learns to classify the same kind of data by finding an optimal hyperplane, which, in plain English, is the plane that divides differently labeled data by the largest possible margin. To aid our understanding, let's consider the following diagram, which is provided by Zach Weinberg under the Creative Commons Attribution-Share Alike 3.0 Unported License:

Hyperplane H₁ (shown as a green line) does not divide the two classes (the black dots versus the white dots). Hyperplanes H₂ (shown as a blue line) and H₃ (shown as a red line) both divide the classes; however, only hyperplane H₃ divides the classes by a maximal margin.

Let's suppose we are training an...

Detecting people with HOG descriptors

OpenCV comes with a class called cv2.HOGDescriptor, which is capable of performing people detection. The interface has some similarities to the cv2.CascadeClassifier class that we used in Chapter 5, Detecting and Recognizing Faces. However, unlike cv2.CascadeClassifier, cv2.HOGDescriptor sometimes returns nested detection rectangles. In other words, cv2.HOGDescriptor might tell us that it detected one person whose bounding rectangle is located completely inside another person's bounding rectangle. This situation really is possible; for example, a child could be standing in front of an adult, and the child's bounding rectangle could be completely inside the adult's bounding rectangle. However, in a typical situation, nested detections are probably errors, so cv2.HOGDescriptor is often used along with code to filter out any nested...

Creating and training an object detector

Using a pre-trained detector makes it easy to build a quick prototype, and we are all very grateful to the OpenCV developers for making such useful capabilities as face detection and people detection readily available. However, whether you are a hobbyist or a computer vision professional, it is unlikely that you will only deal with people and faces.

Moreover, if you are like the authors of this book, you will wonder how the people detector was created in the first place and whether you can improve it. Furthermore, you may also wonder whether you can apply the same concepts to detect diverse objects, ranging from cars to goblins.

Indeed, in industry, you may have to deal with problems of detecting very specific objects, such as registration plates, book covers, or whatever thing may be most important to your employer or client.

Thus, the...

Detecting cars

To train any kind of classifier, we must begin by creating or acquiring a training dataset. We are going to train a car detector, so our dataset must contain positive samples that represent cars, as well as negative samples that represent other (non-car) things that the detector is likely to encounter while looking for cars. For example, if the detector is intended to search for cars on a street, then a picture of a curb, a crosswalk, a pedestrian, or a bicycle might be a more representative negative sample than a picture of the rings of Saturn. Besides representing the expected subject matter, ideally, the training samples should represent the way our particular camera and algorithm will see the subject matter.

Ultimately, in this chapter, we intend to use a sliding window of fixed size, so it is important that our training samples conform to a fixed size, and...

Summary

In this chapter, we covered a wide range of concepts and techniques, including HOG, BoW, SVMs, image pyramids, sliding windows, and NMS. We learned that these techniques have applications in object detection, as well as other fields. We wrote a script that combined most of these techniques – BoW, SVMs, an image pyramid, a sliding window, and NMS – and we gained practical experience in machine learning through the exercise of training and testing a custom detector. Finally, we demonstrated that we can detect cars!

Our new knowledge forms the foundation of the next chapter, in which we will utilize object detection and classification techniques on sequences of frames in videos. We will learn how to track objects and retain information about them – an important objective in many real-world applications.