Home

Data

OpenCV 4 with Python Blueprints - Second Edition

By Dr. Menua Gevorgyan , Arsen Mamikonyan , Michael Beyeler

Book

eBook $35.99 $24.99

Print $48.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook $35.99 $24.99

Print $48.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

OpenCV is a native cross-platform C++ library for computer vision, machine learning, and image processing. It is increasingly being adopted in Python for development. This book will get you hands-on with a wide range of intermediate to advanced projects using the latest version of the framework and language, OpenCV 4 and Python 3.8, instead of only covering the core concepts of OpenCV in theoretical lessons. This updated second edition will guide you through working on independent hands-on projects that focus on essential OpenCV concepts such as image processing, object detection, image manipulation, object tracking, and 3D scene reconstruction, in addition to statistical learning and neural networks. You’ll begin with concepts such as image filters, Kinect depth sensor, and feature matching. As you advance, you’ll not only get hands-on with reconstructing and visualizing a scene in 3D but also learn to track visually salient objects. The book will help you further build on your skills by demonstrating how to recognize traffic signs and emotions on faces. Later, you’ll understand how to align images, and detect and track objects using neural networks. By the end of this OpenCV Python book, you’ll have gained hands-on experience and become proficient at developing advanced computer vision apps according to specific business needs.

Publication date:: March 2020
Publisher: Packt
Pages: 366
ISBN: 9781789801811
Download code from GitHub

Hand Gesture Recognition Using a Kinect Depth Sensor

The goal of this chapter is to develop an app that detects and tracks simple hand gestures in real time, using the output of a depth sensor, such as that of a Microsoft Kinect 3D sensor or an ASUS Xtion sensor. The app will analyze each captured frame to perform the following tasks:

Hand region segmentation: The user's hand region will be extracted in each frame by analyzing the depth map output of the Kinect sensor, which is done by thresholding, applying some morphological operations, and finding connected components.
Hand shape analysis: The shape of the segmented hand region will be analyzed by determining contours, convex hull, and convexity defects.
Hand gesture recognition: The number of extended fingers will be determined based on the hand contour's convexity defects, and the gesture will be classified accordingly (with no extended fingers corresponding to a fist, and five extended fingers corresponding to an open hand).

Gesture recognition is an ever-popular topic in computer science. This is because it not only enables humans to communicate with machines (Human-Machine Interaction (HMI)) but also constitutes the first step for machines to begin understanding human body language. With affordable sensors such as Microsoft Kinect or Asus Xtion and open source software such as OpenKinect and OpenNI, it has never been easier to get started in the field yourself. So, what shall we do with all this technology?

In this chapter, we will cover the following topics:

Planning the app
Setting up the app
Tracking hand gestures in real time
Understanding hand region segmentation
Performing hand shape analysis
Performing hand gesture recognition

The beauty of the algorithm that we are going to implement in this chapter is that it works well for many hand gestures, yet it is simple enough to run in real time on a generic laptop. Also, if we want, we can easily extend it to incorporate more complicated hand-pose estimations.

Once you complete the app, you will understand how to use depth sensors in your own apps. You will learn how to compose shapes of interest with OpenCV from the depth information, as well as understanding how to analyze shapes with OpenCV, using their geometric properties.

Getting started

This chapter requires you to have a Microsoft Kinect 3D sensor installed. Alternatively, you may install an Asus Xtion sensor or any other depth sensor for which OpenCV has built-in support.

First, install OpenKinect and libfreenect from http://www.openkinect.org/wiki/Getting_Started. You can find the code that we present in this chapter at our GitHub repository: https://github.com/PacktPublishing/OpenCV-4-with-Python-Blueprints-Second-Edition/tree/master/chapter2.

Let's first plan the application we are going to create in this chapter.

Planning the app

The final app will consist of the following modules and scripts:

gestures: This is a module that consists of an algorithm for recognizing hand gestures.
gestures.process: This is a function that implements the entire process flow of hand gesture recognition. It accepts a single-channel depth image (acquired from the Kinect depth sensor) and returns an annotated Blue, Green, Red (BGR) color image with an estimated number of extended fingers.
chapter2: This is the main script for the chapter.
chapter2.main: This is the main function routine that iterates over frames acquired from a depth sensor that uses .process gestures to process frames, and then illustrates results.

The end product looks like this:

No matter how many fingers of a hand are extended, the algorithm correctly segments the hand region (white), draws the corresponding convex hull (the green line surrounding the hand), finds all convexity defects that belong to the spaces between fingers (large green points) while ignoring others (small red points), and infers the correct number of extended fingers (the number in the bottom-right corner), even for a fist.

Now, let's set up the application in the next section.

Setting up the app

Before we can get down to the nitty-gritty of our gesture recognition algorithm, we need to make sure that we can access the depth sensor and display a stream of depth frames. In this section, we will cover the following things that will help us set up the app:

Accessing the Kinect 3D sensor
Utilizing OpenNI-compatible sensors
Running the app and main function routine

First, we will look at how to use the Kinect 3D sensor.

Accessing the Kinect 3D sensor

The easiest way to access a Kinect sensor is by using an OpenKinect module called freenect. For installation instructions, take a look at the preceding section.

The freenect module has functions such as sync_get_depth() and sync_get_video(), used to obtain images synchronously from the depth sensor and camera sensor respectively. For this chapter, we will need only the Kinect depth map, which is a single-channel (grayscale) image in which each pixel value is the estimated distance from the camera to a particular surface in the visual scene.

Here, we will design a function that will read a frame from the sensor and convert it to the desired format, and return the frame together with a success status, as follows:

def read_frame(): -> Tuple[bool,np.ndarray]:

The function consists of the following steps:

Grab a frame; terminate the function if a frame was not acquired, like this:

    frame, timestamp = freenect.sync_get_depth() 
    if frame is None:
        return False, None

The sync_get_depth method returns both the depth map and a timestamp. By default, the map is in an 11-bit format. The last 10 bits of the sensor describes the depth, while the first bit states that the distance estimation was not successful when it's equal to 1.

It is a good idea to standardize the data into an 8-bit precision format, as an 11-bit format is inappropriate to be visualized with cv2.imshow right away, as well as in the future. We might want to use some different sensor that returns in a different format, as follows:

np.clip(depth, 0, 2**10-1, depth) 
depth >>= 2

In the previous code, we have first clipped the values to 1,023 (or 2**10-1) to fit in 10 bits. Such clipping results in the assignment of the undetected distance to the farthest possible point. Next, we shift 2 bits to the right to fit the distance in 8 bits.

Finally, we convert the image into an 8-bit unsigned integer array and return the result, as follows:

return True, depth.astype(np.uint8)

Now, the depth image can be visualized as follows:

cv2.imshow("depth", read_frame()[1])

Let's see how to use OpenNI-compatible sensors in the next section.

Utilizing OpenNI-compatible sensors

To use an OpenNI-compatible sensor, you must first make sure that OpenNI2 is installed and that your version of OpenCV was built with the support of OpenNI. The build information can be printed as follows:

import cv2
print(cv2.getBuildInformation())

If your version was built with OpenNI support, you will find it under the Video I/O section. Otherwise, you will have to rebuild OpenCV with OpenNI support, which is done by passing the -D WITH_OPENNI2=ON flag to cmake.

After the installation process is complete, you can access the sensor similarly to other video input devices, using cv2.VideoCapture. In this app, in order to use an OpenNI-compatible sensor instead of a Kinect 3D sensor, you have to cover the following steps:

Create a video capture that connects to your OpenNI-compatible sensor, like this:

device = cv2.cv.CV_CAP_OPENNI 
capture = cv2.VideoCapture(device)

If you want to connect to Asus Xtion, the device variable should be assigned to the cv2.CV_CAP_OPENNI_ASUS value instead.

Change the input frame size to the standard Video Graphics Array (VGA) resolution, as follows:

capture.set(cv2.cv.CV_CAP_PROP_FRAME_WIDTH, 640) 
capture.set(cv2.cv.CV_CAP_PROP_FRAME_HEIGHT, 480)

In the previous section, we designed the read_frame function, which accesses the Kinect sensor using freenect. In order to read depth images from the video capture, you have to change that function to the following one:

def read_frame():
    if not capture.grab():
        return False,None
    return capture.retrieve(cv2.CAP_OPENNI_DEPTH_MAP)

You will note that we have used the grab and retrieve methods instead of the read method. The reason is that the read method of cv2.VideoCapture is inappropriate when we need to synchronize a set of cameras or a multi-head camera, such as a Kinect.

For such cases, you grab frames from multiple sensors at a certain moment in time with the grab method and then retrieve the data of the sensors of interest with the retrieve method. For example, in your own apps, you might also need to retrieve a BGR frame (standard camera frame), which can be done by passing cv2.CAP_OPENNI_BGR_IMAGE to the retrieve method.

So, now that you can read data from your sensor, let's see how to run the application in the next section.

Running the app and main function routine

The chapter2.py script is responsible for running the app, and it first imports the following modules:

import cv2
import numpy as np
from gestures import recognize
from frame_reader import read_frame

The recognize function is responsible for recognizing a hand gesture, and we will compose it later in this chapter. We have also placed the read_frame method that we composed in the previous section in a separate script, for convenience.

In order to simplify the segmentation task, we will instruct the user to place their hand in the center of the screen. To provide a visual aid for this, we create the following function:

def draw_helpers(img_draw: np.ndarray) -> None:
    # draw some helpers for correctly placing hand
    height, width = img_draw.shape[:2]
    color = (0,102,255)
    cv2.circle(img_draw, (width // 2, height // 2), 3, color, 2)
    cv2.rectangle(img_draw, (width // 3, height // 3),
                  (width * 2 // 3, height * 2 // 3), color, 2)

The function draws a rectangle around the image center and highlights the center pixel of the image in orange.

All the heavy lifting is done by the main function, shown in the following code block:

def main():
    for _, frame in iter(read_frame, (False, None)):

The function iterates over grayscale frames from Kinect, and, in each iteration, it covers the following steps:

Recognize hand gestures using the recognize function, which returns the estimated number of extended fingers (num_fingers) and an annotated BGR color image, as follows:

num_fingers, img_draw = recognize(frame)

Call the draw_helpers function on the annotated BGR image in order to provide a visual aid for hand placement, as follows:

 draw_helpers(img_draw)

Finally, the main function draws the number of fingers on the annotated frame, displays results with cv2.imshow, and sets termination criteria, as follows:

        # print number of fingers on image
        cv2.putText(img_draw, str(num_fingers), (30, 30),
                    cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255))
        cv2.imshow("frame", img_draw)
        # Exit on escape
        if cv2.waitKey(10) == 27:
            break

So, now that we have the main script, you will note that the only function that we are missing is the recognize function. In order to track hand gestures, we need to compose this function, which we will do in the next section.

Tracking hand gestures in real time

Hand gestures are analyzed by the recognize function; this is where the real magic takes place. This function handles the entire process flow, from the raw grayscale image to a recognized hand gesture. It returns the number of fingers and the illustration frame. It implements the following procedure:

It extracts the user's hand region by analyzing the depth map (img_gray), and returns a hand region mask (segment), like this:

def recognize(img_gray: np.ndarray) -> Tuple[int,np.ndarray]:
    # segment arm region
    segment = segment_arm(img_gray)

It performs contour analysis on the hand region mask (segment). Then, it returns the largest contour found in the image (contour) and any convexity defects (defects), as follows:

# find the hull of the segmented area, and based on that find the
# convexity defects
contour, defects = find_hull_defects(segment)

Based on the contour found and the convexity defects, it detects the number of extended fingers (num_fingers) in the image. Then, it creates an illustration image (img_draw) using (segment) image as a template, and annotates it with contour and defect points, like this:

img_draw = cv2.cvtColor(segment, cv2.COLOR_GRAY2RGB)
num_fingers, img_draw = detect_num_fingers(contour,
                                             defects, img_draw)

Finally, the estimated number of extended fingers (num_fingers), as well as the annotated output image (img_draw), are returned, as follows:

return num_fingers, img_draw

In the next section, let's learn how to accomplish hand region segmentation, which we used at the beginning of the procedure.

Understanding hand region segmentation

The automatic detection of an arm—and later, the hand region—could be designed to be arbitrarily complicated, maybe by combining information about the shape and color of an arm or hand. However, using skin color as a determining feature to find hands in visual scenes might fail terribly in poor lighting conditions or when the user is wearing gloves. Instead, we choose to recognize the user's hand by its shape in the depth map.

Allowing hands of all sorts to be present in any region of the image unnecessarily complicates the mission of the present chapter, so we make two simplifying assumptions:

We will instruct the user of our app to place their hand in front of the center of the screen, orienting their palm roughly parallel to the orientation of the Kinect sensor so that it is easier to identify the corresponding depth layer of the hand.
We will also instruct the user to sit roughly 1 to 2 meters away from the Kinect and to slightly extend their arm in front of their body so that the hand will end up in a slightly different depth layer than the arm. However, the algorithm will still work even if the full arm is visible.

In this way, it will be relatively straightforward to segment the image based on the depth layer alone. Otherwise, we would have to come up with a hand detection algorithm first, which would unnecessarily complicate our mission. If you feel adventurous, feel free to do this on your own.

Let's see how to find the most prominent depth of the image center region in the next section.

Finding the most prominent depth of the image center region

Once the hand is placed roughly in the center of the screen, we can start finding all image pixels that lie on the same depth plane as the hand. This is done by following these steps:

First, we simply need to determine the most prominent depth value of the center region of the image. The simplest approach would be to look only at the depth value of the center pixel, like this:

width, height = depth.shape 
center_pixel_depth = depth[width/2, height/2]

Then, create a mask in which all pixels at a depth of center_pixel_depth are white and all others are black, as follows:

import numpy as np 
 
depth_mask = np.where(depth == center_pixel_depth, 255, 
     0).astype(np.uint8)

However, this approach will not be very robust, because there is the chance that it will be compromised by the following:

Your hand will not be placed perfectly parallel to the Kinect sensor.
Your hand will not be perfectly flat.
The Kinect sensor values will be noisy.

Therefore, different regions of your hand will have slightly different depth values.

The segment_arm method takes a slightly better approach—it looks at a small neighborhood in the center of the image and determines the median depth value. This is done by following these steps:

First, we find the center region (for example, 21 x 21 pixels) of the image frame, like this:

def segment_arm(frame: np.ndarray, abs_depth_dev: int = 14) -> np.ndarray:
    height, width = frame.shape
    # find center (21x21 pixels) region of imageheight frame
    center_half = 10 # half-width of 21 is 21/2-1
    center = frame[height // 2 - center_half:height // 2 + center_half,
                   width // 2 - center_half:width // 2 + center_half]

Then, we determine the median depth value, med_val, as follows:

med_val = np.median(center)

We can now compare med_val with the depth value of all pixels in the image and create a mask in which all pixels whose depth values are within a particular range [med_val-abs_depth_dev, med_val+abs_depth_dev] are white, and all other pixels are black.

However, for reasons that will become clear in a moment, let's paint the pixels gray instead of white, like this:

frame = np.where(abs(frame - med_val) <= abs_depth_dev,
                 128, 0).astype(np.uint8)

The result will look like this:

You will note that the segmentation mask is not smooth. In particular, it contains holes at points where the depth sensor failed to make a prediction. Let's learn how to apply morphological closing to smoothen the segmentation mask, in the next section.

Applying morphological closing for smoothening

A common problem with segmentation is that a hard threshold typically results in small imperfections (that is, holes, as in the preceding image) in the segmented region. These holes can be alleviated by using morphological opening and closing. When it is opened, it removes small objects from the foreground (assuming that the objects are bright on a dark foreground), whereas closing removes small holes (dark regions).

This means that we can get rid of the small black regions in our mask by applying morphological closing (dilation followed by erosion) with a small 3 x 3-pixel kernel, as follows:

kernel = np.ones((3, 3), np.uint8)
frame = cv2.morphologyEx(frame, cv2.MORPH_CLOSE, kernel)

The result looks a lot smoother, as follows:

Notice, however, that the mask still contains regions that do not belong to the hand or arm, such as what appears to be one of the knees on the left and some furniture on the right. These objects just happen to be on the same depth layer of my arm and hand. If possible, we could now combine the depth information with another descriptor, maybe a texture- or skeleton-based hand classifier that would weed out all non-skin regions.

An easier approach is to realize that most of the time, hands are not connected to knees or furniture. Let's learn how to find connected components in a segmentation mask.

Finding connected components in a segmentation mask

We already know that the center region belongs to the hand. For such a scenario, we can simply apply cv2.floodfill to find all the connected image regions.

Before we do this, we want to be absolutely certain that the seed point for the flood fill belongs to the right mask region. This can be achieved by assigning a grayscale value of 128 to the seed point. However, we also want to make sure that the center pixel does not, by any coincidence, lie within a cavity that the morphological operation failed to close.

So, let's set a small 7 x 7-pixel region with a grayscale value of 128 instead, like this:

small_kernel = 3
frame[height // 2 - small_kernel:height // 2 + small_kernel,
      width // 2 - small_kernel:width // 2 + small_kernel] = 128

As flood filling (as well as morphological operations) is potentially dangerous, OpenCV requires the specification of a mask that avoids flooding the entire image. This mask has to be 2 pixels wider and taller than the original image and has to be used in combination with the cv2.FLOODFILL_MASK_ONLY flag.

It can be very helpful to constrain the flood filling to a small region of the image or a specific contour so that we need not connect two neighboring regions that should never have been connected in the first place. It's better to be safe than sorry, right?

Nevertheless, today, we feel courageous! Let's make the mask entirely black, like this:

mask = np.zeros((height + 2, width + 2), np.uint8)

Then, we can apply the flood fill to the center pixel (the seed point), and paint all the connected regions white, as follows:

flood = frame.copy()
cv2.floodFill(flood, mask, (width // 2, height // 2), 255,
              flags=4 | (255 << 8))

At this point, it should be clear why we decided to start with a gray mask earlier. We now have a mask that contains white regions (arm and hand), gray regions (neither arm nor hand, but other things in the same depth plane), and black regions (all others). With this setup, it is easy to apply a simple binary threshold to highlight only the relevant regions of the pre-segmented depth plane, as follows:

ret, flooded = cv2.threshold(flood, 129, 255, cv2.THRESH_BINARY)

This is what the resulting mask looks like:

The resulting segmentation mask can now be returned to the recognize function, where it will be used as an input to the find_hull_defects function, as well as a canvas for drawing the final output image (img_draw). The function analyzes the shape of a hand in order to detect the defects of a hull that corresponds to the hand. Let's learn how to perform hand shape analysis in the next section.

Performing hand shape analysis

Now that we know (roughly) where the hand is located, we aim to learn something about its shape. In this app, we will make a decision on which exact gesture is shown, based on convexity defects of a contour corresponding to the hand.

Let's move on and learn how to determine the contour of the segmented hand region in the next section, which will be the first step in our hand shape analysis.

Determining the contour of the segmented hand region

The first step involves determining the contour of the segmented hand region. Luckily, OpenCV comes with a pre-canned version of such an algorithm—cv2.findContours. This function acts on a binary image and returns a set of points that are believed to be part of the contour. As there might be multiple contours present in the image, it is possible to retrieve an entire hierarchy of contours, as follows:

def find_hull_defects(segment: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
    contours, hierarchy = cv2.findContours(segment, cv2.RETR_TREE,
                                           cv2.CHAIN_APPROX_SIMPLE)

Furthermore, because we do not know which contour we are looking for, we have to make an assumption to clean up the contour result, since it is possible that some small cavities are left over even after the morphological closing. However, we are fairly certain that our mask contains only the segmented area of interest. We will assume that the largest contour found is the one that we are looking for.

Thus, we simply traverse the list of contours, calculate the contour area (cv2.contourArea), and store only the largest one (max_contour), like this:

max_contour = max(contours, key=cv2.contourArea)

The contour that we found might still have too many corners. We approximate the contour with a similar contour that does not have sides that are less than 1 percent of the perimeter of the contour, like this:

epsilon = 0.01 * cv2.arcLength(max_contour, True)
max_contour = cv2.approxPolyDP(max_contour, epsilon, True)

Let's learn how to find the convex hull of a contour area, in the next section.

Finding the convex hull of a contour area

Once we have identified the largest contour in our mask, it is straightforward to compute the convex hull of the contour area. The convex hull is basically the envelope of the contour area. If you think of all the pixels that belong to the contour area as a set of nails poking out of a board, then a tight rubber band encircles all the nails forming the convex hull shape. We can get the convex hull directly from our largest contour (max_contour), like this:

hull = cv2.convexHull(max_contour, returnPoints=False)

As we now want to look at convexity deficits in this hull, we are instructed by the OpenCV documentation to set the returnPoints optional flag to False.

The convex hull drawn in yellow around a segmented hand region looks like this:

As mentioned previously, we will determine a hand gesture based on convexity defects. Let's move on and learn how to find the convexity defects of a convex hull in the next section, which will bring us one step closer to recognizing hand gestures.

Finding the convexity defects of a convex hull

As is evident from the preceding screenshot, not all points on the convex hull belong to the segmented hand region. In fact, all the fingers and the wrist cause severe convexity defects—that is, points of the contour that are far away from the hull.

We can find these defects by looking at both the largest contour (max_contour) and the corresponding convex hull (hull), as follows:

defects = cv2.convexityDefects(max_contour, hull)

The output of this function (defects) is a NumPy array containing all defects. Each defect is an array of four integers that are start_index (index of the point in the contour where the defect begins), end_index (index of the point in the contour where the defect ends), farthest_pt_index (the index of the farthest point from the convex hull within the defect), and fixpt_depth (the distance between the farthest point and the convex hull).

We will make use of this information in just a moment when we try to estimate the number of extended fingers.

For now, though, our job is done. The extracted contour (max_contour) and convexity defects (defects) can be returned to recognize, where they will be used as inputs to detect_num_fingers, as follows:

return max_contour, defects

So, now that we have found the defects, let's move on and learn how to perform hand gesture recognition using the convexity defects, which will bring us toward the completion of the app.

Performing hand gesture recognition

What remains to be done is to classify the hand gesture based on the number of extended fingers. For example, if we find five extended fingers, we assume the hand to be open, whereas no extended fingers implies a fist. All that we are trying to do is count from zero to five, and make the app recognize the corresponding number of fingers.

This is actually trickier than it might seem at first. For example, people in Europe might count to three by extending their thumb, index finger, and middle finger. If you do that in the US, people there might get horrendously confused, because they do not tend to use their thumbs when signaling the number two.

This might lead to frustration, especially in restaurants (trust me). If we could find a way to generalize these two scenarios—maybe by appropriately counting the number of extended fingers, we would have an algorithm that could teach simple hand gesture recognition to not only a machine but also (maybe) to a person of average intellect.

As you might have guessed, the answer is related to convexity defects. As mentioned earlier, extended fingers cause defects in the convex hull. However, the inverse is not true; that is, not all convexity defects are caused by fingers! There might be additional defects caused by the wrist, as well as the overall orientation of the hand or the arm. How can we distinguish between these different causes of defects?

Let's distinguish between different cases of convexity defects, in the next section.

Distinguishing between different causes of convexity defects

The trick is to look at the angle between the farthest point from the convex hull point within the defect (farthest_pt_index) and the start and endpoints of the defect (start_index and end_index, respectively), as illustrated in the following screenshot:

In the previous screenshot, the orange markers serve as a visual aid to center the hand in the middle of the screen, and the convex hull is outlined in green. Each red dot corresponds to the point farthest from the convex hull (farthest_pt_index) for every convexity defect detected. If we compare a typical angle that belongs to two extended fingers (such as θj) to an angle that is caused by general hand geometry (such as θi), we notice that the former is much smaller than the latter.

This is obviously because humans can spread their fingers only a little, thus creating a narrow angle made by the farthest defect point and the neighboring fingertips. Therefore, we can iterate over all convexity defects and compute the angle between the said points. For this, we will need a utility function that calculates the angle (in radians) between two arbitrary—a list like vectors, v1 and v2, as follows:

def angle_rad(v1, v2): 
    return np.arctan2(np.linalg.norm(np.cross(v1, v2)), 
         np.dot(v1, v2))

This method uses the cross product to compute the angle, rather than doing it in the standard way. The standard way of calculating the angle between two vectors v1 and v2 is by calculating their dot product and dividing it by the norm of v1 and the norm of v2. However, this method has two imperfections:

You have to manually avoid division by zero if either the norm of v1 or the norm of v2 is zero.
The method returns relatively inaccurate results for small angles.

Similarly, we provide a simple function to convert an angle from degrees to radians, illustrated here:

def deg2rad(angle_deg): 
    return angle_deg/180.0*np.pi

In the next section, we'll see how to classify hand gestures based on the number of extended fingers.

Classifying hand gestures based on the number of extended fingers

What remains to be done is to actually classify the hand gesture based on the number of instances of extended fingers. The classification is done using the following function:

def detect_num_fingers(contour: np.ndarray, defects: np.ndarray,
                       img_draw: np.ndarray, thresh_deg: float = 80.0) -> Tuple[int, np.ndarray]:

The function accepts the detected contour (contour), the convexity defects (defects), a canvas to draw on (img_draw), and a cutoff angle that can be used as a threshold to classify whether convexity defects are caused by extended fingers or not (thresh_deg).

Except for the angle between the thumb and the index finger, it is rather hard to get anything close to 90 degrees, so anything close to that number should work. We do not want the cutoff angle to be too high, because that might lead to errors in classifications. The complete function will return the number of fingers and the illustration frame, and consists of the following steps:

First, let's focus on special cases. If we do not find any convexity defects, it means that we possibly made a mistake during the convex hull calculation, or there are simply no extended fingers in the frame, so we return 0 as the number of detected fingers, as follows:

if defects is None: 
    return [0, img_draw]

However, we can take this idea even further. Due to the fact that arms are usually slimmer than hands or fists, we can assume that the hand geometry will always generate at least two convexity defects (which usually belong to the wrists). So, if there are no additional defects, it implies that there are no extended fingers:

if len(defects) <= 2: 
    return [0, img_draw]

Now that we have ruled out all special cases, we can begin counting real fingers. If there is a sufficient number of defects, we will find a defect between every pair of fingers. Thus, in order to get the number right (num_fingers), we should start counting at 1, like this:

num_fingers = 1

Then, we start iterating over all convexity defects. For each defect, we extract the three points and draw its hull for visualization purposes, as follows:

# Defects are of shape (num_defects,1,4)
for defect in defects[:, 0, :]:
    # Each defect is an array of four integers.
    # First three indexes of start, end and the furthest
    # points respectively
    start, end, far = [contour[i][0] for i in defect[:3]]
    # draw the hull
    cv2.line(img_draw, tuple(start), tuple(end), (0, 255, 0), 2)

Then, we compute the angle between the two edges from far to start and from far to end. If the angle is smaller than thresh_deg degrees, it means that we are dealing with a defect that is most likely caused by two extended fingers. In such cases, we want to increment the number of detected fingers (num_fingers) and draw the point with green. Otherwise, we draw the point with red, as follows:

# if angle is below a threshold, defect point belongs to two
# extended fingers
if angle_rad(start - far, end - far) < deg2rad(thresh_deg):
    # increment number of fingers
    num_fingers += 1

    # draw point as green
    cv2.circle(img_draw, tuple(far), 5, (0, 255, 0), -1)
else:
    # draw point as red
    cv2.circle(img_draw, tuple(far), 5, (0, 0, 255), -1)

After iterating over all convexity defects, we return the number of detected fingers and the assembled output image, like this:

return min(5, num_fingers), img_draw

Computing the minimum will make sure that we do not exceed the common number of fingers per hand.

The result can be seen in the following screenshot:

Interestingly, our app is able to detect the correct number of extended fingers in a variety of hand configurations. Defect points between extended fingers are easily classified as such by the algorithm, and others are successfully ignored.

Summary

This chapter showed a relatively simple—and yet surprisingly robust—way of recognizing a variety of hand gestures by counting the number of extended fingers.

The algorithm first shows how a task-relevant region of the image can be segmented using depth information acquired from a Microsoft Kinect 3D sensor, and how morphological operations can be used to clean up the segmentation result. By analyzing the shape of the segmented hand region, the algorithm comes up with a way to classify hand gestures based on the types of convexity effects found in the image.

Once again, mastering our use of OpenCV to perform the desired task did not require us to produce a large amount of code. Instead, we were challenged to gain an important insight that made us use the built-in functionality of OpenCV in an effective way.

Gesture recognition is a popular but challenging field in computer science, with applications in a large number of areas, such as Human-Computer Interaction (HCI), video surveillance, and even the video game industry. You can now use your advanced understanding of segmentation and structure analysis to build your own state-of-the-art gesture recognition system. Another approach you might want to use for hand gesture recognition is to train a deep image classification network on hand gestures. We will discuss deep networks for image classifications in Chapter 9, Learning to Classify and Localize Objects.

In the next chapter, we will continue to focus on detecting objects of interest in visual scenes, but we will assume a much more complicated case: viewing the object from an arbitrary perspective and distance. To do this, we will combine perspective transformations with scale-invariant feature descriptors to develop a robust feature-matching algorithm.

About the Authors

Dr. Menua Gevorgyan

Dr. Menua Gevorgyan is an experienced researcher with a demonstrated history of working in the information technology and services industry. He is skilled in computer vision, deep learning, machine learning, and data science as well as having a lot of experience with OpenCV and Python programming. He is interested in machine perception and machine understanding problems, and wonders if it is possible to make a machine perceive the world as a human does.
Browse publications by this author
Arsen Mamikonyan

Arsen Mamikonyan is an experienced machine learning specialist with demonstrated work experience in Silicon Valley and London, and teaching experience at the American University of Armenia. He is skilled in applied machine learning and data science and has built real-life applications using Python and OpenCV, among others. He holds a master's degree in engineering (MEng) with a concentration on artificial intelligence from the Massachusetts Institute of Technology.
Browse publications by this author
Michael Beyeler

Michael Beyeler is a postdoctoral fellow in neuroengineering and data science at the University of Washington, where he is working on computational models of bionic vision in order to improve the perceptual experience of blind patients implanted with a retinal prosthesis (bionic eye).His work lies at the intersection of neuroscience, computer engineering, computer vision, and machine learning. He is also an active contributor to several open source software projects, and has professional programming experience in Python, C/C++, CUDA, MATLAB, and Android. Michael received a PhD in computer science from the University of California, Irvine, and an MSc in biomedical engineering and a BSc in electrical engineering from ETH Zurich, Switzerland.
Browse publications by this author

Modern Computer Vision with PyTorch

Get to grips with deep learning techniques for building image processing applications using PyTorch with the help of code notebooks and test questions

By V Kishore Ayyadevara and 1 more

Hands-On Mathematics for Deep Learning

A comprehensive guide to getting well-versed with the mathematical techniques for building modern deep learning architectures

By Jay Dawani

Machine Learning for Algorithmic Trading - Second Edition

Leverage machine learning to design and back-test automated trading strategies for real-world markets using pandas, TA-Lib, scikit-learn, LightGBM, SpaCy, Gensim, TensorFlow 2, Zipline, backtrader, Alphalens, and pyfolio.

By Stefan Jansen

PyTorch Computer Vision Cookbook

Discover powerful ways to use deep learning algorithms and solve real-world computer vision problems using Python

By Michael Avendi