You're reading from Hands-On Image Generation with TensorFlow

Product type Book

Published in Dec 2020

Publisher Packt

ISBN-13 9781838826789

Pages 306 pages

Edition 1st Edition

Languages

Python

Concepts

Computer Vision

Author (1):

Soon Yau Cheong

Table of Contents (15) Chapters

Preface

Section 1: Fundamentals of Image Generation with TensorFlow

Chapter 1: Getting Started with Image Generation Using TensorFlow

Chapter 2: Variational Autoencoder

Chapter 3: Generative Adversarial Network

Section 2: Applications of Deep Generative Models

Chapter 4: Image-to-Image Translation

Chapter 5: Style Transfer

Chapter 6: AI Painter

Section 3: Advanced Deep Generative Techniques

Chapter 7: High Fidelity Face Generation

Chapter 8: Self-Attention for Image Generation

Chapter 9: Video Synthesis

Chapter 10: Road Ahead

Other Books You May Enjoy

Leave a review - let other readers know what you think

Chapter 9: Video Synthesis

We have learned about and built many models for image generation, including state-of-the-art StyleGAN and Self-Attention GAN (SAGAN) models, in previous chapters. You have now learned about most if not all of the important techniques used to generate images, and we can now move on to video generation (synthesis). In essence, video is simply a series of still images. Therefore, the most basic video generation method is to generate images individually and put them together in a sequence to make a video. Video synthesis is a complex and broad topic in its own right, and we won't be able to cover everything in a single chapter.

In this chapter, we will get an overview of video synthesis. We will then implement what is probably the most well-known video generation technique, deepfake. We will use this to swap a person's face in a video with someone else's face. I'm sure you have seen such fake videos before. If you haven't, then just...

Technical requirements

The code for this chapter can be accessed here:

https://github.com/PacktPublishing/Hands-On-Image-Generation-with-TensorFlow-2.0/tree/master/Chapter09

The notebook used in this chapter is this one:

ch9_deepfake.ipynb

Video synthesis overview

Let's say your doorbell rings while you're watching a video, so you pause the video and go to answer the door. What would you see on your screen when you come back? A still picture where everything is frozen and not moving. If you press the play button and pause it again quickly, you will see another image that looks very similar to the previous one but with slight differences. Yes – when you play a series of images sequentially, you get a video.

We say that image data has three dimensions, or (H, W, C); video data has four dimensions, (T, H, W, C), where T is the temporal (time) dimension. It's also the case that video is just a big batch of images, except that we cannot shuffle the batch. There must be temporal consistency between the images; I'll explain this further.

Let's say we extract images from some video datasets and train an unconditional GAN to generate images from random noise input. As you can imagine, the...

Implementing face image processing

We will use mainly two Python libraries – dlib and OpenCV – to implement most of the face processing tasks. OpenCV is good for general-purpose computer vision tasks and includes low-level functions and algorithms. While dlib was originally a C++ toolkit for machine learning, it also has a Python interface, and it is the go-to machine learning Python library for facial landmark detection. Most of the image processing code used in this chapter is adapted from https://github.com/deepfakes/faceswap.

Extracting image from video

The first thing in the production pipeline is to extract images from video. A video is made up of a series of images separated by a fixed time interval. If you check a video file's properties, you may find something that says frame rate = 25 fps. FPS indicates the number of image frames per second in a video, and 25 fps is the standard video frame rate. That means 25 images are played within a 1-second...

Building a DeepFake model

The deep learning model used in the original deepfake is an autoencoder-based one. There are a total of two autoencoders, one for each face domain. They share the same encoder, so there is a total of one encoder and two decoders in the model. The autoencoders expect an image size of 64×64 for both the input and the output. Now, let's build the encoder.

Building the encoder

As we learned in the previous chapter, the encoder is responsible for converting high-dimensional images into a low-dimensional representation. We'll first write a function to encapsulate the convolutional layer; leaky ReLU activation is used for downsampling:

def downsample(filters):
    return Sequential([
        Conv2D(filters, kernel_size=5, strides=2, 			padding='same'),
        LeakyReLU(0.1)])

In the usual autoencoder implementation, the output...

Swapping faces

Here comes the last step of the deepfake pipeline, but let's first recap the pipeline. The deepfake production pipeline involves three main stages:

Extract a face from an image using dlib and OpenCV.
Translate the face using the trained encoder and decoders.
Swap the new face back into the original image.

The new face generated by the autoencoder is an aligned face of size 64×64, so we will need to warp it to the position, size, and angle of the face in the original image. We'll use the affine matrix obtained from step 1 in the face extraction stage. We'll use cv2.warpAffine like before, but this time, the cv2.WARP_INVERSE_MAP flag is used to reverse the direction of image transformation as follows:

h, w, _ = image.shape
size = 64
new_image = np.zeros_like(image, dtype=np.uint8)
new_image = cv2.warpAffine(np.array(new_face, 						   dtype=np.uint8)
           ...

Improving DeepFakes with GANs

The output image of deepfake's autoencoders can be a little blurry, so how can we improve that? To recap, the deepfake algorithm can be broken into two main techniques – face image processing and face generation. The latter can be thought of as an image-to-image translation problem, and we learned a lot about that in Chapter 4, Image-to-Image Translation. Therefore, the natural thing to do would be to use a GAN to improve the quality. One helpful model is faceswap-GAN, and we will now go over a high-level overview of it. The autoencoder from the original deepfake is enhanced with residual blocks and self-attention blocks (see Chapter 8, Self-Attention for Image Generation) and used as a generator in faceswap-GAN. The discriminator architecture is as follows:

Figure 9.10 - faceswap-GAN's discriminator architecture (Redrawn from: https://github.com/shaoanlu/faceswap-GAN)

We can learn a lot about the discriminator...

Summary

Congratulations! We have now finished all the coding in this book. We have learned how to use dlib to detect faces and facial landmarks and how to use OpenCV to warp and align a face. We also learned how to use warping and masking to do face swapping. As a matter of fact, we spent most of the chapter learning about face image processing and spent very little time on the deep learning side. We have implemented autoencoders by reusing and modifying the autoencoder code from the previous chapter.

Finally, we went over an example of improving deepfake by using GANs. faceswap-GAN improves deepfake by adding a residual block, a self-attention block, and a discriminator for adversarial training, all of which we have already learned about in previous chapters.

In the next chapter, which is also the final chapter, we will review the techniques we have learned in this book and look at some of the pitfalls in training GANs for real-world applications. Then, we will go over a few...