Reader small image

You're reading from  Deep Learning with Theano

Product typeBook
Published inJul 2017
PublisherPackt
ISBN-139781786465825
Edition1st Edition
Tools
Right arrow
Author (1)
Christopher Bourez
Christopher Bourez
author image
Christopher Bourez

Christopher Bourez graduated from Ecole Polytechnique and Ecole Normale Suprieure de Cachan in Paris in 2005 with a Master of Science in Math, Machine Learning and Computer Vision (MVA). For 7 years, he led a company in computer vision that launched Pixee, a visual recognition application for iPhone in 2007, with the major movie theater brand, the city of Paris and the major ticket broker: with a snap of a picture, the user could get information about events, products, and access to purchase. While working on missions in computer vision with Caffe, TensorFlow or Torch, he helped other developers succeed by writing on a blog on computer science. One of his blog posts, a tutorial on the Caffe deep learning technology, has become the most successful tutorial on the web after the official Caffe website. On the initiative of Packt Publishing, the same recipes that made the success of his Caffe tutorial have been ported to write this book on Theano technology. In the meantime, a wide range of problems for Deep Learning are studied to gain more practice with Theano and its application.
Read more about Christopher Bourez

Right arrow

Chapter 6. Locating with Spatial Transformer Networks

In this chapter, the NLP field is left to come back to images, and get an example of application of recurrent neural networks to images. In Chapter 2, Classifying Handwritten Digits with a Feedforward Network we addressed the case of image classification, consisting of predicting the class of an image. Here, we'll address object localization, a common task in computer vision as well, consisting of predicting the bounding box of an object in the image.

While Chapter 2, Classifying Handwritten Digits with a Feedforward Network solved the classification task with neural nets built with linear layers, convolutions, and non-linarites, the spatial transformer is a new module built on very specific equations dedicated to the localization task.

In order to locate multiple objects in the image, spatial transformers are composed with recurrent networks. This chapter takes the opportunity to show how to use prebuilt recurrent networks in Lasagne,...

MNIST CNN model with Lasagne


The Lasagne library has packaged layers and tools to handle neural nets easily. Let's first install the latest version of Lasagne:

pip install --upgrade https://github.com/Lasagne/Lasagne/archive/master.zip

Let us reprogram the MNIST model from Chapter 2, Classifying Handwritten Digits with a Feedforward Network with Lasagne:

def model(l_input, input_dim=28, num_units=256, num_classes=10, p=.5):


    network = lasagne.layers.Conv2DLayer(
            l_input, num_filters=32, filter_size=(5, 5),
            nonlinearity=lasagne.nonlinearities.rectify,
            W=lasagne.init.GlorotUniform())

    network = lasagne.layers.MaxPool2DLayer(network, pool_size=(2, 2))

    network = lasagne.layers.Conv2DLayer(
            network, num_filters=32, filter_size=(5, 5),
            nonlinearity=lasagne.nonlinearities.rectify)

    network = lasagne.layers.MaxPool2DLayer(network, pool_size=(2, 2))

    if num_units > 0:
        network = lasagne.layers.DenseLayer(
...

A localization network


In Spatial Transformer Networks (STN), instead of applying the network directly to the input image signal, the idea is to add a module to preprocess the image and crop it, rotate it, and scale it to fit the object, to assist in classification:

Spatial Transformer Networks

For that purpose, STNs use a localization network to predict the affine transformation parameters and process the input:

Spatial transformer networks

In Theano, differentiation through the affine transformation is automatic, we simply have to connect the localization net with the input of the classification net through the affine transformation.

First, we create a localization network not very far from the MNIST CNN model, to predict six parameters of the affine transformation:

l_in = lasagne.layers.InputLayer((None, dim, dim))
l_dim = lasagne.layers.DimshuffleLayer(l_in, (0, 'x', 1, 2))
l_pool0_loc = lasagne.layers.MaxPool2DLayer(l_dim, pool_size=(2, 2))
l_dense_loc = mnist_cnn.model(l_pool0_loc, input_dim...

Unsupervised learning with co-localization


The first layers of the digit classifier trained in Chapter 2, Classifying Handwritten Digits with a Feedforward Network as an encoding function to represent the image in an embedding space, as for words:

It is possible to train unsurprisingly the localization network of the spatial transformer network by minimizing the hinge loss objective function on random sets of two images supposed to contain the same digit:

Minimizing this sum leads to modifying the weights in the localization network, so that two localized digits become closer than two random crops.

Here are the results:

(Spatial transformer networks paper, Jaderberg et al., 2015)

Region-based localization networks


Historically, the basic approach in object localization was to use a classification network in a sliding window; it consists of sliding a window one pixel by one pixel in each direction and applying a classifier at each position and each scale in the image. The classifier learns to say if the object is present and centered. It requires a large amount of computations since the model has to be evaluated at every position and scale.

To accelerate such a process, the Region Proposal Network (RPN) in the Fast-R-CNN paper from the researcher Ross Girshick consists of transforming the fully connected layers of a neural net classifier such as MNIST CNN into convolutional layers as well; in fact, network dense on 28x28 image, there is no difference between a convolution and a linear layer when the convolution kernel has the same dimensions as the input. So, any fully connected layers can be rewritten as convolutional layers, with the same weights and the appropriate...

Further reading


You can further refer to these sources for more information:

  • Spatial Transformer Networks, Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Jun 2015

  • Recurrent Spatial Transformer Networks, Søren Kaae Sønderby, Casper Kaae Sønderby, Lars Maaløe, Ole Winther, Sept 2015

  • Original code: https://github.com/skaae/recurrent-spatial-transformer-code

  • Google Street View Character Recognition, Jiyue Wang, Peng Hui How

  • Reading Text in the Wild with Convolutional Neural Networks, Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014

  • Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks, Ian J. Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, Vinay Shet, 2013

  • Recognizing Characters From Google Street View Images, Guan Wang, Jingrui Zhang

  • Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition, Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014

  • R-CNN minus R, Karel Lenc...

Summary


The spatial transformer layer is an original module to localize an area of the image, crop it and resize it to help the classifier focus on the relevant part in the image, and increase its accuracy. The layer is composed of differentiable affine transformation, for which the parameters are computed through another model, the localization network, and can be learned via backpropagation as usual.

An example of the application to reading multiple digits in an image can be inferred with the use of recurrent neural units. To simplify our work, the Lasagne library was introduced.

Spatial transformers are one solution among many others for localizations; region-based localizations, such as YOLO, SSD, or Faster RCNN, provide state-of-the-art results for bounding box prediction.

In the next chapter, we'll continue with image recognition to discover how to classify full size images that contain a lot more information than digits, such as natural images of indoor scenes and outdoor landscapes...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Deep Learning with Theano
Published in: Jul 2017Publisher: PacktISBN-13: 9781786465825
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Christopher Bourez

Christopher Bourez graduated from Ecole Polytechnique and Ecole Normale Suprieure de Cachan in Paris in 2005 with a Master of Science in Math, Machine Learning and Computer Vision (MVA). For 7 years, he led a company in computer vision that launched Pixee, a visual recognition application for iPhone in 2007, with the major movie theater brand, the city of Paris and the major ticket broker: with a snap of a picture, the user could get information about events, products, and access to purchase. While working on missions in computer vision with Caffe, TensorFlow or Torch, he helped other developers succeed by writing on a blog on computer science. One of his blog posts, a tutorial on the Caffe deep learning technology, has become the most successful tutorial on the web after the official Caffe website. On the initiative of Packt Publishing, the same recipes that made the success of his Caffe tutorial have been ported to write this book on Theano technology. In the meantime, a wide range of problems for Deep Learning are studied to gain more practice with Theano and its application.
Read more about Christopher Bourez