Reader small image

You're reading from  Deep Learning with Theano

Product typeBook
Published inJul 2017
PublisherPackt
ISBN-139781786465825
Edition1st Edition
Tools
Right arrow
Author (1)
Christopher Bourez
Christopher Bourez
author image
Christopher Bourez

Christopher Bourez graduated from Ecole Polytechnique and Ecole Normale Suprieure de Cachan in Paris in 2005 with a Master of Science in Math, Machine Learning and Computer Vision (MVA). For 7 years, he led a company in computer vision that launched Pixee, a visual recognition application for iPhone in 2007, with the major movie theater brand, the city of Paris and the major ticket broker: with a snap of a picture, the user could get information about events, products, and access to purchase. While working on missions in computer vision with Caffe, TensorFlow or Torch, he helped other developers succeed by writing on a blog on computer science. One of his blog posts, a tutorial on the Caffe deep learning technology, has become the most successful tutorial on the web after the official Caffe website. On the initiative of Packt Publishing, the same recipes that made the success of his Caffe tutorial have been ported to write this book on Theano technology. In the meantime, a wide range of problems for Deep Learning are studied to gain more practice with Theano and its application.
Read more about Christopher Bourez

Right arrow

Chapter 8. Translating and Explaining with Encoding – decoding Networks

Encoding-decoding techniques occur when inputs and outputs belong to the same space. For example, image segmentation consists of transforming an input image into a new image, the segmentation mask; translation consists of transforming a character sequence into a new character sequence; and question-answering consists of replying to a sequence of words with a new sequence of words.

To address these challenges, encoding-decoding networks are networks composed of two symmetric parts: an encoding network and a decoding network. The encoder network encodes the input data into a vector, which will be used by the decoder network to produce an output, such as a translation, an answer to the input question, an explanation, or an annotation of an input sentence or an input image.

An encoder network is usually composed of the first layers of a network of the type of the ones presented in the previous chapters, without the last layers...

Sequence-to-sequence networks for natural language processing


Rule-based systems are being replaced by end-to-end neural networks because of their increase in performance.

An end-to-end neural network means the network directly infers all possible rules by example, without knowing the underlying rules, such as syntax and conjugation; the words (or the characters) are directly fed into the network as input. The same is true for the output format, which can be directly the word indexes themselves. The architecture of the network takes care of learning the rules with its coefficients.

The architecture of choice for such end-to-end encoding-decoding networks applied to Natural Language Processing (NLP), is the sequence-to-sequence network, displayed in the following figure:

Word indexes are converted into their continuous multi-dimensional values in the embedded space with a lookup table. This conversion, presented in Chapter 3, Encoding Word into Vector is a crucial step to encode the discrete...

Seq2seq for translation


Sequence-to-sequence (Seq2seq) networks have their first application in language translation.

A translation task has been designed for the conferences of the Association for Computational Linguistics (ACL), with a dataset, WMT16, composed of translations of news in different languages. The purpose of this dataset is to evaluate new translation systems or techniques. We'll use the German-English dataset.

  1. First, preprocess the data:

    python 0-preprocess_translations.py --srcfile data/src-train.txt --targetfile data/targ-train.txt --srcvalfile data/src-val.txt --targetvalfile data/targ-val.txt --outputfile data/demo
    First pass through data to get vocab...
    Number of sentences in training: 10000
    Number of sentences in valid: 2819
    Source vocab size: Original = 24995, Pruned = 24999
    Target vocab size: Original = 35816, Pruned = 35820
    (2819, 2819)
    Saved 2819 sentences (dropped 181 due to length/unk filter)
    (10000, 10000)
    Saved 10000 sentences (dropped 0 due to length/unk filter...

Seq2seq for chatbots


A second target application of sequence-to-sequence networks is question-answering, or chatbots.

For that purpose, download the Cornell Movie--Dialogs Corpus and preprocess it:

wget http://www.mpi-sws.org/~cristian/data/cornell_movie_dialogs_corpus.zip -P /sharedfiles/
unzip /sharedfiles/cornell_movie_dialogs_corpus.zip  -d /sharedfiles/cornell_movie_dialogs_corpus

python 0-preprocess_movies.py

This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts.

Since source and target sentences are in the same language, they use the same vocabulary, and the decoding network can use the same word embedding as the encoding network:

if opt.dataset == "chatbot":
    embeddings = encoder_params[0]

The same commands are true for chatbot dataset:

python 1-train.py  --dataset chatbot # training
python 1-train.py  --dataset chatbot --model model_chatbot_e100_n2_h500 # answer my question

Improving efficiency of sequence-to-sequence network


A first interesting point to notice in the chatbot example is the reverse ordered input sequence: such a technique has been shown to improve results.

For translation, it is very common then to use a bidirectional LSTM to compute the internal state as seen in Chapter 5, Analyzing Sentiment with a Bidirectional LSTM: two LSTMs, one running in the forward order, the other in the reverse order, run in parallel on the sequence, and their outputs are concatenated:

Such a mechanism captures better information given future and past.

Another technique is the attention mechanism that will be the focus of the next chapter.

Lastly, refinement techniques have been developed and tested with two-dimensional Grid LSTM, which are not very far from stacked LSTM (the only difference is a gating mechanism in the depth/stack direction):

Grid long short-term memory

The principle of refinement is to run the stack in both orders on the input sentence as well, sequentially...

Deconvolutions for images


In the case of images, researchers have been looking for decoding operations acting as the inverse of the encoding convolutions.

The first application was the analysis and understanding of convolutional networks, as seen in Chapter 2, Classifying Handwritten Digits with a Feedforward Network, composed of convolutional layers, max-pooling layers and rectified linear units. To better understand the network, the idea is to visualize the parts of an image that are most discriminative for a given unit of a network: one single neuron in a high level feature map is left non-zero and, from that activation, the signal is retro-propagated back to the 2D input.

To reconstruct the signal through the max pooling layers, the idea is to keep track of the position of the maxima within each pooling region during the forward pass. Such architecture, named DeConvNet can be shown as:

Visualizing and understanding convolutional networks

The signal is retro-propagated to the position that...

Multimodal deep learning


To open the possible applications further, the encoding-decoding framework can be applied with different modalities, such as, for example, for image captioning.

Image captioning consists of describing the content of the image with words. The input is an image, naturally encoded into a thought vector with a deep convolutional network.

The text to describe the content of the image can be produced from this internal state vector with the same stack of LSTM networks as a decoder, as in Seq2seq networks:

Further reading


Please refer to the following topics for better insights:

  • Sequence to Sequence Learning with Neural Networks, Ilya Sutskever, Oriol Vinyals, Quoc V. Le, Dec 2014

  • Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio, Sept 2014

  • Neural Machine Translation by Jointly Learning to Align and Translate, Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, May 2016

  • A Neural Conversational Model, Oriol Vinyals, Quoc Le, July 2015

  • Fast and Robust Neural Network Joint Models for Statistical Machine Translation, Jacob Devlin, Rabih Zbib, Zhongqiang Huang,Thomas Lamar, Richard Schwartz, John Mkahoul, 2014

  • SYSTRAN's Pure Neural Machine Translation Systems, Josep Crego, Jungi Kim, Guillaume Klein, Anabel Rebollo, Kathy Yang, Jean Senellart, Egor Akhanov, Patrice Brunelle, Aurelien Coquard, Yongchao Deng, Satoshi Enoue, Chiyo Geiss, Joshua...

Summary


As for love, head-to-toe positions provide exciting new possibilities: encoder and decoder networks use the same stack of layers but in their opposite directions.

Although it does not provide new modules to deep learning, such a technique of encoding-decoding is quite important because it enables the training of the networks 'end-to-end', that is, directly feeding the inputs and corresponding outputs, without specifying any rules or patterns to the networks and without decomposing encoding training and decoding training into two separate steps.

While image classification was a one-to-one task, and sentiment analysis a many-to-one task, encoding-decoding techniques illustrate many-to-many tasks, such as translation or image segmentation.

In the next chapter, we'll introduce an attention mechanism that provides the ability for encoder-decoder architecture to focus on some parts of the input in order to produce a more accurate output.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Deep Learning with Theano
Published in: Jul 2017Publisher: PacktISBN-13: 9781786465825
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Christopher Bourez

Christopher Bourez graduated from Ecole Polytechnique and Ecole Normale Suprieure de Cachan in Paris in 2005 with a Master of Science in Math, Machine Learning and Computer Vision (MVA). For 7 years, he led a company in computer vision that launched Pixee, a visual recognition application for iPhone in 2007, with the major movie theater brand, the city of Paris and the major ticket broker: with a snap of a picture, the user could get information about events, products, and access to purchase. While working on missions in computer vision with Caffe, TensorFlow or Torch, he helped other developers succeed by writing on a blog on computer science. One of his blog posts, a tutorial on the Caffe deep learning technology, has become the most successful tutorial on the web after the official Caffe website. On the initiative of Packt Publishing, the same recipes that made the success of his Caffe tutorial have been ported to write this book on Theano technology. In the meantime, a wide range of problems for Deep Learning are studied to gain more practice with Theano and its application.
Read more about Christopher Bourez