Reader small image

You're reading from  Deep Learning for Computer Vision

Product typeBook
Published inJan 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788295628
Edition1st Edition
Languages
Right arrow
Author (1)
Rajalingappaa Shanmugamani
Rajalingappaa Shanmugamani
author image
Rajalingappaa Shanmugamani

Rajalingappaa Shanmugamani is currently working as an Engineering Manager for a Deep learning team at Kairos. Previously, he worked as a Senior Machine Learning Developer at SAP, Singapore and worked at various startups in developing machine learning products. He has a Masters from Indian Institute of TechnologyMadras. He has published articles in peer-reviewed journals and conferences and submitted applications for several patents in the area of machine learning. In his spare time, he coaches programming and machine learning to school students and engineers.
Read more about Rajalingappaa Shanmugamani

Right arrow

Chapter 7. Image Captioning

In this chapter, we will deal with the problem of captioning images. This involves detecting the objects and also coming up with a text caption for the image. Image captioning also can be called Image to Text translation. Once thought a very tough problem, we have reasonably good results on this now. For this chapter, a dataset of images with corresponding captions is required. In this chapter, we will discuss the techniques and applications of image captioning in detail.

We will cover the following topics in this chapter:

  • Understand the different datasets and metrics used to evaluate them
  • Understand some techniques used for natural language processing problems
  • Different words for vector models
  • Several algorithms for image captioning
  • Adverse results and scope for improvement

Understanding the problem and datasets


The process of automatically generating captions for images is a key deep learning task, as it combines the two worlds of language and vision. The uniqueness of the problem makes it one of the primary problems in computer vision. A deep learning model for image captioning should be able to identify the objects present in the image and also generate text in natural language expressing the relationship between the objects and actions. There are few datasets for this problem. The most famous of the datasets is an extension of the COCO dataset covered in object detection in Chapter 4, Object Detection.

Understanding natural language processing for image captioning


As natural language has to be generated from the image, getting familiar with natural language processing (NLP) becomes important. The concept of NLP is a vast subject, and hence we will limit our scope to topics that are relevant to image captioning. One form of natural language is text. The text is a sequence of words or characters. The atomic element of text is called token, which is a sequence of characters. A character is an atomic element of text. 

In order to process any natural language in the form of text, the text has to be preprocessed by removing punctuation, brackets and so on. Then, the text has to be tokenized into words by separating them into spaces. Then, the words have to be converted to vectors. Next, we will see how this vector conversion can help.  

Expressing words in vector form

Words expressed in vector form can help perform arithmetic operations on themselves. The vector has to be compact, with less dimension...

Implementing attention-based image captioning


Let's define a CNN from VGG and the LSTM model, using the following code:

vgg_model = tf.keras.applications.vgg16.VGG16(weights='imagenet',
                                              include_top=False,
                                              input_tensor=input_tensor,
                                              input_shape=input_shape)

word_embedding = tf.keras.layers.Embedding(
    vocabulary_size, embedding_dimension, input_length=sequence_length)
embbedding = word_embedding(previous_words)
embbedding = tf.keras.layers.Activation('relu')(embbedding)
embbedding = tf.keras.layers.Dropout(dropout_prob)(embbedding)

cnn_features_flattened = tf.keras.layers.Reshape((height * height, shape))(cnn_features)
net = tf.keras.layers.GlobalAveragePooling1D()(cnn_features_flattened)

net = tf.keras.layers.Dense(embedding_dimension, activation='relu')(net)
net = tf.keras.layers.Dropout(dropout_prob)(net)
net = tf.keras.layers.RepeatVector(sequence_length...

Summary


In this chapter, we have understood the problems associated with image captions. We saw a few techniques involving natural language processing and various word2vec models such as GLOVE. We understood several algorithms such as CNN2RNN, metric learning, and combined objective. Later, we implemented a model that combines CNN and LSTM. 

In the next chapter, we will come to understand generative models. We will learn and implement style algorithms from scratch and cover a few of the best models. We will also cover the cool Generative Adversarial Networks (GAN) and its various applications.

 

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Deep Learning for Computer Vision
Published in: Jan 2018Publisher: PacktISBN-13: 9781788295628
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rajalingappaa Shanmugamani

Rajalingappaa Shanmugamani is currently working as an Engineering Manager for a Deep learning team at Kairos. Previously, he worked as a Senior Machine Learning Developer at SAP, Singapore and worked at various startups in developing machine learning products. He has a Masters from Indian Institute of TechnologyMadras. He has published articles in peer-reviewed journals and conferences and submitted applications for several patents in the area of machine learning. In his spare time, he coaches programming and machine learning to school students and engineers.
Read more about Rajalingappaa Shanmugamani