You're reading from Natural Language Processing with TensorFlow - Second Edition

Product typeBook

Published inJul 2022

Reading LevelIntermediate

PublisherPackt

ISBN-139781838641351

Edition2nd Edition

Languages

Python

Tools

TensorFlow

Concepts

Mobile Application Development

Author (1)

Thushan Ganegedara

Image Captioning with Transformers

Transformer models changed the playing field for many NLP problems. They have redefined the state of the art by a significant margin, compared to the previous leaders: RNN-based models. We have already studied Transformers and understood what makes them tick. Transformers have access to the whole sequence of items (e.g. a sequence of tokens), as opposed to RNN-based models that look at one item at a time, making them well-suited for sequential problems. Following their success in the field of NLP, researchers have successfully used Transformers to solve computer vision problems. Here we will learn how to use Transformers to solve a multi-modal problem involving both images and text: image captioning.

Automated image captioning, or image annotation, has a wide variety of applications. One of the most prominent applications is image retrieval in search engines. Automated image captioning can be used to retrieve all the images belonging to a certain...

Getting to know the data

Let’s first understand the data we are working with both directly and indirectly. There are two datasets we will rely on:

The ILSVRC ImageNet dataset (http://image-net.org/download)
The MS-COCO dataset (http://cocodataset.org/#download)

We will not engage the first dataset directly, but it is essential for caption learning. This dataset contains images and their respective class labels (for example, cat, dog, and car). We will use a CNN that is already trained on this dataset, so we do not have to download and train on this dataset from scratch. Next we will use the MS-COCO dataset, which contains images and their respective captions. We will directly learn from this dataset by mapping the image to a fixed-size feature vector, using the Vision Transformer, and then map this vector to the corresponding caption using a text-based Transformer (we will discuss this process in detail later).

ILSVRC ImageNet dataset

ImageNet...

Downloading the data

The MS-COCO dataset we will be using is quite large. Therefore, we will manually download these datasets. To do that, follow the instructions below:

Create a folder called data in the Ch11-Image-Caption-Generation folder
Download the 2014 Train images set (http://images.cocodataset.org/zips/train2014.zip) containing 83K images (train2014.zip)
Download the 2017 Val images set (http://images.cocodataset.org/zips/val2017.zip) containing 5K images (val2017.zip)
Download the annotation sets for 2014 (annotations_trainval2014.zip) (http://images.cocodataset.org/annotations/annotations_trainval2014.zip) and 2017 (annotations_trainval2017.zip) (http://images.cocodataset.org/annotations/annotations_trainval2017.zip)
Copy the downloaded zip files to the Ch11-Image-Caption-Generation/data folder
Extract the zip files using the Extract to option so that it unzips the content within a sub-folder

Once you complete the above...

Processing and tokenizing data

With the data downloaded and placed in the correct folders, let’s define the directories containing the required data:

trainval_image_dir = os.path.join('data', 'train2014', 'train2014')
trainval_captions_dir = os.path.join('data', 'annotations_trainval2014', 'annotations')
test_image_dir = os.path.join('data', 'val2017', 'val2017')
test_captions_dir = os.path.join('data', 'annotations_trainval2017', 'annotations')
trainval_captions_filepath = os.path.join(trainval_captions_dir, 'captions_train2014.json')
test_captions_filepath = os.path.join(test_captions_dir, 'captions_val2017.json')

Here we have defined the directories containing training and testing images as well as the file paths of the JSON files that contain the captions of the training and testing images.

Preprocessing data

As the...

Defining a tf.data.Dataset

Now let’s look at how we can create a tf.data.Dataset using the data. We will first write a few helper functions. Namely, we’ll define:

parse_image() to load and process an image from a filepath
generate_tokenizer() to generate a tokenizer trained on the data passed to the function

First let’s discuss the parse_image() function. It takes three arguments:

filepath – Location of the image
resize_height – Height to resize the image to
resize_width – Width to resize the image to

The function is defined as follows:

def parse_image(filepath, resize_height, resize_width):
    """ Reading an image from a given filepath """
    
    # Reading the image
    image = tf.io.read_file(filepath)
    # Decode the JPEG, make sure there are 3 channels in the output
    image = tf.io.decode_jpeg(image, channels=3)
    image = tf.image.convert_image_dtype...

The machine learning pipeline for image caption generation

Here we will look at the image caption generation pipeline at a very high level and then discuss it piece by piece until we have the full model. The image caption generation framework consists of two main components:

A pretrained Vision Transformer model to produce an image representation
A text-based decoder model that can decode the image representation to a series of token IDs. This uses a text tokenizer to convert tokens to token IDs and vice versa

Though the Transformer models were initially used for text-based NLP problems, they have out-grown the domain of text data and have been used in other areas such as image data and audio data.

Here we will be using one Transformer model that can process image data and another that can process text data.

Vision Transformer (ViT)

First, let’s look at the Transformer generating the encoded vector representations of images. We will be...

Implementing the model with TensorFlow

We will now implement the model we just studied. First let’s import a few things:

import tensorflow_hub as hub
import tensorflow as tf
import tensorflow.keras.backend as K

Implementing the ViT model

Next, we are going to download the pretrained ViT model from TensorFlow Hub. We will be using a model submitted by Sayak Paul. The model is available at https://tfhub.dev/sayakpaul/vit_s16_fe/1. You can see other Vision Transformer models available at https://tfhub.dev/sayakpaul/collections/vision_transformer/1.

image_encoder = hub.KerasLayer("https://tfhub.dev/sayakpaul/vit_s16_fe/1", trainable=False)

We then define an input layer to input images and pass that to the image_encoder to get the final feature vector for that image:

image_input = tf.keras.layers.Input(shape=(224, 224, 3))
image_features = image_encoder(image_input)

You can look at the size of the final image representation by running:

...

Training the model

Now that the data pipeline and the model are defined, training it is quite easy. First let’s define a few parameters:

n_vocab = 4000
batch_size=96
train_fraction = 0.6
valid_fraction = 0.2

We use a vocabulary size of 4,000 and a batch size of 96. To speed up the training we’ll only use 60% of training data and 20% of validation data. However, you could increase these to get better results. Then we get the tokenizer trained on the full training dataset:

tokenizer = generate_tokenizer(
    train_captions_df, n_vocab=n_vocab
)

Next we define the BLEU metric. This is the same BLEU computation from Chapter 9, Sequence-to-Sequence Learning – Neural Machine Translation, with some minor differences. Therefore, we will not repeat the discussion here.

bleu_metric = BLEUMetric(tokenizer=tokenizer)

Sample the smaller set of validation data outside the training loop to keep the set constant:

sampled_validation_captions_df ...

Evaluating the results quantitatively

There are many different techniques for evaluating the quality and the relevancy of the captions generated. We will briefly discuss several such metrics we can use to evaluate the captions. We will discuss four metrics: BLEU, ROGUE, METEOR, and CIDEr.

All these measures share a key objective, to measure the adequacy (the meaning of the generated text) and fluency (the grammatical correctness of text) of the generated text. To calculate all these measures, we will use a candidate sentence and a reference sentence, where a candidate sentence is the sentence/phrase predicted by our algorithm and the reference sentence is the true sentence/phrase we want to compare with.

BLEU

Bilingual Evaluation Understudy (BLEU) was proposed by Papineni and others in BLEU: A Method for Automatic Evaluation of Machine Translation, Proceedings of the 40^th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July (2002)...

Evaluating the model

With the model trained, let’s test the model on our unseen test dataset. Testing logic is almost identical to the validation logic we discussed earlier during model training. Therefore we will not repeat our discussion here.

bleu_metric = BLEUMetric(tokenizer=tokenizer)
test_dataset, _ = generate_tf_dataset(
    test_captions_df, tokenizer=tokenizer, n_vocab=n_vocab, batch_size=batch_size, training=False
)
test_loss, test_accuracy, test_bleu = [], [], []
for ti, t_batch in enumerate(test_dataset):
    print(f"{ti+1} batches processed", end='\r')
    loss, accuracy = full_model.test_on_batch(t_batch[0], t_batch[1])
    batch_predicted = full_model.predict_on_batch(t_batch[0])
    bleu_score = bleu_metric.calculate_bleu_from_predictions(t_batch[1], batch_predicted)
    test_loss.append(loss)
    test_accuracy.append(accuracy)
    test_bleu.append(bleu_score)
print(
    f"\ntest_loss: {np.mean(test_loss)} - test_accuracy: {np.mean...

Captions generated for test images

With the help of metrics such as accuracy and BLEU, we have ensured our model is performing well. But, one of the most important tasks a trained model has to perform is generating outputs for new data. We will learn how we can use our model to generate actual captions. Let’s first understand how we can generate captions at a conceptual level. It’s quite straightforward to generate the image representation using an image. The tricky part is adapting the text decoder to generate captions. As you can imagine, the decoder inference needs to work in a different setting than the training. This is because at inference we don’t have caption tokens to input to the model.

The way we predict with our model is by starting with the image and a starting caption that has the single token [START]. We feed these two inputs to the model to generate the next token. We then combine the new token with the current input and predict the next token...

Summary

In this chapter, we focused on a very interesting task that involves generating captions for given images. Our image-captioning model was one of the most complex models in this book, which included the following:

A vision Transformer model that produces an image representation
A text-based Transformer decoder

Before we began with the model, we analyzed our dataset to understand various characteristics such as image sizes and the vocabulary size. Then we understood how we can use a tokenizer to tokenize captions strings. We then used this knowledge to build a TensorFlow data pipeline.

We discussed each component in detail. The Vision Transformer (ViT) takes in an image and produces a hidden representation of that image. Specifically, the ViT breaks an image into a sequence of 16x16 patches of pixels. After that, it treats each patch as a token embedding to the Transformer (along with positional information) to produce a representation of each...

The rest of the chapter is locked

You have been reading a chapter from

Natural Language Processing with TensorFlow - Second Edition

Published in: Jul 2022Publisher: PacktISBN-13: 9781838641351

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Thushan Ganegedara

Thushan is a seasoned ML practitioner with 4+ years of experience in the industry. Currently he is a senior machine learning engineer at Canva; an Australian startup that founded the online visual design software, Canva, serving millions of customers. His efforts are particularly concentrated in the search and recommendations group working on both visual and textual content. Prior to Canva, Thushan was a senior data scientist at QBE Insurance; an Australian Insurance company. Thushan was developing ML solutions for use-cases related to insurance claims. He also led efforts in developing a Speech2Text pipeline there. He obtained his PhD specializing in machine learning from the University of Sydney in 2018.
Read more about Thushan Ganegedara

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages