Reader small image

You're reading from  Transformers for Natural Language Processing and Computer Vision - Third Edition

Product typeBook
Published inFeb 2024
Reading LevelN/a
PublisherPackt
ISBN-139781805128724
Edition3rd Edition
Languages
Tools
Right arrow
Author (1)
Denis Rothman
Denis Rothman
author image
Denis Rothman

Denis Rothman graduated from Sorbonne University and Paris-Diderot University, designing one of the very first word2matrix patented embedding and patented AI conversational agents. He began his career authoring one of the first AI cognitive Natural Language Processing (NLP) chatbots applied as an automated language teacher for Moet et Chandon and other companies. He authored an AI resource optimizer for IBM and apparel producers. He then authored an Advanced Planning and Scheduling (APS) solution used worldwide.
Read more about Denis Rothman

Right arrow

Preface

Transformer-driven Generative AI models are a game-changer for Natural Language Processing (NLP) and computer vision. Large Language Generative AI transformer models have achieved superhuman performance through services such as ChatGPT with GPT-4V for text, image, data science, and hundreds of domains. We have gone from primitive Generative AI to superhuman AI performance in just a few years!

Language understanding has become the pillar of language modeling, chatbots, personal assistants, question answering, text summarizing, speech-to-text, sentiment analysis, machine translation, and more. The expansion from the early Large Language Models (LLMs) to multimodal (text, image, sound) algorithms has taken AI into a new era.

For the past few years, we have been witnessing the expansion of social networks versus physical encounters, e-commerce versus physical shopping, digital newspapers, streaming versus physical theaters, remote doctor consultations versus physical visits, remote work instead of on-site tasks, and similar trends in hundreds more domains. This digital activity is now increasingly driven by transformer copilots in hundreds of applications.

The transformer architecture began just a few years ago as revolutionary and disruptive. It broke with the past, leaving the dominance of RNNs and CNNs behind. BERT and GPT models abandoned recurrent network layers and replaced them with self-attention. But in 2023, OpenAI GPT-4 proposed AI into new realms with GPT-4V (vision transformer), which is paving the path for functional (everyday tasks) AGI. Google Vertex AI offered similar technology. 2024 is not a new year in AI; it’s a new decade! Meta (formerly Facebook) has released Llama 2, which we can deploy seamlessly on Hugging Face.

Transformer encoders and decoders contain attention heads that train separately, parallelizing cutting-edge hardware. Attention heads can run on separate GPUs, opening the door to billion-parameter models and soon-to-come trillion-parameter models.

The increasing amount of data requires training AI models at scale. As such, transformers pave the way to a new era of parameter-driven AI. Learning to understand how hundreds of millions of words and images fit together requires a tremendous amount of parameters. Transformer models such as Google Vertex AI PaLM 2 and OpenAI GPT-4V have taken emergence to another level. Transformers can perform hundreds of NLP tasks they were not trained for.

Transformers can also learn image classification and reconstruction by embedding images as sequences of words. This book will introduce you to cutting-edge computer vision transformers such as Vision Transformers (ViTs), CLIP, GPT-4V, DALL-E 3, and Stable Diffusion.

Think of how many humans it would take to control the content of the billions of messages posted on social networks per day to decide if they are legal and ethical before extracting the information they contain.

Think of how many humans would be required to translate the millions of pages published each day on the web. Or imagine how many people it would take to manually control the millions of messages and images made per minute!

Imagine how many humans it would take to write the transcripts of all of the vast amount of hours of streaming published per day on the web. Finally, think about the human resources that would be required to replace AI image captioning for the billions of images that continuously appear online.

This book will take you from developing code to prompt engineering, a new “programming” skill that controls the behavior of a transformer model. Each chapter will take you through the key aspects of language understanding and computer vision from scratch in Python, PyTorch, and TensorFlow.

You will learn the architecture of the Original Transformer, Google BERT, GPT-4, PaLM 2, T5, ViT, Stable Diffusion, and several other models. You will fine-tune transformers, train models from scratch, and learn to use powerful APIs.

You will keep close to the market and its demand for language understanding in many fields, such as media, social media, and research papers, for example. You will learn how to improve Generative AI models with Retrieval Augmented Generation (RAG), embedding-based searches, prompt engineering, and automated ideation with AI-generated prompts.

Throughout the book, you will work hands-on with Python, PyTorch, and TensorFlow. You will be introduced to the key AI language understanding neural network models. You will then learn how to explore and implement transformers.

You will learn the skills required not only to adapt to the present market but also to acquire the vision to face innovative projects and AI evolutions. This book aims to give readers both the knowledge and the vision to select the right models and environment for any given project.

Who this book is for

This book is not an introduction to Python programming or machine learning concepts. Instead, it focuses on deep learning for machine translation, speech-to-text, text-to-speech, language modeling, question answering, and many more NLP domains, as well as computer vision multimodal tasks.

Readers who can benefit the most from this book are:

  • Deep learning, vision, and NLP practitioners familiar with Python programming.
  • Data analysts, data scientists, and machine learning/AI engineers who want to understand how to process and interrogate the increasing amounts of language-driven and image data.

What this book covers

Part I: The Foundations of Transformers

Chapter 1, What Are Transformers?, explains, at a high level, what transformers and Foundation Models are. We will first unveil the incredible power of the deceptively simple O(1) time complexity of transformer models that changed everything. We will continue to discover how a hardly known transformer algorithm in 2017 rose to dominate so many domains and brought us Foundation Models.

Chapter 2, Getting Started with the Architecture of the Transformer Model, goes through the background of NLP to understand how RNN, LSTM, and CNN architectures were abandoned and how the transformer architecture opened a new era. We will go through the Original Transformer’s architecture through the unique Attention Is All You Need approach invented by the Google Research and Google Brain authors. We will describe the theory of transformers. We will get our hands dirty in Python to see how multi-attention head sublayers work.

Chapter 3, Emergent vs Downstream Tasks: The Unseen Depths of Transformers, bridges the gap between the functional and mathematical architecture of transformers by introducing emergence. We will then see how to measure the performance of transformers before exploring several downstream tasks, such as the Standard Sentiment TreeBank (SST-2), linguistic acceptability, and Winograd schemas.

Chapter 4, Advancements in Translations with Google Trax, Google Translate, and Gemini, goes through machine translation in three steps. We will first define what machine translation is. We will then preprocess a Workshop on Machine Translation (WMT) dataset. Finally, we will see how to implement machine translations.

Chapter 5, Diving into Fine-Tuning through BERT, builds on the architecture of the Original Transformer. Bidirectional Encoder Representations from Transformers (BERT) takes transformers into a vast new way of perceiving the world of NLP. Instead of analyzing a past sequence to predict a future sequence, BERT attends to the whole sequence! We will first go through the key innovations of BERT’s architecture and then fine-tune a BERT model by going through each step in a Google Colaboratory notebook. Like humans, BERT can learn tasks and perform other new ones without having to learn the topic from scratch.

Chapter 6, Pretraining a Transformer from Scratch through RoBERTa, builds a RoBERTa transformer model from scratch using the Hugging Face PyTorch modules. The transformer will be both BERT-like and DistilBERT-like. First, we will train a tokenizer from scratch on a customized dataset. Finally, we will put the knowledge acquired in this chapter to work and pretrain a Generative AI customer support model on X (formerly Twitter) data.

Part II: The Rise of Suprahuman NLP

Chapter 7, The Generative AI Revolution with ChatGPT, goes through the tremendous improvements and diffusion of ChatGPT models into the everyday lives of developers and end-users. We will first examine the architecture of OpenAI’s GPT models before working with the GPT-4 API and its hyperparameters to implement several NLP examples. Finally, we will learn how to obtain better results with Retrieval Augmented Generation (RAG). We will implement an example of automated RAG with GPT-4.

Chapter 8, Fine-Tuning OpenAI GPT Models, explores fine-tuning to make sense of the choices we can make for a project to go in this direction or not. We will introduce risk management perspectives. We will prepare a dataset and fine-tune a cost-effective babbage-02 model for a completion task.

Chapter 9, Shattering the Black Box with Interpretable Tools, lifts the lid on the black box that is transformer models by visualizing their activity. We will use BertViz to visualize attention heads, Language Interpretability Tool (LIT) to carry out a Principal Component Analysis (PCA), and LIME to visualize transformers via dictionary learning. OpenAI LLMs will take us deeper and visualize the activity of a neuron in a transformer with an interactive interface. This approach opens the door to GPT-4 explaining a transformer, for example.

Chapter 10, Investigating the Role of Tokenizers in Shaping Transformer Models, introduces some tokenizer-agnostic best practices to measure the quality of a tokenizer. We will describe basic guidelines for datasets and tokenizers from a tokenization perspective. We will explore word and subword tokenizers and show how a tokenizer can shape a transformer model’s training and performance. Finally, we will build a function to display and control token-ID mappings.

Chapter 11, Leveraging LLM Embeddings as an Alternative to Fine-Tuning, explains why searching with embeddings can sometimes be a very effective alternative to fine-tuning. We will go through the advantages and limits of this approach. We will go through the fundamentals of text embeddings. We will build a program that reads a file, tokenizes it, and embeds it with Gensim and Word2Vec. We will implement a question-answering program on sports events and use OpenAI Ada to embed Amazon fine food reviews. By the end of the chapter, we will have taken a system from prompt design to advanced prompt engineering using embeddings for RAG.

Chapter 12, Toward Syntax-Free Semantic Role Labeling with ChatGPT and GPT-4, goes through the revolutionary concepts of syntax-free, nonrepetitive stochastic models. We will use ChatGPT Plus with GPT-4 to run easy to complex Semantic Role Labeling (SRL) samples. We will see how a general-purpose, emergent model reacts to our SRL requests. We will progressively push the transformer model to the limits of SRL.

Chapter 13, Summarization with T5 and ChatGPT, goes through the concepts and architecture of the T5 transformer model. We will then apply T5 to summarize documents with Hugging Face models. The examples in this chapter will be legal and medical to explore domain-specific summarization beyond simple texts. We are not looking for an easy way to implement NLP but preparing ourselves for the reality of real-life projects. We will then compare T5 and ChatGPT approaches to summarization.

Chapter 14, Exploring Cutting-Edge LLMs with Vertex AI and PaLM 2, examines Pathways to understand PaLM. We will continue and look at the main features of PaLM (Pathways Language Model), a decoder-only, densely activated, and autoregressive transformer model with 540 billion parameters trained on Google’s Pathways system. We will see how Google PaLM 2 can perform a chat task, a discriminative task (such as classification), a completion task (also known as a generative task), and more. We will implement the Vertex AI PaLM 2 API for several NLP tasks, including question-answering and summarization. Finally, we will go through Google Cloud’s fine-tuning process.

Chapter 15, Guarding the Giants: Mitigating Risks in Large Language Models, examines the risks of LLMs, risk management, and risk mitigation tools. The chapter explains hallucinations, memorization, risky emergent behavior, disinformation, influence operations, harmful content, adversarial attacks (“jailbreaks”), privacy, cybersecurity, overreliance, and memorization. We will then go through some risk mitigation tools through advanced prompt engineering, such as implementing a moderation model, a knowledge base, keyword parsing, prompt pilots, post-processing moderation, and embeddings.

Part III: Generative Computer Vision: A New Way to See the World

Chapter 16, Beyond Text: Vision Transformers in the Dawn of Revolutionary AI, explores the innovative transformer models that respect the basic structure of the Original Transformer but make some significant changes. We will discover powerful computer vision transformers like ViT, CLIP, DALL-E, and GPT-4V. We will implement vision transformers in code, including GPT-4V, and expand the text-image interactions of DALL-3 to divergent semantic association. We will take OpenAI models into the nascent world of highly divergent semantic association creativity.

Chapter 17, Transcending the Image-Text Boundary with Stable Diffusion, delves into to diffusion models, introducing Stable Vision, which has created a disruptive generative image AI wave rippling through the market. We will then dive into the principles, math, and code of the remarkable Keras Stable Diffusion model. We will go through each of the main components of a Stable Diffusion model and peek into the source code provided by Keras and run the model. We will run a text-to-video synthesis model with Hugging Face and a video-to-text task with Meta’s TimeSformer.

Chapter 18, Hugging Face AutoTrain: Training Vision Models without Coding, explores how to train a vision transformer using Hugging Face’s AutoTrain. We will go through the automated training process and discover the unpredictable problems that show why even automated ML requires human AI expertise. The goal of this chapter is also to show how to probe the limits of a computer vision model, no matter how sophisticated it is.

Chapter 19, On the Road to Functional AGI with HuggingGPT and its Peers, shows how we can use cross-platform chained models to solve difficult image classification problems. We will put HuggingGPT and Google Cloud Vision to work to identify easy, difficult, and very difficult images. We will go beyond classical pipelines and explore how to chain heterogeneous competing models.

Chapter 20, Beyond Human-Designed Prompts with Generative Ideation, explores generative ideation, an ecosystem that automates the production of an idea to text and image content. The development phase requires highly skilled human AI experts. For an end user, the ecosystem is a click-and-run experience. By the end of this chapter, we will be able to deliver ethical, exciting, generative ideation to companies with no marketing resources. We will be able to expand generative ideation to any field in an exciting, cutting-edge, yet ethical ecosystem.

To get the most out of this book

Most of the programs in the book are Jupyter notebooks. All you will need is a free Google Gmail account, and you will be able to run the notebooks on Google Colaboratory’s free VM.

Take the time to read Chapter 2, Getting Started with the Architecture of the Transformer Model. Chapter 2 contains the description of the Original Transformer. If you find it difficult, then pick up the general intuitive ideas from the chapter. You can then go back to that chapter when you feel more comfortable with transformers after a few chapters.

After reading each chapter, consider how you could implement transformers for your customers or use them to move up in your career with novel ideas.

Download the example code files

The code bundle for the book is hosted on GitHub at https://github.com/Denis2054/Transformers-for-NLP-and-Computer-Vision-3rd-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that contains color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/gbp/9781805128724.

Conventions used

There are several text conventions used throughout this book.

CodeInText: Indicates sentences and words run through the models in the book, code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. For example, “However, if you wish to explore the code, you will find it in the Google Colaboratory positional_encoding.ipynb notebook and the text.txt file in this chapter’s GitHub repository.”

A block of code is set as follows:

import numpy as np
from scipy.special import softmax

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

The black cat sat on the couch and the brown dog slept on the rug.

Any command-line input or output is written as follows:

vector similarity
[[0.9627094]] final positional encoding similarity

Bold: Indicates a new term, an important word, or words that you see on the screen.

For instance, words in menus or dialog boxes also appear in the text like this. For example:

“In our case, we are looking for t5-large, a t5-large model we can smoothly run in

Google Colaboratory.”

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email feedback@packtpub.com and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at questions@packtpub.com.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packtpub.com with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.

Share your thoughts

Once you’ve read Transformers for Natural Language Processing and Computer Vision - Third Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application. 

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

  1. Scan the QR code or visit the link below
Qr code  Description automatically generated

https://packt.link/free-ebook/9781805128724

  1. Submit your proof of purchase
  2. That’s it! We’ll send your free PDF and other benefits to your email directly
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Transformers for Natural Language Processing and Computer Vision - Third Edition
Published in: Feb 2024Publisher: PacktISBN-13: 9781805128724
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Denis Rothman

Denis Rothman graduated from Sorbonne University and Paris-Diderot University, designing one of the very first word2matrix patented embedding and patented AI conversational agents. He began his career authoring one of the first AI cognitive Natural Language Processing (NLP) chatbots applied as an automated language teacher for Moet et Chandon and other companies. He authored an AI resource optimizer for IBM and apparel producers. He then authored an Advanced Planning and Scheduling (APS) solution used worldwide.
Read more about Denis Rothman