Reader small image

You're reading from  Transformers for Natural Language Processing and Computer Vision - Third Edition

Product typeBook
Published inFeb 2024
Reading LevelN/a
PublisherPackt
ISBN-139781805128724
Edition3rd Edition
Languages
Tools
Right arrow
Author (1)
Denis Rothman
Denis Rothman
author image
Denis Rothman

Denis Rothman graduated from Sorbonne University and Paris-Diderot University, designing one of the very first word2matrix patented embedding and patented AI conversational agents. He began his career authoring one of the first AI cognitive Natural Language Processing (NLP) chatbots applied as an automated language teacher for Moet et Chandon and other companies. He authored an AI resource optimizer for IBM and apparel producers. He then authored an Advanced Planning and Scheduling (APS) solution used worldwide.
Read more about Denis Rothman

Right arrow

Leveraging LLM Embeddings as an Alternative to Fine-Tuning

Do not overlook embeddings as an alternative to fine-tuning a large language transformer model. Fine-tuning requires a reliable dataset, the right model configuration, and hardware resources. Creating high-quality datasets takes time and resources.

Leveraging the embedding abilities of a Large Language Model (LLM) such as OpenAI’s Ada will enable you to customize your model with reduced cost and effort. Your model will be able to access updated data in real time. You will be implementing Retrieval Augmented Generation (RAG) through embedded texts. We used web pages and customized text for RAG in Chapter 7, The Generative AI Revolution with ChatGPT. This time, we will go further and use embeddings.

This chapter begins by explaining why searching with embeddings can sometimes be a very effective alternative to fine-tuning. We will go through the advantages and limits of this approach.

Then, we will go through...

LLM embeddings as an alternative to fine-tuning

ChatGPT models are impressive. They have taken everyone by surprise. However, ChatGPT has a memory problem! It only remembers what it learned from its training data. For example, in January 2024, ChatGPT’s cutoff date was April 2023. It cannot answer questions about events after April 2023. OpenAI has found a workaround for some issues using the BING search engine, but this isn’t enough.

Also, ChatGPT only knows what the training set contains. For example, maybe you have information that hasn’t been made public and that ChatGPT cannot find.

In this chapter, we will build two methods:

  • An ask method using Retrieval Augmented Generation (RAG) by adding information to the prompt
  • A RAG search and ask function that leverages the Ada embedding model

In both cases, these approaches take us from prompt design to advanced prompt engineering.

From prompt design to prompt engineering

...

Fundamentals of text embedding with NLKT and Gensim

In this section, we will go through the fundamentals of text embedding: tokenizing a book, embedding the tokens, and exploring the vector space we created.

Open Embedding_with_NLKT_Gensim.ipynb in the chapter directory of the GitHub repository.

We will first install the libraries we will need.

Installing libraries

The program first installs the Natural Language Toolkit (NLTK):

!pip install --upgrade nltk -qq
import nltk

The NLTK will take us down to the token level as in Chapter 10, Investigating the Role of Tokenizers in Shaping Transformer Models.

We’ll use the punkt sentence tokenizer:

nltk.download('punkt')

The program installs gensim for the similarity tools:

!pip install gensim -qq
import gensim
print(gensim.__version__)

The output is the version:

4.3.2

The first step is to read the file.

1. Reading the text file

The program downloads a file containing...

Join our book community on Discord

https://packt.link/EarlyAccessCommunity

Qr code Description automatically generated

When studying transformer models, we tend to focus on their architecture and the datasets provided to train them. This book covers the original Transformer, BERT, RoBERTa, ChatGPT, GPT4, PaLM, LaMBDA, DALL-E, and more. In addition, the book reviews several benchmark tasks and datasets. We have fine-tuned a BERT-like model and trained a RoBERTa tokenizer using tokenizers to encode data. In the previous Chapter 9, Shattering the Black Box with Interpretable Tools, we also shattered the black box and analyzed the inner workings of a transformer model.However, we did not explore the critical role tokenizers play and evaluate how they shape the models we build. AI is data-driven. Raffel et al. (2019), like all the authors cited in this book, spent time preparing datasets for transformer models. In this chapter, we will go through some of the issues of tokenizers that hinder or boost the performance of transformer...

Matching datasets and tokenizers

Downloading benchmark datasets to train transformers has many advantages. The data has been prepared, and every research lab uses the same references. Also, the performance of a transformer model can be compared to another model with the same data.However, more needs to be done to improve the performance of transformers. Furthermore, implementing a transformer model in production requires careful planning and defining best practices.In this section, we will define some best practices to avoid critical stumbling blocks.Then, we will go through a few examples in Python using cosine similarity to measure the limits of tokenization and encoding datasets.Let's start with best practices.

Best practices

Raffel et al. (2019) defined a standard text-to-text T5 transformer model. They also went further. They contributed to the destruction of the myth of using raw data without preprocessing it first.Preprocessing data reduces training...

Exploring Sentence and Wordpiece tokenizers to understand the efficiency of subword tokenizers for transformers

Transformer models commonly use BPE and Wordpiece tokenization. In this section, we will understand why choosing a subword tokenizer over other tokenizers significantly impacts transformer models. The goal of this section will thus be to first review some of the main word and Sentence tokenizers. We will continue and implement subword tokenizers. But, first, we will detect if the tokenizer is a BPE or a Wordpiece.Then, we'll create a function to display the token-ID mappings.Finally, we'll analyze and control the quality of token-ID mappings.The first step is to review some of the main word and Sentence tokenizers.

Word and sentence tokenizers

Choosing a tokenizer depends on the objectives of the NLP project. Although subword tokenizers are more efficient for transformer models, word and Sentence tokenizers provide useful functionality. Sentence and word tokenizers...

Summary

In this chapter, we measured the impact of tokenization on the subsequent layers of a transformer model. A transformer model can only attend to tokens from a stack's embedding and positional encoding sub-layers. It does not matter if the model is an encoder-decoder, encoder-only, or decoder-only model. Furthermore, whether the dataset seems good enough to train does not matter.If the tokenization process fails, even partly, our transformer model will miss critical tokens.We first saw that raw datasets might be enough for standard language tasks to train a transformer.However, we discovered that even if a pretrained tokenizer has gone through a billion words, it only creates a dictionary with a small portion of the vocabulary it comes across. Like us, a tokenizer captures the essence of the language it is learning and only remembers the most important words if these words are also frequently used. This approach works well for a standard task and creates problems...

Questions

  1. A tokenized dictionary contains every word that exists in a language. (True/False)
  2. Pretrained tokenizers can encode any dataset. (True/False)
  3. It is good practice to check a database before using it. (True/False)
  4. It is good practice to eliminate obscene data from datasets. (True/False)
  5. It is good practice to delete data containing discriminating assertions. (True/False)
  6. Raw datasets might sometimes produce relationships between noisy content and useful content. (True/False)
  7. A standard pretrained tokenizer contains the English vocabulary of the past 700 years. (True/False)
  8. Old English can create problems when encoding data with a tokenizer trained in modern English. (True/False)
  9. Medical and other types of jargon can create problems when encoding data with a tokenizer trained in modern English. (True/False)
  10. Controlling the output of the encoded data produced by a pretrained tokenizer is good practice. (True/False)

References

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Transformers for Natural Language Processing and Computer Vision - Third Edition
Published in: Feb 2024Publisher: PacktISBN-13: 9781805128724
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Denis Rothman

Denis Rothman graduated from Sorbonne University and Paris-Diderot University, designing one of the very first word2matrix patented embedding and patented AI conversational agents. He began his career authoring one of the first AI cognitive Natural Language Processing (NLP) chatbots applied as an automated language teacher for Moet et Chandon and other companies. He authored an AI resource optimizer for IBM and apparel producers. He then authored an Advanced Planning and Scheduling (APS) solution used worldwide.
Read more about Denis Rothman