You're reading from Transformers for Natural Language Processing and Computer Vision - Third Edition

Product typeBook

Published inFeb 2024

Reading LevelN/a

PublisherPackt

ISBN-139781805128724

Edition3rd Edition

Languages

Python

Tools

PyTorch

Concepts

Deep Learning

Author (1)

Denis Rothman

Leveraging LLM Embeddings as an Alternative to Fine-Tuning

Do not overlook embeddings as an alternative to fine-tuning a large language transformer model. Fine-tuning requires a reliable dataset, the right model configuration, and hardware resources. Creating high-quality datasets takes time and resources.

Leveraging the embedding abilities of a Large Language Model (LLM) such as OpenAI’s Ada will enable you to customize your model with reduced cost and effort. Your model will be able to access updated data in real time. You will be implementing Retrieval Augmented Generation (RAG) through embedded texts. We used web pages and customized text for RAG in Chapter 7, The Generative AI Revolution with ChatGPT. This time, we will go further and use embeddings.

This chapter begins by explaining why searching with embeddings can sometimes be a very effective alternative to fine-tuning. We will go through the advantages and limits of this approach.

Then, we will go through...

LLM embeddings as an alternative to fine-tuning

ChatGPT models are impressive. They have taken everyone by surprise. However, ChatGPT has a memory problem! It only remembers what it learned from its training data. For example, in January 2024, ChatGPT’s cutoff date was April 2023. It cannot answer questions about events after April 2023. OpenAI has found a workaround for some issues using the BING search engine, but this isn’t enough.

Also, ChatGPT only knows what the training set contains. For example, maybe you have information that hasn’t been made public and that ChatGPT cannot find.

In this chapter, we will build two methods:

An ask method using Retrieval Augmented Generation (RAG) by adding information to the prompt
A RAG search and ask function that leverages the Ada embedding model

In both cases, these approaches take us from prompt design to advanced prompt engineering.

From prompt design to prompt engineering

...

Fundamentals of text embedding with NLKT and Gensim

In this section, we will go through the fundamentals of text embedding: tokenizing a book, embedding the tokens, and exploring the vector space we created.

Open Embedding_with_NLKT_Gensim.ipynb in the chapter directory of the GitHub repository.

We will first install the libraries we will need.

Installing libraries

The program first installs the Natural Language Toolkit (NLTK):

!pip install --upgrade nltk -qq
import nltk

The NLTK will take us down to the token level as in Chapter 10, Investigating the Role of Tokenizers in Shaping Transformer Models.

We’ll use the punkt sentence tokenizer:

nltk.download('punkt')

The program installs gensim for the similarity tools:

!pip install gensim -qq
import gensim
print(gensim.__version__)

The output is the version:

4.3.2

The first step is to read the file.

1. Reading the text file

The program downloads a file containing...

Join our book community on Discord

https://packt.link/EarlyAccessCommunity

Qr code Description automatically generated

When studying transformer models, we tend to focus on their architecture and the datasets provided to train them. This book covers the original Transformer, BERT, RoBERTa, ChatGPT, GPT4, PaLM, LaMBDA, DALL-E, and more. In addition, the book reviews several benchmark tasks and datasets. We have fine-tuned a BERT-like model and trained a RoBERTa tokenizer using tokenizers to encode data. In the previous Chapter 9, Shattering the Black Box with Interpretable Tools, we also shattered the black box and analyzed the inner workings of a transformer model.However, we did not explore the critical role tokenizers play and evaluate how they shape the models we build. AI is data-driven. Raffel et al. (2019), like all the authors cited in this book, spent time preparing datasets for transformer models. In this chapter, we will go through some of the issues of tokenizers that hinder or boost the performance of transformer...

Matching datasets and tokenizers

Downloading benchmark datasets to train transformers has many advantages. The data has been prepared, and every research lab uses the same references. Also, the performance of a transformer model can be compared to another model with the same data.However, more needs to be done to improve the performance of transformers. Furthermore, implementing a transformer model in production requires careful planning and defining best practices.In this section, we will define some best practices to avoid critical stumbling blocks.Then, we will go through a few examples in Python using cosine similarity to measure the limits of tokenization and encoding datasets.Let's start with best practices.

Best practices

Raffel et al. (2019) defined a standard text-to-text T5 transformer model. They also went further. They contributed to the destruction of the myth of using raw data without preprocessing it first.Preprocessing data reduces training...

Exploring Sentence and Wordpiece tokenizers to understand the efficiency of subword tokenizers for transformers

Transformer models commonly use BPE and Wordpiece tokenization. In this section, we will understand why choosing a subword tokenizer over other tokenizers significantly impacts transformer models. The goal of this section will thus be to first review some of the main word and Sentence tokenizers. We will continue and implement subword tokenizers. But, first, we will detect if the tokenizer is a BPE or a Wordpiece.Then, we'll create a function to display the token-ID mappings.Finally, we'll analyze and control the quality of token-ID mappings.The first step is to review some of the main word and Sentence tokenizers.

Word and sentence tokenizers

Choosing a tokenizer depends on the objectives of the NLP project. Although subword tokenizers are more efficient for transformer models, word and Sentence tokenizers provide useful functionality. Sentence and word tokenizers...

Summary

In this chapter, we measured the impact of tokenization on the subsequent layers of a transformer model. A transformer model can only attend to tokens from a stack's embedding and positional encoding sub-layers. It does not matter if the model is an encoder-decoder, encoder-only, or decoder-only model. Furthermore, whether the dataset seems good enough to train does not matter.If the tokenization process fails, even partly, our transformer model will miss critical tokens.We first saw that raw datasets might be enough for standard language tasks to train a transformer.However, we discovered that even if a pretrained tokenizer has gone through a billion words, it only creates a dictionary with a small portion of the vocabulary it comes across. Like us, a tokenizer captures the essence of the language it is learning and only remembers the most important words if these words are also frequently used. This approach works well for a standard task and creates problems...

Questions

A tokenized dictionary contains every word that exists in a language. (True/False)
Pretrained tokenizers can encode any dataset. (True/False)
It is good practice to check a database before using it. (True/False)
It is good practice to eliminate obscene data from datasets. (True/False)
It is good practice to delete data containing discriminating assertions. (True/False)
Raw datasets might sometimes produce relationships between noisy content and useful content. (True/False)
A standard pretrained tokenizer contains the English vocabulary of the past 700 years. (True/False)
Old English can create problems when encoding data with a tokenizer trained in modern English. (True/False)
Medical and other types of jargon can create problems when encoding data with a tokenizer trained in modern English. (True/False)
Controlling the output of the encoded data produced by a pretrained tokenizer is good practice. (True/False)

References

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, 2019, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer: https://arxiv.org/pdf/1910.10683.pdf
Gensim: https://radimrehurek.com/gensim/intro.html

The rest of the chapter is locked

You have been reading a chapter from

Transformers for Natural Language Processing and Computer Vision - Third Edition

Published in: Feb 2024Publisher: PacktISBN-13: 9781805128724

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Denis Rothman

Denis Rothman graduated from Sorbonne University and Paris-Diderot University, designing one of the very first word2matrix patented embedding and patented AI conversational agents. He began his career authoring one of the first AI cognitive Natural Language Processing (NLP) chatbots applied as an automated language teacher for Moet et Chandon and other companies. He authored an AI resource optimizer for IBM and apparel producers. He then authored an Advanced Planning and Scheduling (APS) solution used worldwide.
Read more about Denis Rothman

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages