Reader small image

You're reading from  Mastering Transformers

Product typeBook
Published inSep 2021
PublisherPackt
ISBN-139781801077651
Edition1st Edition
Right arrow
Authors (2):
Savaş Yıldırım
Savaş Yıldırım
author image
Savaş Yıldırım

Savaş Yıldırım graduated from the Istanbul Technical University Department of Computer Engineering and holds a Ph.D. degree in Natural Language Processing (NLP). Currently, he is an associate professor at the Istanbul Bilgi University, Turkey, and is a visiting researcher at the Ryerson University, Canada. He is a proactive lecturer and researcher with more than 20 years of experience teaching courses on machine learning, deep learning, and NLP. He has significantly contributed to the Turkish NLP community by developing a lot of open source software and resources. He also provides comprehensive consultancy to AI companies on their R&D projects. In his spare time, he writes and directs short films, and enjoys practicing yoga.
Read more about Savaş Yıldırım

Meysam Asgari- Chenaghlu
Meysam Asgari- Chenaghlu
author image
Meysam Asgari- Chenaghlu

Meysam Asgari-Chenaghlu is an AI manager at Carbon Consulting and is also a Ph.D. candidate at the University of Tabriz. He has been a consultant for Turkey's leading telecommunication and banking companies. He has also worked on various projects, including natural language understanding and semantic search.
Read more about Meysam Asgari- Chenaghlu

View More author details
Right arrow

Chapter 4:Autoregressive and Other Language Models

We looked at details of Autoencoder (AE) language models in Chapter 3, Autoencoding Language Models, and studied how an AE language model can be trained from scratch. In the current chapter, you will see theoretical details of Autoregressive (AR) language models and learn how to pre-train them on your own corpus. You will learn how to pre-train any language model such as Generated Pre-trained Transformer 2 (GPT-2) on your own text and use it in various tasks such as Natural Language Generation (NLG). You will understand the basics of a Text-to-Text Transfer Transformer (T5) model and train a Multilingual T5 (mT5) model on your own Machine Translation (MT) data. After finishing this chapter, you will have an overview of AR language models and their various use cases in text2text applications, such as summarization, paraphrasing, and MT.

The following topics will be covered in this chapter:

  • Working with AR language models...

Technical requirements

The following libraries/packages are required to successfully complete this chapter:

  • Anaconda
  • transformers 4.0.0
  • pytorch 1.0.2
  • tensorflow 2.4.0
  • datasets 1.4.1
  • tokenizers
  • simpletransformers 0.61

All notebooks with coding exercises will be available at the following GitHub link: https://github.com/PacktPublishing/Mastering-Transformers/tree/main/CH04.

Check out the following link to see the Code in Action: https://bit.ly/3yjn55X

Working with AR language models

The Transformer architecture was originally intended to be effective for Seq2Seq tasks such as MT or summarization, but it has since been used in diverse NLP problems ranging from token classification to coreference resolution. Subsequent works began to use separately and more creatively the left and right parts of the architecture. The objective, also known as denoising objective, is to fully recover the original input from the corrupted one in a bidirectional fashion, as shown on the left side of Figure 4.1, which you will see shortly. As seen in the Bidirectional Encoder Representations from Transformers (BERT) architecture, which is a notable example of AE models, they can incorporate the context of both sides of a word. However, the first issue is that the corrupting [MASK] symbols that are used during the pre-training phase are absent from the data during the fine-tuning phase, leading to a pre-training-fine-tuning discrepancy. Secondly, the BERT...

Working with Seq2Seq models

The left encoder and the right decoder part of the transformer are connected with cross-attention, which helps each decoder layer attend over the final encoder layer. This naturally pushes models toward producing output that closely ties to the original input. A Seq2Seq model, which is the original transformer, achieves this by using the following scheme:

Input tokens-> embeddings-> encoder-> decoder-> output tokens

Seq2Seq models keep the encoder and decoder part of the transformer. T5, Bidirectional and Auto-Regressive Transformer (BART), and Pre-training with Extracted Gap-sentences for Abstractive Summarization Sequence-to-Sequence models (PEGASUS) are among the popular Seq2Seq models.

T5

Most NLP architectures, ranging from Word2Vec to transformers learn embeddings and other parameters by predicting the masked words using context (neighbor) words. We treat NLP problems as word prediction problems. Some studies cast almost all...

AR language model training

In this section, you will learn how it is possible to train your own AR language models. We will start with GPT-2 and get a deeper look inside its different functions for training, using the transformers library.

You can find any specific corpus to train your own GPT-2, but for this example, we used Emma by Jane Austen, which is a romantic novel. Training on a much bigger corpus is highly recommended to have a more general language generation.

Before we start, it's good to note that we used TensorFlow's native training functionality to show that all Hugging Face models can be directly trained on TensorFlow or PyTorch if you wish to. Follow these steps:

  1. You can download the Emma novel raw text by using the following command:
    wget https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/gutenberg/austen-emma.txt
  2. The first step is to train the BytePairEncoding tokenizer for GPT-2 on a corpus that you intend to train your...

NLG using AR models

In the previous section, you have learned how it is possible to train an AR model on your own corpus. As a result, you have trained the GPT-2 version of your own. But the missing answer to the question How can I use it? remains. To answer that, let's proceed as follows:

  1. Let's start generating sentences from the model you have just trained, as follows:
    def generate(start, model): 
        input_token_ids = tokenizer_gpt.encode(start, return_tensors='tf') 
        output = model.generate( 
            input_token_ids, 
            max_length = 500, 
            num_beams = 5, 
            temperature = 0.7, 
            no_repeat_ngram_size=2, 
            num_return_sequences=1 
      &...

Summarization and MT fine-tuning using simpletransformers

Up to now, you have learned the basics and advanced methods of training language models, but it is not always feasible to train your own language model from scratch because there are sometimes impediments such as low computational power. In this section, you will look at how to fine-tune language models on your own datasets for specific tasks of MT and summarization. Follow these next steps:

  1. To start, you need to install the simpletransformers library, as follows:
    pip install simpletransformers
  2. The next step is to download the dataset that contains your parallel corpus. This parallel corpus can be of any type of Seq2Seq task. For this example, we are going to use the MT example, but you can use any other dataset for other tasks such as paraphrasing, summarization, or even for converting text to Structured Query Language (SQL).

    You can download the dataset from https://www.kaggle.com/seymasa/turkish-to-english-translation...

Summary

In this chapter, we have learned various aspects of AR language models, from pre-training to fine-tuning. We looked at the best features of such models by training generative language models and fine-tuning on tasks such as MT. We understood the basics of more complex models such as T5 and used this kind of model to perform MT. We also used the simpletransformers library. We trained GPT-2 on our own corpus and generated text using it. We learned how to save it and use it with AutoModel. We also had a deeper look into how BPE can be trained and used, using the tokenizers library.

In the next chapter, we will see how to fine-tune models for text classification.

References

Here are a few references that you can use to expand on what we learned in this chapter:

  • Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. and Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI blog, 1(8), 9.
  • Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O. and Zettlemoyer, L. (2019). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv preprint arXiv:1910.13461.
  • Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A. and Raffel, C. (2020). mT5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
  • Raffel, C. , Shazeer, N. , Roberts, A. , Lee, K. , Narang, S. , Matena, M. and Liu, P. J. (2019). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv preprint arXiv:1910.10683.
  • Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Transformers
Published in: Sep 2021Publisher: PacktISBN-13: 9781801077651
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Savaş Yıldırım

Savaş Yıldırım graduated from the Istanbul Technical University Department of Computer Engineering and holds a Ph.D. degree in Natural Language Processing (NLP). Currently, he is an associate professor at the Istanbul Bilgi University, Turkey, and is a visiting researcher at the Ryerson University, Canada. He is a proactive lecturer and researcher with more than 20 years of experience teaching courses on machine learning, deep learning, and NLP. He has significantly contributed to the Turkish NLP community by developing a lot of open source software and resources. He also provides comprehensive consultancy to AI companies on their R&D projects. In his spare time, he writes and directs short films, and enjoys practicing yoga.
Read more about Savaş Yıldırım

author image
Meysam Asgari- Chenaghlu

Meysam Asgari-Chenaghlu is an AI manager at Carbon Consulting and is also a Ph.D. candidate at the University of Tabriz. He has been a consultant for Turkey's leading telecommunication and banking companies. He has also worked on various projects, including natural language understanding and semantic search.
Read more about Meysam Asgari- Chenaghlu