Reader small image

You're reading from  Mastering Transformers

Product typeBook
Published inSep 2021
PublisherPackt
ISBN-139781801077651
Edition1st Edition
Right arrow
Authors (2):
Savaş Yıldırım
Savaş Yıldırım
author image
Savaş Yıldırım

Savaş Yıldırım graduated from the Istanbul Technical University Department of Computer Engineering and holds a Ph.D. degree in Natural Language Processing (NLP). Currently, he is an associate professor at the Istanbul Bilgi University, Turkey, and is a visiting researcher at the Ryerson University, Canada. He is a proactive lecturer and researcher with more than 20 years of experience teaching courses on machine learning, deep learning, and NLP. He has significantly contributed to the Turkish NLP community by developing a lot of open source software and resources. He also provides comprehensive consultancy to AI companies on their R&D projects. In his spare time, he writes and directs short films, and enjoys practicing yoga.
Read more about Savaş Yıldırım

Meysam Asgari- Chenaghlu
Meysam Asgari- Chenaghlu
author image
Meysam Asgari- Chenaghlu

Meysam Asgari-Chenaghlu is an AI manager at Carbon Consulting and is also a Ph.D. candidate at the University of Tabriz. He has been a consultant for Turkey's leading telecommunication and banking companies. He has also worked on various projects, including natural language understanding and semantic search.
Read more about Meysam Asgari- Chenaghlu

View More author details
Right arrow

Chapter 7: Text Representation

So far, we have addressed classification and generation problems with the transformers library. Text representation is another crucial task in modern Natural Language Processing (NLP), especially for unsupervised tasks such as clustering, semantic search, and topic modeling. Representing sentences by using various models such as Universal Sentence Encoder (USE) and Siamese BERT (Sentence-BERT) with additional libraries such as sentence transformers will be explained here. Zero-shot learning using BART will also be explained, and you will learn how to utilize it. Few-shot learning methodologies and unsupervised use cases such as semantic text clustering and topic modeling will also be described. Finally, one-shot learning use cases such as semantic search will be covered.

The following topics will be covered in this chapter:

  • Introduction to sentence embeddings
  • Benchmarking sentence similarity models
  • Using BART for zero-shot learning...

Technical requirements

We will be using a Jupyter notebook to run our coding exercises. For this, you will need Python 3.6+ and the following packages:

  • sklearn
  • transformers >=4.00
  • datasets
  • sentence-transformers
  • tensorflow-hub
  • flair
  • umap-learn
  • bertopic

All the notebooks for the coding exercises in this chapter will be available at the following GitHub link:

https://github.com/PacktPublishing/Mastering-Transformers/tree/main/CH07

Check out the following link to see the Code in Action video: https://bit.ly/2VcMCyI

Introduction to sentence embeddings

Pre-trained BERT models do not produce efficient and independent sentence embeddings as they always need to be fine-tuned in an end-to-end supervised setting. This is because we can think of a pre-trained BERT model as an indivisible whole and semantics is spread across all layers, not just the final layer. Without fine-tuning, it may be ineffective to use its internal representations independently. It is also hard to handle unsupervised tasks such as clustering, topic modeling, information retrieval, or semantic search. Because we have to evaluate many sentence pairs during clustering tasks, for instance, this causes massive computational overhead.

Luckily, many modifications have been made to the original BERT model, such as Sentence-BERT (SBERT), to derive semantically meaningful and independent sentence embeddings. We will talk about these approaches in a moment. In the NLP literature, many neural sentence embedding methods have been proposed...

Semantic similarity experiment with FLAIR

In this experiment, we will qualitatively evaluate the sentence representation models thanks to the flair library, which really simplifies obtaining the document embeddings for us.

We will perform experiments while taking on the following approaches:

  • Document average pool embeddings
  • RNN-based embeddings
  • BERT embeddings
  • SBERT embeddings

We need to install these libraries before we can start the experiments:

!pip install sentence-transformers
!pip install dataset
!pip install flair

For qualitative evaluation, we define a list of similar sentence pairs and a list of dissimilar sentence pairs (five pairs for each). What we expect from the embeddings models is that they should measure a high score and a low score, respectively.

The sentence pairs are extracted from the SBS Benchmark dataset, which we are already familiar with from the sentence-pair regression part of Chapter 6, Fine-Tuning Language Models...

Text clustering with Sentence-BERT

For clustering algorithms, we will need a model that's suitable for textual similarity. Let's use the paraphrase-distilroberta-base-v1 model here for a change. We will start by loading the Amazon Polarity dataset for our clustering experiment. This dataset includes Amazon web page reviews spanning a period of 18 years up to March 2013. The original dataset includes over 35 million reviews. These reviews include product information, user information, user ratings, and user reviews. Let's get started:

  1. First, randomly select 10K reviews by shuffling, as follows:
    import pandas as pd, numpy as np
    import torch, os, scipy
    from datasets import load_dataset
    dataset = load_dataset("amazon_polarity",split="train")
    corpus=dataset.shuffle(seed=42)[:10000]['content']
  2. The corpus is now ready for clustering. The following code instantiates a sentence-transformer object using the pre-trained paraphrase-distilroberta...

Semantic search with Sentence-BERT

We may already be familiar with keyword-based search (Boolean model), where, for a given keyword or pattern, we can retrieve the results that match the pattern. Alternatively, we can use regular expressions, where we can define advanced patterns such as the lexico-syntactic pattern. These traditional approaches cannot handle synonym (for example, car is the same as automobile) or word sense problems (for example, bank as the side of a river or bank as a financial institute). While the first synonym case causes low recall due to missing out the documents that shouldn't be missed, the second causes low precision due to catching the documents not to be caught. Vector-based or semantic search approaches can overcome these drawbacks by building a dense numerical representation of both queries and documents.

Let's set up a case study for Frequently Asked Questions (FAQs) that are idle on websites. We will exploit FAQ resources within a semantic...

Summary

In this chapter, we learned about text representation methods. We learned how it is possible to perform tasks such as zero-/few-/one-shot learning using different and diverse semantic models. We also learned about NLI and its importance in capturing semantics of text. Moreover, we looked at some useful use cases such as semantic search, semantic clustering, and topic modeling using Transformer-based semantic models. We learned how to visualize the clustering results and understood the importance of centroids in such problems.

In the next chapter, you will learn about efficient Transformer models. You will learn about distillation, pruning, and quantizing Transformer-based models. You will also learn about different and efficient Transformer architectures that make improvements to computational and memory efficiency, as well as how to use them in NLP problems.

Further reading

Please refer to the following works/papers for more information about the topics that were covered in this chapter:

  • Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., ... & Zettlemoyer, L. (2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
  • Pushp, P. K., & Srivastava, M. M. (2017). Train once, test anywhere: Zero-shot learning for text classification. arXiv preprint arXiv:1712.05972.
  • Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  • Williams, A., Nangia, N., & Bowman, S. R. (2017). A broad-coverage challenge corpus for sentence understanding...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Transformers
Published in: Sep 2021Publisher: PacktISBN-13: 9781801077651
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Savaş Yıldırım

Savaş Yıldırım graduated from the Istanbul Technical University Department of Computer Engineering and holds a Ph.D. degree in Natural Language Processing (NLP). Currently, he is an associate professor at the Istanbul Bilgi University, Turkey, and is a visiting researcher at the Ryerson University, Canada. He is a proactive lecturer and researcher with more than 20 years of experience teaching courses on machine learning, deep learning, and NLP. He has significantly contributed to the Turkish NLP community by developing a lot of open source software and resources. He also provides comprehensive consultancy to AI companies on their R&D projects. In his spare time, he writes and directs short films, and enjoys practicing yoga.
Read more about Savaş Yıldırım

author image
Meysam Asgari- Chenaghlu

Meysam Asgari-Chenaghlu is an AI manager at Carbon Consulting and is also a Ph.D. candidate at the University of Tabriz. He has been a consultant for Turkey's leading telecommunication and banking companies. He has also worked on various projects, including natural language understanding and semantic search.
Read more about Meysam Asgari- Chenaghlu