Python Natural Language Processing Cookbook

By Zhenya Antić
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Chapter 2: Playing with Grammar

About this book

Python is the most widely used language for natural language processing (NLP) thanks to its extensive tools and libraries for analyzing text and extracting computer-usable data. This book will take you through a range of techniques for text processing, from basics such as parsing the parts of speech to complex topics such as topic modeling, text classification, and visualization.

Starting with an overview of NLP, the book presents recipes for dividing text into sentences, stemming and lemmatization, removing stopwords, and parts of speech tagging to help you to prepare your data. You’ll then learn ways of extracting and representing grammatical information, such as dependency parsing and anaphora resolution, discover different ways of representing the semantics using bag-of-words, TF-IDF, word embeddings, and BERT, and develop skills for text classification using keywords, SVMs, LSTMs, and other techniques. As you advance, you’ll also see how to extract information from text, implement unsupervised and supervised techniques for topic modeling, and perform topic modeling of short texts, such as tweets. Additionally, the book shows you how to develop chatbots using NLTK and Rasa and visualize text data.

By the end of this NLP book, you’ll have developed the skills to use a powerful set of tools for text processing.

Publication date:
March 2021
Publisher
Packt
Pages
284
ISBN
9781838987312

 

Chapter 3: Representing Text – Capturing Semantics

Representing the meaning of words, phrases, and sentences in a form that's understandable to computers is one of the pillars of NLP processing. Machine learning, for example, represents each data point as a fixed-size vector, and we are faced with the question of how to turn words and sentences into vectors. Almost any NLP task starts with representing the text in some numeric form, and this chapter will show several ways of doing that. Once you've learned how to represent text as vectors, you will be able to perform tasks such as classification, which will be described in later chapters.

We will also learn how to turn phrases such as fried chicken into vectors, how to train a word2vec model, and how to create a small search engine with semantic search.

The following recipes will be covered in this chapter:

  • Putting documents into a bag of words
  • Constructing the N-gram model
  • Representing texts...
 

Technical requirements

The code for this chapter is located at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook/tree/master/Chapter03. In this chapter, we will need additional packages. The installation instructions for Anaconda are as follows:

pip install sklearn
pip install gensim
pip install pickle
pip install langdetect
conda install pytorch torchvision cudatoolkit=10.2 -c pytorch
pip install transformers
pip install -U sentence-transformers
pip install whoosh

In addition, we will use the models and datasets located at the following URLs:

 

Putting documents into a bag of words

A bag of words is the simplest way of representing text. We treat our text as a collection of documents, where documents are anything from sentences to book chapters to whole books. Since we usually compare different documents to each other or use them in a larger context of other documents, typically, we work with a collection of documents, not just a single document.

The bag of words method uses a training text that provides it with a list of words that it should consider. When encoding new sentences, it counts the number of occurrences each word makes in the document, and the final vector includes those counts for each word in the vocabulary. This representation can then be fed into a machine learning algorithm.

The decision of what represents a document lies with the engineer, and in many cases will be obvious. For example, if you are working on classifying tweets as belonging to a particular topic, a single tweet will be your document...

 

Constructing the N-gram model

Representing a document as a bag of words is useful, but semantics is about more than just words in isolation. To capture word combinations, an n-gram model is useful. Its vocabulary consists not just of words, but word sequences, or n-grams. We will build a bigram model in this recipe, where bigrams are sequences of two words.

Getting ready

The CountVectorizer class is very versatile and allows us to construct n-gram models. We will use it again in this recipe. We will also explore how to build character n-gram models using this class.

How to do it…

Follow these steps:

  1. Import the CountVectorizer class and helper functions from Chapter 1, Learning NLP Basics, from the Putting documents into a bag of words recipe:
    from sklearn.feature_extraction.text import CountVectorizer
    from Chapter01.dividing_into_sentences import read_text_file, preprocess_text, divide_into_sentences_nltk
    from Chapter03.bag_of_words import get_sentences, get_new_sentence_vector...
 

Representing texts with TF-IDF

We can go one step further and use the TF-IDF algorithm to count words and ngrams in incoming documents. TF-IDF stands for term frequency-inverse document frequency and gives more weight to words that are unique to a document than to words that are frequent, but repeated throughout most documents. This allows us to give more weight to words uniquely characteristic to particular documents. You can find out more at https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting.

In this recipe, we will use a different type of vectorizer that can apply the TF-IDF algorithm to the input text. Like the CountVectorizer class, it has an analyzer that we will use to show the representations of new sentences.

Getting ready

We will be using the TfidfVectorizer class from the sklearn package. We will also be using the stopwords list from Chapter 1, Learning NLP Basics.

How to do it…

The TfidfVectorizer class allows for...

 

Using word embeddings

In this recipe we switch gears and learn how to represent words using word embeddings, which are powerful because they are a result of training a neural network that predicts a word from all other words in the sentence. The resulting vector embeddings are similar for words that occur in similar contexts. We will use the embeddings to show these similarities.

Getting ready

In this recipe, we will use a pretrained word2vec model, which can be found at http://vectors.nlpl.eu/repository/20/40.zip. Download the model and unzip it in the Chapter03 directory. You should now have a file whose path is …/Chapter03/40/model.bin.

We will also be using the gensim package to load and use the model. Install it using pip:

pip install gensim

How to do it…

We will load the model, demonstrate some features of the gensim package, and then compute a sentence vector using the word embeddings. Let's get started:

  1. Import the KeyedVectors object...
 

Training your own embeddings model

We can now train our own word2vec model on a corpus. For this task, we will use the top 20 Project Guttenberg books, which includes The Adventures of Sherlock Holmes. The reason for this is that training a model on just one book will result in suboptimal results. Once we get more text, the results will be better.

Getting ready

You can download the dataset for this recipe from Kaggle: https://www.kaggle.com/currie32/project-gutenbergs-top-20-books. The dataset includes files in RTF format, so you will have to save them as text. We will use the same package, gensim, to train our custom model.

We will use the pickle package to save the model on disk. If you do not have it installed, install it by using pip:

pip install pickle

How to do it…

We will read in all 20 books and use the text to create a word2vec model. Make sure all the books are located in one directory. Let's get started:

  1. Import the necessary packages...
 

Representing phrases – phrase2vec

Encoding words is useful, but usually, we deal with more complex units, such as phrases and sentences. Phrases are important because they specify more detail than just words. For example, the phrase delicious fried rice is very different than just the word rice.

In this recipe, we will train a word2vec model that uses phrases as well as words.

Getting ready

We will be using the Yelp restaurant review dataset in this recipe, which is available here: https://www.yelp.com/dataset (the file is about 4 GB.) Download the file and unzip it in the Chapter03 folder. I downloaded the dataset in September 2020, and the results in the recipe are from that download. Your results might differ, since the dataset is updated by Yelp periodically.

The dataset is multilingual, and we will be working with the English reviews. In order to filter them, we will need the langdetect package. Install it using pip:

pip install langdetect

How to do...

 

Using BERT instead of word embeddings

A recent development in the embeddings world is BERT, also known as Bidirectional Encoder Representations from Transformers, which, like word embeddings, gives a vector representation, but it takes context into account and can represent a whole sentence. We can use the Hugging Face sentence_transformers package to represent sentences as vectors.

Getting ready

For this recipe, we need to install PyTorch with Torchvision, and then the transformers and sentence transformers from Hugging Face. Follow these installation steps in an Anaconda prompt. For Windows, use the following code:

conda install pytorch torchvision cudatoolkit=10.2 -c pytorch
pip install transformers
pip install -U sentence-transformers

For macOS, use the following code:

conda install pytorch torchvision torchaudio -c pytorch
pip install transformers
pip install -U sentence-transformers

How to do it…

The Hugging Face code makes using BERT very easy. The...

 

Getting started with semantic search

In this recipe, we will get a glimpse of how to get started on expanding search with the help of a word2vec model. When we search for a term, we expect the search engine to show us a result with a synonym when we didn't use the exact term contained in the document. Search engines are far more complicated than what we'll show in the recipe, but this should give you a taste of what it's like to build a customizable search engine.

Getting ready

We will be using an IMDb dataset from Kaggle, which can be downloaded from https://www.kaggle.com/PromptCloudHQ/imdb-data. Download the dataset and unzip the CSV file.

We will also use a small-scale Python search engine called Whoosh. Install it using pip:

pip install whoosh

We will also be using the pretrained word2vec model from the Using word embeddings recipe.

How to do it…

We will create a class for the Whoosh search engine that will create a document index based...

About the Author

  • Zhenya Antić

    Zhenya Antić is a Natural Language Processing (NLP) professional working at Practical Linguistics Inc. She helps businesses to improve processes and increase productivity by automating text processing. Zhenya holds a PhD in linguistics from University of California Berkeley and a BS in computer science from Massachusetts Institute of Technology.

    Browse publications by this author
Book Title
Access this book, plus 7,500 other titles for FREE
Access now