Representing the meaning of words, phrases, and sentences in a form that's understandable to computers is one of the pillars of NLP processing. Machine learning, for example, represents each data point as a fixed-size vector, and we are faced with the question of how to turn words and sentences into vectors. Almost any NLP task starts with representing the text in some numeric form, and this chapter will show several ways of doing that. Once you've learned how to represent text as vectors, you will be able to perform tasks such as classification, which will be described in later chapters.
We will also learn how to turn phrases such as fried chicken into vectors, how to train a
word2vec model, and how to create a small search engine with semantic search.
The following recipes will be covered in this chapter:
- Putting documents into a bag of words
- Constructing the N-gram model
- Representing texts...
The code for this chapter is located at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook/tree/master/Chapter03. In this chapter, we will need additional packages. The installation instructions for Anaconda are as follows:
pip install sklearn pip install gensim pip install pickle pip install langdetect conda install pytorch torchvision cudatoolkit=10.2 -c pytorch pip install transformers pip install -U sentence-transformers pip install whoosh
In addition, we will use the models and datasets located at the following URLs:
Putting documents into a bag of words
A bag of words is the simplest way of representing text. We treat our text as a collection of documents, where documents are anything from sentences to book chapters to whole books. Since we usually compare different documents to each other or use them in a larger context of other documents, typically, we work with a collection of documents, not just a single document.
The bag of words method uses a training text that provides it with a list of words that it should consider. When encoding new sentences, it counts the number of occurrences each word makes in the document, and the final vector includes those counts for each word in the vocabulary. This representation can then be fed into a machine learning algorithm.
The decision of what represents a document lies with the engineer, and in many cases will be obvious. For example, if you are working on classifying tweets as belonging to a particular topic, a single tweet will be your document...
Constructing the N-gram model
Representing a document as a bag of words is useful, but semantics is about more than just words in isolation. To capture word combinations, an n-gram model is useful. Its vocabulary consists not just of words, but word sequences, or n-grams. We will build a bigram model in this recipe, where bigrams are sequences of two words.
CountVectorizer class is very versatile and allows us to construct n-gram models. We will use it again in this recipe. We will also explore how to build character n-gram models using this class.
How to do it…
- Import the
CountVectorizerclass and helper functions from Chapter 1, Learning NLP Basics, from the Putting documents into a bag of words recipe:
from sklearn.feature_extraction.text import CountVectorizer from Chapter01.dividing_into_sentences import read_text_file, preprocess_text, divide_into_sentences_nltk from Chapter03.bag_of_words import get_sentences, get_new_sentence_vector...
Representing texts with TF-IDF
We can go one step further and use the TF-IDF algorithm to count words and ngrams in incoming documents. TF-IDF stands for term frequency-inverse document frequency and gives more weight to words that are unique to a document than to words that are frequent, but repeated throughout most documents. This allows us to give more weight to words uniquely characteristic to particular documents. You can find out more at https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting.
In this recipe, we will use a different type of vectorizer that can apply the TF-IDF algorithm to the input text. Like the
CountVectorizer class, it has an analyzer that we will use to show the representations of new sentences.
We will be using the
TfidfVectorizer class from the
sklearn package. We will also be using the stopwords list from Chapter 1, Learning NLP Basics.
How to do it…
TfidfVectorizer class allows for...
Using word embeddings
In this recipe we switch gears and learn how to represent words using word embeddings, which are powerful because they are a result of training a neural network that predicts a word from all other words in the sentence. The resulting vector embeddings are similar for words that occur in similar contexts. We will use the embeddings to show these similarities.
In this recipe, we will use a pretrained word2vec model, which can be found at http://vectors.nlpl.eu/repository/20/40.zip. Download the model and unzip it in the
Chapter03 directory. You should now have a file whose path is
We will also be using the
gensim package to load and use the model. Install it using
pip install gensim
How to do it…
- Import the
Training your own embeddings model
We can now train our own
word2vec model on a corpus. For this task, we will use the top 20 Project Guttenberg books, which includes The Adventures of Sherlock Holmes. The reason for this is that training a model on just one book will result in suboptimal results. Once we get more text, the results will be better.
You can download the dataset for this recipe from Kaggle: https://www.kaggle.com/currie32/project-gutenbergs-top-20-books. The dataset includes files in RTF format, so you will have to save them as text. We will use the same package,
gensim, to train our custom model.
We will use the
pickle package to save the model on disk. If you do not have it installed, install it by using pip:
pip install pickle
How to do it…
- Import the necessary packages...
Representing phrases – phrase2vec
Encoding words is useful, but usually, we deal with more complex units, such as phrases and sentences. Phrases are important because they specify more detail than just words. For example, the phrase delicious fried rice is very different than just the word rice.
In this recipe, we will train a
word2vec model that uses phrases as well as words.
We will be using the Yelp restaurant review dataset in this recipe, which is available here: https://www.yelp.com/dataset (the file is about 4 GB.) Download the file and unzip it in the
Chapter03 folder. I downloaded the dataset in September 2020, and the results in the recipe are from that download. Your results might differ, since the dataset is updated by Yelp periodically.
The dataset is multilingual, and we will be working with the English reviews. In order to filter them, we will need the
langdetect package. Install it using
pip install langdetect
How to do...
Using BERT instead of word embeddings
A recent development in the embeddings world is BERT, also known as Bidirectional Encoder Representations from Transformers, which, like word embeddings, gives a vector representation, but it takes context into account and can represent a whole sentence. We can use the Hugging Face
sentence_transformers package to represent sentences as vectors.
For this recipe, we need to install PyTorch with Torchvision, and then the transformers and sentence transformers from Hugging Face. Follow these installation steps in an Anaconda prompt. For Windows, use the following code:
conda install pytorch torchvision cudatoolkit=10.2 -c pytorch pip install transformers pip install -U sentence-transformers
For macOS, use the following code:
conda install pytorch torchvision torchaudio -c pytorch pip install transformers pip install -U sentence-transformers
How to do it…
The Hugging Face code makes using BERT very easy. The...
Getting started with semantic search
In this recipe, we will get a glimpse of how to get started on expanding search with the help of a
word2vec model. When we search for a term, we expect the search engine to show us a result with a synonym when we didn't use the exact term contained in the document. Search engines are far more complicated than what we'll show in the recipe, but this should give you a taste of what it's like to build a customizable search engine.
We will be using an IMDb dataset from Kaggle, which can be downloaded from https://www.kaggle.com/PromptCloudHQ/imdb-data. Download the dataset and unzip the CSV file.
We will also use a small-scale Python search engine called Whoosh. Install it using pip:
pip install whoosh
We will also be using the pretrained
word2vec model from the Using word embeddings recipe.
How to do it…
We will create a class for the Whoosh search engine that will create a document index based...