Essentials of NLP
Language has been a part of human evolution. The development of language allowed better communication between people and tribes. The evolution of written language, initially as cave paintings and later as characters, allowed information to be distilled, stored, and passed on from generation to generation. Some would even say that the hockey stick curve of advancement is because of the ever-accumulating cache of stored information. As this stored information trove becomes larger and larger, the need for computational methods to process and distill the data becomes more acute. In the past decade, a lot of advances were made in the areas of image and speech recognition. Advances in Natural Language Processing (NLP) are more recent, though computational methods for NLP have been an area of research for decades. Processing textual data requires many different building blocks upon which advanced models can be built. Some of these building blocks themselves can be quite challenging and advanced. This chapter and the next focus on these building blocks and the problems that can be solved with them through simple models.
In this chapter, we will focus on the basics of pre-processing text and build a simple spam detector. Specifically, we will learn about the following:
- The typical text processing workflow
- Data collection and labeling
- Text normalization, including case normalization, text tokenization, stemming, and lemmatization
- Modeling datasets that have been text normalized
- Vectorizing text
- Modeling datasets with vectorized text
Let's start by getting to grips with the text processing workflow most NLP models use.
A typical text processing workflow
Figure 1.1: Typical stages of a text processing workflow
The first two steps of the process in the preceding diagram involve collecting labeled data. A supervised model or even a semi-supervised model needs data to operate. The next step is usually normalizing and featurizing the data. Models have a hard time processing text data as is. There is a lot of hidden structure in a given text that needs to be processed and exposed. These two steps focus on that. The last step is building a model with the processed inputs. While NLP has some unique models, this chapter will use only a simple deep neural network and focus more on the normalization and vectorization/featurization. Often, the last three stages operate in a cycle, even though the diagram may give the impression of linearity. In industry, additional features require more effort to develop and more resources to keep running. Hence, it is important that features add value. Taking this approach, we will use a simple model to validate different normalization/vectorization/featurization steps. Now, let's look at each of these stages in detail.
Data collection and labeling
The first step of any Machine Learning (ML) project is to obtain a dataset. Fortunately, in the text domain, there is plenty of data to be found. A common approach is to use libraries such as
scrapy or Beautiful Soup to scrape data from the web. However, data is usually unlabeled, and as such can't be used in supervised models directly. This data is quite useful though. Through the use of transfer learning, a language model can be trained using unsupervised or semi-supervised methods and can be further used with a small training dataset specific to the task at hand. We will cover transfer learning in more depth in Chapter 3, Named Entity Recognition (NER) with BiLSTMs, CRFs, and Viterbi Decoding, when we look at transfer learning using BERT embeddings.
In the labeling step, textual data sourced in the data collection step is labeled with the right classes. Let's take some examples. If the task is to build a spam classifier for emails, then the previous step would involve collecting lots of emails. This labeling step would be to attach a spam or not spam label to each email. Another example could be sentiment detection on tweets. The data collection step would involve gathering a number of tweets. This step would label each tweet with a label that acts as a ground truth. A more involved example would involve collecting news articles, where the labels would be summaries of the articles. Yet another example of such a case would be an email auto-reply functionality. Like the spam case, a number of emails with their replies would need to be collected. The labels in this case would be short pieces of text that would approximate replies. If you are working on a specific domain without much public data, you may have to do these steps yourself.
Given that text data is generally available (outside of specific domains like health), labeling is usually the biggest challenge. It can be quite time consuming or resource intensive to label data. There has been a lot of recent focus on using semi-supervised approaches to labeling data. We will cover some methods for labeling data at scale using semi-supervised methods and the snorkel library in Chapter 7, Multi-modal Networks and Image Captioning with ResNets and Transformer, when we look at weakly supervised learning for classification using Snorkel.
There is a number of commonly used datasets that are available on the web for use in training models. Using transfer learning, these generic datasets can be used to prime ML models and then you can use a small amount of domain-specific data to fine-tune the model. Using these publicly available datasets gives us a few advantages. First, all the data collection has been already performed. Second, labeling has already been done. Lastly, using such a dataset allows the comparison of results with the state of the art; most papers use specific datasets in their area of research and publish benchmarks. For example, the Stanford Question Answering Dataset (or SQuAD for short) is often used as a benchmark for question-answering models. It is a good source to train on as well.
Collecting labeled data
In this book, we will rely on publicly available datasets. The appropriate datasets will be called out in their respective chapters along with instructions on downloading them. To build a spam detection system on an email dataset, we will be using the SMS Spam Collection dataset made available by University of California, Irvine. This dataset can be downloaded using instructions available in the tip box below. Each SMS is tagged as "SPAM" or "HAM," with the latter indicating it is not a spam message.
University of California, Irvine, is a great source of machine learning datasets. You can see all the datasets they provide by visiting http://archive.ics.uci.edu/ml/datasets.php. Specifically for NLP, you can see some publicly available datasets on https://github.com/niderhoff/nlp-datasets.
Before we start working with the data, the development environment needs to be set up. Let's take a quick moment to set up the development environment.
Development environment setup
In this chapter, we will be using Google Colaboratory, or Colab for short, to write code. You can use your Google account, or register a new account. Google Colab is free to use, requires no configuration, and also provides access to GPUs. The user interface is very similar to a Jupyter notebook, so it should seem familiar. To get started, please navigate to colab.research.google.com using a supported web browser. A web page similar to the screenshot below should appear:
Figure 1.2: Google Colab website
The next step is to create a new notebook. There are a couple of options. The first option is to create a new notebook in Colab and type in the code as you go along in the chapter. The second option is to upload a notebook from the local drive into Colab. It is also possible to pull in notebooks from GitHub into Colab, the process for which is detailed on the Colab website. For the purposes of this chapter, a complete notebook named
SMS_Spam_Detection.ipynb is available in the GitHub repository of the book in the
chapter1-nlp-essentials folder. Please upload this notebook into Google Colab by clicking File | Upload Notebook. Specific sections of this notebook will be referred to at the appropriate points in the chapter in tip boxes. The instructions for creating the notebook from scratch are in the main description.
Click on the File menu option at the top left and click on New Notebook. A new notebook will open in a new browser tab. Click on the notebook name at the top left, just above the File menu option, and edit it to read
SMS_Spam_Detection. Now the development environment is set up. It is time to begin loading in data.
First, let us edit the first line of the notebook and import TensorFlow 2. Enter the following code in the first cell and execute it:
%tensorflow_version 2.x import tensorflow as tf import os import io tf.__version__
The output of running this cell should look like this:
TensorFlow 2.x is selected. '2.4.0'
This confirms that version 2.4.0 of the TensorFlow library was loaded. The highlighted line in the preceding code block is a magic command for Google Colab, instructing it to use TensorFlow version 2+. The next step is to download the data file and unzip to a location in the Colab notebook on the cloud.
The code for loading the data is in the Download Data section of the notebook. Also note that as of writing, the release version of TensorFlow was 2.4.
This can be done with the following code:
# Download the zip file path_to_zip = tf.keras.utils.get_file("smsspamcollection.zip", origin="https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip", extract=True) # Unzip the file into a folder !unzip $path_to_zip -d data
The following output confirms that the data was downloaded and extracted:
Archive: /root/.keras/datasets/smsspamcollection.zip inflating: data/SMSSpamCollection inflating: data/readme
Reading the data file is trivial:
# Let's see if we read the data correctly lines = io.open('data/SMSSpamCollection').read().strip().split('\n') lines
The last line of code shows a sample line of data:
'ham\tGo until jurong point, crazy.. Available only in bugis n great world'
This example is labeled as not spam. The next step is to split each line into two columns – one with the text of the message and the other as the label. While we are separating these labels, we will also convert the labels to numeric values. Since we are interested in predicting spam messages, we can assign a value of
1 to the spam messages. A value of
0 will be assigned to legitimate messages.
The code for this part is in the Pre-Process Data section of the notebook.
Please note that the following code is verbose for clarity:
spam_dataset =  for line in lines: label, text = line.split('\t') if label.strip() == 'spam': spam_dataset.append((1, text.strip())) else: spam_dataset.append(((0, text.strip()))) print(spam_dataset)
(0, 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...')
Now the dataset is ready for further processing in the pipeline. However, let's take a short detour to see how to configure GPU access in Google Colab.
Enabling GPUs on Google Colab
One of the advantages of using Google Colab is access to free GPUs for small tasks. GPUs make a big difference in the training time of NLP models, especially ones that use Recurrent Neural Networks (RNNs). The first step in enabling GPU access is to start a runtime, which can be done by executing a command in the notebook. Then, click on the Runtime menu option and select the Change Runtime option, as shown in the following screenshot:
Figure 1.3: Colab runtime settings menu option
Figure 1.4: Enabling GPUs on Colab
For now, let's turn our attention back to the data that has been loaded and is ready to be processed further for use in models.
Text normalization is a pre-processing step aimed at improving the quality of the text and making it suitable for machines to process. Four main steps in text normalization are case normalization, tokenization and stop word removal, Parts-of-Speech (POS) tagging, and stemming.
Case normalization applies to languages that use uppercase and lowercase letters. All languages based on the Latin alphabet or the Cyrillic alphabet (Russian, Mongolian, and so on) use upper- and lowercase letters. Other languages that sometimes use this are Greek, Armenian, Cherokee, and Coptic. In case normalization, all letters are converted to the same case. It is quite helpful in semantic use cases. However, in other cases, this may hinder performance. In the spam example, spam messages may have more words in all-caps compared to regular messages.
Another common normalization step removes punctuation in the text. Again, this may or may not be useful given the problem at hand. In most cases, this should give good results. However, in some cases, such as spam or grammar models, it may hinder performance. It is more likely for spam messages to use more exclamation marks or other punctuation for emphasis.
The code for this part is in the Data Normalization section of the notebook.
- Number of characters in the message
- Number of capital letters in the message
- Number of punctuation symbols in the message
To do so, first, we will convert the data into a
import pandas as pd df = pd.DataFrame(spam_dataset, columns=['Spam', 'Message'])
Next, let's build some simple functions that can count the length of the message, and the numbers of capital letters and punctuation symbols. Python's regular expression package,
re, will be used to implement these:
import re def message_length(x): # returns total number of characters return len(x) def num_capitals(x): _, count = re.subn(r'[A-Z]', '', x) # only works in english return count def num_punctuation(x): _, count = re.subn(r'\W', '', x) return count
num_capitals() function, substitutions are performed for the capital letters in English. The
count of these substitutions provides the count of capital letters. The same technique is used to count the number of punctuation symbols. Please note that the method used to count capital letters is specific to English.
Additional feature columns will be added to the DataFrame, and then the set will be split into test and train sets:
df['Capitals'] = df['Message'].apply(num_capitals) df['Punctuation'] = df['Message'].apply(num_punctuation) df['Length'] = df['Message'].apply(message_length) df.describe()
Figure 1.5: Base dataset for initial spam model
The following code can be used to split the dataset into training and test sets, with 80% of the records in the training set and the rest in the test set. Further more, labels will be removed from both the training and test sets:
train=df.sample(frac=0.8,random_state=42) test=df.drop(train.index) x_train = train[['Length', 'Capitals', 'Punctuation']] y_train = train[['Spam']] x_test = test[['Length', 'Capitals', 'Punctuation']] y_test = test[['Spam']]
Now we are ready to build a simple classifier to use this data.
Modeling normalized data
Recall that modeling was the last part of the text processing pipeline described earlier. In this chapter, we will use a very simple model, as the objective is to show different basic NLP data processing techniques more than modeling. Here, we want to see if three simple features can aid in the classification of spam. As more features are added, passing them through the same model will help in seeing if the featurization aids or hampers the accuracy of the classification.
The Model Building section of the workbook has the code shown in this section.
A function is defined that allows the construction of models with different numbers of inputs and hidden units:
# Basic 1-layer neural network model for evaluation def make_model(input_dims=3, num_units=12): model = tf.keras.Sequential() # Adds a densely-connected layer with 12 units to the model: model.add(tf.keras.layers.Dense(num_units, input_dim=input_dims, activation='relu')) # Add a sigmoid layer with a binary output unit: model.add(tf.keras.layers.Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) return model
This model uses binary cross-entropy for computing loss and the Adam optimizer for training. The key metric, given that this is a binary classification problem, is accuracy. The default parameters passed to the function are sufficient as only three features are being passed in.
We can train our simple baseline model with only three features like so:
model = make_model() model.fit(x_train, y_train, epochs=10, batch_size=10)
Train on 4459 samples Epoch 1/10 4459/4459 [==============================] - 1s 281us/sample - loss: 0.6062 - accuracy: 0.8141 Epoch 2/10 … Epoch 10/10 4459/4459 [==============================] - 1s 145us/sample - loss: 0.1976 - accuracy: 0.9305
This is not bad as our three simple features help us get to 93% accuracy. A quick check shows that there are 592 spam messages in the test set, out of a total of 4,459. So, this model is doing better than a very simple model that guesses everything as not spam. That model would have an accuracy of 87%. This number may be surprising but is fairly common in classification problems where there is a severe class imbalance in the data. Evaluating it on the training set gives an accuracy of around 93.4%:
1115/1115 [==============================] - 0s 94us/sample - loss: 0.1949 - accuracy: 0.9336 [0.19485870356516988, 0.9336323]
Please note that the actual performance you see may be slightly different due to the data splits and computational vagaries. A quick verification can be performed by plotting the confusion matrix to see the performance:
y_train_pred = model.predict_classes(x_train) # confusion matrix tf.math.confusion_matrix(tf.constant(y_train.Spam), y_train_pred) <tf.Tensor: shape=(2, 2), dtype=int32, numpy= array([[3771, 96], [ 186, 406]], dtype=int32)>
Predicted Not Spam
Actual Not Spam
This shows that 3,771 out of 3,867 regular messages were classified correctly, while 406 out of 592 spam messages were classified correctly. Again, you may get a slightly different result.
To test the value of the features, try re-running the model by removing one of the features, such as punctuation or a number of capital letters, to get a sense of their contribution to the model. This is left as an exercise for the reader.
This step takes a piece of text and converts it into a list of tokens. If the input is a sentence, then separating the words would be an example of tokenization. Depending on the model, different granularities can be chosen. At the lowest level, each character could become a token. In some cases, entire sentences of paragraphs can be considered as a token:
Figure 1.6: Tokenizing a sentence
The preceding diagram shows two ways a sentence can be tokenized. One way to tokenize is to chop a sentence into words. Another way is to chop into individual characters. However, this can be a complex proposition in some languages such as Japanese and Mandarin.
Segmentation in Japanese
Many languages use a word separator, a space, to separate words. This makes the task of tokenizing on words trivial. However, there are other languages that do not use any markers or separators between words. Some examples of such languages are Japanese and Chinese. In such languages, the task is referred to as segmentation.
Specifically, in Japanese, there are mainly three different types of characters that are used: Hiragana, Kanji, and Katakana. Kanji is adapted from Chinese characters, and similar to Chinese, there are thousands of characters. Hiragana is used for grammatical elements and native Japanese words. Katakana is mostly used for foreign words and names. Depending on the preceding characters, a character may be part of an existing word or the start of a new word. This makes Japanese one of the most complicated writing systems in the world. Compound words are especially hard. Consider the following compound word that reads Election Administration Committee:
This can be tokenized in two different ways, outside of the entire phrase being considered one word. Here are two examples of tokenizing (from the Sudachi library):
(Election / Administration / Committee)
(Election / Administration / Committee / Meeting)
Common libraries that are used specifically for Japanese segmentation or tokenization are MeCab, Juman, Sudachi, and Kuromoji. MeCab is used in Hugging Face, spaCy, and other libraries.
The code shown in this section is in the Tokenization and Stop Word Removal section of the notebook.
Fortunately, most languages are not as complex as Japanese and use spaces to separate words. In Python, splitting by spaces is trivial. Let's take an example:
Sentence = 'Go until Jurong point, crazy.. Available only in bugis n great world' sentence.split()
The output of the preceding split operation results in the following:
['Go', 'until', 'jurong', 'point,', 'crazy..', 'Available', 'only', 'in', 'bugis', 'n', 'great', 'world']
The two highlighted lines in the preceding output show that the naïve approach in Python will result in punctuation being included in the words, among other issues. Consequently, this step is done through a library like StanfordNLP. Using
pip, let's install this package in our Colab notebook:
!pip install stanfordnlp
The StanfordNLP package uses PyTorch under the hood as well as a number of other packages. These and other dependencies will be installed. By default, the package does not install language files. These have to be downloaded. This is shown in the following code:
Import stanfordnlp as snlp en = snlp.download('en')
The English file is approximately 235 MB. A prompt will be displayed to confirm the download and the location to store it in:
Figure 1.7: Prompt for downloading English models
Google Colab recycles the runtimes upon inactivity. This means that if you perform commands in the book at different times, you may have to re-execute every command again from the start, including downloading and processing the dataset, downloading the StanfordNLP English files, and so on. A local notebook server would usually maintain the state of the runtime but may have limited processing power. For simpler examples as in this chapter, Google Colab is a decent solution. For the more advanced examples later in the book, where training may run for hours or days, a local runtime or one running on a cloud Virtual Machine (VM) would be preferred.
This package provides capabilities for tokenization, POS tagging, and lemmatization out of the box. To start with tokenization, we instantiate a pipeline and tokenize a sample text to see how this works:
en = snlp.Pipeline(lang='en', processors='tokenize')
lang parameter is used to indicate that an English pipeline is desired. The second parameter,
processors, indicates the type of processing that is desired in the pipeline. This library can also perform the following processing steps in the pipeline:
poslabels each token with a POS token. The next section provides more details on POS tags.
lemma, which can convert different forms of verbs, for example, to the base form. This will be covered in detail in the Stemming and lemmatization section later in this chapter.
depparseperforms dependency parsing between words in a sentence. Consider the following example sentence, "Hari went to school." Hari is interpreted as a noun by the POS tagger, and becomes the governor of the word went. The word school is dependent on went as it describes the object of the verb.
For now, only tokenization of text is desired, so only the tokenizer is used:
tokenized = en(sentence) len(tokenized.sentences)
This shows that the tokenizer correctly divided the text into two sentences. To investigate what words were removed, the following code can be used:
for snt in tokenized.sentences: for word in snt.tokens: print(word.text) print("<End of Sentence>")
Go until jurong point , crazy .. <End of Sentence> Available only in bugis n great world <End of Sentence>
Note the highlighted words in the preceding output. Punctuation marks were separated out into their own words. Text was split into multiple sentences. This is an improvement over only using spaces to split. In some applications, removal of punctuation may be required. This will be covered in the next section.
Consider the preceding example of Japanese. To see the performance of StanfordNLP on Japanese tokenization, the following piece of code can be used:
jp = snlp.download('ja')
This is the first step, which involves downloading the Japanese language model, similar to the English model that was downloaded and installed previously. Next, a Japanese pipeline will be instantiated and the words will be processed:
jp = snlp.download('ja') jp_line = jp("")
You may recall that the Japanese text reads Election Administration Committee. Correct tokenization should produce three words, where first two should be two characters each, and the last word is three characters:
for snt in jp_line.sentences: for word in snt.tokens: print(word.text)
This matches the expected output. StanfordNLP supports 53 languages, so the same code can be used for tokenizing any language that is supported.
Coming back to the spam detection example, a new feature can be implemented that counts the number of words in the message using this tokenization functionality.
This word count feature is implemented in the Adding Word Count Feature section of the notebook.
en = snlp.Pipeline(lang='en') def word_counts(x, pipeline=en): doc = pipeline(x) count = sum([len(sentence.tokens) for sentence in doc.sentences]) return count
Next, using the train and test splits, add a column for the word count feature:
train['Words'] = train['Message'].apply(word_counts) test['Words'] = test['Message'].apply(word_counts) x_train = train[['Length', 'Punctuation', 'Capitals', 'Words']] y_train = train[['Spam']] x_test = test[['Length', 'Punctuation', 'Capitals' , 'Words']] y_test = test[['Spam']] model = make_model(input_dims=4)
The last line in the preceding code block creates a new model with four input features.
When you execute functions in the StanfordNLP library, you may see a warning like this:
/pytorch/aten/src/ATen/native/LegacyDefinitions.cpp:19: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead.
Internally, StanfordNLP uses the PyTorch library. This warning is due to StanfordNLP using an older version of a function that is now deprecated. For all intents and purposes, this warning can be ignored. It is expected that maintainers of StanfordNLP will update their code.
Modeling tokenized data
model.fit(x_train, y_train, epochs=10, batch_size=10) Train on 4459 samples Epoch 1/10 4459/4459 [==============================] - 1s 202us/sample - loss: 2.4261 - accuracy: 0.6961Epoch 10/10 4459/4459 [==============================] - 1s 142us/sample - loss: 0.2061 - accuracy: 0.9312
There is only a marginal improvement in accuracy. One hypothesis is that the number of words is not useful. It would be useful if the average number of words in spam messages were smaller or larger than regular messages. Using pandas, this can be quickly verified:
train.loc[train.Spam == 1].describe()
Figure 1.8: Statistics for spam message features
train.loc[train.Spam == 0].describe()
Figure 1.9: Statistics for regular message features
Some interesting patterns can quickly be seen. Spam messages usually have much less deviation from the mean. Focus on the Capitals feature column. It shows that regular messages use far fewer capitals than spam messages. At the 75th percentile, there are 3 capitals in a regular message versus 21 for spam messages. On average, regular messages have 4 capital letters while spam messages have 15. This variation is much less pronounced in the number of words category. Regular messages have 17 words on average, while spam has 29. At the 75th percentile, regular messages have 22 words while spam messages have 35. This quick check yields an indication as to why adding the word features wasn't that useful. However, there are a couple of things to consider still. First, the tokenization model split out punctuation marks as words. Ideally, these words should be removed from the word counts as the punctuation feature is showing that spam messages use a lot more punctuation characters. This will be covered in the Parts-of-speech tagging section. Secondly, languages have some common words that are usually excluded. This is called stop word removal and is the focus of the next section.
Stop word removal
Stop word removal involves removing common words such as articles (the, an) and conjunctions (and, but), among others. In the context of information retrieval or search, these words would not be helpful in identifying documents or web pages that would match the query. As an example, consider the query "Where is Google based?". In this query, is is a stop word. The query would produce similar results irrespective of the inclusion of is. To determine the stop words, a simple approach is to use grammar clues.
In English, articles and conjunctions are examples of classes of words that can usually be removed. A more robust way is to consider the frequency of occurrence of words in a corpus, set of documents, or text. The most frequent terms can be selected as candidates for the stop word list. It is recommended that this list be reviewed manually. There can be cases where words may be frequent in a collection of documents but are still meaningful. This can happen if all the documents in the collection are from a specific domain or on a specific topic. Consider a set of documents from the Federal Reserve. The word economy may appear quite frequently in this case; however, it is unlikely to be a candidate for removal as a stop word.
In some cases, stop words may actually contain information. This may be applicable to phrases. Consider the fragment "flights to Paris." In this case, to provides valuable information, and its removal may change the meaning of the fragment.
Recall the stages of the text processing workflow. The step after text normalization is vectorization. This step is discussed in detail later in the Vectorizing text section of this chapter, but the key step in vectorization is to build a vocabulary or dictionary of all the tokens. The size of this vocabulary can be reduced by removing stop words. While training and evaluating models, removing stop words reduces the number of computation steps that need to be performed. Hence, the removal of stop words can yield benefits in terms of computation speed and storage space. Modern advances in NLP see smaller and smaller stop words lists as more efficient encoding schemes and computation methods evolve. Let's try and see the impact of stop words on the spam problem to develop some intuition about its usefulness.
Many NLP packages provide lists of stop words. These can be removed from the text after tokenization. Tokenization was done through the StanfordNLP library previously. However, this library does not come with a list of stop words. NLTK and spaCy supply stop words for a set of languages. For this example, we will use an open source package called
The Stop Word Removal section of the notebook contains the code for this section.
This Python package takes the list of stop words from the stopwords-iso GitHub project at https://github.com/stopwords-iso/stopwords-iso. This package provides stop words in 57 languages. The first step is to install the Python package that provides access to the stop words lists.
The following command will install the package through the notebook:
!pip install stopwordsiso
Supported languages can be checked with the following commands:
import stopwordsiso as stopwords stopwords.langs()
English language stop words can be checked as well to get an idea of some of the words:
["'ll", "'tis", "'twas", "'ve", '10', '39', 'a', "a's", 'able', 'ableabout', 'about', 'above', 'abroad', 'abst', 'accordance', 'according', 'accordingly', 'across', 'act', 'actually', 'ad', 'added',
Given that tokenization was already implemented in the preceding
word_counts() method, the implementation of that method can be updated to include removing stop words. However, all the stop words are in lowercase. Case normalization was discussed earlier, and capital letters were a useful feature for spam detection. In this case, tokens need to be converted to lowercase to effectively remove them:
en_sw = stopwords.stopwords('en') def word_counts(x, pipeline=en): doc = pipeline(x) count = 0 for sentence in doc.sentences: for token in sentence.tokens: if token.text.lower() not in en_sw: count += 1 return count
A consequence of using stop words is that a message such as "When are you going to ride your bike?" counts as only 3 words. When we see if this has had any effect on the statistics for word length, the following picture emerges:
Figure 1.10: Word counts for spam messages after removing stop words
Compared to the word counts prior to stop word removal, the average number of words has been reduced from 29 to 18, almost a 30% decrease. The 25th percentile changed from 26 to 14. The maximum has also reduced from 49 to 33.
The impact on regular messages is even more dramatic:
Figure 1.11: Word counts for regular messages after removing stop words
Comparing these statistics to those from before stop word removal, the average number of words has more than halved to almost 8. The maximum number of words has also reduced from 209 to 147. The standard deviation of regular messages is about the same as its mean, indicating that there is a lot of variation in the number of words in regular messages. Now, let's see if this helps us train a model and improve its accuracy.
Modeling data with stop words removed
train['Words'] = train['Message'].apply(word_counts) test['Words'] = test['Message'].apply(word_counts) x_train = train[['Length', 'Punctuation', 'Capitals', 'Words']] y_train = train[['Spam']] x_test = test[['Length', 'Punctuation', 'Capitals', 'Words']] y_test = test[['Spam']] model = make_model(input_dims=4) model.fit(x_train, y_train, epochs=10, batch_size=10)
Epoch 1/10 4459/4459 [==============================] - 2s 361us/sample - loss: 0.5186 - accuracy: 0.8652 Epoch 2/10Epoch 9/10 4459/4459 [==============================] - 2s 355us/sample - loss: 0.1790 - accuracy: 0.9417 Epoch 10/10 4459/4459 [==============================] - 2s 361us/sample - loss: 0.1802 - accuracy: 0.9421
1115/1115 [==============================] - 0s 74us/sample - loss: 0.1954 - accuracy: 0.9372 [0.19537461110027382, 0.93721974]
In NLP, stop word removal used to be standard practice. In more modern applications, stop words may actually end up hindering performance in some use cases, rather than helping. It is becoming more common not to exclude stop words. Depending on the problem you are solving, stop word removal may or may not help.
Note that StanfordNLP will separate words like can't into ca and n't. This represents the expansion of the short form into its constituents, can and not. These contractions may or may not appear in the stop word list. Implementing a more robust stop word detector is left to the reader as an exercise.
StanfordNLP uses a supervised RNN with Bi-directional Long Short-Term Memory (BiLSTM) units. This architecture uses a vocabulary to generate embeddings through the vectorization of the vocabulary. The vectorization and generation of embeddings is covered later in the chapter, in the Vectorizing text section. This architecture of BiLSTMs with embeddings is often a common starting point in NLP tasks. This will be covered and used in successive chapters in detail. This particular architecture for tokenization is considered the state of the art as of the time of writing this book. Prior to this, Hidden Markov Model (HMM)-based models were popular.
Depending on the languages in question, regular expression-based tokenization is also another approach. The NLTK library provides the Penn Treebank tokenizer based on regular expressions in a
sed script. In future chapters, other tokenization or segmentation schemes such as Byte Pair Encoding (BPE) and WordPiece will be explained.
The next task in text normalization is to understand the structure of a text through POS tagging.
Languages have a grammatical structure. In most languages, words can be categorized primarily into verbs, adverbs, nouns, and adjectives. The objective of this part of the processing step is to take a piece of text and tag each word token with a POS identifier. Note that this makes sense only in the case of word-level tokens. Commonly, the Penn Treebank POS tagger is used by libraries including StanfordNLP to tag words. By convention, POS tags are added by using a code after the word, separated by a slash. As an example,
NNS is the tag for a plural noun. If the words
goats was encountered, it would be represented as
goats/NNS. In the StandfordNLP library, Universal POS (UPOS) tags are used. The following tags are part of the UPOS tag set. More details on mapping of standard POS tags to UPOS tags can be seen at https://universaldependencies.org/docs/tagset-conversion/en-penn-uposf.html. The following is a table of the most common tags:
Adjective: Usually describes a noun. Separate tags are used for comparatives and superlatives.
Adposition: Used to modify an object such as a noun, pronoun, or phrase; for example, "Walk up the stairs." Some languages like English use prepositions while others such as Hindi and Japanese use postpositions.
Adverb: A word or phrase that modifies or qualifies an adjective, verb, or another adverb.
Auxiliary verb: Used in forming mood, voice, or tenses of other verbs.
Will, can, may
Co-ordinating conjunction: Joins two phrases, clauses, or sentences.
And, but, that
Interjection: An exclamation, interruption, or sudden remark.
Oh, uh, lol
Noun: Identifies people, places, or things.
Numeral: Represents a quantity.
Determiner: Identifies a specific noun, usually as a singular.
A, an, the
Particle: Parts of speech outside of the main types.
Pronoun: Substitutes for other nouns, especially proper nouns.
Proper noun: A name for a specific person, place, or thing.
Different punctuation symbols.
, ? /
Subordinating conjunction: Connects independent clause to a dependent clause.
Symbols including currency signs, emojis, and so on.
$, #, % :)
Verb: Denotes action or occurrence.
Other: That which cannot be classified elsewhere.
Etc, 4. (a numbered list bullet)
The best way to understand how POS tagging works is to try it out:
The code for this section is in the POS Based Features section of the notebook.
en = snlp.Pipeline(lang='en') txt = "Yo you around? A friend of mine's lookin." pos = en(txt)
def print_pos(doc): text = "" for sentence in doc.sentences: for token in sentence.tokens: text += token.words.text + "/" + \ token.words.upos + " " text += "\n" return text
This method can be used to investigate the tagging for the preceding example sentence:
Yo/PRON you/PRON around/ADV ?/PUNCT A/DET friend/NOUN of/ADP mine/PRON 's/PART lookin/NOUN ./PUNCT
Most of these tags would make sense, though there may be some inaccuracies. For example, the word lookin is miscategorized as a noun. Neither StanfordNLP, nor a model from another package, will be perfect. This is something that we have to account for in building models using such features. There are a couple of different features that can be built using these POS. First, we can update the
word_counts() method to exclude the punctuation from the count of words. The current method is unaware of the punctuation when it counts the words. Additional features can be created that look at the proportion of different types of grammatical elements in the messages. Note that so far, all features are based on the structure of the text, and not on the content itself. Working with content features will be covered in more detail as this book continues.
As a next step, let's update the
word_counts() method and add a feature to show the percentages of symbols and punctuation in a message – with the hypothesis that maybe spam messages use more punctuation and symbols. Other features around types of different grammatical elements can also be built. These are left to you to implement. Our
word_counts() method is updated as follows:
en_sw = stopwords.stopwords('en') def word_counts_v3(x, pipeline=en): doc = pipeline(x) totals = 0. count = 0. non_word = 0. for sentence in doc.sentences: totals += len(sentence.tokens) # (1) for token in sentence.tokens: if token.text.lower() not in en_sw: if token.words.upos not in ['PUNCT', 'SYM']: count += 1. else: non_word += 1. non_word = non_word / totals return pd.Series([count, non_word], index=['Words_NoPunct', 'Punct'])
This function is a little different compared to the previous one. Since there are multiple computations that need to be performed on the message in each row, these operations are combined and a
Series object with column labels is returned. This can be merged with the main DataFrame like so:
train_tmp = train['Message'].apply(word_counts_v3) train = pd.concat([train, train_tmp], axis=1)
A similar process can be performed on the test set:
test_tmp = test['Message'].apply(word_counts_v3) test = pd.concat([test, test_tmp], axis=1)
Figure 1.12: Statistics for regular messages after using POS tags
And then for spam messages:
Figure 1.13: Statistics for spam messages after using POS tags
In general, word counts have been reduced even further after stop word removal. Further more, the new
Punct feature computes the ratio of punctuation tokens in a message relative to the total tokens. Now we can build a model with this data.
Modeling data with POS tagging
x_train = train[['Length', 'Punctuation', 'Capitals', 'Words_NoPunct', 'Punct']] y_train = train[['Spam']] x_test = test[['Length', 'Punctuation', 'Capitals' , 'Words_NoPunct', 'Punct']] y_test = test[['Spam']] model = make_model(input_dims=5) # model = make_model(input_dims=3) model.fit(x_train, y_train, epochs=10, batch_size=10)
Train on 4459 samples Epoch 1/10 4459/4459 [==============================] - 1s 236us/sample - loss: 3.1958 - accuracy: 0.6028 Epoch 2/10 accuracy: 0.9466Epoch 10/10 4459/4459 [==============================] - 1s 139us/sample - loss: 0.1788 -
The accuracy shows a slight increase and is now up to 94.66%. Upon testing, it seems to hold:
1115/1115 [==============================] - 0s 91us/sample - loss: 0.2076 - accuracy: 0.9426 [0.20764057086989485, 0.9426009]
Stemming and lemmatization
In certain languages, the same word can take a slightly different form depending on its usage. Consider the word depend itself. The following are all valid forms of the word depend: depends, depending, depended, dependent. Often, these variations are due to tenses. In some languages like Hindi, verbs may have different forms for different genders. Another case is derivatives of the same word such as sympathy, sympathetic, sympathize, and sympathizer. These variations can take different forms in other languages. In Russian, proper nouns take different forms based on usage. Suppose there is a document talking about London (Лондон). The phrase in London (в Лондоне) spells London differently than from London (из Лондона). These variations in the spelling of London can cause issues when matching some input to sections or words in a document.
When processing and tokenizing text to construct a vocabulary of words appearing in the corpora, the ability to identify the root word can reduce the size of the vocabulary while expanding the accuracy of matches. In the preceding Russian example, any form of the word London can be matched to any other form if all the forms are normalized to a common representation post-tokenization. This process of normalization is called stemming or lemmatization.
Stemming and lemmatization differ in their approach and sophistication but serve the same objective. Stemming is a simpler, heuristic rule-based approach that chops off the affixes of words. The most famous stemmer is called the Porter stemmer, published by Martin Porter in 1980. The official website is https://tartarus.org/martin/PorterStemmer/, where various versions of the algorithm implemented in various languages are linked.
This stemmer only works for English and has rules including removing s at the end of the words for plurals, and removing endings such as -ed or -ing. Consider the following sentence:
After stemming using Porter's algorithm, this sentence will be reduced to the following:
Note how different forms of morphology, understand, and reduce are all tokenized to the same form.
Lemmatization approaches this task in a more sophisticated manner, using vocabularies and morphological analysis of words. In the study of linguistics, a morpheme is a unit smaller than or equal to a word. When a morpheme is a word in itself, it is called a root or a free morpheme. Conversely, every word can be decomposed into one or more morphemes. The study of morphemes is called morphology. Using this morphological information, a word's root form can be returned post-tokenization. This base or dictionary form of the word is called a lemma, hence the process is called lemmatization. StanfordNLP includes lemmatization as part of processing.
The Lemmatization section of the notebook has the code shown here.
Here is a simple piece of code to take the preceding sentences and parse them:
text = "Stemming is aimed at reducing vocabulary and aid understanding of morphological processes. This helps people understand the morphology of words and reduce size of corpus." lemma = en(text)
After processing, we can iterate through the tokens to get the lemma of each word. This is shown in the following code fragment. The lemma of a word is exposed as the
.lemma property of each word inside a token. For the sake of brevity of code, a simplifying assumption is made here that each token has only one word.
lemmas = "" for sentence in lemma.sentences: for token in sentence.tokens: lemmas += token.words.lemma +"/" + \ token.words.upos + " " lemmas += "\n" print(lemmas)
stem/NOUN be/AUX aim/VERB at/SCONJ reduce/VERB vocabulary/NOUN and/CCONJ aid/NOUN understanding/NOUN of/ADP morphological/ADJ process/NOUN ./PUNCT this/PRON help/VERB people/NOUN understand/VERB the/DET morphology/NOUN of/ADP word/NOUN and/CCONJ reduce/VERB size/NOUN of/ADP corpus/ADJ ./PUNCT
Compare this output to the output of the Porter stemmer earlier. One immediate thing to notice is that lemmas are actual words as opposed to fragments, as was the case with the Porter stemmer. In the case of reduce, the usage in both sentences is in the form of a verb, so the choice of lemma is consistent. Focus on the words understand and understanding in the preceding output. As the POS tag shows, it is used in two different forms. Consequently, it is not reduced to the same lemma. This is different from the Porter stemmer. The same behavior can be observed for morphology and morphological. This is a quite sophisticated behavior.
Now that text normalization is completed, we can begin the vectorization of text.
While building models for the SMS message spam detection thus far, only aggregate features based on counts or distributions of lexical or grammatical features have been considered. The actual words in the messages have not been used thus far. There are a couple of challenges in using the text content of messages. The first is that text can be of arbitrary lengths. Comparing this to image data, we know that each image has a fixed width and height. Even if the corpus of images has a mixture of sizes, images can be resized to a common size with minimal loss of information by using a variety of compression mechanisms. In NLP, this is a bigger problem compared to computer vision. A common approach to handle this is to truncate the text. We will see various ways to handle variable-length texts in various examples throughout the book.
The second issue is that of the representation of words with a numerical quantity or feature. In computer vision, the smallest unit is a pixel. Each pixel has a set of numerical values indicating color or intensity. In a text, the smallest unit could be a word. Aggregating the Unicode values of the characters does not convey or embody the meaning of the word. In fact, these character codes embody no information at all about the character, such as its prevalence, whether it is a consonant or a vowel, and so on. However, averaging the pixels in a section of an image could be a reasonable approximation of that region of the image. It may represent how that region would look if seen from a large distance. A core problem then is to construct a numerical representation of words. Vectorization is the process of converting a word to a vector of numbers that embodies the information contained in the word. Depending on the vectorization technique, this vector may have additional properties that may allow comparison with other words, as will be shown in the Word vectors section later in this chapter.
The simplest approach for vectorizing is to use counts of words. The second approach is more sophisticated, with its origins in information retrieval, and is called TF-IDF. The third approach is relatively new, having been published in 2013, and uses RNNs to generate embeddings or word vectors. This method is called Word2Vec. The newest method in this area as of the time of writing was BERT, which came out in the last quarter of 2018. The first three methods will be discussed in this chapter. BERT will be discussed in detail in Chapter 3, Named Entity Recognition (NER) with BiLSTMs, CRFs, and Viterbi Decoding.
The idea behind count-based vectorization is really simple. Each unique word appearing in the corpus is assigned a column in the vocabulary. Each document, which would correspond to individual messages in the spam example, is assigned a row. The counts of the words appearing in that document are entered in the relevant cell corresponding to the document and the word. With
n unique documents containing
m unique words, this results in a matrix of
n rows by
m columns. Consider a corpus like so:
corpus = [ "I like fruits. Fruits like bananas", "I love bananas but eat an apple", "An apple a day keeps the doctor away" ]
Modeling after count-based vectorization
!pip install sklearn
CountVectorizer class provides a built-in tokenizer that separates the tokens of two or more characters in length. This class takes a variety of options including a custom tokenizer, a stop word list, the option to convert characters to lowercase prior to tokenization, and a binary mode that converts every positive count to 1. The defaults provide a reasonable choice for an English language corpus:
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) vectorizer.get_feature_names()
['an', 'apple', 'away', 'bananas', 'but', 'day', 'doctor', 'eat', 'fruits', 'keeps', 'like', 'love', 'the']
In the preceding code, a model is fit to the corpus. The last line prints out the tokens that are used as columns. The full matrix can be seen as follows:
array([[0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 2, 0, 0], [1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0], [1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1]])
This process has now converted a sentence such as "I like fruits. Fruits like bananas" into a vector (
0, 0, 0, 1, 0, 0, 0, 2, 0, 2, 0, 0). This is an example of context-free vectorization. Context-free refers to the fact that the order of the words in the document did not make any difference in the generation of the vector. This is merely counting the instances of the words in a document. Consequently, words with multiple meanings may be grouped into one, for example, bank. This may refer to a place near the river or a place to keep money. However, it does provide a method to compare documents and derive similarity. The cosine similarity or distance can be computed between two documents, to see which documents are similar to which other documents:
from sklearn.metrics.pairwise import cosine_similarity cosine_similarity(X.toarray()) array([[1. , 0.13608276, 0. ], [0.13608276, 1. , 0.3086067 ], [0. , 0.3086067 , 1. ]])
This shows that the first sentence and the second sentence have a 0.136 similarity score (on a scale of 0 to 1). The first and third sentence have nothing in common. The second and third sentence have a similarity score of 0.308 – the highest in this set. Another use case of this technique is to check the similarity of the documents with given keywords. Let's say that the query is apple and bananas. This first step is to compute the vector of this query, and then compute the cosine similarity scores against the documents in the corpus:
query = vectorizer.transform(["apple and bananas"]) cosine_similarity(X, query) array([[0.23570226], [0.57735027], [0.26726124]])
This shows that this query matches the second sentence in the corpus the best. The third sentence would rank second, and the first sentence would rank lowest. In a few lines, a basic search engine has been implemented, along with logic to serve queries! At scale, this is a very difficult problem, as the number of words or columns in a web crawler would top 3 billion. Every web page would be represented as a row, so that would also require billions of rows. Computing a cosine similarity in milliseconds to serve an online query and keeping the content of this matrix updated is a massive undertaking.
The next step from this rather simple vectorization scheme is to consider the information content of each word in constructing this matrix.
Term Frequency-Inverse Document Frequency (TF-IDF)
In creating a vector representation of the document, only the presence of words was included – it does not factor in the importance of a word. If the corpus of documents being processed is about a set of recipes with fruits, then one may expect words like apples, raspberries, and washing to appear frequently. Term Frequency (TF) represents how often a word or token occurs in a given document. This is exactly what we did in the previous section. In a set of documents about fruits and cooking, a word like apple may not be terribly specific to help identify a recipe. However, a word like tuile may be uncommon in that context. Therefore, it may help to narrow the search for recipes much faster than a word like raspberry. On a side note, feel free to search the web for raspberry tuile recipes. If a word is rare, we want to give it a higher weight, as it may contain more information than a common word. A term can be upweighted by the inverse of the number of documents it appears in. Consequently, words that occur in a lot of documents will get a smaller score compared to terms that appear in fewer documents. This is called the Inverse Document Frequency (IDF).
Mathematically, the score of each term in a document can be computed as follows:
Here, t represents the word or term, and d represents a specific document.
It is common to normalize the TF of a term in a document by the total number of tokens in that document.
The IDF is defined as follows:
Here, N represents the total number of documents in the corpus, and nt represents the number of documents where the term is present. The addition of 1 in the denominator avoids the divide-by-zero error. Fortunately,
sklearn provides methods to compute TF-IDF.
The TF-IDF Vectorization section of the notebook contains the code for this section.
import pandas as pd from sklearn.feature_extraction.text import TfidfTransformer transformer = TfidfTransformer(smooth_idf=False) tfidf = transformer.fit_transform(X.toarray()) pd.DataFrame(tfidf.toarray(), columns=vectorizer.get_feature_names())
This produces the following output:
This should give some intuition on how TF-IDF is computed. Even with three toy sentences and a very limited vocabulary, many of the columns in each row are 0. This vectorization produces sparse representations.
Now, this can be applied to the problem of detecting spam messages. Thus far, the features for each message have been computed based on some aggregate statistics and added to the
pandas DataFrame. Now, the content of the message will be tokenized and converted into a set of columns. The TF-IDF score for each word or token will be computed for each message in the array. This is surprisingly easy to do with
sklearn, as follows:
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn. pre-processing import LabelEncoder tfidf = TfidfVectorizer(binary=True) X = tfidf.fit_transform(train['Message']).astype('float32') X_test = tfidf.transform(test['Message']).astype('float32') X.shape
The second parameter shows that 7,741 tokens were uniquely identified. These are the columns of features that will be used in the model later. Note that the vectorizer was created with the binary flag. This implies that even if a token appears multiple times in a message, it is counted as one. The next line trains the TF-IDF model on the training dataset. Then, it converts the words in the test set according to the TF-IDF scores learned from the training set. Let's train a model on just these TF-IDF features.
Modeling using TF-IDF features
_, cols = X.shape model2 = make_model(cols) # to match tf-idf dimensions y_train = train[['Spam']] y_test = test[['Spam']] model2.fit(X.toarray(), y_train, epochs=10, batch_size=10)
Train on 4459 samples Epoch 1/10 4459/4459 [==============================] - 2s 380us/sample - loss: 0.3505 - accuracy: 0.8903Epoch 10/10 4459/4459 [==============================] - 1s 323us/sample - loss: 0.0027 - accuracy: 1.0000
1115/1115 [==============================] - 0s 134us/sample - loss: 0.0581 - accuracy: 0.9839 [0.05813191874545786, 0.9838565]
An accuracy rate of 98.39% is by far the best we have gotten in any model so far. Checking the confusion matrix, it is evident that this model is indeed doing very well:
y_test_pred = model2.predict_classes(X_test.toarray()) tf.math.confusion_matrix(tf.constant(y_test.Spam), y_test_pred) <tf.Tensor: shape=(2, 2), dtype=int32, numpy= array([[958, 2], [ 16, 139]], dtype=int32)>
Only 2 regular messages were classified as spam, while only 16 spam messages were classified as being not spam. This is indeed a very good model. Note that this dataset has Indonesian (or Bahasa) words as well as English words in it. Bahasa uses the Latin alphabet. This model, without using a lot of pretraining and knowledge of language, vocabulary, and grammar, was able to do a very reasonable job with the task at hand.
However, this model ignores the relationships between words completely. It treats the words in a document as unordered items in a set. There are better models that vectorize the tokens in a way that preserves some of the relationships between the tokens. This is explored in the next section.
In the previous example, a row vector was used to represent a document. This was used as a feature for the classification model to predict spam labels. However, no information can be gleaned reliably from the relationships between words. In NLP, a lot of research has been focused on learning the words or representations in an unsupervised way. This is called representation learning. The output of this approach is a representation of a word in some vector space, and the word can be considered embedded in that space. Consequently, these word vectors are also called embeddings.
The core hypothesis behind word vector algorithms is that words that occur near each other are related to each other. To see the intuition behind this, consider two words, bake and oven. Given a sentence fragment of five words, where one of these words is present, what would be the probability of the other being present as well? You would be right in guessing that the probability is likely quite high. Suppose now that words are being mapped into some two-dimensional space. In that space, these two words should be closer to each other, and probably further away from words like astronomy and tractor.
The task of learning these embeddings for the words can be then thought of as adjusting words in a giant multidimensional space where similar words are closer to each other and dissimilar words are further apart from each other.
A revolutionary approach to do this is called Word2Vec. This algorithm was published by Tomas Mikolov and collaborators from Google in 2013. This approach produces dense vectors of the order of 50-300 dimensions generally (though larger are known), where most of the values are non-zero. In contrast, in our previous trivial spam example, the TF-IDF model had 7,741 dimensions. The original paper had two algorithms proposed in it: continuous bag-of-words and continuous skip-gram. On semantic tasks and overall, the performance of skip-gram was state of the art at the time of its publication. Consequently, the continuous skip-gram model with negative sampling has become synonymous with Word2Vec. The intuition behind this model is fairly straightforward.
Consider this sentence fragment from a recipe: "Bake until the cookie is golden brown all over." Under the assumption that a word is related to the words that appear near it, a word from this fragment can be picked and a classifier can be trained to predict the words around it:
Figure 1.14: A window of 5 centered on cookie
Taking an example of a window of five words, the word in the center is used to predict two words before and two words after it. In the preceding figure, the fragment is until the cookie is golden, with the focus on the word cookie. Assuming that there are 10,000 words in the vocabulary, a network can be trained to predict binary decisions given a pair of words. The training objective is that the network predicts
true for pairs like (cookie, golden) while predicting
false for (cookie, kangaroo). This particular approach is called Skip-Gram Negative Sampling (SGNS) and it considerably reduces the training time required for large vocabularies. Very similar to the single-layer neural model in the previous section, a model can be trained with a one-to-many as the output layer. The sigmoid activation would be changed to a
softmax function. If the hidden layer has 300 units, then its dimensions would be 10,000 x 300, that is, for each of the words, there will be a set of weights. The objective of the training is to learn these weights. In fact, these weights become the embedding for that word once training is complete.
The choice of units in the hidden layer is a hyperparameter that can be adapted for specific applications. 300 is commonly found as it is available through pretrained embeddings on the Google News dataset. Finally, the error is computed as the sum of the categorical cross-entropy of all the word pairs in negative and positive examples.
The beauty of this model is that it does not require any supervised training data. Running sentences can be used to provide positive examples. For the model to learn effectively, it is important to provide negative samples as well. Words are randomly sampled using their probability of occurrence in the training corpus and fed as negative examples.
To understand how the Word2Vec embeddings work, let's download a set of pretrained embeddings.
The code shown in the following section can be found in the Word Vectors section of the notebook.
Pretrained models using Word2Vec embeddings
Since we are only interested in experimenting with a pretrained model, we can use the Gensim library and its pretrained embeddings. Gensim should already be installed in Google Colab. It can be installed like so:
!pip install gensim
After the requisite imports, pretrained embeddings can be downloaded and loaded. Note that these particular embeddings are approximately 1.6 GB in size, so may take a very long time to load (you may encounter some memory issues as well):
from gensim.models.word2vec import Word2Vec import gensim.downloader as api model_w2v = api.load("word2vec-google-news-300")
Another issue that you may run into is the Colab session expiring if left alone for too long while waiting for the download to finish. This may be a good time to switch to a local notebook, which will also be helpful in future chapters. Now, we are ready to inspect the similar words:
[('cookie', 0.745154082775116), ('oatmeal_raisin_cookies', 0.6887780427932739), ('oatmeal_cookies', 0.662139892578125), ('cookie_dough_ice_cream', 0.6520504951477051), ('brownies', 0.6479344964027405), ('homemade_cookies', 0.6476464867591858), ('gingerbread_cookies', 0.6461867690086365), ('Cookies', 0.6341644525527954), ('cookies_cupcakes', 0.6275068521499634), ('cupcakes', 0.6258294582366943)]
This is pretty good. Let's see how this model does at a word analogy task:
The model is able to guess that compared to the other words, which are all countries, Tokyo is the odd one out, as it is a city. Now, let's try a very famous example of mathematics on these word vectors:
king = model_w2v['king'] man = model_w2v['man'] woman = model_w2v['woman'] queen = king - man + woman model_w2v.similar_by_vector(queen)
[('king', 0.8449392318725586), ('queen', 0.7300517559051514), ('monarch', 0.6454660892486572), ('princess', 0.6156251430511475), ('crown_prince', 0.5818676948547363), ('prince', 0.5777117609977722), ('kings', 0.5613663792610168), ('sultan', 0.5376776456832886), ('Queen_Consort', 0.5344247817993164), ('queens', 0.5289887189865112)]
Given that King was provided as an input to the equation, it is simple to filter the inputs from the outputs and Queen would be the top result. SMS spam classification could be attempted using these embeddings. However, future chapters will cover the use of GloVe embeddings and BERT embeddings for sentiment analysis.
A pretrained model like the preceding can be used to vectorize a document. Using these embeddings, models can be trained for specific purposes. In later chapters, newer methods of generating contextual embeddings, such as BERT, will be discussed in detail.
In this chapter, we worked through the basics of NLP, including collecting and labeling training data, tokenization, stop word removal, case normalization, POS tagging, stemming, and lemmatization. Some vagaries of these in languages such as Japanese and Russian were also covered. Using a variety of features derived from these approaches, we trained a model to classify spam messages, where the messages had a combination of English and Bahasa Indonesian words. This got us to a model with 94% accuracy.
However, the major challenge in using the content of the messages was in defining a way to represent words as vectors such that computations could be performed on them. We started with a simple count-based vectorization scheme and then graduated to a more sophisticated TF-IDF approach, both of which produced sparse vectors. This TF-IDF approach gave a model with 98%+ accuracy in the spam detection task.
Finally, we saw a contemporary method of generating dense word embeddings, called Word2Vec. This method, though a few years old, is still very relevant in many production applications. Once the word embeddings are generated, they can be cached for inference and that makes an ML model using these embeddings run with relatively low latency.
We used a very basic deep learning model for solving the SMS spam classification task. Like how Convolutional Neural Networks (CNNs) are the predominant architecture in computer vision, Recurrent Neural Networks (RNNs), especially those based on Long Short-Term Memory (LSTM) and Bi-directional LSTMs (BiLSTMs), are most commonly used to build NLP models. In the next chapter, we cover the structure of LSTMs and build a sentiment analysis model using BiLSTMs. These models will be used extensively in creative ways to solve different NLP problems in future chapters.