Natural Language Processing with TensorFlow - Second Edition

By Thushan Ganegedara
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Free Chapter
    Introduction to Natural Language Processing
About this book

Learning how to solve natural language processing (NLP) problems is an important skill to master due to the explosive growth of data combined with the demand for machine learning solutions in production. Natural Language Processing with TensorFlow, Second Edition, will teach you how to solve common real-world NLP problems with a variety of deep learning model architectures.

The book starts by getting readers familiar with NLP and the basics of TensorFlow. Then, it gradually teaches you different facets of TensorFlow 2.x. In the following chapters, you then learn how to generate powerful word vectors, classify text, generate new text, and generate image captions, among other exciting use-cases of real-world NLP.

TensorFlow has evolved to be an ecosystem that supports a machine learning workflow through ingesting and transforming data, building models, monitoring, and productionization. We will then read text directly from files and perform the required transformations through a TensorFlow data pipeline. We will also see how to use a versatile visualization tool known as TensorBoard to visualize our models.

By the end of this NLP book, you will be comfortable with using TensorFlow to build deep learning models with many different architectures, and efficiently ingest data using TensorFlow Additionally, you’ll be able to confidently use TensorFlow throughout your machine learning workflow.

Publication date:
July 2022


Introduction to Natural Language Processing

Natural Language Processing (NLP) offers a much-needed set of tools and algorithms for understanding and processing the large volume of unstructured data in today’s world. Recently, deep learning has been widely adopted for many NLP tasks because of the remarkable performance deep learning algorithms have shown in a plethora of challenging tasks, such as image classification, speech recognition, and realistic text generation. TensorFlow is one of the most intuitive and efficient deep learning frameworks currently in existence that enables such amazing feats. This book will enable aspiring deep learning developers to handle massive amounts of data using NLP and TensorFlow. This chapter covers the following topics:

  • What is Natural Language Processing?
  • Tasks of Natural Language Processing
  • The traditional approach to Natural Language Processing
  • The deep learning approach to Natural Language Processing
  • Introduction to the technical tools

In this chapter, we will provide an introduction to NLP and to the rest of the book. We will answer the question, “What is Natural Language Processing?”. Also, we’ll look at some of its most important use cases. We will also consider the traditional approaches and the more recent deep learning-based approaches to NLP, including a Fully Connected Neural Network (FCNN). Finally, we will conclude with an overview of the rest of the book and the technical tools we will be using.


What is Natural Language Processing?

According to DOMO (, an analytics company, there were 1.7MB for every person on earth every second by 2020, with a staggering 4.6 billion active users on the internet. This includes roughly 500,000 tweets sent and 306 billion emails circulated every day. These figures are only going in one direction as this book is being written, and that is up! Of all this data, a large fraction is unstructured text and speech as there are billions of emails and social media content created and phone calls made every day.

These statistics provide a good basis for us to define what NLP is. Simply put, the goal of NLP is to make machines understand our spoken and written languages. Moreover, NLP is ubiquitous and is already a large part of human life. Virtual Assistants (VAs), such as Google Assistant, Cortana, Alexa, and Apple Siri, are largely NLP systems. Numerous NLP tasks take place when one asks a VA, “Can you show me a good Italian restaurant nearby?” First, the VA needs to convert the utterance to text (that is, speech-to-text). Next, it must understand the semantics of the request (for example, identify the most important keywords like restaurant and Italian) and formulate a structured request (for example, cuisine = Italian, rating = 3–5, distance < 10 km). Then, the VA must search for restaurants filtering by the location and cuisine, and then, rank the restaurants by the ratings received. To calculate an overall rating for a restaurant, a good NLP system may look at both the rating and text description provided by each user. Finally, once the user is at the restaurant, the VA might assist the user by translating various menu items from Italian to English. This example shows that NLP has become an integral part of human life.

It should be understood that NLP is an extremely challenging field of research as words and semantics have a highly complex nonlinear relationship, and it is even more difficult to capture this information as a robust numerical representation. To make matters worse, each language has its own grammar, syntax, and vocabulary. Therefore, processing textual data involves various complex tasks such as text parsing (for example, tokenization and stemming), morphological analysis, word sense disambiguation, and understanding the underlying grammatical structure of a language. For example, in these two sentences, I went to the bank and I walked along the river bank, the word bank has two entirely different meanings, due to the context it’s used in. To distinguish or (disambiguate) the word bank, we need to understand the context in which the word is being used. Machine learning has become a key enabler for NLP, helping to accomplish the aforementioned tasks through machines. Below we discuss some of the important tasks that fall under NLP.


Tasks of Natural Language Processing

NLP has a multitude of real-world applications. A good NLP system is one that performs many NLP tasks. When you search for today’s weather on Google or use Google Translate to find out how to say, “How are you?” in French, you rely on a subset of such tasks in NLP. We will list some of the most ubiquitous tasks here, and this book covers most of these tasks:

  • Tokenization: Tokenization is the task of separating a text corpus into atomic units (for example, words or characters). Although it may seem trivial for a language like English, tokenization is an important task. For example, in the Japanese language, words are not delimited by spaces or punctuation marks.
  • Word-Sense Disambiguation (WSD): WSD is the task of identifying the correct meaning of a word. For example, in the sentences, The dog barked at the mailman and Tree bark is sometimes used as a medicine, the word bark has two different meanings. WSD is critical for tasks such as question answering.
  • Named Entity Recognition (NER): NER attempts to extract entities (for example, person, location, and organization) from a given body of text or a text corpus. For example, the sentence, John gave Mary two apples at school on Monday will be transformed to [John]name gave [Mary]name [two]number apples at [school]organization on [Monday]time. NER is an imperative topic in fields such as information retrieval and knowledge representation.
  • Part-of-Speech (PoS) tagging: PoS tagging is the task of assigning words to their respective parts of speech. It can either be basic tags such as noun, verb, adjective, adverb, and preposition, or it can be granular such as proper noun, common noun, phrasal verb, verb, and so on. The Penn Treebank project, a popular project focusing PoS, defines a comprehensive list of PoS tags at
  • Sentence/synopsis classification: Sentence or synopsis (for example, movie reviews) classification has many use cases such as spam detection, news article classification (for example, political, technology, and sport), and product review ratings (that is, positive or negative). This is achieved by training a classification model with labeled data (that is, reviews annotated by humans, with either a positive or negative label).
  • Text generation: In text generation, a learning model (for example, a neural network) is trained with text corpora (a large collection of textual documents), and it then predicts new text that follows. For example, language modeling can output an entirely new science fiction story by using existing science fiction stories for training.

Recently, OpenAI released a language model known as OpenAI-GPT-2, which can generate incredibly realistic text. Furthermore, this task plays a very important role in understanding language, which helps a downstream decision-support model get off the ground quickly.

  • Question Answering (QA): QA techniques possess a high commercial value, and such techniques are found at the foundation of chatbots and VA (for example, Google Assistant and Apple Siri). Chatbots have been adopted by many companies for customer support. Chatbots can be used to answer and resolve straightforward customer concerns (for example, changing a customer’s monthly mobile plan), which can be solved without human intervention. QA touches upon many other aspects of NLP such as information retrieval and knowledge representation. Consequently, all this makes developing a QA system very difficult.
  • Machine Translation (MT): MT is the task of transforming a sentence/phrase from a source language (for example, German) to a target language (for example, English). This is a very challenging task, as different languages have different syntactical structures, which means that it is not a one-to-one transformation. Furthermore, word-to-word relationships between languages can be one-to-many, one-to-one, many-to-one, or many-to-many. This is known as the word alignment problem in MT literature.

Finally, to develop a system that can assist a human in day-to-day tasks (for example, VA or a chatbot) many of these tasks need to be orchestrated in a seamless manner. As we saw in the previous example where the user asks, “Can you show me a good Italian restaurant nearby?” several different NLP tasks, such as speech-to-text conversion, semantic and sentiment analyses, question answering, and machine translation, need to be completed. In Figure 1.1, we provide a hierarchical taxonomy of different NLP tasks categorized into several different types. It is a difficult task to attribute an NLP task to a single classification. Therefore, you can see some tasks spanning multiple categories. We will split the categories into two main types: language-based (light-colored with black text) and problem formulation-based (dark-colored with white text). The linguistic breakdown has two categories: syntactic (structure-based) and semantic (meaning-based). The problem formulation-based breakdown has three categories: preprocessing tasks (tasks that are performed on text data before feeding to a model), discriminative tasks (tasks where we attempt to assign an input text to one or more categories from a set of predefined categories) and generative tasks (tasks where we attempt to generate a new textual output). Of course, this is one classification among many. But it will show how difficult it is to assign a specific NLP task to a specific category.

Figure 1.1: A taxonomy of the popular tasks of NLP categorized under broader categories

Having understood the various tasks in NLP, let us now move on to understand how we can solve these tasks with the help of machines. We will discuss both the traditional method and the deep- learning-based approach.


The traditional approach to Natural Language Processing

The traditional or classical approach to solving NLP is a sequential flow of several key steps, and it is a statistical approach. When we take a closer look at a traditional NLP learning model, we will be able to see a set of distinct tasks taking place, such as preprocessing data by removing unwanted data, feature engineering to get good numerical representations of textual data, learning to use machine learning algorithms with the aid of training data, and predicting outputs for novel, unseen data. Of these, feature engineering was the most time-consuming and crucial step for obtaining good performance on a given NLP task.

Understanding the traditional approach

The traditional approach to solving NLP tasks involves a collection of distinct subtasks. First, the text corpora need to be preprocessed, focusing on reducing the vocabulary and distractions.

By distractions, I refer to the things that distract the algorithm (for example, punctuation marks and stop word removal) from capturing the vital linguistic information required for the task.

Next come several feature engineering steps. The main objective of feature engineering is to make learning easier for the algorithms. Often the features are hand-engineered and biased toward the human understanding of a language. Feature engineering was of the utmost importance for classical NLP algorithms, and consequently, the best-performing systems often had the best-engineered features. For example, for a sentiment classification task, you can represent a sentence with a parse tree and assign positive, negative, or neutral labels to each node/subtree in the tree to classify that sentence as positive or negative. Additionally, the feature engineering phase can use external resources such as WordNet (a lexical database that can provide insights into how different words are related to each other – e.g. synonyms) to develop better features. We will soon look at a simple feature engineering technique known as bag-of-words.

Next, the learning algorithm learns to perform well at the given task using the obtained features and, optionally, the external resources. For example, for a text summarization task, a parallel corpus containing common phrases and succinct paraphrases would be a good external resource. Finally, prediction occurs. Prediction is straightforward, where you will feed a new input and obtain the predicted label by forwarding the input through the learning model. The entire process of the traditional approach is depicted in Figure 1.2:


Figure 1.2: The general approach of classical NLP

Next, let’s discuss a use case where we use NLP to generate football game summaries.

Example – generating football game summaries

To gain an in-depth understanding of the traditional approach to NLP, let’s consider a task of automatic text generation from the statistics of a game of football. We have several sets of game statistics (for example, the score, penalties, and yellow cards) and the corresponding articles generated for that game by a journalist, as the training data. Let’s also assume that for a given game, we have a mapping from each statistical parameter to the most relevant phrase of the summary for that parameter. Our task here is that, given a new game, we need to generate a natural-looking summary of the game. Of course, this can be as simple as finding the best-matching statistics for the new game from the training data and retrieving the corresponding summary. However, there are more sophisticated and elegant ways of generating text.

If we were to incorporate machine learning to generate natural language, a sequence of operations, such as preprocessing the text, feature engineering, learning, and prediction, is likely to be performed.

Preprocessing: The text involves operations, such as tokenization (for example, splitting “I went home” into “I”, “went”, “home”), stemming (for example, converting listened to listen), and removing punctuation (for example, ! and ;), in order to reduce the vocabulary (that is, the features), thus reducing the dimensionality of the data. Tokenization might appear trivial for a language such as English, as the words are isolated; however, this is not the case for certain languages such as Thai, Japanese, and Chinese, as these languages are not consistently delimited. Next, it is important to understand that stemming is not a trivial operation either. It might appear that stemming is a simple operation that relies on a simple set of rules such as removing ed from a verb (for example, the stemmed result of listened is listen); however, it requires more than a simple rule base to develop a good stemming algorithm, as stemming certain words can be tricky (for example, using rule-based stemming, the stemmed result of argued is argu). In addition, the effort required for proper stemming can vary in complexity for other languages.

Feature engineering is used to transform raw text data into an appealing numerical representation so that a model can be trained on that data, for example, converting text into a bag-of-words representation or using n-gram representation, which we will discuss later. However, remember that state-of-the-art classical models rely on much more sophisticated feature engineering techniques.

The following are some of the feature engineering techniques:

Bag-of-words: This is a feature engineering technique that creates feature representations based on the word occurrence frequency. For example, let’s consider the following sentences:

  • Bob went to the market to buy some flowers
  • Bob bought the flowers to give to Mary

The vocabulary for these two sentences would be:

[“Bob”, “went”, “to”, “the”, “market”, “buy”, “some”, “flowers”, “bought”, “give”, “Mary”]

Next, we will create a feature vector of size V (vocabulary size) for each sentence, showing how many times each word in the vocabulary appears in the sentence. In this example, the feature vectors for the sentences would respectively be as follows:

[1, 1, 2, 1, 1, 1, 1, 1, 0, 0, 0]

[1, 0, 2, 1, 0, 0, 0, 1, 1, 1, 1]

A crucial limitation of the bag-of-words method is that it loses contextual information as the order of words is no longer preserved.

n-gram: This is another feature engineering technique that breaks down text into smaller components consisting of n letters (or words). For example, 2-gram would break the text into two-letter (or two-word) entities. For example, consider this sentence:

Bob went to the market to buy some flowers

The letter level n-gram decomposition for this sentence is as follows:

[“Bo”, “ob”, “b “, “ w”, “we”, “en”, ..., “me”, “e “,” f”, “fl”, “lo”, “ow”, “we”, “er”, “rs”]

The word-based n-gram decomposition is this:

[“Bob went”, “went to”, “to the”, “the market”, ..., “to buy”, “buy some”, “some flowers”]

The advantage in this representation (letter level) is that the vocabulary will be significantly smaller than if we were to use words as features for large corpora.

Next, we need to structure our data to be able to feed it into a learning model. For example, we will have data tuples of the form (a statistic, a phrase explaining the statistic) as follows:

Total goals = 4, “The game was tied with 2 goals for each team at the end of the first half”

Team 1 = Manchester United, “The game was between Manchester United and Barcelona”

Team 1 goals = 5, “Manchester United managed to get 5 goals”

The learning process may comprise three sub-modules: a Hidden Markov Model (HMM), a sentence planner, and a discourse planner. An HMM is a recurrent model that can be used to solve time-series problems. For example, generating text is a time-series problem as the order of generated words matters. In our example, an HMM might learn to model language (i.e. generate meaningful text) by training on a corpus of statistics and related phrases. We will train the HMM so that it produces a relevant sequence of text, given the statistics as the starting input. Once trained, the HMM can be used for inference in a recursive manner, where we start with a seed (e.g. a statistic) and predict the first word of the description, then use the predicted word to generate the next word, and so on.

Next, we can have a sentence planner that corrects any syntactical or grammatical errors, that might have been introduced by the model. For example, a sentence planner might take in the phrase, I go house and output I go home. For this, it can use a database of rules, which contains the correct way of conveying meanings, such as the need for a preposition between a verb and the word house.

Using the HMM and the sentence planner, we will have syntactically grammatically correct sentences. Then, we need to collate these phrases in such a way that the essay made from the collection of phrases is human readable and flows well. For example, consider the three phrases, Player 10 of the Barcelona team scored a goal in the second half, Barcelona played against Manchester United, and Player 3 from Manchester United got a yellow card in the first half; having these sentences in this order does not make much sense. We like to have them in this order: Barcelona played against Manchester United, Player 3 from Manchester United got a yellow card in the first half, and Player 10 of the Barcelona team scored a goal in the second half. To do this, we use a discourse planner; discourse planners can organize a set of messages so that the meaning of them is conveyed properly.

Now, we can get a set of arbitrary test statistics and obtain an essay explaining the statistics by following the preceding workflow, which is depicted in Figure 1.3:


Figure 1.3: The classical approach to solving a language modeling task

Here, it is important to note that this is a very high-level explanation that only covers the main general-purpose components that are most likely to be included in traditional NLP. The details can largely vary according to the particular application we are interested in solving. For example, additional application-specific crucial components might be needed for certain tasks (a rule base and an alignment model in machine translation). However, in this book, we do not stress about such details as the main objective here is to discuss more modern ways of Natural Language Processing.

Drawbacks of the traditional approach

Let’s list several key drawbacks of the traditional approach as this would lay a good foundation for discussing the motivation for deep learning:

  • The preprocessing steps used in traditional NLP forces a trade-off of potentially useful information embedded in the text (for example, punctuation and tense information) in order to make the learning feasible by reducing the vocabulary. Though preprocessing is still used in modern deep-learning-based solutions, it is not as crucial for them as it is for the traditional NLP workflow due to the large representational capacity of deep networks and their ability to optimize high-end hardware like GPUs.
  • Feature engineering is very labor-intensive. In order to design a reliable system, good features need to be devised. This process can be very tedious as different feature spaces need to be extensively explored and evaluated. Additionally, in order to effectively explore robust features, domain expertise is required, which can be scarce and expensive for certain NLP tasks.
  • Various external resources are needed for it to perform well, and there are not many freely available ones. Such external resources often consist of manually created information stored in large databases. Creating one for a particular task can take several years, depending on the severity of the task (for example, a machine translation rule base).

Now, let’s discuss how deep learning can help to solve NLP problems.


The deep learning approach to Natural Language Processing

I think it is safe to say that deep learning revolutionized machine learning, especially in fields such as computer vision, speech recognition, and of course, NLP. Deep models created a wave of paradigm shifts in many of the fields in machine learning, as deep models learned rich features from raw data instead of using limited human-engineered features. This consequentially caused the pesky and expensive feature engineering to be obsolete. With this, deep models made the traditional workflow more efficient, as deep models perform feature learning and task learning, simultaneously. Moreover, due to the massive number of parameters (that is, weights) in a deep model, it can encompass significantly more features than a human could’ve engineered. However, deep models are considered a black box due to the poor interpretability of the model. For example, understanding the “how” and “what” features learned by deep models for a given problem is still an active area of research. But it is important to understand that there is a lot more research focusing on “model interpretability of deep learning models”.

A deep neural network is essentially an artificial neural network that has an input layer, many interconnected hidden layers in the middle, and finally, an output layer (for example, a classifier or a regressor). As you can see, this forms an end-to-end model from raw data to predictions. These hidden layers in the middle give the power to deep models as they are responsible for learning the good features from raw data, eventually succeeding at the task at hand. Let’s now understand the history of deep learning briefly.

History of deep learning

Let’s briefly discuss the roots of deep learning and how the field evolved to be a very promising technique for machine learning. In 1960, Hubel and Weisel performed an interesting experiment and discovered that a cat’s visual cortex is made of simple and complex cells, and that these cells are organized in a hierarchical form. Also, these cells react differently to different stimuli. For example, simple cells are activated by variously oriented edges while complex cells are insensitive to spatial variations (for example, the orientation of the edge). This kindled the motivation for replicating a similar behavior in machines, giving rise to the concept of artificial neural networks.

In the years that followed, neural networks gained the attention of many researchers. In 1965, a neural network trained by a method known as the Group Method of Data Handling (GMDH) and based on the famous Perceptron by Rosenblatt, was introduced by Ivakhnenko and others. Later, in 1979, Fukushima introduced the Neocognitron, which planted the seeds for one of the most famous variants of deep models—Convolutional Neural Networks (CNNs). Unlike the perceptrons, which always took in a 1D input, a Neocognitron was able to process 2D inputs using convolution operations.

Artificial neural networks used to backpropagate the error signal to optimize the network parameters by computing the gradients of the weights of a given layer with regards to the loss. Then, the weights are updated by pushing them in the opposite direction of the gradient, in order to minimize the loss. For a layer further away from the output layer (i.e. where the loss is computed), the algorithm uses the chain rule to compute gradients. The chain rule used with many layers led to a practical problem known as the vanishing gradients problem, strictly limiting the potential number of layers (depth) of the neural network. The gradients of layers closer to the inputs (i.e. further away from the output layer), being very small, cause the model training to stop prematurely, leading to an underfitted model. This is known as the vanishing gradients phenomenon.

Then, in 2006, it was found that pretraining a deep neural network by minimizing the reconstruction error (obtained by trying to compress the input to a lower dimensionality and then reconstructing it back into the original dimensionality) for each layer of the network provides a good initial starting point for the weight of the neural network; this allows a consistent flow of gradients from the output layer to the input layer. This essentially allowed neural network models to have more layers without the ill effects of the vanishing gradient. Also, these deeper models were able to surpass traditional machine learning models in many tasks, mostly in computer vision (for example, test accuracy for the MNIST handwritten digit dataset). With this breakthrough, deep learning became the buzzword in the machine learning community.

Things started gaining progressive momentum when, in 2012, AlexNet (a deep convolutional neural network created by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton) won the Large Scale Visual Recognition Challenge (LSVRC) 2012 with an error decrease of 10% from the previous best. During this time, advances were made in speech recognition, wherein state-of-the-art speech recognition accuracies were reported using deep neural networks. Furthermore, people began to realize that Graphical Processing Units (GPUs) enable more parallelism, which allows for faster training of larger and deeper networks compared with Central Processing Units (CPUs).

Deep models were further improved with better model initialization techniques (for example, Xavier initialization), making the time-consuming pretraining redundant. Also, better nonlinear activation functions, such as Rectified Linear Units (ReLUs), were introduced, which alleviated the adversities of the vanishing gradient in deeper models. Better optimization (or learning) techniques, such as the Adam optimizer, automatically tweaked individual learning rates of each parameter among the millions of parameters that we have in the neural network model, which rewrote the state-of-the-art performance in many different fields of machine learning, such as object classification and speech recognition. These advancements also allowed neural network models to have large numbers of hidden layers. The ability to increase the number of hidden layers (that is, to make the neural networks deep) is one of the primary contributors to the significantly better performance of neural network models compared with other machine learning models. Furthermore, better intermediate regularizers, such as batch normalization layers, have improved the performance of deep nets for many tasks.

Later, even deeper models such as ResNets, Highway Nets, and Ladder Nets were introduced, which had hundreds of layers and billions of parameters. It was possible to have such an enormous number of layers with the help of various empirically and theoretically inspired techniques. For example, ResNets use shortcut connections or skip connections to connect layers that are far apart, which minimizes the diminishing of gradients layer to layer, as discussed earlier.

The current state of deep learning and NLP

Many different deep models have seen the light since their inception in early 2000. Even though they share a resemblance, such as all of them using nonlinear transformation of the inputs and parameters, the details can vary vastly. For example, a CNN can learn from two-dimensional data (for example, RGB images) as it is, while a multilayer perceptron model requires the input to be unwrapped to a one-dimensional vector, causing the loss of important spatial information.

When processing text, as one of the most intuitive interpretations of text is to perceive it as a sequence of characters, the learning model should be able to do time-series modeling, thus requiring the memory of the past. To understand this, think of a language modeling task; the next word for the word cat should be different from the next word for the word climbed. One such popular model that encompasses this ability is known as a Recurrent Neural Network (RNN). We will see in Chapter 6, Recurrent Neural Networks, how exactly RNNs achieve this by going through interactive exercises.

It should be noted that memory is not a trivial operation that is inherent to a learning model. Conversely, ways of persisting memory should be carefully designed.

Also, the term memory should not be confused with the learned weights of a non-sequential deep network that only looks at the current input, where a sequential model (for example, an RNN) will look at both the learned weights and the previous element of the sequence to predict the next output.

One prominent drawback of RNNs is that they cannot remember more than a few (approximately seven) time steps, thus lacking long-term memory. Long Short-Term Memory (LSTM) networks are an extension of RNNs that encapsulate long-term memory. Therefore, often LSTMs are preferred over standard RNNs, nowadays. We will peek under the hood in Chapter 7, Understanding Long Short-Term Memory Networks, to understand them better.

Finally, a model known as a Transformer has been introduced by Google fairly recently, which has outperformed many of the previous state-of-the-art models such as LSTMs on a plethora of NLP tasks. Previously, both recurrent models (e.g. LSTMs) and convolutional models (e.g. CNNs) dominated the NLP domain. For example, CNNs have been used for sentence classification, machine translation, and sequence-to-sequence learning tasks. However, Transformers use an entirely different approach where they use neither recurrence nor convolution, but an attention mechanism. The attention mechanism allows the model to look at the entire sequence at once, to produce a single output. For example, consider the sentence “The animal didn’t cross the road because it was tired.” While generating intermediate representations for the word “it,” it would be useful for the model to learn that “it” refers to the “animal”. The attention mechanism allows the Transformer model to learn such relationships. This capability cannot be replicated with standard recurrent models or convolutional models. We will investigate these models further in Chapter 10, Transformers and Chapter 11, Image Captioning with Transformers.

In summary, we can mainly separate deep networks into three categories: the non-sequential models that deal with only a single input at a time for both training and prediction (for example, image classification), the sequential models that cope with sequences of inputs of arbitrary length (for example, text generation where a single word is a single input), and finally, attention-based models that look at the sequence at once such as the Transformer, BERT, and XLNet, which are pretrained models based on the Transformer architecture. We can categorize non-sequential (also called feed-forward) models into deep (approximately less than 20 layers) and very deep networks (can be greater than hundreds of layers). The sequential models are categorized into short-term memory models (for example, RNNs), which can only memorize short-term patterns, and long-term memory models, which can memorize longer patterns. In Figure 1.4, we outline the discussed taxonomy. You don’t have to understand these different deep learning models fully at this point, but it illustrates the diversity of the deep learning models:

Diagram  Description automatically generated with low confidence

Figure 1.4: A general taxonomy of the most commonly used deep learning methods, categorized into several classes

Now, let’s take our first steps toward understanding the inner workings of a neural network.

Understanding a simple deep model – a fully connected neural network

Now, let’s have a closer look at a deep neural network in order to gain a better understanding. Although there are numerous different variants of deep models, let’s look at one of the earliest models (dating back to 1950–60), known as a fully connected neural network (FCNN), sometimes called a multilayer perceptron. Figure 1.5 depicts a standard three-layered FCNN.

The goal of an FCNN is to map an input (for example, an image or a sentence) to a certain label or annotation (for example, the object category for images). This is achieved by using an input x to compute h – a hidden representation of x – using a transformation such as ; here, W and b are the weights and bias of the FCNN, respectively, and is the sigmoid activation function. Neural networks use non-linear activation functions at every layer. Sigmoid activation is one such activation. It is an element-wise transformation applied to the output of a layer, where the sigmoidal output of x is given by, . Next, a classifier is placed on top of the FCNN that gives the ability to leverage the learned features in hidden layers to classify inputs. The classifier is a part of the FCNN and yet another hidden layer with some weights, Ws and a bias, bs. Also, we can calculate the final output of the FCNN as . For example, a softmax classifier can be used for multi-label classification problems. It provides a normalized representation of the scores output by the classifier layer. That is, it will produce a valid probability distribution over the classes in the classifier layer. The label is considered to be the output node with the highest softmax value. Then, with this, we can define a classification loss that is calculated as the difference between the predicted output label and the actual output label. An example of such a loss function is the mean squared loss. You don’t have to worry if you don’t understand the actual intricacies of the loss function. We will discuss quite a few of them in later chapters. Next, the neural network parameters, W, b, Ws, and bs, are optimized using a standard stochastic optimizer (for example, the stochastic gradient descent) to reduce the classification loss of all the inputs. Figure 1.5 depicts the process explained in this paragraph for a three-layer FCNN. We will walk through the details on how to use such a model for NLP tasks, step by step, in Chapter 3, Word2vec – Learning Word Embeddings.


Figure 1.5: An example of a fully connected neural network (FCNN)

Let’s look at an example of how to use a neural network for a sentiment analysis task. Consider that we have a dataset where the input is a sentence expressing a positive or negative opinion about a movie and a corresponding label saying if the sentence is actually positive (1) or negative (0). Then, we are given a test dataset, where we have single-sentence movie reviews, and our task is to classify these new sentences as positive or negative.

It is possible to use a neural network (which can be deep or shallow, depending on the difficulty of the task) for this task by adhering to the following workflow:

  1. Tokenize the sentence by words.
  2. Convert the sentences into a fixed sized numerical representation (for example, Bag-of-Words representation). A fixed sized representation is needed as fully connected neural networks require a fixed sized input.
  3. Feed the numerical inputs to the neural network, predict the output (positive or negative), and compare that with the true target.
  4. Optimize the neural network using a desired loss function.

In this section we looked at deep learning in more detail. We looked at the history and the current state of NLP. Finally, we looked at a fully connected neural network (a type of deep learning model) in more detail.

Now that we’ve introduced NLP, its tasks, and how approaches to it have evolved over the years, let’s take a moment to look the technical tools required for the rest of this book.


Introduction to the technical tools

In this section, you will be introduced to the technical tools that will be used in the exercises of the following chapters. First, we will present a brief introduction to the main tools provided. Next, we will present a rough guide on how to install each tool along with hyperlinks to detailed guides provided by the official websites. Additionally, we will share tips on how to make sure that the tools were installed properly.

Description of the tools

We will use Python as the coding/scripting language. Python is a very versatile, easy-to-set-up coding language that is heavily used by the scientific and machine learning communities.

Additionally, there are numerous scientific libraries built for Python, catering to areas ranging from deep learning to probabilistic inference to data visualization. TensorFlow is one such library that is well known among the deep learning community, providing many basic and advanced operations that are useful for deep learning. Next, we will use Jupyter Notebook in all our exercises as it provides a rich and interactive environment for coding compared to using Python scripts. We will also use pandas, NumPy and scikit-learn — three popular — two popular libraries for Python—for various miscellaneous purposes such as data preprocessing. Another library we will be using for various text-related operations is NLTK—the Python Natural Language Toolkit. Finally, we will use Matplotlib for data visualization.

Installing Anaconda and Python

Python is hassle-free to install in any of the commonly used operating systems, such as Windows, macOS, or Linux. We will use Anaconda to set up Python, as it does all the laborious work for setting up Python as well as the essential libraries.

To install Anaconda, follow these steps:

  1. Download Anaconda from
  2. Select the appropriate OS and download Python 3.7
  3. Install Anaconda by following the instructions at

To check whether Anaconda was properly installed, open a Terminal window (Command Prompt in Windows), and then run the following command:

conda --version

If installed properly, the version of the current Anaconda distribution should be shown in the Terminal.

Creating a Conda environment

One of the attractive features of Anaconda is that it allows you to create multiple Conda, or virtual, environments. Each Conda environment can have its own environment variables and Python libraries. For example, one Conda environment can be created to run TensorFlow 1.x, whereas another can run TensorFlow 2.x. This is great because it allows you to separate your development environments from any changes taking place in the host’s Python installation. Then, you can activate or deactivate Conda environments depending on which environment you want to use.

To create a Conda environment, follow these instructions:

  1. Run Conda and create -n packt.nlp.2 python=3.7 in the terminal window using the command conda create -n packt.nlp.2 python=3.7.
  2. Change directory (cd) to the project directory.
  3. Activate the new Conda environment by entering activate packt.nlp.2 in the terminal. If successfully activated, you should see (packt.nlp.2) appearing before the user prompt in the terminal.
  4. Install the required libraries using one of the following options.
  5. If you have a GPU, use pip install -r requirements-base.txt -r requirements-tf-gpu.txt
  6. If you do not have a GPU, use pip install -r requirements-base.txt -r requirements-tf.txt

Next, we’ll discuss some prerequisites for GPU support for TensorFlow.

TensorFlow (GPU) software requirements

If you are using the TensorFlow GPU version, you will need to satisfy certain software requirements such as installing CUDA 11.0. An exhaustive list is available at

Accessing Jupyter Notebook

After running the pip install command, you should have Jupyter Notebook available in the Conda environment. To check whether Jupyter Notebook is properly installed and can be accessed, follow these steps:

  1. Open a Terminal window.
  2. Activate the packt.nlp.2 Conda environment if it is not already by running activate packt.nlp.2
  3. Run the command: jupyter notebook

You should be presented with a new browser window that looks like Figure 1.6:


Figure 1.6: Jupyter Notebook installed successfully

Verifying the TensorFlow installation

In this book, we are using TensorFlow 2.7.0. It is important that you install the exact version used in the book as TensorFlow can undergo many changes while migrating from one version to the other. TensorFlow should be installed in the packt.nlp.2 Conda environment if everything went well. If you are having trouble installing TensorFlow, you can find guides and troubleshooting instructions at

To check whether TensorFlow installed properly, follow these steps:

  1. Open Command Prompt in Windows or Terminal in Linux or macOS.
  2. Activate the packt.nlp.2 Conda environment.
  3. Type python to enter the Python prompt. You should now see the Python version right below. Make sure that you are using Python 3.
  4. Next, enter the following commands:
    import tensorflow as tf 
    print(tf. version )

If all went well, you should not have any errors (there might be warnings if your computer does not have a dedicated GPU, but you can ignore them) and TensorFlow version 2.7.0 should be shown.

Many cloud-based computational platforms are also available, where you can set up your own machine with various customization (operating system, GPU card type, number of GPU cards, and so on). Many are migrating to such cloud-based services due to the following benefits:

  • More customization options
  • Less maintenance effort
  • No infrastructure requirements

Several popular cloud-based computational platforms are as follows:

Google Colab is a great cloud-based platform that allows you to write TensorFlow code and execute it on CPU/GPU hardware for free.



In this chapter, we broadly explored NLP to get an impression of the kind of tasks involved in building a good NLP-based system. First, we explained why we need NLP and then discussed various tasks of NLP to generally understand the objective of each task and how difficult it is to succeed at them.

After that, we looked at the classical approach of solving NLP and went into the details of the workflow using an example of generating sport summaries for football games. We saw that the traditional approach usually involves cumbersome and tedious feature engineering. For example, in order to check the correctness of a generated phrase, we might need to generate a parse tree for that phrase. Then, we discussed the paradigm shift that transpired with deep learning and saw how deep learning made the feature engineering step obsolete. We started with a bit of time-traveling to go back to the inception of deep learning and artificial neural networks and worked our way through to the massive modern networks with hundreds of hidden layers. Afterward, we walked through a simple example illustrating a deep model—a multilayer perceptron model—to understand the mathematical wizardry taking place in such a model (on the surface of course!).

With a foundation in both the traditional and modern ways of approaching NLP, we then discussed the roadmap to understand the topics we will be covering in the book, from learning word embeddings to mighty LSTMs, and to state-of-the-art Transformers! Finally, we set up our virtual Conda environment by installing Python, scikit-learn, Jupyter Notebook, and TensorFlow.

In the next chapter, you will learn the basics of TensorFlow. By the end of the chapter, you should be comfortable with writing a simple algorithm that can take some input, transform the input through a defined function and output the result.

To access the code files for this book, visit our GitHub page at:

Join our Discord community to meet like-minded people and learn alongside more than 1000 members at:

About the Author
  • Thushan Ganegedara

    Thushan is a seasoned ML practitioner with 4+ years of experience in the industry. Currently he is a senior machine learning engineer at Canva; an Australian startup that founded the online visual design software, Canva, serving millions of customers. His efforts are particularly concentrated in the search and recommendations group working on both visual and textual content. Prior to Canva, Thushan was a senior data scientist at QBE Insurance; an Australian Insurance company. Thushan was developing ML solutions for use-cases related to insurance claims. He also led efforts in developing a Speech2Text pipeline there. He obtained his PhD specializing in machine learning from the University of Sydney in 2018.

    Browse publications by this author
Natural Language Processing with TensorFlow - Second Edition
Unlock this book and the full library FREE for 7 days
Start now