Home Data Natural Language Processing with TensorFlow

Natural Language Processing with TensorFlow

By Motaz Saad , Thushan Ganegedara
books-svg-icon Book
eBook $35.99 $24.99
Print $43.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $35.99 $24.99
Print $43.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Introduction to Natural Language Processing
About this book
Natural language processing (NLP) supplies the majority of data available to deep learning applications, while TensorFlow is the most important deep learning framework currently available. Natural Language Processing with TensorFlow brings TensorFlow and NLP together to give you invaluable tools to work with the immense volume of unstructured data in today’s data streams, and apply these tools to specific NLP tasks. Thushan Ganegedara starts by giving you a grounding in NLP and TensorFlow basics. You'll then learn how to use Word2vec, including advanced extensions, to create word embeddings that turn sequences of words into vectors accessible to deep learning algorithms. Chapters on classical deep learning algorithms, like convolutional neural networks (CNN) and recurrent neural networks (RNN), demonstrate important NLP tasks as sentence classification and language generation. You will learn how to apply high-performance RNN models, like long short-term memory (LSTM) cells, to NLP tasks. You will also explore neural machine translation and implement a neural machine translator. After reading this book, you will gain an understanding of NLP and you'll have the skills to apply TensorFlow in deep learning NLP applications, and how to perform specific NLP tasks.
Publication date:
May 2018
Publisher
Packt
Pages
472
ISBN
9781788478311

 

Chapter 1. Introduction to Natural Language Processing

Natural Language Processing (NLP) is an important tool for understanding and processing the immense volume of unstructured data in today's world. Recently, deep learning has been widely adopted for many NLP tasks because of the remarkable performance that deep learning algorithms have shown in a plethora of challenging tasks, such as, image classification, speech recognition, and realistic text generation. TensorFlow, in turn, is one of the most intuitive and efficient deep learning frameworks currently in existence. This book will enable aspiring deep learning developers to handle massive amounts of data using NLP and TensorFlow.

In this chapter, we will provide an introduction to NLP and to the rest of the book. We will answer the question, "What is Natural Language Processing?" Also, we'll look at some of its most important uses. We will also consider the traditional approaches and the more recent deep learning-based approaches to NLP, including a Fully-Connected Neural Network (FCNN). Finally, we will conclude with an overview of the rest of the book and the technical tools we will be using.

 

What is Natural Language Processing?


According to IBM, 2.5 exabytes (1 exabyte = 1,000,000,000 gigabytes) of data were generated every day in 2017, and this is growing as this book is being written. To put that into perspective, if all the human beings in the world were to process that data, it would be roughly 300 MB for each of us every day to process. Of all this data, a large fraction is unstructured text and speech as there are millions of emails and social media content created and phone calls made every day.

These statistics provide a good basis for us to define what NLP is. Simply put, the goal of NLP is to make machines understand our spoken and written languages. Moreover, NLP is ubiquitous and is already a large part of human life. Virtual Assistants (VAs), such as Google Assistant, Cortana, and Apple Siri, are largely NLP systems. Numerous NLP tasks take place when one asks a VA, "Can you show me a good Italian restaurant nearby?". First, the VA needs to convert the utterance to text (that is, speech-to-text). Next, it must understand the semantics of the request (for example, the user is looking for a good restaurant with an Italian cuisine) and formulate a structured request (for example, cuisine = Italian, rating = 3-5, distance < 10 km). Then, the VA must search for restaurants filtering by the location and cuisine, and then, sort the restaurants by the ratings received. To calculate an overall rating for a restaurant, a good NLP system may look at both the rating and text description provided by each user. Finally, once the user is at the restaurant, the VA might assist the user by translating various menu items from Italian to English. This example shows that NLP has become an integral part of human life.

It should be understood that NLP is an extremely challenging field of research as words and semantics have a highly complex nonlinear relationship, and it is even more difficult to capture this information as a robust numerical representation. To make matters worse, each language has its own grammar, syntax, and vocabulary. Therefore, processing textual data involves various complex tasks such as text parsing (for example, tokenization and stemming), morphological analysis, word sense disambiguation, and understanding the underlying grammatical structure of a language. For example, in these two sentences, I went to the bank and I walked along the river bank, the word bank has two entirely different meanings. To distinguish or (disambiguate) the word bank, we need to understand the context in which the word is being used. Machine learning has become a key enabler for NLP, helping to accomplish the aforementioned tasks through machines.

 

Tasks of Natural Language Processing


NLP has a multitude of real-world applications. A good NLP system is that which performs many NLP tasks. When you search for today's weather on Google or use Google Translate to find out how to say, "How are you?" in French, you rely on a subset of such tasks in NLP. We will list some of the most ubiquitous tasks here, and this book covers most of these tasks:

  • Tokenization: Tokenization is the task of separating a text corpus into atomic units (for example, words). Although it may seem trivial, tokenization is an important task. For example, in the Japanese language, words are not delimited by spaces nor punctuation marks.

  • Word-sense Disambiguation (WSD): WSD is the task of identifying the correct meaning of a word. For example, in the sentences, The dog barked at the mailman, and Tree bark is sometimes used as a medicine, the word bark has two different meanings. WSD is critical for tasks such as question answering.

  • Named Entity Recognition (NER): NER attempts to extract entities (for example, person, location, and organization) from a given body of text or a text corpus. For example, the sentence, John gave Mary two apples at school on Monday will be transformed to [John] name gave [Mary] name [two] number apples at [school] organization on [Monday.] time. NER is an imperative topic in fields such as information retrieval and knowledge representation.

  • Part-of-Speech (PoS) tagging: PoS tagging is the task of assigning words to their respective parts of speech. It can either be basic tags such as noun, verb, adjective, adverb, and preposition, or it can be granular such as proper noun, common noun, phrasal verb, verb, and so on.

  • Sentence/Synopsis classification: Sentence or synopsis (for example, movie reviews) classification has many use cases such as spam detection, news article classification (for example, political, technology, and sport), and product review ratings (that is, positive or negative). This is achieved by training a classification model with labeled data (that is, reviews annotated by humans, with either a positive or negative label).

  • Language generation: In language generation, a learning model (for example, neural network) is trained with text corpora (a large collection of textual documents), which predict new text that follows. For example, language generation can output an entirely new science fiction story by using existing science fiction stories for training.

  • Question Answering (QA): QA techniques possess a high commercial value, and such techniques are found at the foundation of chatbots and VA (for example, Google Assistant and Apple Siri). Chatbots have been adopted by many companies for customer support. Chatbots can be used to answer and resolve straightforward customer concerns (for example, changing a customer's monthly mobile plan), which can be solved without human intervention. QA touches upon many other aspects of NLP such as information retrieval, and knowledge representation. Consequently, all this makes developing a QA system very difficult.

  • Machine Translation (MT): MT is the task of transforming a sentence/phrase from a source language (for example, German) to a target language (for example, English). This is a very challenging task as, different languages have highly different morphological structures, which means that it is not a one-to-one transformation. Furthermore, word-to-word relationships between languages can be one-to-many, one-to-one, many-to-one, or many-to-many. This is known as the word alignment problem in MT literature.

Finally, to develop a system that can assist a human in day-to-day tasks (for example, VA or a chatbot) many of these tasks need to be performed together. As we saw in the previous example where the user asks, "Can you show me a good Italian restaurant nearby?" several different NLP tasks, such as speech-to-text conversion, semantic and sentiment analyses, question answering, and machine translation, need to be completed. In Figure 1.1, we provide a hierarchical taxonomy of different NLP tasks categorized into several different types. We first have two broad categories: analysis (analyzing existing text) and generation (generating new text) tasks. Then we divide analysis into three different categories: syntactic (language structure-based tasks), semantic (meaning-based tasks), and pragmatic (open problems difficult to solve):

Figure 1.1: A taxonomy of the popular tasks of NLP categorized under broader categories

Having understood the various tasks in NLP, let us now move on to understand how we can solve these tasks with the help of machines.

 

The traditional approach to Natural Language Processing


The traditional or classical approach to solving NLP is a sequential flow of several key steps, and it is a statistical approach. When we take a closer look at a traditional NLP learning model, we will be able to see a set of distinct tasks taking place, such as preprocessing data by removing unwanted data, feature engineering to get good numerical representations of textual data, learning to use machine learning algorithms with the aid of training data, and predicting outputs for novel unfamiliar data. Of these, feature engineering was the most time-consuming and crucial step for obtaining good performance on a given NLP task.

Understanding the traditional approach

The traditional approach to solving NLP tasks involves a collection of distinct subtasks. First, the text corpora need to be preprocessed focusing on reducing the vocabulary and distractions. By distractions, I refer to the things that distract the algorithm (for example, punctuation marks and stop word removal) from capturing the vital linguistic information required for the task.

Next, comes several feature engineering steps. The main objective of feature engineering is to make the learning easier for the algorithms. Often the features are hand-engineered and biased toward the human understanding of a language. Feature engineering was of utter importance for classical NLP algorithms, and consequently, the best performing systems often had the best engineered features. For example, for a sentiment classification task, you can represent a sentence with a parse tree and assign positive, negative, or neutral labels to each node/subtree in the tree to classify that sentence as positive or negative. Additionally, the feature engineering phase can use external resources such as WordNet (a lexical database) to develop better features. We will soon look at a simple feature engineering technique known as bag-of-words.

Next, the learning algorithm learns to perform well at the given task using the obtained features and optionally, the external resources. For example, for a text summarization task, a thesaurus that contains synonyms of words can be a good external resource. Finally, prediction occurs. Prediction is straightforward, where you will feed a new input and obtain the predicted label by forwarding the input through the learning model. The entire process of the traditional approach is depicted in Figure 1.2:

Figure 1.2: The general approach of classical NLP

Example – generating football game summaries

To gain an in-depth understanding of the traditional approach to NLP, let's consider a task of automatic text generation from the statistics of a game of football. We have several sets of game statistics (for example, score, penalties, and yellow cards) and the corresponding articles generated for that game by a journalist, as the training data. Let's also assume that for a given game, we have a mapping from each statistical parameter to the most relevant phrase of the summary for that parameter. Our task here is that, given a new game, we need to generate a natural looking summary about the game. Of course, this can be as simple as finding the best-matching statistics for the new game from the training data and retrieving the corresponding summary. However, there are more sophisticated and elegant ways of generating text.

If we were to incorporate machine learning to generate natural language, a sequence of operations such as preprocessing the text, tokenization, feature engineering, learning, and prediction are likely to be performed.

Preprocessing the text involves operations, such as stemming (for example, converting listened to listen) and removing punctuation (for example, ! and ;), in order to reduce the vocabulary (that is, features), thus reducing the memory requirement. It is important to understand that stemming is not a trivial operation. It might appear that stemming is a simple operation that relies on a simple set of rules such as removing ed from a verb (for example, the stemmed result of listened is listen); however, it requires more than a simple rule base to develop a good stemming algorithm, as stemming certain words can be tricky (for example, the stemmed result of argued is argue). In addition, the effort required for proper stemming can vary in complexity for other languages.

Tokenization is another preprocessing step that might need to be performed. Tokenization is the process of dividing a corpus into small entities (for example, words). This might appear trivial for a language such as English, as the words are isolated; however, this is not the case for certain languages such as Thai, Japanese, and Chinese, as these languages are not consistently delimited.

Feature engineering is used to transform raw text data into an appealing numerical format so that a model can be trained on that data, for example, converting text into a bag-of-words representation or using the n-gram representation which we will discuss later. However, remember that state-of-the-art classical models rely on much more sophisticated feature engineering techniques.

The following are some of the feature engineering techniques:

Bag-of-words: This is a feature engineering technique that creates feature representations based on the word occurrence frequency. For example, let's consider the following sentences:

  • Bob went to the market to buy some flowers

  • Bob bought the flowers to give to Mary

The vocabulary for these two sentences would be:

["Bob", "went", "to", "the", "market", "buy", "some", "flowers", "bought", "give", "Mary"]

Next, we will create a feature vector of size V (vocabulary size) for each sentence showing how many times each word in the vocabulary appears in the sentence. In this example, the feature vectors for the sentences would respectively be as follows:

[1, 1, 2, 1, 1, 1, 1, 1, 0, 0, 0]

[1, 0, 2, 1, 0, 0, 0, 1, 1, 1, 1]

A crucial limitation of the bag-of-words method is that it loses contextual information as the order of words is no longer preserved.

n-gram: This is another feature engineering technique that breaks down text into smaller components consisting of n letters (or words). For example, 2-gram would break the text into two-letter (or two-word) entities. For example, consider this sentence:

Bob went to the market to buy some flowers

The letter level n-gram decomposition for this sentence is as follows:

["Bo", "ob", "b ", " w", "we", "en", ..., "me", "e "," f", "fl", "lo", "ow", "we", "er", "rs"]

The word-based n-gram decomposition is this:

["Bob went", "went to", "to the", "the market", ..., "to buy", "buy some", "some flowers"]

The advantage in this representation (letter, level) is that the vocabulary will be significantly smaller than if we were to use words as features for large corpora.

Next, we need to structure our data to be able to feed it into a learning model. For example, we will have data tuples of the form, (statistic, a phrase explaining the statistic) as follows:

Total goals = 4, "The game was tied with 2 goals for each team at the end of the first half"

Team 1 = Manchester United, "The game was between Manchester United and Barcelona"

Team 1 goals = 5, "Manchester United managed to get 5 goals"

The learning process may comprise three sub modules: a Hidden Markov Model (HMM), a sentence planner, and a discourse planner. In our example, a HMM might learn the morphological structure and grammatical properties of the language by analyzing the corpus of related phrases. More specifically, we will concatenate each phrase in our dataset to form a sequence, where the first element is the statistic followed by the phrase explaining it. Then, we will train a HMM by asking it to predict the next word, given the current sequence. Concretely, we will first input the statistic to the HMM and then get the prediction made by the HMM; then, we will concatenate the last prediction to the current sequence and ask the HMM to give another prediction, and so on. This will enable the HMM to output meaningful phrases, given statistics.

Next, we can have a sentence planner that corrects any linguistic mistakes (for example, morphological or grammar), which we might have in the phrases. For examples, a sentence planner outputs the phrase, I go house as I go home; it can use a database of rules, which contains the correct way of conveying meanings (for example, the need of a preposition between a verb and the word house).

Now we can generate a set of phrases for a given set of statistics using a HMM. Then, we need to aggregate these phrases in such a way that an essay made from the collection of phrases is human readable and flows correctly. For example, consider the three phrases, Player 10 of the Barcelona team scored a goal in the second half, Barcelona played against Manchester United, and Player 3 from Manchester United got a yellow card in the first half; having these sentences in this order does not make much sense. We like to have them in this order: Barcelona played against Manchester United, Player 3 from Manchester United got a yellow card in the first half, and Player 10 of the Barcelona team scored a goal in the second half. To do this, we use a discourse planner; discourse planners can order and structure a set of messages that need to be conveyed.

Now we can get a set of arbitrary test statistics and obtain an essay explaining the statistics by following the preceding workflow, which is depicted in Figure 1.3:

Figure 1.3: A step from a classical approach example of solving a language modelling task

Here, it is important to note that this is a very high level explanation that only covers the main general-purpose components that are most likely to be included in the traditional way of NLP. The details can largely vary according to the particular application we are interested in solving. For example, additional application-specific crucial components might be needed for certain tasks (a rule base and an alignment model in machine translation). However, in this book, we do not stress about such details as the main objective here is to discuss more modern ways of natural language processing.

Drawbacks of the traditional approach

Let's list several key drawbacks of the traditional approach as this would lay a good foundation for discussing the motivation for deep learning:

  • The preprocessing steps used in traditional NLP forces a trade-off of potentially useful information embedded in the text (for example, punctuation and tense information) in order to make the learning feasible by reducing the vocabulary. Though preprocessing is still used in modern deep-learning-based solutions, it is not as crucial as for the traditional NLP workflow due to the large representational capacity of deep networks.

  • Feature engineering needs to be performed manually by hand. In order to design a reliable system, good features need to be devised. This process can be very tedious as different feature spaces need to be extensively explored. Additionally, in order to effectively explore robust features, domain expertise is required, which can be scarce for certain NLP tasks.

  • Various external resources are needed for it to perform well, and there are not many freely available ones. Such external resources often consist of manually created information stored in large databases. Creating one for a particular task can take several years, depending on the severity of the task (for example, a machine translation rule base).

 

The deep learning approach to Natural Language Processing


I think it is safe to assume that deep learning revolutionized machine learning, especially in fields such as computer vision, speech recognition, and of course, NLP. Deep models created a wave of paradigm shifts in many of the fields in machine learning, as deep models learned rich features from raw data instead of using limited human-engineered features. This consequentially caused the pesky and expensive feature engineering to be obsolete. With this, deep models made the traditional workflow more efficient, as deep models perform feature learning and task learning, simultaneously. Moreover, due to the massive number of parameters (that is, weights) in a deep model, it can encompass significantly more features than a human would've engineered. However, deep models are considered a black box due to the poor interpretability of the model. For example, understanding the "how" and "what" features learnt by deep models for a given problem still remains an open problem.

A deep model is essentially an artificial neural network that has an input layer, many interconnected hidden layers in the middle, and finally, an output layer (for example, a classifier or a regressor). As you can see, this forms an end-to-end model from raw data to predictions. These hidden layers in the middle give the power to deep models as they are responsible for learning the good features from raw data, eventually succeeding at the task at hand.

History of deep learning

Let's briefly discuss the roots of deep learning and how the field evolved to be a very promising technique for machine learning. In 1960, Hubel and Weisel performed an interesting experiment and discovered that a cat's visual cortex is made of simple and complex cells, and that these cells are organized in a hierarchical form. Also, these cells react differently to different stimuli. For example, simple cells are activated by variously oriented edges while complex cells are insensitive to spatial variations (for example, the orientation of the edge). This kindled the motivation for replicating a similar behavior in machines, giving rise to the concept of deep learning.

In the years that followed, neural networks gained the attention of many researchers. In 1965, a neural network trained by a method known as the Group Method of Data Handling (GMDH) and based on the famous Perceptron by Rosenblatt, was introduced by Ivakhnenko and others. Later, in 1979, Fukushima introduced the Neocognitron, which laid the base for one of the most famous variants of deep models—Convolution Neural Networks. Unlike the perceptrons, which always took in a 1D input, a neocognitron was able to process 2D inputs using convolution operations.

Artificial neural networks used to backpropagate the error signal to optimize the network parameters by computing a Jacobian matrix from one layer to the layer before it. Furthermore, the problem of vanishing gradients strictly limited the potential number of layers (depth) of the neural network. The gradients of layers closer to the inputs, being very small, is known as the vanishing gradients phenomenon. This transpired due to the application of the chain rule to compute gradients (the Jacobian matrix) of lower layer weights. This in turn limited the plausible maximum depth of classical neural networks.

Then in 2006, it was found that pretraining a deep neural network by minimizing the reconstruction error (obtained by trying to compress the input to a lower dimensionality and then reconstructing it back into the original dimensionality) for each layer of the network, provides a good initial starting point for the weight of the neural network; this allows a consistent flow of gradients from the output layer to the input layer. This essentially allowed neural network models to have more layers without the ill-effects of the vanishing gradient. Also, these deeper models were able to surpass traditional machine learning models in many tasks, mostly in computer vision (for example, test accuracy for the MNIST hand-written digit dataset). With this breakthrough, deep learning became the buzzword in the machine learning community.

Things started gaining a progressive momentum, when in 2012, AlexNet (a deep convolution neural network created by Alex Krizhevsky (http://www.cs.toronto.edu/~kriz/), Ilya Sutskever (http://www.cs.toronto.edu/~ilya/), and Geoff Hinton) won the Large Scale Visual Recognition Challenge (LSVRC) 2012 with an error decrease of 10% from the previous best. During this time, advances were made in speech recognition, wherein state-of-the-art speech recognition accuracies were reported using deep neural networks. Furthermore, people began realizing that Graphical Processing Units (GPUs) enable more parallelism, which allows for faster training of larger and deeper networks compared with Central Processing Units (CPUs).

Deep models were further improved with better model initialization techniques (for example, Xavier initialization), making the time-consuming pretraining redundant. Also, better nonlinear activation functions, such as Rectified Linear Units (ReLUs), were introduced, which alleviated the ill-effects of the vanishing gradient in deeper models. Better optimization (or learning) techniques, such as Adam, automatically tweaked individual learning rates of each parameter among the millions of parameters that we have in the neural network model, which rewrote the state-of-the-art performance in many different fields of machine learning, such as object classification and speech recognition. These advancements also allowed neural network models to have large numbers of hidden layers. The ability to increase the number of hidden layers (that is, to make the neural networks deep) is one of the primary contributors to the significantly better performance of neural network models compared with other machine learning models. Furthermore, better intermediate regularizers, such as batch normalization layers, have improved the performance of deep nets for many tasks.

Later, even deeper models such as ResNets, Highway Nets, and Ladder Nets were introduced, which had hundreds of layers and billions of parameters. It was possible to have such an enormous number of layers with the help of various empirically and theoretically inspired techniques. For example, ResNets use shortcut connections to connect layers that are far apart, which minimizes the diminishing of gradients, layer to layer, as discussed earlier.

The current state of deep learning and NLP

Many different deep models have seen the light since their inception in early 2000. Even though they share a resemblance, such as all of them using nonlinear transformation of the inputs and parameters, the details can vary vastly. For example, a Convolution Neural Network (CNN) can learn from two-dimensional data (for example, RGB images) as it is, while a multilayer perceptron model requires the input to be unwrapped to a one-dimensional vector, causing loss of important spatial information.

When processing text, as one of the most intuitive interpretations of text is to perceive it as a sequence of characters, the learning model should be able to do time-series modelling, thus requiring the memory of the past. To understand this, think of a language modelling task; the next word for the word cat should be different from the next word for the word climbed. One such popular model that encompasses this ability is known as a Recurrent Neural Network (RNN). We will see in Chapter 6, Recurrent Neural Networks how exactly RNNs achieve this by going through interactive exercises.

It should be noted that memory is not a trivial operation that is inherent to a learning model. Conversely, ways of persisting memory should be carefully designed. Also, the term memory should not be confused with the learned weights of a non-sequential deep network that only looks at the current input, where a sequential model (for example, RNN) will look at both the learned weights and the previous element of the sequence to predict the next output.

One prominent drawback of RNNs is that they cannot remember more than few (approximately 7) time steps, thus lacking long-term memory. Long Short-Term Memory (LSTM) networks are an extension of RNNs that encapsulate long-term memory. Therefore, often LSTMs are preferred over standard RNNs, nowadays. We will peek under the hood in Chapter 7, Long Short-Term Memory Networks to understand them better.

In summary, we can mainly separate deep networks into two categories: the non-sequential models that deal with only a single input at a time for both training and prediction (for example, image classification) and the sequential models that cope with sequences of inputs of arbitrary length (for example, text generation where a single word is a single input). Then we can categorize non-sequential (also called feed-forward) models into deep (approximately less than 20 layers) and very deep networks (can be greater than hundreds of layers). The sequential models are categorized into short-term memory models (for example, RNNs), which can only memorize short-term patterns and long-term memory models, which can memorize longer patterns. In Figure 1.4, we outline the discussed taxonomy. It is not expected that you understand these different deep learning models fully at this point, but it only illustrates the diversity of the deep learning models:

Figure 1.4: A general taxonomy of the most commonly used deep learning methods, categorized into several classes

Understanding a simple deep model – a Fully-Connected Neural Network

Now let's have a closer look at a deep neural network in order to gain a better understanding. Although there are numerous different variants of deep models, let's look at one of the earliest models (dating back to 1950-60), known as a Fully-Connected Neural Network (FCNN), or sometimes called a multilayer perceptron. The Figure 1.5 depicts a standard three-layered FCNN.

The goal of a FCNN is to map an input (for example, an image or a sentence) to a certain label or annotation (for example, the object category for images). This is achieved by using an input x to compute h—a hidden representation of x—using a transformation such as h = sigma (W * x + b); here, W and b are the weights and bias of the FCNN, respectively, and sigma is the sigmoid activation function. Next, a classifier (for example, a softmax classifier) is placed on top of the FCNN that gives the ability to leverage the learned features in hidden layers to classify inputs. Classifier, essentially is a part of the FCNN and yet another hidden layer with some weights, W s and a bias, b s. Also, we can calculate the final output of the FCNN as, output = softmax (W s * h + b s ). For example, a softmax classifier provides a normalized representation of the scores output by the classifier layer; the label is considered to be the output node with the highest softmax value. Then, with this, we can define a classification loss that is calculated as the difference between the predicted output label and the actual output label. An example of such a loss function is the mean squared loss. You don't have to worry if you don't understand the actual intricacies of the loss function. We will discuss quite a few of them in later chapters. Next, the neural network parameters, W, b, W s, and b s, are optimized using a standard stochastic optimizer (for example, the stochastic gradient descent) to reduce the classification loss all the inputs. Figure 1.5 depicts the process explained in this paragraph for a three-layer FCNN. We will walk-through the details on how to use such a model for NLP tasks, step by step in Chapter 3, Word2vec – Learning Word Embeddings.

Figure 1.5: An example of a Fully Connected Neural Network (FCNN)

Let's look at an example of how to use a neural network for a sentiment analysis task. Consider that we have a dataset where the input is a sentence expressing a positive or negative opinion about a movie and a corresponding label saying if the sentence is actually positive (1) or negative (0). Then, we are given a test data set, where we have single sentence movie reviews, and our task is to classify these new sentences as positive or negative.

It is possible to use a neural network (which can be deep or shallow, depending on the difficulty of the task) for this task by adhering to the following workflow:

  1. Tokenize the sentence by words

  2. Pad the sentences with a special token if necessary, to bring all sentences to a fixed length

  3. Convert the sentences into a numerical representation (for example, Bag-of-Words representation)

  4. Feed the numerical inputs to the neural network and predict the output (positive or negative)

  5. Optimize the neural network using a desired loss function

 

The roadmap – beyond this chapter


This section delineates the details of the rest of the book; it's brief, but has informative details about what each chapter of the book covers. In this book, we will be looking at numerous exciting fields of NLP, from algorithms that find word similarities without any sort of annotated data, to algorithms that can write a story by themselves.

Starting from the next chapter, we will dive into the details about several popular and interesting NLP tasks. In order to gain an in-depth knowledge and to make the learning interactive, various exercises are also provided. We will use Python and TensorFlow, an open-source library for distributed numerical computations, for all the implementations. TensorFlow encapsulates advance technicalities such as optimizing your code for GPUs using Compute Unified Device Architecture (CUDA), which can be challenging. Furthermore, TensorFlow provides built-in functions for implementing deep learning algorithms, for example, activations, stochastic optimization methods, and convolutions, making everyone's life easier.

We will embark on a journey that covers many hot topics of NLP and how they perform, while using TensorFlow to see the state-of-the-art algorithms in action. This is what we will look at in this book:

  • Chapter 2, Understanding TensorFlow, provides you with a sound guide to understand how to write client programs and run them in TensorFlow. This is important especially if you are new to TensorFlow, because TensorFlow behaves differently from a traditional coding language such as Python. This chapter will first offer an in-depth explanation about how TensorFlow executes a client. This will help you to understand the TensorFlow execution workflow and feel comfortable around TensorFlow terminology. Next, the chapter will walk you through various elements of a TensorFlow client such as defining variables, defining operations/functions, feeding inputs to an algorithm, and obtaining the results. We will finally discuss how all this knowledge of TensorFlow can be used to implement a moderately complex neural network to classify images of hand-written images.

  • Chapter 3, Word2vec – Learning Word Embeddings. The objective of this chapter is to introduce Word2vec—a method to learn numerical representations of words that reflects semantic of the words. But before diving straight into the Word2vec techniques, we will first discuss some classical approaches used to represent word semantics. One of the early approach was to rely on WordNet—a large lexical database. WordNet can be used to measure the semantic similarity between different words. However, maintaining such a large lexical database is costly. Therefore, there exist other simpler representation techniques, such as one-hot-encoded representations, and the term-frequency inverse document frequency method, that doesn't rely on external resources. Following this, we will move onto the modern way of learning word vectors known as Word2vec, where we use a neural network to learn word representations. We will discuss two popular Word2vec techniques: skip-gram and continuous bag-of-words (CBOW) model.

  • Chapter 4, Advanced Word2vec. We will start this chapter with several comparisons including a comparison between the skip-gram and CBOW algorithms to see if there is a clear-cut winner. Then we will discuss several extensions that have been introduced to the original Word2vec techniques over the course of the past few years. For example, ignoring common words in the text, such as "the" and "a", that have a high probability, improves the performance of the Word2vec models. On the other hand, the Word2vec model only considers the local context of a word and ignores the global statistics of the entire corpus. Consequently, a word embedding learning technique known as GloVe, which incorporates both global and local statistics in finding word vectors will be discussed.

  • Chapter 5, Sentence Classification with Convolution Neural Networks, introduces you to convolution neural networks (CNNs). Convolution networks are a powerful family of deep models that can leverage the spatial structure of an input to learn from data. In other words, a CNN can process images in their two-dimensional form, where a multilayer perceptron needs the image to be unwrapped to a one-dimensional vector. We will first discuss various operations that undergoes in CNNs, such as the convolution and pooling operations, in detail. Then we will see an example where we will learn to classify hand-written digit images with a CNN. Then we will transition into an application of CNNs in NLP. Precisely, we will be investigating how to apply a CNN to classify sentences, where the task is to classify if a sentence is about a person, location, object, and so on.

  • Chapter 6, Recurrent Neural Networks, focuses on introducing recurrent neural networks (RNNs) and using RNNs for language generation. RNNs are different from feed-forward neural networks (for example, CNNs) as RNNs have memory. The memory is stored as a continuously updated system state. We will start with a representation of a feed-forward neural network and modify that representation to learn from sequences of data instead of individual data points. This process will transform the feed-forward network to a RNN. This will be followed by a technical description about the exact equations used for computations within the RNN. Next, we will discuss the optimization process of RNNs that is used to update the RNN's weights. Thereafter we will iterate through different types of RNNs such as one-to-one RNNs and one-to-many RNNs. We will then walkthrough an exciting application of RNNs, where the RNN will learn to tell new stories by learning from a corpus of existing stories. We achieve this by training the RNN to predict the next word given the preceding sequence of words of the story. Finally, we will discuss a variant of standard RNNs, which we call RNN-CF (RNN with contextual features), and will compare it with the standard RNN to see which one performs better.

  • Chapter 7, Long Short-Term Memory Networks, discusses LSTMs by initially providing a solid intuition to how these models work and progressively diving into the technical details adequate to implement them on your own. Standard RNNs suffer from the crucial limitation of the inability to persist long-term memory. However, advanced RNN models (for example, long short-term memory cells (LSTMs) and gated recurrent units (GRUs)) have been proposed, which can remember sequences for large number of time steps. We will also examine how exactly does the LSTMs alleviate the problem of persisting long-term memory (this is known as the vanishing gradient problem). We will then discuss several improvements that can be used to improve LSTM models further such as predicting for several time steps ahead at once and reading sequences both forward and backward. Finally, we will discuss several variants of LSTM models such as GRUs and LSTMs with peephole connections.

  • Chapter 8, Applications of LSTM – Generating Text, explains how to implement LSTMs, GRUs, and LSTMs with peephole connections discussed in Chapter 7, Long Short-Term Memory Networks. Furthermore, we will compare the performance of these extensions both qualitatively and quantitatively. We will also discuss how to implement some of the extensions examined in Chapter 7, Long Short-Term Memory Networks such as predicting several time steps ahead (known as beam search) and using word vectors as inputs instead of one-hot-encoded representations. Finally, we will discuss how we can use the RNN API, which is a sub library of TensorFlow that simplifies the implementation of recurrent models.

  • Chapter 9, Applications of LSTM – Image Caption Generation, looks at another exciting application, where the model learns how to generate captions (that is, descriptions) for images using an LSTM and a CNN. This application is interesting because it shows us how to combine two different types of models as well as how to learn with multimodal data (for example, images and text). The specific way to achieve this is to first learn image representations (similar to word vectors) with the CNN and train the LSTM by feeding that image vector followed by the words of the description of the image as a sequence. We will first discuss how we can use a pretrained CNN to obtain the image representations. Then we will discuss how to learn the word embeddings. Next we will discuss how to feed the image vectors along with word embeddings to train the LSTM. This is followed by a description of different evaluation metrics that exist for evaluating image captioning systems. Afterwards, we will evaluate the captions generated by our model, both qualitatively and quantitatively. We will conclude the chapter with a guide of how to implement the same system using the TensorFlow RNN API.

  • Chapter 10, Sequence-to-Sequence Learning – Neural Machine Translation. Machine Translation has gained a lot of attention both due to the necessity of automating translation and the inherent difficulty of the task. We will start the chapter with a brief historical flashback of how machine translation was implemented in the early days. This discussion ends with an introduction to neural machine translation (NMT) systems. We will see how well current NMT systems are doing compared to old systems (such as statistical machine translation systems), which will motivate us to learn about NMT systems. Afterwards, we will discuss the intuition behind the design of NMT systems and continue with the technical details. Then we will discuss the evaluation metric we use to evaluate our system. Following this, we will investigate how we can implement a German to English translator from scratch. Next, we will learn about ways to improve NMT systems. We will look at one of those extensions in detail, called attention mechanism. Attention mechanism has become an essential in sequence to sequence learning problems. Finally, we will compare the performance improvement obtained with attention mechanism and analyze reasons behind the performance gain. This chapter concludes with a section on how the same concept of NMT systems can be extended to implement chatbots. Chatbots are systems that can communicate with humans and are used to fulfill various customer requests.

  • Chapter 11, Current Trends and the Future of Natural Language Processing. Natural language processing has branched out to a vast spectrum of different tasks. Here we will discuss some of the current trends and future developments of NLP we can expect in the future. We will first discuss various word embedding extensions that have emerged recently. We will also look at the implementation of one such embedding learning technique, known as tv-embeddings. Next, we will examine various trends growing in the field of neural machine translation. Then we will look at how NLP is combined with other fields such as computer vision and reinforcement learning to solve some interesting problems such as teaching computer agents to communicate by devising their own language. Another booming area these days is artificial general intelligence, which is about developing systems that can do multiple tasks (classify images, translate text, caption images, and so on) with a single system. We will investigate several such systems. Afterwards, we will talk about the introduction of NLP into mining social media. We will conclude this chapter with some of the new tasks emerging (for example, language grounding – developing common sense NLP systems) and new models (for example, phased LSTMs).

  • Appendix, Mathematical Foundations and Advanced TensorFlow, will introduce the reader to various mathematical data structures (for example, matrices) and operations (for example, matrix inverse). We will also discuss several important concepts in probability. We will then introduce Keras—a high-level library that uses TensorFlow underneath. Keras makes the implementing of neural networks simpler by hiding some of the details in TensorFlow, which some might find challenging. Concretely, we will see how we can implement a CNN with Keras, to get a feel of how to use Keras. Next, we will discuss how we can use the seq2seq library in TensorFlow to implement a neural machine translation system with much less code that we used in Chapter 11, Current Trends and the Future of Natural Language Processing. Finally, we will walk you through a guide aimed at teaching to use the TensorBoard to visualize word embeddings. TensorBoard is a handy visualization tool that is shipped with TensorFlow. This can be used to visualize and monitor various variables in your TensorFlow client.

 

Introduction to the technical tools


In this section, you will be introduced to the technical tools that will be used in the exercises of the following chapters. First, we will present a brief introduction to the main tools provided. Next, we will present a coarse guide on how to install each tool along with hyperlinks to detailed guides provided by the official websites. Additionally, we will share tips on how to make sure that the tools were installed properly.

Description of the tools

We will use Python as the coding/scripting language. Python is a very versatile easy-to-set-up coding language that is heavily used by the scientific community. Additionally, there are numerous scientific libraries floating around Python, catering to areas ranging from deep learning to probabilistic inference to data visualization. TensorFlow is one such library that is well-known among the deep learning community, providing many basic and advanced operations that are useful for deep learning. Next, we will use Jupyter notebooks in all our exercises as it provides a more interactive environment for coding compared to using an IDE. We will also use scikit-learn—another popular machine learning toolkit for Python—for various miscellaneous purposes such as data preprocessing. Another library we will be using for various text related operations is NLTK—Python natural language toolkit. Finally, we will use matplotlib for data visualization.

Installing Python and scikit-learn

Python is hassle-free to install in any of the commonly used operating systems such as Windows, macOS, or Linux. We will use Anaconda to set up Python, as it does all the laborious work for setting up Python as well as the essential libraries.

To install Anaconda, follow these steps:

  1. Download Anaconda from https://www.continuum.io/downloads

  2. Select the appropriate OS and download Python 3.5

  3. Install Anaconda by following the instructions at https://docs.continuum.io/anaconda/install/

To check whether Anaconda was properly installed, follow these steps:

  1. Open a Terminal window (Command Prompt in Windows)

  2. Now, run the following command:

    conda --version
    

If installed properly, the version of the current Anaconda distribution should be shown in Terminalthe instructions at http://scikit-learn.org/stable/install.html, NLTK from https://www.nltk.org/install.html and Matplotlib from https://matplotlib.org/users/installing.html.

Installing Jupyter Notebook

You can install Jupyter Notebook by following the instruction at http://jupyter.readthedocs.io/en/latest/install.html.

To check whether Jupyter Notebook is properly installed, follow these steps:

  1. Open a Terminal window

  2. Run this command:

    jupyter notebook
    

    You should be presented with a new browser window that looks like Figure 1.6:

    Figure 1.6. Jupyter Notebook installed successfully

Installing TensorFlow

Follow the instructions given at https://www.tensorflow.org/install/ under the Installing with Anaconda subsection to install TensorFlow. We will use TensorFlow 1.8.x throughout all the exercises.

When providing the tfBinaryURL as asked in the instruction, make sure that you provide a TensorFlow 1.8.x version. We stress this as the API has undergone many changes compared to the previous TensorFlow versions.

To check whether TensorFlow installed properly, follow these steps:

  1. Open Command Prompt in Windows or Terminal in Linux or macOS.

  2. Type python to enter the Python environment. You should now see the Python version right below. Make sure that you are using Python 3.

  3. Next, enter the following commands:

    import tensorflow as tf
    print(tf.__version__)

If all went well, you should not have any errors (there might be warnings if your computer does not have a dedicated GPU, but you can ignore them) and the TensorFlow version 1.8.x should be shown.

Note

Many cloud-based computational platforms are also available, where you can set up your own machine with various customization (operating system, GPU card type, number of GPU cards, and so on). Many are migrating to such cloud-based services due to the following benefits:

  • More customization options

  • Less maintenance effort

  • No infrastructure requirements

Several popular cloud-based computational platforms are as follows:

 

Summary


In this chapter, we broadly explored NLP to get an impression of the kind of tasks involved in building a good NLP-based system. First, we explained why we need NLP and then discussed various tasks of NLP to generally understand the objective of each task and how difficult it is to succeed at these tasks.

Next, we looked at the classical approach of solving NLP and went into the details of the workflow using an example of generating sport summaries for football games. We saw that the traditional approach usually involves cumbersome and tedious feature engineering. For example, in order to check the correctness of a generated phrase, we might need to generate a parse tree for that phrase. Next, we discussed the paradigm shift that transpired with deep learning and saw how deep learning made the feature engineering step obsolete. We started with a bit of time-travelling to go back to the inception of deep learning and artificial neural networks and worked our way to the massive modern networks with hundreds of hidden layers. Afterward, we walked through a simple example illustrating a deep model—a multilayer perceptron model—to understand the mathematical wizardry taking place in such a model (on the surface of course!).

With a nice foundation to both traditional and modern ways of approaching NLP, we then discussed the roadmap to understand the topics we will be covering in the book, from learning word embeddings to mighty LSTMs, generating captions for images to neural machine translators! Finally, we set up our environment by installing Python, scikit-learn, Jupyter Notebook, and TensorFlow.

In the next chapter, you will learn the basics of TensorFlow. By the end of the chapter, you should be comfortable with writing a simple algorithm that can take some input, transform the input through a defined function and output the result.

About the Authors
  • Motaz Saad
  • Thushan Ganegedara

    Thushan is a seasoned ML practitioner with 4+ years of experience in the industry. Currently he is a senior machine learning engineer at Canva; an Australian startup that founded the online visual design software, Canva, serving millions of customers. His efforts are particularly concentrated in the search and recommendations group working on both visual and textual content. Prior to Canva, Thushan was a senior data scientist at QBE Insurance; an Australian Insurance company. Thushan was developing ML solutions for use-cases related to insurance claims. He also led efforts in developing a Speech2Text pipeline there. He obtained his PhD specializing in machine learning from the University of Sydney in 2018.

    Browse publications by this author
Latest Reviews (11 reviews total)
Great book!
. I would rather comment on the new format. It was a surprise to see the new format on my purchase of Machine Learning for Finance. I picked up the offer on 2/7/19 as I had been in Hospital for several days with pneumonia. It was offered @ 15 Aust $. When I had entered it up to buy it would not continue. I pulled out and reloaded the Email and entered the buy again.It carried forward the number ordered plus 1, and changed the price to $31.49 Aust After several tries, I eventually paid $31.49 Aust for one copy. I gather I was too late for the.special. Comment WHY NOT HAVE A COMMENT SUCH AS TOO LATE FOR THE SPECIAL, BUT STILL AT OUR DISCOUNT PRICE IT WOULD SAVE A LOT OF WASTED TIME. regards Philip Stanley still loyal after 58 purchases, but tiring
Demasiado bueno, muy teorico y practico.
Natural Language Processing with TensorFlow
Unlock this book and the full library FREE for 7 days
Start now