You're reading from Java Deep Learning Cookbook

Product type Book

Published in Nov 2019

Publisher Packt

ISBN-13 9781788995207

Pages 304 pages

Edition 1st Edition

Languages

Java

Concepts

Deep Learning

Author (1):

Rahul Raj

Implementing Natural Language Processing

In this chapter, we will discuss word vectors (Word2Vec) and paragraph vectors (Doc2Vec) in DL4J. We will develop a complete running example step by step, covering all the stages, such as ETL, model configuration, training, and evaluation. Word2Vec and Doc2Vec are natural language processing (NLP) implementations in DL4J. It is worth mentioning a little about the bag-of-words algorithm before we talk about Word2Vec.

Bag-of-words is an algorithm that counts the instances of words in documents. This will allow us to perform document classification. Bag of words and Word2Vec are just two different types of text classification. Word2Vec can use a bag of words extracted from a document to create vectors. In addition to these text classification methods, term frequency–inverse document frequency (TF-IDF) can be used to judge the topic...

Technical requirements

The examples discussed in this chapter can be found at https://github.com/PacktPublishing/Java-Deep-Learning-Cookbook/tree/master/05_Implementing_NLP/sourceCode/cookbookapp/src/main/java/com/javadeeplearningcookbook/examples.

After cloning our GitHub repository, navigate to the directory called Java-Deep-Learning-Cookbook/05_Implementing_NLP/sourceCode. Then, import the cookbookapp project as a Maven project by importing pom.xml.

To get started with NLP in DL4J, add the following Maven dependency in pom.xml:

<dependency>
 <groupId>org.deeplearning4j</groupId>
 <artifactId>deeplearning4j-nlp</artifactId>
 <version>1.0.0-beta3</version>
 </dependency>

Data requirements

...

Reading and loading text data

We need to load raw sentences in text format and iterate them using an underlined iterator that serves the purpose. A text corpus can also be subjected to preprocessing, such as lowercase conversion. Stop words can be mentioned while configuring the Word2Vec model. In this recipe, we will extract and load text data from various data-input scenarios.

Getting ready

Select an iterator approach from step 1 to step 5 depending on what kind of data you're looking for and how you want to load it.

How to do it...

Create a sentence iterator...

Tokenizing data and training the model

We need to perform tokenization in order to build the Word2Vec models. The context of a sentence (document) is determined by the words in it. Word2Vec models require words rather than sentences (documents) to feed in, so we need to break the sentence into atomic units and create a token each time a white space is hit. DL4J has a tokenizer factory that is responsible for creating the tokenizer. The TokenizerFactory generates a tokenizer for the given string. In this recipe, we will tokenize the text data and train the Word2Vec model on top of them.

How to do it...

Create a tokenizer factory and set the token preprocessor:

TokenizerFactory tokenFactory = new DefaultTokenizerFactory()...

Evaluating the model

We need to check the feature vector quality during the evaluation process. This will give us an idea of the quality of the Word2Vec model that was generated. In this recipe, we will follow two different approaches to evaluate the Word2Vec model.

How to do it...

Find similar words to a given word:

Collection<String> words = model.wordsNearest("season",10);

You will see an n output similar to the following:

week
game
team
year
world
night
time
country
last
group

Find the cosine similarity of the given two words:

double cosSimilarity = model.similarity("season","program");
System.out.println(cosSimilarity);

For the preceding example, the cosine similarity is calculated as...

Generating plots from the model

We have mentioned that we have been using a layer size of 100 while training the Word2Vec model. This means that there can be 100 features and, eventually, a 100-dimensional feature space. It is impossible to plot a 100-dimensional space, and therefore we rely on t-SNE to perform dimensionality reduction. In this recipe, we will generate 2D plots from the Word2Vec model.

Getting ready

For this recipe, refer to the t-SNE visualization example found at: //github.com/PacktPublishing/Java-Deep-Learning-Cookbook/blob/master/05_Implementing_NLP/sourceCode/cookbookapp/src/main/java/com/javadeeplearningcookbook/examples/TSNEVisualizationExample.java.

The example generates t-SNE plots in a CSV file...

Saving and reloading the model

Model persistence is a key topic, especially while operating with different platforms. We can also reuse the model for further training (transfer learning) or performing tasks.

In this recipe, we will persist (save and reload) the Word2Vec models.

How to do it...

Save the Word2Vec model using WordVectorSerializer:

WordVectorSerializer.writeWord2VecModel(model, "model.zip");

Reload the Word2Vec model using WordVectorSerializer:

Word2Vec word2Vec = WordVectorSerializer.readWord2VecModel("model.zip");

How it works...

...

Importing Google News vectors

Google provides a large, pretrained Word2Vec model with around 3 million 300-dimension English word vectors. It is large enough, and pretrained to display promising results. We will use Google vectors as our input word vectors for the evaluation. You will need at least 8 GB of RAM to run this example. In this recipe, we will import the Google News vectors and then perform an evaluation.

How to do it...

Import the Google News vectors:

File file = new File("GoogleNews-vectors-negative300.bin.gz");
Word2Vec model = WordVectorSerializer.readWord2VecModel(file);

Run an evaluation on the Google News vectors:

model.wordsNearest("season",10))

...

Troubleshooting and tuning Word2Vec models

Word2Vec models can be tuned further to produce better results. Runtime errors can happen in situations where there is high memory demand and less resource availability. We need to troubleshoot them to understand why they are happening and take preventative measures. In this recipe, we will troubleshoot Word2Vec models and tune them.

How to do it...

Monitor OutOfMemoryError in the application console/logs to check whether the heap space needs to be increased.
Check your IDE console for out-of-memory errors. If there are out-of-memory errors, then add VM options to your IDE to increase the Java memory heap.
Monitor StackOverflowError while running Word2Vec models. Watch out for the...

Using Word2Vec for sentence classification using CNNs

Neural networks require numerical inputs to perform their operations as expected. For text inputs, we cannot directly feed text data into a neural network. Since Word2Vec converts text data to vectors, it is possible to exploit Word2Vec so that we can use it with neural networks. We will use a pretrained Google News vector model as a reference and train a CNN network on top of it. At the end of this process, we will develop an IMDB review classifier to classify reviews as positive or negative. As per the paper found at https://arxiv.org/abs/1408.5882, combining a pretrained Word2Vec model with a CNN will give us better results.

We will employ custom CNN architecture along with the pretrained word vector model as suggested by Yoon Kim in his 2014 publication, https://arxiv.org/abs/1408.5882. The architecture is slightly more...

Using Doc2Vec for document classification

Word2Vec correlates words with words, while the purpose of Doc2Vec (also known as paragraph vectors) is to correlate labels with words. We will discuss Doc2Vec in this recipe. Documents are labeled in such a way that the subdirectories under the document's root represent document labels. For example, all finance-related data should be placed under the finance subdirectory. In this recipe, we will perform document classification using Doc2Vec.

How to do it...

Extract and load the data using FileLabelAwareIterator:

LabelAwareIterator labelAwareIterator = new FileLabelAwareIterator.Builder()
 .addSourceFolder(new ClassPathResource("label").getFile()).build();

Create...

You're reading from Java Deep Learning Cookbook

Table of Contents (14) Chapters

Implementing Natural Language Processing

Technical requirements

Data requirements

Reading and loading text data

Getting ready

How to do it...

Tokenizing data and training the model

How to do it...

Evaluating the model

How to do it...

Generating plots from the model

Getting ready

Saving and reloading the model

How to do it...

How it works...

Importing Google News vectors

How to do it...

Troubleshooting and tuning Word2Vec models

How to do it...

Using Word2Vec for sentence classification using CNNs

Using Doc2Vec for document classification

How to do it...

Authors (1)

Other recommended products

Personalised recommendations for you

You're reading from Java Deep Learning Cookbook

Table of Contents (14) Chapters

Unlock this book and the full library FREE for 7 days

Authors (1)

Other recommended products

Personalised recommendations for you