Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Java Deep Learning Cookbook

You're reading from  Java Deep Learning Cookbook

Product type Book
Published in Nov 2019
Publisher Packt
ISBN-13 9781788995207
Pages 304 pages
Edition 1st Edition
Languages
Author (1):
Rahul Raj Rahul Raj
Profile icon Rahul Raj

Table of Contents (14) Chapters

Preface Introduction to Deep Learning in Java Data Extraction, Transformation, and Loading Building Deep Neural Networks for Binary Classification Building Convolutional Neural Networks Implementing Natural Language Processing Constructing an LSTM Network for Time Series Constructing an LSTM Neural Network for Sequence Classification Performing Anomaly Detection on Unsupervised Data Using RL4J for Reinforcement Learning Developing Applications in a Distributed Environment Applying Transfer Learning to Network Models Benchmarking and Neural Network Optimization Other Books You May Enjoy

Implementing Natural Language Processing

In this chapter, we will discuss word vectors (Word2Vec) and paragraph vectors (Doc2Vec) in DL4J. We will develop a complete running example step by step, covering all the stages, such as ETL, model configuration, training, and evaluation. Word2Vec and Doc2Vec are natural language processing (NLP) implementations in DL4J. It is worth mentioning a little about the bag-of-words algorithm before we talk about Word2Vec.

Bag-of-words is an algorithm that counts the instances of words in documents. This will allow us to perform document classification. Bag of words and Word2Vec are just two different types of text classification. Word2Vec can use a bag of words extracted from a document to create vectors. In addition to these text classification methods, term frequency–inverse document frequency (TF-IDF) can be used to judge the topic...

Technical requirements

The examples discussed in this chapter can be found at https://github.com/PacktPublishing/Java-Deep-Learning-Cookbook/tree/master/05_Implementing_NLP/sourceCode/cookbookapp/src/main/java/com/javadeeplearningcookbook/examples.

After cloning our GitHub repository, navigate to the directory called Java-Deep-Learning-Cookbook/05_Implementing_NLP/sourceCode. Then, import the cookbookapp project as a Maven project by importing pom.xml.

To get started with NLP in DL4J, add the following Maven dependency in pom.xml:

<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-nlp</artifactId>
<version>1.0.0-beta3</version>
</dependency>

Data requirements

...

Reading and loading text data

We need to load raw sentences in text format and iterate them using an underlined iterator that serves the purpose. A text corpus can also be subjected to preprocessing, such as lowercase conversion. Stop words can be mentioned while configuring the Word2Vec model. In this recipe, we will extract and load text data from various data-input scenarios.

Getting ready

Select an iterator approach from step 1 to step 5 depending on what kind of data you're looking for and how you want to load it.

How to do it...

  1. Create a sentence iterator...

Tokenizing data and training the model

We need to perform tokenization in order to build the Word2Vec models. The context of a sentence (document) is determined by the words in it. Word2Vec models require words rather than sentences (documents) to feed in, so we need to break the sentence into atomic units and create a token each time a white space is hit. DL4J has a tokenizer factory that is responsible for creating the tokenizer. The TokenizerFactory generates a tokenizer for the given string. In this recipe, we will tokenize the text data and train the Word2Vec model on top of them.

How to do it...

  1. Create a tokenizer factory and set the token preprocessor:
TokenizerFactory tokenFactory = new DefaultTokenizerFactory()...

Evaluating the model

We need to check the feature vector quality during the evaluation process. This will give us an idea of the quality of the Word2Vec model that was generated. In this recipe, we will follow two different approaches to evaluate the Word2Vec model.

How to do it...

  1. Find similar words to a given word:
Collection<String> words = model.wordsNearest("season",10); 

You will see an n output similar to the following:

week
game
team
year
world
night
time
country
last
group
  1. Find the cosine similarity of the given two words:
double cosSimilarity = model.similarity("season","program");
System.out.println(cosSimilarity);

For the preceding example, the cosine similarity is calculated as...

Generating plots from the model

We have mentioned that we have been using a layer size of 100 while training the Word2Vec model. This means that there can be 100 features and, eventually, a 100-dimensional feature space. It is impossible to plot a 100-dimensional space, and therefore we rely on t-SNE to perform dimensionality reduction. In this recipe, we will generate 2D plots from the Word2Vec model.

Getting ready

Saving and reloading the model

Model persistence is a key topic, especially while operating with different platforms. We can also reuse the model for further training (transfer learning) or performing tasks.

In this recipe, we will persist (save and reload) the Word2Vec models.

How to do it...

  1. Save the Word2Vec model using WordVectorSerializer:
WordVectorSerializer.writeWord2VecModel(model, "model.zip");
  1. Reload the Word2Vec model using WordVectorSerializer:
Word2Vec word2Vec = WordVectorSerializer.readWord2VecModel("model.zip");

How it works...

...

Importing Google News vectors

Google provides a large, pretrained Word2Vec model with around 3 million 300-dimension English word vectors. It is large enough, and pretrained to display promising results. We will use Google vectors as our input word vectors for the evaluation. You will need at least 8 GB of RAM to run this example. In this recipe, we will import the Google News vectors and then perform an evaluation.

How to do it...

  1. Import the Google News vectors:
File file = new File("GoogleNews-vectors-negative300.bin.gz");
Word2Vec model = WordVectorSerializer.readWord2VecModel(file);
  1. Run an evaluation on the Google News vectors:
model.wordsNearest("season",10))
...

Troubleshooting and tuning Word2Vec models

Word2Vec models can be tuned further to produce better results. Runtime errors can happen in situations where there is high memory demand and less resource availability. We need to troubleshoot them to understand why they are happening and take preventative measures. In this recipe, we will troubleshoot Word2Vec models and tune them.

How to do it...

  1. Monitor OutOfMemoryError in the application console/logs to check whether the heap space needs to be increased.
  2. Check your IDE console for out-of-memory errors. If there are out-of-memory errors, then add VM options to your IDE to increase the Java memory heap.
  3. Monitor StackOverflowError while running Word2Vec models. Watch out for the...

Using Word2Vec for sentence classification using CNNs

Neural networks require numerical inputs to perform their operations as expected. For text inputs, we cannot directly feed text data into a neural network. Since Word2Vec converts text data to vectors, it is possible to exploit Word2Vec so that we can use it with neural networks. We will use a pretrained Google News vector model as a reference and train a CNN network on top of it. At the end of this process, we will develop an IMDB review classifier to classify reviews as positive or negative. As per the paper found at https://arxiv.org/abs/1408.5882, combining a pretrained Word2Vec model with a CNN will give us better results.

We will employ custom CNN architecture along with the pretrained word vector model as suggested by Yoon Kim in his 2014 publication, https://arxiv.org/abs/1408.5882. The architecture is slightly more...

Using Doc2Vec for document classification

Word2Vec correlates words with words, while the purpose of Doc2Vec (also known as paragraph vectors) is to correlate labels with words. We will discuss Doc2Vec in this recipe. Documents are labeled in such a way that the subdirectories under the document's root represent document labels. For example, all finance-related data should be placed under the finance subdirectory. In this recipe, we will perform document classification using Doc2Vec.

How to do it...

  1. Extract and load the data using FileLabelAwareIterator:
LabelAwareIterator labelAwareIterator = new FileLabelAwareIterator.Builder()
.addSourceFolder(new ClassPathResource("label").getFile()).build();
  1. Create...
lock icon The rest of the chapter is locked
You have been reading a chapter from
Java Deep Learning Cookbook
Published in: Nov 2019 Publisher: Packt ISBN-13: 9781788995207
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}