Preparing Text for Analysis and Tokenization
One of the first steps required for Natural Language Processing (NLP) is the extraction of tokens in text. The process of tokenization splits text into tokens—that is, words. Normally, tokens are split based upon delimiters, such as white space. White space includes blanks, tabs, and carriage-return line feeds. However, specialized tokenizers can split tokens according to other delimiters. In this chapter, we will illustrate several tokenizers that you will find useful in your analysis.
Another important NLP task involves determining the stem and lexical meaning of a word. This is useful for deriving more meaning about the words beings processed, as illustrated in the fifth and sixth recipe. The stem of a word refers to the root of a word. For example, the stem of the word antiquated is antiqu. While this may not seem to be the correct stem, the stem of a word is the ultimate base of the word.
The lexical meaning of a word is not concerned with the context in which it is being used. We will be examining the process of performing lemmatization of a word. This is also concerned with finding the root of a word, but uses a more detailed dictionary to find the root. The stem of a word may vary depending on the form the word takes. However, with lemmatization, the root will always be the same. Stemming is often used when we will be satisfied with possibly a less than precise determination of the root of a word. A more thorough discussion of stemming versus lemmatization can be found at: https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/.
The last task in this chapter deals with the process of text normalization. Here, we are concerned with converting the token that is extracted to a form that can be more easily processed during later analysis. Typical normalization activities include converting cases, expanding abbreviations, removing stop words along with stemming, and lemmatization. Stop words are those words that can often be ignored with certain types of analyses. For example, in some contexts, the word the does not always need to be included.
In this chapter, we will cover the following recipes:
- Tokenization using the Java SDK
- Tokenization using OpenNLP
- Tokenization using maximum entropy
- Training a neural network tokenizer for specialized text
- Identifying the stem of a word
- Training an OpenNLP lemmatization model
- Determining the lexical meaning of a word using OpenNLP
- Removing stop words using LingPipe
Technical requirements
In this chapter, you will need to install the following software, if they have not already been installed:
- Eclipse Photon 4.8.0
- Java JDK 8 or later
We will be using the following APIs, which you will be instructed to add for each recipe as appropriate:
- OpenNLP 1.9.0
- LingPipe 4.1.0
The code files for this chapter can be found at https://github.com/PacktPublishing/Natural-Language-Processing-with-Java-Cookbook/tree/master/Chapter01.
Tokenization using the Java SDK
Tokenization can be achieved using a number of Java classes, including the String, StringTokenizer, and StreamTokenizer classes. In this recipe, we will demonstrate the use of the Scanner class. While frequently used for console input, it can also be used to tokenize a string.
Getting ready
To prepare, we need to create a new Java project.
How to do it...
Let's go through the following steps:
- Add the following import statement to your project's class:
import java.util.ArrayList;
import java.util.Scanner;
- Add the following statements to the main method to declare the sample string, create an instance of the Scanner class, and add a list to hold the tokens:
String sampleText =
"In addition, the rook was moved too far to be effective.";
Scanner scanner = new Scanner(sampleText);
ArrayList<String> list = new ArrayList<>();
- Insert the following loops to populate the list and display the tokens:
while (scanner.hasNext()) {
String token = scanner.next();
list.add(token);
}
for (String token : list) {
System.out.println(token);
}
- Execute the program. You should get the following output:
In
addition,
the
rook
was
moved
too
far
to
be
effective.
How it works...
The Scanner class's constructor took a string as an argument. This allowed us to apply the Scanner class's methods against the text we used in the next method, which returns a single token at a time, delimited by white spaces. While it was not necessary to store the tokens in a list, this permits us to use it later for different purposes.
Tokenization using OpenNLP
In this recipe, we will create an instance of the OpenNLP SimpleTokenizer class to illustrate tokenization. We will use its tokenize method against a sample text.
Getting ready
To prepare, we need to do the following:
- Create a new Java project
- Add the following POM dependency to your project:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.9.0</version>
</dependency>
How to do it...
Let's go through the following steps:
- Start by adding the following import statement to your project's class:
import opennlp.tools.tokenize.SimpleTokenizer;
- Next, add the following main method to your project:
public static void main(String[] args) {
String sampleText =
"In addition, the rook was moved too far to be effective.";
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;
String tokenList[] = simpleTokenizer.tokenize(sampleText);
for (String token : tokenList) {
System.out.println(token);
}
}
After executing the program, you should get the following output:
In
addition
,
the
rook
was
moved
too
far
to
be
effective
.
How it works...
The SimpleTokenizer instance represents a tokenizer that will split text using white space delimiters, which are accessed through the class's INSTANCE field. With this tokenizer, we use its tokenize method to pass a single string returning an array of strings, as shown in the following code:
String sampleText =
"In addition, the rook was moved too far to be effective.";
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE;
String tokenList[] = simpleTokenizer.tokenize(sampleText);
We then iterated through the list of tokens and displayed one per line. Note how the tokenizer treats the comma and the period as tokens.
See also
- The OpenNLP API documentation can be found at https://opennlp.apache.org/docs/1.9.0/apidocs/opennlp-tools/index.html
Tokenization using maximum entropy
Maximum entropy is a statistical classification technique. It takes various characteristics of a subject, such as the use of specialized words or the presence of whiskers in a picture, and assigns a weight to each characteristic. These weights are eventually added up and normalized to a value between 0 and 1, indicating the probability that the subject is of a particular kind. With a high enough level of confidence, we can conclude that the text is all about high-energy physics or that we have a picture of a cat.
If you're interested, you can find a more complete explanation of this technique at https://nadesnotes.wordpress.com/2016/09/05/natural-language-processing-nlp-fundamentals-maximum-entropy-maxent/. In this recipe, we will demonstrate the use of maximum entropy with the OpenNLP TokenizerME class.
Getting ready
To prepare, we need to do the following:
- Create a new Maven project.
- Download the en-token.bin file from http://opennlp.sourceforge.net/models-1.5/. Save it at the root directory of the project.
- Add the following POM dependency to your project:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.9.0</version>
</dependency>
How to do it...
Let's go through the following steps:
- Add the following imports to the project:
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
- Next, add the following code to the main method. This sequence initializes the text to be processed and creates an input stream to read in the tokenization model. Modify the first argument of the File constructor to reflect the path to the model files:
String sampleText =
"In addition, the rook was moved too far to be effective.";
try (InputStream modelInputStream = new FileInputStream(
new File("...", "en-token.bin"))) {
...
} catch (FileNotFoundException e) {
// Handle exception
} catch (IOException e) {
// Handle exception
}
- Add the following code to the try block. It creates a tokenizer model and then the actual tokenizer:
TokenizerModel tokenizerModel =
new TokenizerModel(modelInputStream);
Tokenizer tokenizer = new TokenizerME(tokenizerModel);
- Insert the following code sequence that uses the tokenize method to create a list of tokens and then display the tokens:
String tokenList[] = tokenizer.tokenize(sampleText);
for (String token : tokenList) {
System.out.println(token);
}
- Next, execute the program. You should get the following output:
In
addition
,
the
rook
was
moved
too
far
to
be
effective
.
How it works...
The sampleText variable holds the test string. A try-with-resources block is used to automatically close the InputStream. The new File statement throws a FileNotFoundException, while the new TokenizerModel(modelInputStream) statement throws an IOException, both of which need to be handled.
An instance of the TokenizerModel class is created using the en-token.bin model. This model has been trained to recognize English text. An instance of the TokenizerME class represents the tokenizer where the tokenize method is executed against it using the sample text. This method returns an array of strings that are then displayed. Note that the comma and period are treated as separate tokens.
See also
- The OpenNLP API documentation can be found at https://opennlp.apache.org/docs/1.9.0/apidocs/opennlp-tools/index.html
Training a neural network tokenizer for specialized text
Sometimes, we need to work with specialized text, such as an uncommon language or text that is unique to a problem domain. In such cases, the standard tokenizers are not always sufficient. This necessitates the creation of a unique model that will work better with the specialized text. In this recipe, we will demonstrate how to train a model using OpenNLP.
Getting ready
To prepare, we need to do the following:
- Create a new Maven project
- Add the following dependency to the POM file:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.9.0</version>
</dependency>
How to do it...
Let's go through the following steps:
- Create a file called training-data.train. Add the following to the file:
The first sentence is terminated by a period<SPLIT>. We will want to be able to identify tokens that are separated by something other than whitespace<SPLIT>. This can include commas<SPLIT>, numbers such as 100.204<SPLIT>, and other punctuation characters including colons:<SPLIT>.
- Next, add the following imports to the program:
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import opennlp.tools.tokenize.TokenSample;
import opennlp.tools.tokenize.TokenSampleStream;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerFactory;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;
- Next, add the following try block to the project's main method that contains the code needed to obtain the training data:
InputStreamFactory inputStreamFactory = new InputStreamFactory() {
public InputStream createInputStream()
throws FileNotFoundException {
return new FileInputStream(
"C:/NLP Cookbook/Code/chapter2a/training-data.train");
}
};
- Insert the following code segment into the try block that will train the model and save it:
try (
ObjectStream<String> stringObjectStream =
new PlainTextByLineStream(inputStreamFactory, "UTF-8");
ObjectStream<TokenSample> tokenSampleStream =
new TokenSampleStream(stringObjectStream);) {
TokenizerModel tokenizerModel = TokenizerME.train(
tokenSampleStream, new TokenizerFactory(
"en", null, true, null),
TrainingParameters.defaultParams());
BufferedOutputStream modelOutputStream =
new BufferedOutputStream(new FileOutputStream(
new File(
"C:/NLP Cookbook/Code/chapter2a/mymodel.bin")));
tokenizerModel.serialize(modelOutputStream);
} catch (IOException ex) {
// Handle exception
}
- To test the new model, we will reuse the code found in the Tokenization using OpenNLP recipe. Add the following code after the preceding try block:
String sampleText = "In addition, the rook was moved too far to be effective.";
try (InputStream modelInputStream = new FileInputStream(
new File("C:/Downloads/OpenNLP/Models", "mymodel.bin"));) {
TokenizerModel tokenizerModel =
new TokenizerModel(modelInputStream);
Tokenizer tokenizer = new TokenizerME(tokenizerModel);
String tokenList[] = tokenizer.tokenize(sampleText);
for (String token : tokenList) {
System.out.println(token);
}
} catch (FileNotFoundException e) {
// Handle exception
} catch (IOException e) {
// Handle exception
}
- When executing the program, you will get an output similar to the following. Some of the training model output has been removed to save space:
Indexing events with TwoPass using cutoff of 5
Computing event counts... done. 36 events
Indexing... done.
Sorting and merging events... done. Reduced 36 events to 12.
Done indexing in 0.21 s.
Incorporating indexed data for training...
done.
Number of Event Tokens: 12
Number of Outcomes: 2
Number of Predicates: 9
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-24.95329850015802 0.8611111111111112
2: ... loglikelihood=-14.200654164477221 0.8611111111111112
3: ... loglikelihood=-11.526745527757855 0.8611111111111112
4: ... loglikelihood=-9.984657035211438 0.8888888888888888
...
97: ... loglikelihood=-0.7805227945549726 1.0
98: ... loglikelihood=-0.7730211829010772 1.0
99: ... loglikelihood=-0.765664507836384 1.0
100: ... loglikelihood=-0.7584485899716518 1.0
In
addition
,
the
rook
was
moved
too
far
to
be
effective
.
How it works...
To understand how this all works, we will explain the training code, the testing code, and the output. We will start with the training code.
To create a model, we need test data that was saved in the training-data.train file. Its contents are as follows:
These fields are used to provide further information about how tokens should be identified<SPLIT>. They can help identify breaks between numbers<SPLIT>, such as 23.6<SPLIT>, punctuation characters such as commas<SPLIT>.
The <SPLIT> markup has been added just before the places where the tokenizer should split code, in locations rather than white spaces. Normally, we would use a larger set of data to obtain a better model. For our purposes, this file will work.
We created an instance of the InputStreamFactory to represent the training data file, as shown in the following code:
InputStreamFactory inputStreamFactory = new InputStreamFactory() {
public InputStream createInputStream()
throws FileNotFoundException {
return new FileInputStream("training-data.train");
}
};
An object stream is created in the try block that read from the file. The PlainTextByLineStream class processes plain text line by line. This stream was then used to create another input stream of TokenSample objects, providing a usable form for training the model, as shown in the following code:
try (
ObjectStream<String> stringObjectStream =
new PlainTextByLineStream(inputStreamFactory, "UTF-8");
ObjectStream<TokenSample> tokenSampleStream =
new TokenSampleStream(stringObjectStream);) {
...
} catch (IOException ex) {
// Handle exception
}
The train method performed the training. It takes the token stream, a TokenizerFactory instance, and a set of training parameters. The TokenizerFactory instance provides the basic tokenizer. Its arguments include the language used and other factors, such as an abbreviation dictionary. In this example, English is the language, and the other arguments are not used. We used the default set of training parameters, as shown in the following code:
TokenizerModel tokenizerModel = TokenizerME.train(
tokenSampleStream, new TokenizerFactory("en", null, true, null),
TrainingParameters.defaultParams());
Once the model was trained, we saved it to the mymodel.bin file using the serialize method:
BufferedOutputStream modelOutputStream = new BufferedOutputStream(
new FileOutputStream(new File("mymodel.bin")));
tokenizerModel.serialize(modelOutputStream);
To test the model, we reused the tokenization code found in the Tokenization using the OpenNLP recipe. You can refer to that recipe for an explanation of the code.
The output of the preceding code displays various statistics, such as the number of passes and iterations performed. One token was displayed per line, as shown in the following code. Note that the comma and period are treated as separate tokens using this model:
In
addition
,
the
rook
was
moved
too
far
to
be
effective
.
There's more...
The training process can be tailored using training parameters. Details of how to use these parameters are hard to find; however, cut-off and iteration are described at: https://stackoverflow.com/questions/30238014/what-is-the-meaning-of-cut-off-and-iteration-for-trainings-in-opennlp.
See also
- The OpenNLP API can be found at: https://opennlp.apache.org/docs/1.9.0/apidocs/opennlp-tools/index.html
- See the Tokenization using OpenNLP recipe for an explanation of how the model is tested
Identifying the stem of a word
Finding the stem of a word is easy to do. We will illustrate this process using OpenNLP’s PorterStemmer class.
Getting ready
To prepare, we need to do the following:
- Create a new Maven project
- Add the following dependency to the POM file:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.9.0</version>
</dependency>
How to do it...
Let's go through the following steps:
- Add the following import statement to the program:
import opennlp.tools.stemmer.PorterStemmer;
- Then, add the following code to the main method:
String wordList[] =
{ "draft", "drafted", "drafting", "drafts",
"drafty", "draftsman" };
PorterStemmer porterStemmer = new PorterStemmer();
for (String word : wordList) {
String stem = porterStemmer.stem(word);
System.out.println("The stem of " + word + " is " + stem);
}
- Execute the program. The output should be as follows:
The stem of drafted is draft
The stem of drafting is draft
The stem of drafts is draft
The stem of drafty is drafti
The stem of draftsman is draftsman
How it works...
We start by creating an array of strings that will hold words that we will use with the stemmer:
String wordList[] =
{ "draft", "drafted", "drafting", "drafts", "drafty", "draftsman" };
The OpenNLP PorterStemmer class supports finding the stem of a word. It has a single default constructor that is used to create an instance of the class, as shown in the following code. This is the only constructor available for this class:
PorterStemmer porterStemmer = new PorterStemmer();
The remainder of the code iterates over the array and invokes the stem method against each word in the array, as shown in the following code:
for (String word : wordList) {
String stem = porterStemmer.stem(word);
System.out.println("The stem of " + word + " is " + stem);
}
See also
- The OpenNLP API can be found at https://opennlp.apache.org/docs/1.9.0/apidocs/opennlp-tools/index.html
- The process of lemmatization is discussed in the Determining the lexical meaning of a word recipe
- An comparison of stemming versus lemmatization can be found at https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/
Training an OpenNLP lemmatization model
We will train a model using OpenNLP, which can be used to perform lemmatization. The actual process of performing lemmatization is illustrated in the following recipe, Determining the lexical meaning of a word using OpenNLP.
Getting ready
The most straightforward technique to train a model is to use the OpenNLP command-line tools. Download these tools from the OpenNLP page at https://opennlp.apache.org/download.html. We will not need the source code for these tools, so download the file named apache-opennlp-1.9.0-bin.tar.gz. Selecting that file will take you to a page that lists mirror sites for the file. Choose one that will work best for your location.
Once the file has been saved, expand the file. This will extract a .tar file. Next, expand this file, which will create a directory called apache-opennlp-1.9.0. In its bin subdirectory, you will find the tools that we need.
We will need training data for the training process. We will use the en-lemmatizer.dict file found at https://raw.githubusercontent.com/richardwilly98/elasticsearch-opennlp-auto-tagging/master/src/main/resources/models/en-lemmatizer.dict. Use a browser to open this page and then save this page using the file name en-lemmatizer.dict.
How to do it...
Let's go through the following steps:
- Open a command-line window. We used the Window's cmd program in this example
- Set up a path for the OpenNLP tool's bin directory and then navigate to the directory containing the en-lemmatizer.dict file.
- Execute the following command:
opennlp LemmatizerTrainerME -model en-lemmatizer.bin -lang en -data en-lemmatizer.dict -encoding UTF-8
You will get the following output. It has been shortened here to save space:
Indexing events with TwoPass using cutoff of 5
Computing event counts... done. 301403 events Indexing... done.
Sorting and merging events... done. Reduced 301403 events to 297777.
Done indexing in 9.09 s.
Incorporating indexed data for training...
done.
Number of Event Tokens: 297777
Number of Outcomes: 432
Number of Predicates: 69122
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-1829041.6775780176 3.317817009120679E-6
2: ... loglikelihood=-452333.43760414346 0.876829361353404
3: ... loglikelihood=-211099.05280473927 0.9506806501594212
4: ... loglikelihood=-132195.3981804198 0.9667554735686108
...
98: ... loglikelihood=-6702.5821153954375 0.9988420818638168
99: ... loglikelihood=-6652.6134177562335 0.998845399680826
100: ... loglikelihood=-6603.518040975329 0.9988553531318534
Writing lemmatizer model
... done (1.274s)
Wrote lemmatizer model to
path: C:\Downloads\OpenNLP\en-lemmatizer.bin
Execution time: 275.369 seconds
How it works...
To understand the output, we need to explain the following command:
opennlp LemmatizerTrainerME -model en-lemmatizer.bin -lang en -data en-lemmatizer.dict -encoding UTF-8
The opennlp command is used with a number of OpenNLP tools. The tool to be used is specified by the command's first argument. In this example, we used the LemmatizerTrainerME tool. The arguments that follow control how the training process works. The LemmatizerTrainerME arguments are documented at https://opennlp.apache.org/docs/1.9.0/manual/opennlp.html#tools.cli.lemmatizer.LemmatizerTrainerME.
We use the -model, -lang, -data, and -encoding arguments, as detailed in the following list:
- The -model argument specifies the name of the model output file. This is the file that holds the trained model that we will use in the next recipe.
- The -lang argument specifies the natural language used. In this case, we use en, which indicates the training data is English.
- The -data argument specifies the file containing the training data. We used the en-lemmatizer.dict file.
- The -encoding parameter specifies the character set used by the training data. We used UTF-8, which indicates the data is Unicode data.
The output shows the training process. It displays various statistics, such as the number of passes and iterations performed. During each iteration, the probability increases, as shown in the following code. With the 100th iteration, the probability approaches 100.
Performing 100 iterations:
1: ... loglikelihood=-1829041.6775780176 3.317817009120679E-6
2: ... loglikelihood=-452333.43760414346 0.876829361353404
3: ... loglikelihood=-211099.05280473927 0.9506806501594212
4: ... loglikelihood=-132195.3981804198 0.9667554735686108
...
98: ... loglikelihood=-6702.5821153954375 0.9988420818638168
99: ... loglikelihood=-6652.6134177562335 0.998845399680826
100: ... loglikelihood=-6603.518040975329 0.9988553531318534
Writing lemmatizer model ... done (1.274s)
The final part of the output shows where the file is written. We wrote the lemmatizer model to the path :\Downloads\OpenNLP\en-lemmatizer.bin.
There's more...
If you have specialized lemmatization needs, then you will need to create a training file. The training data file consists of a series of lines. Each line consists of three entries separated by spaces. The first entry contains a word. The second entry is the POS tag for the word. The third entry is the lemma for the word.
For example, in en-lemmatizer.dict, there are several lines for variations of the word bump, as shown in the following code:
bump NN bump
bump VB bump
bump VBP bump
bumped VBD bump
bumped VBN bump
bumper JJ bumper
bumper NN bumper
As you can see, a word may be used in different contexts and with different suffixes. Other datasets can be used for training. These include the Penn Treebank (https://web.archive.org/web/19970614160127/http://www.cis.upenn.edu/~treebank/) and the CoNLL 2009 datasets (https://www.ldc.upenn.edu/).
Training parameters other than the default parameters can be specified depending on the needs of the problem.
In the next recipe, Determining the lexical meaning of a word using OpenNLP, we will use the model to develop and determine the lexical meaning of a word.
See also
- The OpenNLP API can be found at https://opennlp.apache.org/docs/1.9.0/apidocs/opennlp-tools/index.html
Determining the lexical meaning of a word using OpenNLP
In this recipe, we will use the model we created in the previous recipe to perform lemmatization. We will perform lemmatization on the following sentence:
The girls were leaving the clubhouse for another adventurous afternoon.
In the example, the lemmas for each word in the sentence will be displayed.
Getting ready
To prepare, we need to do the following:
- Create a new Maven project
- Add the following dependency to the POM file:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.9.0</version>
</dependency>
How to do it...
Let's go through the following steps:
- Add the following imports to the project:
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import opennlp.tools.lemmatizer.LemmatizerME;
import opennlp.tools.lemmatizer.LemmatizerModel;
- Add the following try block to the main method. An input stream and model are created, followed by the instantiation of the lemmatization model:
LemmatizerModel lemmatizerModel = null;
try (InputStream modelInputStream = new FileInputStream(
"C:\\Downloads\\OpenNLP\\en-lemmatizer.bin")) {
lemmatizerModel = new LemmatizerModel(modelInputStream);
LemmatizerME lemmatizer = new LemmatizerME(lemmatizerModel);
…
} catch (FileNotFoundException e) {
// Handle exception
} catch (IOException e) {
// Handle exception
}
- Add the following code to the end of the try block. It sets up arrays holding the words of the sample text and their POS tags. It then performs the lemmatization and displays the results:
String[] tokens = new String[] {
"The", "girls", "were", "leaving", "the",
"clubhouse", "for", "another", "adventurous",
"afternoon", "." };
String[] posTags = new String[] { "DT", "NNS", "VBD",
"VBG", "DT", "NN", "IN", "DT", "JJ", "NN", "." };
String[] lemmas = lemmatizer.lemmatize(tokens, posTags);
for (int i = 0; i < tokens.length; i++) {
System.out.println(tokens[i] + " - " + lemmas[i]);
}
- Upon executing the program, you will get the following output that displays each word and then its lemma:
The - the
girls - girl
were - be
leaving - leave
the - the
clubhouse - clubhouse
for - for
another - another
adventurous - adventurous
afternoon - afternoon
. - .
How it works...
We performed lemmatization on the sentence The girls were leaving the clubhouse for another adventurous afternoon. A LemmatizerModel was declared and instantiated from the en-lemmatizer.bin file. A try-with-resources block was used to obtained an input stream for the file, as shown in the following code:
LemmatizerModel lemmatizerModel = null;
try (InputStream modelInputStream = new FileInputStream(
"C:\\Downloads\\OpenNLP\\en-lemmatizer.bin")) {
lemmatizerModel = new LemmatizerModel(modelInputStream);
Next, the lemmatizer was created using the LemmatizerME class, as shown in the following code:
LemmatizerME lemmatizer = new LemmatizerME(lemmatizerModel);
The following sentence was processed, and is represented as an array of strings. We also need an array of POS tags for the lemmatization process to work. This array was defined in parallel with the sentence array. As we will see in Chapter 4, Detecting POS Using Neural Networks, there are often alternative tags that are possible for a sentence. For this example, we used tags generated by the Cognitive Computation Group's online tool at http://cogcomp.org/page/demo_view/pos:
String[] tokens = new String[] {
"The", "girls", "were", "leaving", "the",
"clubhouse", "for", "another", "adventurous",
"afternoon", "." };
String[] posTags = new String[] { "DT", "NNS", "VBD",
"VBG", "DT", "NN", "IN", "DT", "JJ", "NN", "." };
The lemmatization then occurred, where the lemmatize method uses the two arrays to build an array of lemmas for each word in the sentence, as shown in the following code:
String[] lemmas = lemmatizer.lemmatize(tokens, posTags);
The lemmas are then displayed, as shown in the following code:
for (int i = 0; i < tokens.length; i++) {
System.out.println(tokens[i] + " - " + lemmas[i]);
}
See also
- The OpenNLP API can be found at https://opennlp.apache.org/docs/1.9.0/apidocs/opennlp-tools/index.html
- The Training an OpenNLP lemmatization model recipe shows how the model was trained
Removing stop words using LingPipe
Normalization is the process of preparing text for subsequent analysis. This is frequently performed once the text has been tokenized. Normalization activities include such tasks as converting the text to lowercase, validating data, inserting missing elements, stemming, lemmatization, and removing stop words.
We have already examined the stemming and lemmatization process in earlier recipes. In this recipe, we will show how stop words can be removed. Stop words are those words that are not always useful. For example, some downstream NLP tasks do not need to have words such as a, the, or and. These types of words are the common words found in a language. Analysis can often be enhanced by removing them from a text.
Getting ready
To prepare, we need to do the following:
- Create a new Maven project
- Add the following dependency to the POM file:
<dependency>
<groupId>de.julielab</groupId>
<artifactId>aliasi-lingpipe</artifactId>
<version>4.1.0</version>
</dependency>
How to do it...
Let's go through the following steps:
- Add the following import statements to your program:
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.tokenizer.EnglishStopTokenizerFactory;
- Add the following code to the main method:
String sentence =
"The blue goose and a quiet lamb stopped to smell the roses.";
TokenizerFactory tokenizerFactory =
IndoEuropeanTokenizerFactory.INSTANCE;
tokenizerFactory =
new EnglishStopTokenizerFactory(tokenizerFactory);
Tokenizer tokenizer =tokenizerFactory.tokenizer(
sentence.toCharArray(), 0, sentence.length());
for (String token : tokenizer) {
System.out.println(token);
}
- Execute the program. You will get the following output:
The
blue
goose
quiet
lamb
stopped
smell
roses
.
How it works...
The example started with the declaration of a sample sentence. The program will return a list of words found in the sentence with the stop words removed, as shown in the following code:
String sentence =
"The blue goose and a quiet lamb stopped to smell the roses.";
An instance of LingPipe's IndoEuropeanTokenizerFactory is used to provide a means of tokenizing the sentence. It is used as the argument to the EnglishStopTokenizerFactory constructor, which provides a stop word tokenizer, as shown in the following code:
TokenizerFactory tokenizerFactory =
IndoEuropeanTokenizerFactory.INSTANCE;
tokenizerFactory = new EnglishStopTokenizerFactory(tokenizerFactory);
The tokenizer method is invoked against the sentence, where its second and third parameters specify which part of the sentence to tokenize. The Tokenizer class represents the tokens extracted from the sentence:
Tokenizer tokenizer = tokenizerFactory.tokenizer(
sentence.toCharArray(), 0, sentence.length());
The Tokenizer class implements the Iterable<String> interface that we utilized in the following for-each statement to display the tokens:
for (String token : tokenizer) {
System.out.println(token);
}
Note that in the following duplicated output, the first word of the sentence, The, was not removed, nor was there a terminating period. Otherwise, common stop words were removed, as shown in the following code:
The
blue
goose
quiet
lamb
stopped
smell
roses
.
See also
- The LingPipe API can be found at http://alias-i.com/lingpipe/docs/api/index.html