Natural Language Processing (NLP) is concerned with the interaction between natural language and the computer. It is one of the major components of Artificial Intelligence (AI) and computational linguistics. It provides a seamless interaction between computers and human beings and gives computers the ability to understand human speech with the help of machine learning. The fundamental data type used to represent the contents of a file or a document in programming languages (for example, C, C++, JAVA, Python, and so on) is known as string. In this chapter, we will explore various operations that can be performed on strings that will be useful to accomplish various NLP tasks.
This chapter will include the following topics:
Tokenization of text
Normalization of text
Substituting and correcting tokens
Applying Zipf's law to text
Applying similarity measures using the Edit Distance Algorithm
Applying similarity measures using Jaccard's Coefficient
Applying similarity measures using Smith Waterman
Tokenization may be defined as the process of splitting the text into smaller parts called tokens, and is considered a crucial step in NLP.
When NLTK is installed and Python IDLE is running, we can perform the tokenization of text or paragraphs into individual sentences. To perform tokenization, we can import the sentence tokenization function. The argument of this function will be text that needs to be tokenized. The sent_tokenize
function uses an instance of NLTK known as PunktSentenceTokenizer
. This instance of NLTK has already been trained to perform tokenization on different European languages on the basis of letters or punctuation that mark the beginning and end of sentences.
Now, let's see how a given text is tokenized into individual sentences:
>>> import nltk >>> text=" Welcome readers. I hope you find it interesting. Please do reply." >>> from nltk.tokenize import sent_tokenize >>> sent_tokenize(text) [' Welcome readers.', 'I hope you find it interesting.', 'Please do reply.']
So, a given text is split into individual sentences. Further, we can perform processing on the individual sentences.
To tokenize a large number of sentences, we can load PunktSentenceTokenizer
and use the tokenize()
function to perform tokenization. This can be seen in the following code:
>>> import nltk >>> tokenizer=nltk.data.load('tokenizers/punkt/english.pickle') >>> text=" Hello everyone. Hope all are fine and doing well. Hope you find the book interesting" >>> tokenizer.tokenize(text) [' Hello everyone.', 'Hope all are fine and doing well.', 'Hope you find the book interesting']
For performing tokenization in languages other than English, we can load the respective language pickle file found in tokenizers/punkt
and then tokenize the text in another language, which is an argument of the tokenize()
function. For the tokenization of French text, we will use the french.pickle
file as follows:
>> import nltk >>> french_tokenizer=nltk.data.load('tokenizers/punkt/french.pickle') >>> french_tokenizer.tokenize('Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage collège franco-britanniquedeLevallois-Perret. Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage Levallois. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, janvier , d'un professeur d'histoire. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, mercredi , d'un professeur d'histoire') ['Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage collège franco-britanniquedeLevallois-Perret.', 'Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage Levallois.', 'L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, janvier , d'un professeur d'histoire.', 'L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, mercredi , d'un professeur d'histoire']
Now, we'll perform processing on individual sentences. Individual sentences are tokenized into words. Word tokenization is performed using a word_tokenize()
function. The word_tokenize
function uses an instance of NLTK known as TreebankWordTokenizer
to perform word tokenization.
The tokenization of English text using word_tokenize
is shown here:
>>> import nltk >>> text=nltk.word_tokenize("PierreVinken , 59 years old , will join as a nonexecutive director on Nov. 29 .») >>> print(text) [' PierreVinken', ',', '59', ' years', ' old', ',', 'will', 'join', 'as', 'a', 'nonexecutive', 'director' , 'on', 'Nov.', '29', '.']
Tokenization of words can also be done by loading TreebankWordTokenizer
and then calling the tokenize()
function, whose argument is a sentence that needs to be tokenized into words. This instance of NLTK has already been trained to perform the tokenization of sentence into words on the basis of spaces and punctuation.
The following code will help us obtain user input, tokenize it, and evaluate its length:
>>> import nltk >>> from nltk import word_tokenize >>> r=input("Please write a text") Please write a textToday is a pleasant day >>> print("The length of text is",len(word_tokenize(r)),"words") The length of text is 5 words
Let's have a look at the code that performs tokenization using TreebankWordTokenizer
:
>>> import nltk >>> from nltk.tokenize import TreebankWordTokenizer >>> tokenizer = TreebankWordTokenizer() >>> tokenizer.tokenize("Have a nice day. I hope you find the book interesting") ['Have', 'a', 'nice', 'day.', 'I', 'hope', 'you', 'find', 'the', 'book', 'interesting']
TreebankWordTokenizer
uses conventions according to Penn Treebank Corpus. It works by separating contractions. This is shown here:
>>> import nltk >>> text=nltk.word_tokenize(" Don't hesitate to ask questions") >>> print(text) ['Do', "n't", 'hesitate', 'to', 'ask', 'questions']
Another word tokenizer is PunktWordTokenizer
. It works by splitting punctuation; each word is kept instead of creating an entirely new token. Another word tokenizer is WordPunctTokenizer
. It provides splitting by making punctuation an entirely new token. This type of splitting is usually desirable:
>>> from nltk.tokenize import WordPunctTokenizer >>> tokenizer=WordPunctTokenizer() >>> tokenizer.tokenize(" Don't hesitate to ask questions") ['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions']
The inheritance tree for tokenizers is given here:
The tokenization of words can be performed by constructing regular expressions in these two ways:
By matching with words
By matching spaces or gaps
We can import RegexpTokenizer
from NLTK. We can create a Regular Expression that can match the tokens present in the text:
>>> import nltk >>> from nltk.tokenize import RegexpTokenizer >>> tokenizer=RegexpTokenizer([\w]+") >>> tokenizer.tokenize("Don't hesitate to ask questions") ["Don't", 'hesitate', 'to', 'ask', 'questions']
Instead of instantiating class, an alternative way of tokenization would be to use this function:
>>> import nltk >>> from nltk.tokenize import regexp_tokenize >>> sent="Don't hesitate to ask questions" >>> print(regexp_tokenize(sent, pattern='\w+|\$[\d\.]+|\S+')) ['Don', "'t", 'hesitate', 'to', 'ask', 'questions']
RegularexpTokenizer
uses the re.findall()
function to perform tokenization by matching tokens. It uses the re.split()
function to perform tokenization by matching gaps or spaces.
Let's have a look at an example of how to tokenize using whitespaces:
>>> import nltk >>> from nltk.tokenize import RegexpTokenizer >>> tokenizer=RegexpTokenizer('\s+',gaps=True) >>> tokenizer.tokenize("Don't hesitate to ask questions") ["Don't", 'hesitate', 'to', 'ask', 'questions']
To select the words starting with a capital letter, the following code is used:
>>> import nltk >>> from nltk.tokenize import RegexpTokenizer >>> sent=" She secured 90.56 % in class X . She is a meritorious student" >>> capt = RegexpTokenizer('[A-Z]\w+') >>> capt.tokenize(sent) ['She', 'She']
The following code shows how a predefined Regular Expression is used by a subclass of RegexpTokenizer
:
>>> import nltk >>> sent=" She secured 90.56 % in class X . She is a meritorious student" >>> from nltk.tokenize import BlanklineTokenizer >>> BlanklineTokenizer().tokenize(sent) [' She secured 90.56 % in class X \n. She is a meritorious student\n']
The tokenization of strings can be done using whitespace—tab, space, or newline:
>>> import nltk >>> sent=" She secured 90.56 % in class X . She is a meritorious student" >>> from nltk.tokenize import WhitespaceTokenizer >>> WhitespaceTokenizer().tokenize(sent) ['She', 'secured', '90.56', '%', 'in', 'class', 'X', '.', 'She', 'is', 'a', 'meritorious', 'student']
WordPunctTokenizer
makes use of the regular expression \w+|[^\w\s]+
to perform the tokenization of text into alphabetic and non-alphabetic characters.
Tokenization using the split()
method is depicted in the following code:
>>>import nltk >>>sent= She secured 90.56 % in class X. She is a meritorious student" >>> sent.split() ['She', 'secured', '90.56', '%', 'in', 'class', 'X', '.', 'She', 'is', 'a', 'meritorious', 'student'] >>> sent.split('') ['', 'She', 'secured', '90.56', '%', 'in', 'class', 'X', '.', 'She', 'is', 'a', 'meritorious', 'student'] >>> sent=" She secured 90.56 % in class X \n. She is a meritorious student\n" >>> sent.split('\n') [' She secured 90.56 % in class X ', '. She is a meritorious student', '']
Similar to sent.split('\n')
, LineTokenizer
works by tokenizing text into lines:
>>> import nltk >>> from nltk.tokenize import BlanklineTokenizer >>> sent=" She secured 90.56 % in class X \n. She is a meritorious student\n" >>> BlanklineTokenizer().tokenize(sent) [' She secured 90.56 % in class X \n. She is a meritorious student\n'] >>> from nltk.tokenize import LineTokenizer >>> LineTokenizer(blanklines='keep').tokenize(sent) [' She secured 90.56 % in class X ', '. She is a meritorious student'] >>> LineTokenizer(blanklines='discard').tokenize(sent) [' She secured 90.56 % in class X ', '. She is a meritorious student']
SpaceTokenizer
works similar to sent.split('')
:
>>> import nltk >>> sent=" She secured 90.56 % in class X \n. She is a meritorious student\n" >>> from nltk.tokenize import SpaceTokenizer >>> SpaceTokenizer().tokenize(sent) ['', 'She', 'secured', '90.56', '%', 'in', 'class', 'X', '\n.', 'She', 'is', 'a', 'meritorious', 'student\n']
nltk.tokenize.util
module works by returning the sequence of tuples that are offsets of the tokens in a sentence:
>>> import nltk >>> from nltk.tokenize import WhitespaceTokenizer >>> sent=" She secured 90.56 % in class X \n. She is a meritorious student\n" >>> list(WhitespaceTokenizer().span_tokenize(sent)) [(1, 4), (5, 12), (13, 18), (19, 20), (21, 23), (24, 29), (30, 31), (33, 34), (35, 38), (39, 41), (42, 43), (44, 55), (56, 63)]
Given a sequence of spans, the sequence of relative spans can be returned:
>>> import nltk >>> from nltk.tokenize import WhitespaceTokenizer >>> from nltk.tokenize.util import spans_to_relative >>> sent=" She secured 90.56 % in class X \n. She is a meritorious student\n" >>>list(spans_to_relative(WhitespaceTokenizer().span_tokenize(sent))) [(1, 3), (1, 7), (1, 5), (1, 1), (1, 2), (1, 5), (1, 1), (2, 1), (1, 3), (1, 2), (1, 1), (1, 11), (1, 7)]
nltk.tokenize.util.string_span_tokenize(sent,separator)
will return the offsets of tokens in sent by splitting at each incidence of the separator:
>>> import nltk >>> from nltk.tokenize.util import string_span_tokenize >>> sent=" She secured 90.56 % in class X \n. She is a meritorious student\n" >>> list(string_span_tokenize(sent, "")) [(1, 4), (5, 12), (13, 18), (19, 20), (21, 23), (24, 29), (30, 31), (32, 34), (35, 38), (39, 41), (42, 43), (44, 55), (56, 64)]
In order to carry out processing on natural language text, we need to perform normalization that mainly involves eliminating punctuation, converting the entire text into lowercase or uppercase, converting numbers into words, expanding abbreviations, canonicalization of text, and so on.
Sometimes, while tokenizing, it is desirable to remove punctuation. Removal of punctuation is considered one of the primary tasks while doing normalization in NLTK.
Consider the following example:
>>> text=[" It is a pleasant evening.","Guests, who came from US arrived at the venue","Food was tasty."] >>> from nltk.tokenize import word_tokenize >>> tokenized_docs=[word_tokenize(doc) for doc in text] >>> print(tokenized_docs) [['It', 'is', 'a', 'pleasant', 'evening', '.'], ['Guests', ',', 'who', 'came', 'from', 'US', 'arrived', 'at', 'the', 'venue'], ['Food', 'was', 'tasty', '.']]
The preceding code obtains the tokenized text. The following code will remove punctuation from tokenized text:
>>> import re >>> import string >>> text=[" It is a pleasant evening.","Guests, who came from US arrived at the venue","Food was tasty."] >>> from nltk.tokenize import word_tokenize >>> tokenized_docs=[word_tokenize(doc) for doc in text] >>> x=re.compile('[%s]' % re.escape(string.punctuation)) >>> tokenized_docs_no_punctuation = [] >>> for review in tokenized_docs: new_review = [] for token in review: new_token = x.sub(u'', token) if not new_token == u'': new_review.append(new_token) tokenized_docs_no_punctuation.append(new_review) >>> print(tokenized_docs_no_punctuation) [['It', 'is', 'a', 'pleasant', 'evening'], ['Guests', 'who', 'came', 'from', 'US', 'arrived', 'at', 'the', 'venue'], ['Food', 'was', 'tasty']]
A given text can be converted into lowercase or uppercase text entirely using the functions lower()
and upper()
. The task of converting text into uppercase or lowercase falls under the category of normalization.
Consider the following example of case conversion:
>>> text='HARdWork IS KEy to SUCCESS' >>> print(text.lower()) hardwork is key to success >>> print(text.upper()) HARDWORK IS KEY TO SUCCESS
Stop words are words that need to be filtered out during the task of information retrieval or other natural language tasks, as these words do not contribute much to the overall meaning of the sentence. There are many search engines that work by deleting stop words so as to reduce the search space. Elimination of stopwords
is considered one of the normalization tasks that is crucial in NLP.
NLTK has a list of stop words for many languages. We need to unzip datafile
so that the list of stop words can be accessed from nltk_data/corpora/stopwords/
:
>>> import nltk >>> from nltk.corpus import stopwords >>> stops=set(stopwords.words('english')) >>> words=["Don't", 'hesitate','to','ask','questions'] >>> [word for word in words if word not in stops] ["Don't", 'hesitate', 'ask', 'questions']
The instance of nltk.corpus.reader.WordListCorpusReader
is a stopwords
corpus. It has the words()
function, whose argument is fileid
. Here, it is English; this refers to all the stop words present in the English file. If the words()
function has no argument, then it will refer to all the stop words of all the languages.
Other languages in which stop word removal can be done, or the number of languages whose file of stop words is present in NLTK can be found using the fileids()
function:
>>> stopwords.fileids() ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish', 'turkish']
Any of these previously listed languages can be used as an argument to the words()
function so as to get the stop words in that language.
Let's see an example of how to calculate stopwords
:
>>> import nltk >>> from nltk.corpus import stopwords >>> stopwords.words('english') ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'] >>> def para_fraction(text): stopwords = nltk.corpus.stopwords.words('english') para = [w for w in text if w.lower() not in stopwords] return len(para) / len(text) >>> para_fraction(nltk.corpus.reuters.words()) 0.7364374824583169 >>> para_fraction(nltk.corpus.inaugural.words()) 0.5229560503653893
Normalization may also involve converting numbers into words (for example, 1
can be replaced by one
) and expanding abbreviations (for instance, can't
can be replaced by cannot
). This can be achieved by representing them in replacement patterns. This is discussed in the next section.
In this section, we will discuss the replacement of tokens with other tokens. We will also about how we can correct the spelling of tokens by replacing incorrectly spelled tokens with correctly spelled tokens.
In order to remove errors or perform text normalization, word replacement is done. One way by which text replacement is done is by using regular expressions. Previously, we faced problems while performing tokenization for contractions. Using text replacement, we can replace contractions with their expanded versions. For example, doesn't
can be replaced by does not
.
We will begin by writing the following code, naming this program replacers.py
, and saving it in the nltkdata
folder:
import re replacement_patterns = [ (r'won\'t', 'will not'), (r'can\'t', 'cannot'), (r'i\'m', 'i am'), (r'ain\'t', 'is not'), (r'(\w+)\'ll', '\g<1> will'), (r'(\w+)n\'t', '\g<1> not'), (r'(\w+)\'ve', '\g<1> have'), (r'(\w+)\'s', '\g<1> is'), (r'(\w+)\'re', '\g<1> are'), (r'(\w+)\'d', '\g<1> would') ] class RegexpReplacer(object): def __init__(self, patterns=replacement_patterns): self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns] def replace(self, text): s = text for (pattern, repl) in self.patterns: (s, count) = re.subn(pattern, repl, s) return s
Here, replacement patterns are defined in which the first term denotes the pattern to be matched and the second term is its corresponding replacement pattern. The RegexpReplacer
class has been defined to perform the task of compiling pattern pairs and it provides a method called replace()
, whose function is to perform the replacement of a pattern with another pattern.
Let's see an example of how we can substitute a text with another text:
>>> import nltk >>> from replacers import RegexpReplacer >>> replacer= RegexpReplacer() >>> replacer.replace("Don't hesitate to ask questions") 'Do not hesitate to ask questions' >>> replacer.replace("She must've gone to the market but she didn't go") 'She must have gone to the market but she did not go'
The function of RegexpReplacer.replace()
is substituting every instance of a replacement pattern with its corresponding substitution pattern. Here, must've
is replaced by must have
and didn't
is replaced by did not
, since the replacement pattern in replacers.py
has already been defined by tuple pairs, that is,(r'(\w+)\'ve', '\g<1> have')
and (r'(\w+)n\'t', '\g<1> not')
.
We can not only perform the replacement of contractions; we can also substitute a token with any other token.
Tokens substitution can be performed prior to tokenization so as to avoid the problem that occurs during tokenization for contractions:
>>> import nltk >>> from nltk.tokenize import word_tokenize >>> from replacers import RegexpReplacer >>> replacer=RegexpReplacer() >>> word_tokenize("Don't hesitate to ask questions") ['Do', "n't", 'hesitate', 'to', 'ask', 'questions'] >>> word_tokenize(replacer.replace("Don't hesitate to ask questions")) ['Do', 'not', 'hesitate', 'to', 'ask', 'questions']
Sometimes, people write words involving repeating characters that cause grammatical errors. For instance consider a sentence, I like it lotttttt
. Here, lotttttt
refers to lot
. So now, we'll eliminate these repeating characters using the backreference approach, in which a character refers to the previous characters in a group in a regular expression. This is also considered one of the normalization tasks.
Firstly, append the following code to the previously created replacers.py
:
class RepeatReplacer(object): def __init__(self): self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)') self.repl = r'\1\2\3' def replace(self, word): repl_word = self.repeat_regexp.sub(self.repl, word) if repl_word != word: return self.replace(repl_word) else: return repl_word
Let's see an example of how we can delete repeating characters from a token:
>>> import nltk >>> from replacers import RepeatReplacer >>> replacer=RepeatReplacer() >>> replacer.replace('lotttt') 'lot' >>> replacer.replace('ohhhhh') 'oh' >>> replacer.replace('ooohhhhh') 'oh'
The RepeatReplacer
class works by compiling regular expressions and replacement strings and is defined using backreference.Repeat_regexp
, which is present in replacers.py
. It matches the starting characters that can be zero or many (\w*)
, ending characters that can be zero or many (\w*)
, or a character (\w)
that is followed by same character.
For example, lotttt
is split into (lo)(t)t(tt)
. Here, one t
is reduced and the string becomes lottt. The process of splitting continues, and finally, the resultant string obtained is lot.
The problem with RepeatReplacer
is that it will convert happy
to hapy
, which is inappropriate. To avoid this problem, we can embed wordnet
along with it.
In the replacers.py
program created previously, add the following lines to include wordnet
:
import re from nltk.corpus import wordnet class RepeatReplacer(object): def __init__(self): self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)') self.repl = r'\1\2\3' def replace(self, word): if wordnet.synsets(word): return word repl_word = self.repeat_regexp.sub(self.repl, word) if repl_word != word: return self.replace(repl_word) else: return repl_word
Now, let's take a look at how the previously mentioned problem can be overcome:
>>> import nltk >>> from replacers import RepeatReplacer >>> replacer=RepeatReplacer() >>> replacer.replace('happy') 'happy'
Now we will see how we can substitute a given word by its synonym. To the already existing replacers.py
, we can add a class called WordReplacer
that provides mapping between a word and its synonym:
class WordReplacer(object): def __init__(self, word_map): self.word_map = word_map def replace(self, word): return self.word_map.get(word, word)
Let's have a look at an example of substituting a word with its synonym:
>>> import nltk >>> from replacers import WordReplacer >>> replacer=WordReplacer({'congrats':'congratulations'}) >>> replacer.replace('congrats') 'congratulations' >>> replacer.replace('maths') 'maths'
In this code, the replace()
function looks for the corresponding synonym for a word in word_map
. If the synonym is present for a given word, then the word will be replaced by its synonym. If the synonym for a given word is not present, then no replacement will be performed; the same word will be returned.
Zipf's law states that the frequency of a token in a text is directly proportional to its rank or position in the sorted list. This law describes how tokens are distributed in languages: some tokens occur very frequently, some occur with intermediate frequency, and some tokens rarely occur.
Let's see the code for obtaining the log-log plot in NLTK that is based on Zipf's law:
>>> import nltk >>> from nltk.corpus import gutenberg >>> from nltk.probability import FreqDist >>> import matplotlib >>> import matplotlib.pyplot as plt >>> matplotlib.use('TkAgg') >>> fd = FreqDist() >>> for text in gutenberg.fileids(): . . . for word in gutenberg.words(text): . . . fd.inc(word) >>> ranks = [] >>> freqs = [] >>> for rank, word in enumerate(fd): . . . ranks.append(rank+1) . . . freqs.append(fd[word]) . . . >>> plt.loglog(ranks, freqs) >>> plt.xlabel('frequency(f)', fontsize=14, fontweight='bold') >>> plt.ylabel('rank(r)', fontsize=14, fontweight='bold') >>> plt.grid(True) >>> plt.show()
The preceding code will obtain a plot of rank versus the frequency of words in a document. So, we can check whether Zipf's law holds for all the documents or not by seeing the proportionality relationship between rank and the frequency of words.
There are many similarity measures that can be used for performing NLP tasks. The nltk.metrics
package in NLTK is used to provide various evaluation or similarity measures, which is conducive to perform various NLP tasks.
In order to test the performance of taggers, chunkers, and so on, in NLP, the standard scores retrieved from information retrieval can be used.
Let's have a look at how the output of named entity recognizer can be analyzed using the standard scores obtained from a training file:
>>> from __future__ import print_function >>> from nltk.metrics import * >>> training='PERSON OTHER PERSON OTHER OTHER ORGANIZATION'.split() >>> testing='PERSON OTHER OTHER OTHER OTHER OTHER'.split() >>> print(accuracy(training,testing)) 0.6666666666666666 >>> trainset=set(training) >>> testset=set(testing) >>> precision(trainset,testset) 1.0 >>> print(recall(trainset,testset)) 0.6666666666666666 >>> print(f_measure(trainset,testset)) 0.8
Edit distance or the Levenshtein edit distance between two strings is used to compute the number of characters that can be inserted, substituted, or deleted in order to make two strings equal.
The operations performed in Edit Distance include the following:
Copying letters from the first string to the second string (cost 0) and substituting a letter with another (cost 1):
D(i-1,j-1) + d(si,tj)(Substitution / copy)
Deleting a letter in the first string (cost 1)
D(i,j-1)+1 (deletion)
Inserting a letter in the second string (cost 1):
D(i,j) = min D(i-1,j)+1 (insertion)
The Python code for Edit Distance that is included in the nltk.metrics
package is as follows:
from __future__ import print_function def _edit_dist_init(len1, len2): lev = [] for i in range(len1): lev.append([0] * len2) # initialize 2D array to zero for i in range(len1): lev[i][0] = i # column 0: 0,1,2,3,4,... for j in range(len2): lev[0][j] = j # row 0: 0,1,2,3,4,... return lev def_edit_dist_step(lev,i,j,s1,s2,transpositions=False): c1 =s1[i-1] c2 =s2[j-1] # skipping a character in s1 a =lev[i-1][j] +1 # skipping a character in s2 b =lev[i][j -1]+1 # substitution c =lev[i-1][j-1]+(c1!=c2) # transposition d =c+1 # never picked by default if transpositions and i>1 and j>1: if s1[i -2]==c2 and s2[j -2]==c1: d =lev[i-2][j-2]+1 # pick the cheapest lev[i][j] =min(a,b,c,d) def edit_distance(s1, s2, transpositions=False): # set up a 2-D array len1 = len(s1) len2 = len(s2) lev = _edit_dist_init(len1 + 1, len2 + 1) # iterate over the array for i in range(len1): for j in range(len2): _edit_dist_step(lev, i + 1, j + 1, s1, s2, transpositions=transpositions) return lev[len1][len2]
Let's have a look at the Edit Distance calculated in NLTK using the nltk.metrics
package:
>>> import nltk >>> from nltk.metrics import * >>> edit_distance("relate","relation") 3 >>> edit_distance("suggestion","calculation") 7
Here, when we calculate the edit distance between relate
and relation
, three operations (one substitution and two insertions) are performed. While calculating the edit distance between suggestion
and calculation
, seven operations (six substitutions and one insertion) are performed.
Jaccard's coefficient, or Tanimoto coefficient, may be defined as a measure of the overlap of two sets, X and Y.
It may be defined as follows:
Jaccard(X,Y)=|X∩Y|/|XUY|
Jaccard(X,X)=1
Jaccard(X,Y)=0 if X∩Y=0
The code for Jaccard's similarity may be given as follows:
def jacc_similarity(query, document): first=set(query).intersection(set(document)) second=set(query).union(set(document)) return len(first)/len(second)
Let's have a look at the implementation of Jaccard's similarity coefficient using NLTK:
>>> import nltk >>> from nltk.metrics import * >>> X=set([10,20,30,40]) >>> Y=set([20,30,60]) >>> print(jaccard_distance(X,Y)) 0.6
The Smith Waterman distance is similar to edit distance. This similarity metric was developed in order to detect the optical alignments between related protein sequences and DNA. It consists of costs to be assigned to and functions for alphabet mapping to cost values (substitution); cost is also assigned to gap G (insertion or deletion):
0 //start over
D(i-1,j-1) -d(si,tj) //subst/copy
D(i,j) = max D(i-1,j)-G //insert
D(i,j-1)-G //delete
G = 1 //example value for gap
d(c,c) = -2 //context dependent substitution cost
d(c,d) = +1 //context dependent substitution cost
Similar to Edit distance, the Python code for Smith Waterman can be embedded with the nltk.metrics
package to perform string similarity using Smith Waterman in NLTK.
Binary distance is a string similarity metric. It returns the value 0.0
if two labels are identical; otherwise, it returns the value 1.0
.
The Python code for Binary distance metrics is:
def binary_distance(label1, label2): return 0.0 if label1 == label2 else 1.0
Let's see how Binary distance metrics is implemented in NLTK:
>>> import nltk >>> from nltk.metrics import * >>> X = set([10,20,30,40]) >>> Y= set([30,50,70]) >>> binary_distance(X, Y) 1.0
Masi distance is based on partial agreement when multiple labels are present.
The Python code included in nltk.metrics
for masi
distance is as follows:
def masi_distance(label1, label2): len_intersection = len(label1.intersection(label2)) len_union = len(label1.union(label2)) len_label1 = len(label1) len_label2 = len(label2) if len_label1 == len_label2 and len_label1 == len_intersection: m = 1 elif len_intersection == min(len_label1, len_label2): m = 0.67 elif len_intersection > 0: m = 0.33 else: m = 0 return 1 - (len_intersection / float(len_union)) * m
Let's see the implementation of masi
distance in NLTK:
>>> import nltk >>> from __future__ import print_function >>> from nltk.metrics import * >>> X = set([10,20,30,40]) >>> Y= set([30,50,70]) >>> print(masi_distance(X,Y)) 0.945
In this chapter, you have learned various operations that can be performed on a text that is a collection of strings. You have understood the concept of tokenization, substitution, and normalization, and applied various similarity measures to strings using NLTK. We have also discussed Zipf's law, which may be applicable to some of the existing documents.
In the next chapter, we'll discuss various language modeling techniques and different NLP tasks.