Morphology – Getting Our Feet Wet

In this article by Deepti Chopra, Nisheeth Joshi, and Iti Mathur authors of the book Mastering Natural Language Processing with Python, morphology may be defined as the study of the composition of words using morphemes. A morpheme is the smallest unit of the language that has a meaning. In this article, we will discuss stemming and lemmatizing, creating a stemmer and lemmatizer for non-English languages, developing a morphological analyzer and morphological generator using machine learning tools, creating a search engine, and many other concepts.

In brief, this article will include the following topics:

  • Introducing morphology
  • Creating a stemmer and lemmatizer
  • Developing a stemmer for non-English languages
  • Creating a morphological analyzer
  • Creating a morphological generator
  • Creating a search engine

(For more resources related to this topic, see here.)

Introducing morphology

Morphology may be defined as the study of the production of tokens with the help of morphemes. A morpheme is the basic unit of language, which carries a meaning. There are two types of morphemes: stems and affixes (suffixes, prefixes, infixes, and circumfixes).

Stems are also referred to as free morphemes since they can even exist without adding affixes. Affixes are referred to as bound morphemes since they cannot exist in a free form, and they always exist along with free morphemes. Consider the word "unbelievable". Here, "believe" is a stem or free morpheme. It can even exist on its own. The morphemes "un" and "able" are affixes or bound morphemes. They cannot exist in s free form but exist together with a stem. There are three kinds of languages, namely isolating languages, agglutinative languages, and inflecting languages. Morphology has different meanings in all these languages. Isolating languages are those languages in which words are merely free morphemes, and they do not carry any tense (past, present, and future) or number (singular or plural) information. Mandarin Chinese is an example of an isolating language. Agglutinative languages are those languages in which small words combine together to convey compound information. Turkish is an example of an agglutinative language. Inflecting languages are languages in which words are broken down into simpler units, but all these simpler units exhibit different meanings. Latin is an example of an inflecting language. There are morphological processes such as inflections, derivations, semi-affixes, combining forms, and cliticization. An inflection refers to transforming a word into a form so that it represents a person, number, tense, gender, case, aspect, and mood. Here, the syntactic category of the token remains the same. In derivation, the syntactic category of word is also changed. Semi-affixes are bound morphemes that exhibit a word-like quality, for example, noteworthy, antisocial, anticlockwise, and so on.

Understanding stemmers

Stemming may be defined as the process of obtaining a stem from a word by eliminating the affixes from it. For example, in the word "raining", a stemmer would return the root word or the stem word "rain" by removing the affix "ing" from "raining". In order to increase the accuracy of information retrieval, search engines mostly use stemming to get a stem and store it as an index word. Search engines call words with the same meaning synonyms, which may be a kind of query expansion known as conflation. Martin Porter has designed a well-known stemming algorithm known as the Porter Stemming Algorithm. This algorithm is basically designed to replace and eliminate some well-known suffices present in English words. To perform stemming in NLTK, we can simply perform the instantiation of the PorterStemmer class, and then perform stemming by calling the stem method.

Let's take a look at the code for stemming using the PorterStemmer class in NLTK:

>>> import nltk
>>> from nltk.stem import PorterStemmer
>>> stemmerporter = PorterStemmer()
>>> stemmerporter.stem('working')
'work'
>>> stemmerporter.stem('happiness')
'happi'

The PorterStemmer class is trained and has the knowledge of many stems and word forms in the English language. The process of stemming takes place in a series of steps and transforms a word into a shorter word or this word may similar meaning to the root word. The stemmer I interface defines the stem() method, and all stemmers are inherited from this interface. The inheritance diagram is depicted here:

Another Stemming algorithm, known as the Lancaster Stemming algorithm, was introduced in Lancaster University. Similar to the PorterStemmer class, the LancasterStemmer class is used in NLTK to implement Lancaster Stemming.

Let's consider the following code, which depicts Lancaster stemming in NLTK:

>>> import nltk
>>> from nltk.stem import LancasterStemmer
>>> stemmerlan=LancasterStemmer()
>>> stemmerlan.stem('working')
'work'
>>> stemmerlan.stem('happiness')
'happy'

We can also build our own stemmer in NLTK using RegexpStemmer. This works by accepting a string and eliminates it from the prefix or suffix of a word when a match is found.

Let's consider an example of stemming using RegexpStemmer in NLTK:

>>> import nltk
>>> from nltk.stem import RegexpStemmer
>>> stemmerregexp=RegexpStemmer('ing')
>>> stemmerregexp.stem('working')
'work'
>>> stemmerregexp.stem('happiness')
'happiness'
>>> stemmerregexp.stem('pairing')
'pair'

We can use RegexpStemmer in cases where stemming cannot be performed using PorterStemmer and LancasterStemmer.

The SnowballStemmer class is used to perform stemming in 13 languages other than English. In order to perform stemming using SnowballStemmer, firstly, an instance is created in the language where stemming needs to be performed, and then using the stem() method, stepping is performed.

Consider the following example to perform stemming in Spanish and French in NLTK using SnowballStemmer:

>>> import nltk
>>> from nltk.stem import SnowballStemmer
>>> SnowballStemmer.languages
('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')
>>> spanishstemmer=SnowballStemmer('spanish')
>>> spanishstemmer.stem('comiendo')
'com'
>>> frenchstemmer=SnowballStemmer('french')
>>> frenchstemmer.stem('manger')
'mang'

Nltk.stem.api consists of the stemmer I class in which the stem function is performed.

Consider the following code present in NLTK, which enables stemming to be performed:

Class StemmerI(object):
"""
It is an interface that helps to eliminate morphological affixes from the tokens and the process is known as stemming.
"""
def stem(self, token):
"""
Eliminate affixes from token and stem is returned.
"""
raise NotImplementedError()

Here's the code used to perform stemming using multiple stemmers:

>>> import nltk
>>> from nltk.stem.porter import PorterStemmer
>>> from nltk.stem.lancaster import LancasterStemmer
>>> from nltk.stem import SnowballStemmer
>>> def obtain_tokens():
With open('/home/p/NLTK/sample1.txt') as stem: tok = nltk.word_tokenize(stem.read())
   return tokens
>>> def stemming(filtered):
   stem=[]
   for x in filtered:
     stem.append(PorterStemmer().stem(x))
   return stem
>>> if_name_=="_main_":
   tok= obtain_tokens()
>>> print("tokens is %s")%(tok)
>>> stem_tokens= stemming(tok)
>>> print("After stemming is %s")%stem_tokens
>>> res=dict(zip(tok,stem_tokens))
>>> print("{tok:stemmed}=%s")%(result)

Understanding lemmatization

Lemmatization is the process in which we transform a word into a form that has a different word category. The word formed after lemmatization is entirely different from what it was initially.

Consider an example of lemmatization in NLTK:

>>> import nltk
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer_output=WordNetLemmatizer()
>>> lemmatizer_output.lemmatize('working')
'working'
>>> lemmatizer_output.lemmatize('working',pos='v')
'work'
>>> lemmatizer_output.lemmatize('works')
'work'

WordNetLemmatizer may be defined as a wrapper around the so-called WordNet corpus, and it makes use of the morphy() function present in WordNetCorpusReader to extract a lemma. If no lemma is extracted, then the word is only returned in its original form. For example, for 'works', the lemma that is returned is in the singular form 'work'.

This code snippet illustrates the difference between stemming and lemmatization:

>>> import nltk
>>> from nltk.stem import PorterStemmer
>>> stemmer_output=PorterStemmer()
>>> stemmer_output.stem('happiness')
'happi'
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer_output.lemmatize('happiness')
'happiness'

In the preceding code, 'happiness' is converted to 'happi' by stemming it. Lemmatization can't find the root word for 'happiness', so it returns the word "happiness".

Developing a stemmer for non-English languages

Polyglot is a software that is used to provide models called morfessor models, which are used to obtain morphemes from tokens. The Morpho project's goal is to create unsupervised data-driven processes. Its focuses on the creation of morphemes, which are the smallest units of syntax. Morphemes play an important role in natural language processing. They are useful in automatic recognition and the creation of language. With the help of the vocabulary dictionaries of polyglot, morfessor models on 50,000 tokens of different languages was used.

Here's the code to obtain a language table using a polyglot:

from polyglot.downloader import downloader
print(downloader.supported_languages_table("morph2"))

The output obtained from the preceding code is in the form of these languages listed as follows:

1. Piedmontese language       2. Lombard language           3. Gan Chinese
4. Sicilian                   5. Scots                     6. Kirghiz, Kyrgyz
7. Pashto, Pushto             8. Kurdish                   9. Portuguese
10. Kannada                   11. Korean                   12. Khmer
13. Kazakh                   14. Ilokano                   15. Polish
16. Panjabi, Punjabi         17. Georgian                 18. Chuvash
19. Alemannic                 20. Czech                     21. Welsh
22. Chechen                   23. Catalan; Valencian       24. Northern Sami
25. Sanskrit (Saṁskṛta)       26. Slovene                   27. Javanese
28. Slovak                   29. Bosnian-Croatian-Serbian 30. Bavarian
31. Swedish                   32. Swahili                   33. Sundanese
34. Serbian                   35. Albanian                 36. Japanese
37. Western Frisian           38. French                   39. Finnish
40. Upper Sorbian             41. Faroese                   42. Persian
43. Sinhala, Sinhalese       44. Italian                   45. Amharic
46. Aragonese                 47. Volapük                   48. Icelandic
49. Sakha                     50. Afrikaans                 51. Indonesian
52. Interlingua              53. Azerbaijani               54. Ido
55. Arabic                   56. Assamese                 57. Yoruba
58. Yiddish                   59. Waray-Waray               60. Croatian
61. Hungarian                 62. Haitian; Haitian Creole   63. Quechua
64. Armenian                 65. Hebrew (modern)           66. Silesian
67. Hindi                     68. Divehi; Dhivehi; Mald... 69. German
70. Danish                   71. Occitan                   72. Tagalog
73. Turkmen                   74. Thai                     75. Tajik
76. Greek, Modern             77. Telugu                   78. Tamil
79. Oriya                     80. Ossetian, Ossetic         81. Tatar
82. Turkish                   83. Kapampangan               84. Venetian
85. Manx                     86. Gujarati                 87. Galician
88. Irish                     89. Scottish Gaelic; Gaelic   90. Nepali
91. Cebuano                   92. Zazaki                   93. Walloon
94. Dutch                     95. Norwegian                 96. Norwegian Nynorsk
97. West Flemish             98. Chinese                   99. Bosnian
100. Breton                   101. Belarusian               102. Bulgarian
103. Bashkir                 104. Egyptian Arabic         105. Tibetan Standard, Tib...
106. Bengali                 107. Burmese                 108. Romansh
109. Marathi (Marāthī)       110. Malay                   111. Maltese
112. Russian                 113. Macedonian               114. Malayalam
115. Mongolian               116. Malagasy                 117. Vietnamese
118. Spanish; Castilian       119. Estonian                120. Basque
121. Bishnupriya Manipuri     122. Asturian                 123. English
124. Esperanto               125. Luxembourgish, Letzeb... 126. Latin
127. Uighur, Uyghur           128. Ukrainian               129. Limburgish, Limburgan...
130. Latvian                 131. Urdu                     132. Lithuanian
133. Fiji Hindi               134. Uzbek                   135. Romanian, Moldavian, ...

The necessary models can be downloaded using the following code:

%%bash
polyglot download morph2.en morph2.ar

[polyglot_data] Downloading package morph2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package morph2.en is already up-to-date!
[polyglot_data] Downloading package morph2.ar to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package morph2.ar is already up-to-date!

Consider this example that obtains output from a polyglot:

from polyglot.text import Text, Word
tokens =["unconditional" ,"precooked", "impossible", "painful", "entered"]
for s in tokens:
s=Word(s, language="en")
print("{:<20}{}".format(s,s.morphemes))

unconditional   ['un','conditional']
precooked   ['pre','cook','ed']
impossible   ['im','possible']
painful     ['pain','ful']
entered     ['enter','ed']

If tokenization is not performed properly, then we can perform morphological analysis for the process of splitting text into its original constituents:

sent="Ihopeyoufindthebookinteresting"
para=Text(sent)
para.language="en"
para.morphemes
WordList(['I','hope','you','find','the','book','interesting'])

A morphological analyzers

Morphological analysis may be defined as the process of obtaining grammatical information about a token given its suffix information. Morphological analysis can be performed in three ways: Morpheme-based morphology (or the item and arrangement approach), Lexeme-based morphology (or the item and process approach), and Word-based morphology (or the word and paradigm approach). A morphological analyzer may be defined as a program that is responsible for the analysis of the morphology of a given input token. It analyzes a given token and generates morphological information, such as gender, number, class, and so on, as an output.

In order to perform morphological analysis on a given non-whitespace token, pyEnchant dictionary is used.

Consider the following code that performs morphological analysis:

>>> import enchant
>>> s = enchant.Dict("en_US")
>>> tok=[]
>>> def tokenize(st1):
   if not st1:return
   for j in xrange(len(st1),-1,-1):
     if s.check(st1[0:j]):
       tok.append(st1[0:i])
       st1=st[j:]
       tokenize(st1)
       break
>>> tokenize("itismyfavouritebook")
>>> tok
['it', 'is', 'my','favourite','book']
>>> tok=[ ]
>>> tokenize("ihopeyoufindthebookinteresting")
>>> tok
['i','hope','you','find','the','book','interesting']

We can determine the category of a word as follows:

  • Morphological hints: Suffix information helps us to detect the category of a word. For example, -ness and –ment suffixes exist with nouns.
  • Syntactic hints: Contextual information is conducive in determining the category of a word. For example, if we have found a word that has a noun category, then syntactic hints will be useful in determining whether an adjective will appear before the noun or after the noun category.
  • Semantic hints: A semantic hint is also useful in determining the category of a word. For example, if we already know that a word represents the name of a location, then it will fall under the noun category.
  • Open class: This refers to the class of words that are not fixed and each day, their number keeps on increasing whenever a new word is added to the list. Words in an open class are usually in the form of nouns. Prepositions are mostly a closed class.
  • Morphology captured by the part of speech tagset: Part of Speech tagset capture information that helps us to perform morphology. For example, the word 'plays' would appear with the third person and singular noun.

Omorfi (the open morphology of Finnish) is a package that has been licensed by version 3 of GNU GPL. It is used for the purpose of performing numerous tasks such as language modeling, morphological analysis, rule-based machine translations, information retrieval, statistical machine translations, morphological segmentation, ontologies, and spell checking and correction.

A morphological generator

A morphological generator is a program that performs the task of morphological generations. Morphological generation may be considered the opposite of morphological analysis. Here, given the description of a word in terms of its number, category, stem, and so on, the original word is retrieved. For example, if root = go, Part of Speech = verb, tense= present, and if it occurs along with a third person and singular subject, then the morphological generator would generate its surface form, that is, goes.

There are many Python-based software that perform morphological analysis and generation. Some of them are as follows:

  • ParaMorfo: This is used to perform the morphological generation and analysis of Spanish and Guarani nouns, adjectives, and verbs
  • HornMorpho: This is used for the morphological generation and analysis of Oromo and Amharic nouns and verbs as well as Tigrinya verbs
  • AntiMorfo: This is used for the morphological generation and analysis of Quechua adjectives, verbs, and nouns as well as Spanish verbs
  • MorfoMelayu: This is used for the morphological analysis of Malay words

Other examples of software that is used to perform morphological analysis and generation are as follows:

  • Morph is a morphological generator and analyzer for the English language and the RASP system
  • Morphy is a morphological generator, analyzer, and POS tagger for German
  • Morphisto is a morphological generator and analyzer for German
  • Morfette performs supervised learning (inflectional morphology) for Spanish and French

Search engines

PyStemmer 1.0.1 consists of Snowball stemming algorithms that are conducive for performing information retrieval tasks and the construction of a search engine. It consists of the Porter stemming algorithm and many other stemming algorithms that are useful for the purpose of performing stemming and information retrieval tasks in many languages, including many European languages.

We can construct a vector space search engine by converting the texts into vectors.

Here are the steps needed to construct a vector space search engine:

  1. Stemming and elimination of stop words.

    A stemmer is a program that accepts words and converts them into stems. Tokens that have same stem almost have the same meanings. Stop words are also eliminated from text.

    Consider the following code for the removal of stop words and tokenization:

    def eliminatestopwords(self,list):
    " " "
    Eliminate words which occur often and have not much significance from context point of view.
    " " "
    return[ word for word in list if word not in self.stopwords ]
    
    def tokenize(self,string):
    " " "
    Perform the task of splitting text into stop words and tokens
    " " "
    Str=self.clean(str)
    Words=str.split(" ")
    return [self.stemmer.stem(word,0,len(word)-1) for word in words]
  1. Mapping keywords into vector dimensions.
    Here's the code required to perform the mapping of keywords into vector dimensions:
    def obtainvectorkeywordindex(self, documentList):
             " " "
    In the document vectors, generate the keyword for the given position of element
         " " "
    
             #Perform mapping of text into strings
             vocabstring = " ".join(documentList)
    
             vocablist = self.parser.tokenise(vocabstring)
             #Eliminate common words that have no search significance
             vocablist = self.parser.eliminatestopwords(vocablist)
             uniqueVocablist = util.removeDuplicates(vocablist)
    
             vectorIndex={}
             offset=0
    #Attach a position to keywords that performs mapping with dimension that is used to depict this token
             for word in uniqueVocablist:
                     vectorIndex[word]=offset
                     offset+=1
             return vectorIndex #(keyword:position)
  1. Mapping of text strings to vectors
    Here, a simple term count model is used. The code to convert text strings into vectors is as follows:
    def constructVector(self, wordString):
    
           # Initialise the vector with 0's
           Vector_val = [0] * len(self.vectorKeywordIndex)
           tokList = self.parser.tokenize(tokString)
           tokList = self.parser.eliminatestopwords(tokList)
           for word in toklist:
                   vector[self.vectorKeywordIndex[word]] += 1;
    # simple Term Count Model is used
           return vector
  1. Searching similar documents

    By finding the cosine of an angle between the vectors of a document, we can prove whether two given documents are similar or not. If the cosine value is 1, then the angle value is 0 degrees and vectors are said to be parallel (this means that documents are related). If the cosine value is 0 and the value of the angle is 90 degrees, then vectors are said to be perpendicular (this means that documents are not related).

    This is the code to compute the cosine between the text vector using scipy:

    def cosine(vec1, vec2):
           """
                   cosine = ( X * Y ) / ||X|| x ||Y||
           """
           return float(dot(vec1,vec2) / (norm(vec1) * norm(vec2)))
  1. Search keywords

    We perform the mapping of keywords to a vector space. We construct a temporary text that represents items to be searched and then compare it with document vectors with the help of a cosine measurement.

    Here is the following code needed to search for the vector space:

    def searching(self,searchinglist):
    """ search for text that are matched on the basis of     list of items """
           askVector = self.buildQueryVector(searchinglist)
    
    ratings = [util.cosine(askVector, textVector) for textVector in self.documentVectors]
           ratings.sort(reverse=True)
           return ratings

    The following code can be used to detect languages from a source text:

    >>> import nltk
    >>> import sys
    >>> try:
         from nltk import wordpunct_tokenize
         from nltk.corpus import stopwords
    except ImportError:
           print( 'Error has occured')
    
     
    
    #----------------------------------------------------------------------
    >>> def _calculate_languages_ratios(text):
         """
       Compute probability of given document that can be written in different languages and give a dictionary that appears like {'german': 2, 'french': 4, 'english': 1}
         """
    languages_ratios = {}
    '''
    nltk.wordpunct_tokenize() splits all punctuations into separate tokens
    wordpunct_tokenize("I hope you like the book interesting .")
         [' I',' hope ','you ','like ','the ','book' ,'interesting ','.']
         '''
    
       tok = wordpunct_tokenize(text)
    wor = [word.lower() for word in tok]
    
    # Compute occurence of unique stopwords in a text
       for language in stopwords.fileids():
         stopwords_set = set(stopwords.words(language))
             words_set = set(words)
               common_elements = words_set.intersection(stopwords_set)
         languages_ratios[language] = len(common_elements)
    # language "score"
         return languages_ratios
    
    #----------------------------------------------------------------
    
    >>> def detect_language(text):
       """
    Compute the probability of given text that is written in different languages and obtain the one that is highest scored. It makes use of stopwords calculation approach, finds out unique stopwords present in a analyzed text.
         """
           ratios = _calculate_languages_ratios(text)
         most_rated_language = max(ratios, key=ratios.get)
         return most_rated_language
    
     
    
    if __name__=='__main__':
    
           text = '''
    All over this cosmos, most of the people believe that there is an invisible supreme power that is the creator and the runner of this world. Human being is supposed to be the most intelligent and loved creation by that power and that is being searched by human beings in different ways into different things. As a result people reveal His assumed form as per their own perceptions and beliefs. It has given birth to different religions and people are divided on the name of religion viz. Hindu, Muslim, Sikhs, Christian etc. People do not stop at this. They debate the superiority of one over the other and fight to establish their views. Shrewd people like politicians oppose and support them at their own convenience to divide them and control them. It has intensified to the extent that even parents of a
    new born baby teach it about religious differences and recommend their own religion superior to that of others and let the child learn to hate other people just because of religion. Jonathan Swift, an eighteenth century novelist, observes that we have just enough religion to make us hate, but not enough to make us love one another.
    The word 'religion' does not have a derogatory meaning - A literal meaning of religion is 'A
    personal or institutionalized system grounded in belief in a God or Gods and the activities connected
    with this'. At its basic level, 'religion is just a set of teachings that tells people how to lead a good
    life'. It has never been the purpose of religion to divide people into groups of isolated followers that
    cannot live in harmony together. No religion claims to teach intolerance or even instructs its believers to segregate a certain religious group or even take the fundamental rights of an individual solely based on their religious choices. It is also said that 'Majhab nhi sikhata aaps mai bair krna'. But this very majhab or religion takes a very heinous form when it is misused by the shrewd politicians and the fanatics e.g. in Ayodhya on 6th December, 1992 some right wing political parties
    and communal organizations incited the Hindus to demolish the 16th century Babri Masjid in the
    name of religion to polarize Hindus votes. Muslim fanatics in Bangladesh retaliated and destroyed a
    number of temples, assassinated innocent Hindus and raped Hindu girls who had nothing to do with
    the demolition of Babri Masjid. This very inhuman act has been presented by Taslima Nasrin, a Banglsdeshi Doctor-cum-Writer in her controversial novel 'Lajja' (1993) in which, she seems to utilizes fiction's mass emotional appeal, rather than its potential for nuance and universality.
       '''
    
    >>> language = detect_language(text)
    
       >>> print(language)

The preceding code will search for stop words and detect the language of the text, which is English.

Summary

In this article, we discussed stemming, lemmatization, and morphological analysis and generation.

Resources for Article:


Further resources on this subject:


You've been reading an excerpt of:

Mastering Natural Language Processing with Python

Explore Title