Reader small image

You're reading from  The Natural Language Processing Workshop

Product typeBook
Published inAug 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781800208421
Edition1st Edition
Languages
Tools
Right arrow
Authors (6):
Rohan Chopra
Rohan Chopra
author image
Rohan Chopra

Rohan Chopra graduated from Vellore Institute of Technology with a bachelors degree in computer science. Rohan has an experience of more than 2 years in designing, implementing, and optimizing end-to-end deep neural network systems. His research is centered around the use of deep learning to solve computer vision-related problems and has hands-on experience working on self-driving cars. He is a data scientist at Absolutdata.
Read more about Rohan Chopra

Aniruddha M. Godbole
Aniruddha M. Godbole
author image
Aniruddha M. Godbole

Aniruddha M. Godbole is a data science consultant with inter-disciplinary expertise in computer science, applied statistics, and finance. He has a master's degree in data science from Indiana University, USA, and has done MBA in finance from the National Institute of Bank Management, India. He has authored papers in computer science and finance and has been an occasional opinion pages contributor to Mint, which is a leading business newspaper in India. He has fifteen years of experience.
Read more about Aniruddha M. Godbole

Nipun Sadvilkar
Nipun Sadvilkar
author image
Nipun Sadvilkar

Nipun Sadvilkar is a senior data scientist at US healthcare company leading a team of data scientists and subject matter expertise to design and build the clinical NLP engine to revamp medical coding workflows, enhance coder efficiency, and accelerate revenue cycle. He has experience of more than 3 years in building NLP solutions and web-based data science platforms in the area of healthcare, finance, media, and psychology. His interests lie at the intersection of machine learning and software engineering with a fair understanding of the business domain. He is a member of the regional and national python community. He is author of pySBD - an NLP open-source python library for sentence segmentation which is recognized by ExplosionAI (spaCy) and AllenAI (scispaCy) organizations.
Read more about Nipun Sadvilkar

Muzaffar Bashir Shah
Muzaffar Bashir Shah
author image
Muzaffar Bashir Shah

Muzaffar Bashir Shah is a software developer with vast experience in machine learning, natural language processing (NLP), text analytics, and data science. He holds a masters degree in computer science from the University of Kashmir and is currently working in a Bangalore based startup named Datoin.
Read more about Muzaffar Bashir Shah

Sohom Ghosh
Sohom Ghosh
author image
Sohom Ghosh

Sohom Ghosh is a passionate data detective with expertise in natural language processing. He has worked extensively in the data science arena with a specialization in deep learning-based text analytics, NLP, and recommendation systems. He has publications in several international conferences and journals.
Read more about Sohom Ghosh

Dwight Gunning
Dwight Gunning
author image
Dwight Gunning

Dwight Gunning is a data scientist at FINRA, a financial services regulator in the US. He has extensive experience in Python-based machine learning and hands-on experience with the most popular NLP tools such as NLTK, gensim, and spacy.
Read more about Dwight Gunning

View More author details
Right arrow

Word Sense Disambiguation

There's a popular saying: "A man is known by the company he keeps.'' Similarly, a word's meaning depends on its association with other words in a sentence. This means two or more words with the same spelling may have different meanings in different contexts. This often leads to ambiguity. Word sense disambiguation is the process of mapping a word to the sense that it should carry. We need to disambiguate words based on the sense they carry so that they can be treated as different entities when being analyzed. The following figure displays a perfect example of how ambiguity is caused due to the usage of the same word in different sentences:

Figure 1.3: Word sense disambiguation

Figure 1.3: Word sense disambiguation

One of the algorithms to solve word sense disambiguation is the Lesk algorithm. It has a huge corpus in the background (generally WordNet is used) that contains definitions of all the possible synonyms of all the possible words in a language. Then it takes a word and the context as input and finds a match between the context and all the definitions of the word. The meaning with the highest number of matches with the context of the word will be returned.

For example, suppose we have a sentence such as "We play only soccer" in a given text. Now, we need to find the meaning of the word "play" in this sentence. In the Lesk algorithm, each word with ambiguous meaning is saved in background synsets. In this case, the word "play" will be saved with all possible definitions. Let's say we have two definitions of the word "play":

  1. Play: Participating in a sport or game
  2. Play: Using a musical instrument

Then, we will find the similarity between the context of the word "play" in the text and both of the preceding definitions using text similarity techniques. The definition best suited to the context of "play" in the sentence will be considered the meaning or definition of the word. In this case, we will find that our first definition fits best in context, as the words "sport" and "game" are present in the preceding sentences.

In the next exercise, we will be using the Lesk module from NLTK. It takes a sentence and the word as input, and returns the meaning or definition of the word. The output of the Lesk method is synset, which contains the ID of the matched definition. These IDs can be matched with their definitions using the definition() method of wsd.synset('word').

To get a better understanding of this process, let's look at an exercise.

Exercise 1.10: Word Sense Disambiguation

In this exercise, we will find the sense of the word "bank" in two different sentences. Follow these steps to implement this exercise:

  1. Open a Jupyter Notebook.
  2. Insert a new cell and add the following code to import the necessary libraries:
    import nltk
    nltk.download('wordnet')
    from nltk.wsd import lesk
    from nltk import word_tokenize
  3. Declare two variables, sentence1 and sentence2, and assign them with appropriate strings. Insert a new cell and the following code to implement this:
    sentence1 = "Keep your savings in the bank"
    sentence2 = "It's so risky to drive over the banks of the road"
  4. To find the sense of the word "bank" in the preceding two sentences, use the Lesk algorithm provided by the nltk.wsd library. Insert a new cell and add the following code to implement this:
    def get_synset(sentence, word):
        return lesk(word_tokenize(sentence), word)
    get_synset(sentence1,'bank')

    This code generates the following output:

    Synset('savings_bank.n.02')
  5. Here, savings_bank.n.02 refers to a container for keeping money safely at home. To check the other sense of the word "bank," write the following code:
    get_synset(sentence2,'bank')

    This code generates the following output:

    Synset('bank.v.07')

    Here, bank.v.07 refers to a slope in the turn of a road.

    Thus, with the help of the Lesk algorithm, we were able to identify the sense of a word in whatever context.

    Note

    To access the source code for this specific section, please refer to https://packt.live/399JCq5.

    You can also run this example online at https://packt.live/30haCQ6.

In the next section, we will focus on sentence boundary detection, which helps detect the start and end points of sentences.

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
The Natural Language Processing Workshop
Published in: Aug 2020Publisher: PacktISBN-13: 9781800208421

Authors (6)

author image
Rohan Chopra

Rohan Chopra graduated from Vellore Institute of Technology with a bachelors degree in computer science. Rohan has an experience of more than 2 years in designing, implementing, and optimizing end-to-end deep neural network systems. His research is centered around the use of deep learning to solve computer vision-related problems and has hands-on experience working on self-driving cars. He is a data scientist at Absolutdata.
Read more about Rohan Chopra

author image
Aniruddha M. Godbole

Aniruddha M. Godbole is a data science consultant with inter-disciplinary expertise in computer science, applied statistics, and finance. He has a master's degree in data science from Indiana University, USA, and has done MBA in finance from the National Institute of Bank Management, India. He has authored papers in computer science and finance and has been an occasional opinion pages contributor to Mint, which is a leading business newspaper in India. He has fifteen years of experience.
Read more about Aniruddha M. Godbole

author image
Nipun Sadvilkar

Nipun Sadvilkar is a senior data scientist at US healthcare company leading a team of data scientists and subject matter expertise to design and build the clinical NLP engine to revamp medical coding workflows, enhance coder efficiency, and accelerate revenue cycle. He has experience of more than 3 years in building NLP solutions and web-based data science platforms in the area of healthcare, finance, media, and psychology. His interests lie at the intersection of machine learning and software engineering with a fair understanding of the business domain. He is a member of the regional and national python community. He is author of pySBD - an NLP open-source python library for sentence segmentation which is recognized by ExplosionAI (spaCy) and AllenAI (scispaCy) organizations.
Read more about Nipun Sadvilkar

author image
Muzaffar Bashir Shah

Muzaffar Bashir Shah is a software developer with vast experience in machine learning, natural language processing (NLP), text analytics, and data science. He holds a masters degree in computer science from the University of Kashmir and is currently working in a Bangalore based startup named Datoin.
Read more about Muzaffar Bashir Shah

author image
Sohom Ghosh

Sohom Ghosh is a passionate data detective with expertise in natural language processing. He has worked extensively in the data science arena with a specialization in deep learning-based text analytics, NLP, and recommendation systems. He has publications in several international conferences and journals.
Read more about Sohom Ghosh

author image
Dwight Gunning

Dwight Gunning is a data scientist at FINRA, a financial services regulator in the US. He has extensive experience in Python-based machine learning and hands-on experience with the most popular NLP tools such as NLTK, gensim, and spacy.
Read more about Dwight Gunning