Reader small image

You're reading from  Natural Language Processing and Computational Linguistics

Product typeBook
Published inJun 2018
Reading LevelBeginner
PublisherPackt
ISBN-139781788838535
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Bhargav Srinivasa-Desikan
Bhargav Srinivasa-Desikan
author image
Bhargav Srinivasa-Desikan

Bhargav Srinivasa-Desikan is a research engineer working for INRIA in Lille, France. He is a part of the MODAL (Models of Data Analysis and Learning) team, and he works on metric learning, predictor aggregation, and data visualization. He is a regular contributor to the Python open source community, and completed Google Summer of Code in 2016 with Gensim where he implemented Dynamic Topic Models. He is a regular speaker at PyCons and PyDatas across Europe and Asia, and conducts tutorials on text analysis using Python.
Read more about Bhargav Srinivasa-Desikan

Right arrow

Chapter 11. Similarity Queries and Summarization

Once we have begun to represent text documents in the form of vector representations, it is possible to start finding the similarity or distance between documents, and that is exactly what we will learn about in this chapter. We are now aware of a variety of different vector representations, from standard bag-of-words or TF-IDF to topic model representations of text documents. We will also learn about a very useful feature implemented in Gensim and how to use it—summarization and keyword extraction. Here's a summary of what we'll learn from this chapter:

  • Similarity metrics
  • Similarity queries
  • Text summarization

Similarity metrics


Similarity metrics [1] are a mathematical construct which is particularly useful in natural language processing—especially in information retrieval. Let's first try to understand what a metric is. We can understand a metric as a function that defines a distance between each pair of elements of a set, or vector. It's clear how this would be useful to us - we can compare between how similar two documents would be based on the distance. A low value returned by the distance function would mean that the two documents are similar, and a high value would mean they are quite different.

While we mention documents in the example, we can technically compare any two elements in a set – this also means we can compare between two sets of topics created by a topic model, for example. We can check between the TF-IDF representations of documents and between LSI or LDA representations of documents.

Most of us would be aware of one distance or similarity metric already – the Euclideanmetric...

Similarity Queries and Summarization

Once we have begun to represent text documents in the form of vector representations, it is possible to start finding the similarity or distance between documents, and that is exactly what we will learn about in this chapter. We are now aware of a variety of different vector representations, from standard bag-of-words or TF-IDF to topic model representations of text documents. We will also learn about a very useful feature implemented in Gensim and how to use it—summarization and keyword extraction. Here's a summary of what we'll learn from this chapter:

  • Similarity metrics
  • Similarity queries
  • Text summarization

Similarity metrics

Similarity metrics [1] are a mathematical construct which is particularly useful in natural language processing—especially in information retrieval. Let's first try to understand what a metric is. We can understand a metric as a function that defines a distance between each pair of elements of a set, or vector. It's clear how this would be useful to us - we can compare between how similar two documents would be based on the distance. A low value returned by the distance function would mean that the two documents are similar, and a high value would mean they are quite different.

While we mention documents in the example, we can technically compare any two elements in a set this also means we can compare between two sets of topics created by a topic model, for example. We can check between the TF-IDF representations of documents and between...

Similarity queries

Now that we have the capability to compare between two documents, it is possible for us to set up our algorithms to extract out the most similar documents for an input query simply index each of the documents, then search for the lowest distance value returned between the corpus and the query, and return the documents with the lowest distance values these would be most similar. Luckily for us, however, Gensim has in-built structures to do this document similarity task!

We will be using the similarities module to construct this structure.

from gensim import similarities

We previously mentioned creating an index we can do this far faster with the similarities module. As mentioned in the Gensim documentation for the Similarity class the Similarity class splits the index into several smaller subindexes (shards), which are disk...

Summarizing text

Often in text analysis, it is useful to summarize large bodies of text either to have a brief overlook of the text before deeply analyzing it or identifying the keywords in a text. It is also often the end game a text analysis task of its own. We will not be working on building our own text summarization pipeline, but rather focus on using the built-in summarization API which Gensim offers us.

It is important to remember that the algorithms included in Gensim do not create its own sentences, but rather extracts the key sentences from the text which we run the algorithm on. This summarizer is based on the TextRank algorithm, from an article by Mihalcea and others, called TextRank [10]. This algorithm was later improved upon by Barrios and others in another article, Variations of the Similarity Function of TextRank for Automated Summarization ...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Natural Language Processing and Computational Linguistics
Published in: Jun 2018Publisher: PacktISBN-13: 9781788838535
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Bhargav Srinivasa-Desikan

Bhargav Srinivasa-Desikan is a research engineer working for INRIA in Lille, France. He is a part of the MODAL (Models of Data Analysis and Learning) team, and he works on metric learning, predictor aggregation, and data visualization. He is a regular contributor to the Python open source community, and completed Google Summer of Code in 2016 with Gensim where he implemented Dynamic Topic Models. He is a regular speaker at PyCons and PyDatas across Europe and Asia, and conducts tutorials on text analysis using Python.
Read more about Bhargav Srinivasa-Desikan