Chapter 5. Text Summarization and Clustering
High dimensional unstructured data comes with the great trouble of organizing, querying, and information retrieval. If we can learn how to extract latent thematic structure in a text document or a collection of such documents, we can harness the wealth of information that can be retrieved; something that would not have been feasible without the advancements in natural language processing methodologies. In this chapter, we will learn about topic modeling and text summarization. We will learn how to extract hidden themes from documents and collections in order to be able to effectively use it for dozens of purposes such as corpus summarization, document organization, document classification, taxonomy generation of web documents, organizing search engine query results, news or article recommendation systems, and duplicate content detection. We will also discuss an interesting application of probabilistic language models in sentence completion:
Topic models can be used for discovering the underlying themes or topics that are present in an unstructured collection of documents. The collection of documents can be organized based on the discovered topics, so that users can easily browse through the documents based on topics of their interest. There are various topic modeling algorithms that can be applied to a collection of documents to achieve this. Clustering is a very useful technique used to group documents, but this doesn't always fit the requirements. When we cluster a text document, the results in each text exclusively belong to exactly one cluster. Let's consider this scenario: We have a book called Text Mining with R Programming Language. Should this book be grouped with R programming-related books, or with text mining-related books? The book is about R programming as well as text mining, and thus should be listed in both sections. In this topic, we will learn methods that do not cluster documents into completely...
Latent Semantic Analysis (LSA) is a modeling technique that can be used to understand a given collection of documents. It also provides us with insights into the relationship between words in the documents, unravels the concealed structure in the document contents, and creates a group of suitable topics - each topic has information about the data variation that explains the context of the corpus. This modeling technique can come in handy in a variety of natural language processing or information retrieval tasks. LSA can filter out the noise features in the data and represent the data in a simpler form, and discover topics with high affinity.
The topics that are extracted from the collection of documents have the following properties:
The amount of similarity each topic has with each document in the corpus.
The amount of similarity each topic has with each term in the corpus.
It also provides a significance score that highlights the importance of the topic and the variance...
Text clustering is an unsupervised learning algorithm that helps to find and group similar objects together. The objective is to create groups or clusters that are internally coherent but are substantially dissimilar from each other, or they are far from each other when we express similarity in terms of distance. In simple words, the objects inside a cluster are as similar to each other, as possible, while the objects in one cluster are as dissimilar or far from the objects in another cluster as possible.
Traditionally, clustering has been applied on numeric data. Lately, it has found its usage even in text data. Text clustering is utilized to group text objects of different granularities such as documents, paragraphs, sentences, or terms together. We can find the application of text clustering in many tasks related to text data, for example, corpus summarization, document organization, document classification, taxonomy generation of web documents, organizing search engine...
Document clustering is the process of grouping or partitioning text documents into meaningful groups. The hypothesis of the clustering algorithm is based on minimizing the distance between objects in a cluster, while keeping the intra-cluster distance at maximum.
For example, if we have a collection of news articles and we perform clustering on the collection, we will find that the similar documents are closer to each other and lie in the same cluster.
Some of the commonly used texts clustering methods are as follows:
Standard methods:
K-means
Hierarchical clustering
Specialized clustering:
Suffix tree clustering
Frequent-term set-based
Let's take a simple example of a term document matrix created from data available with tm
package in R:
This is an interesting application of natural language processing. Sentence auto-completion is an interesting feature that is shockingly absent in our modern-day browsers and mobile interfaces. Getting grammatically and contextually relevant suggestions as to what to type next, while we are typing a few words, would be such a great feature to have.
Coursera, in one of the data science courses by Johns Hopkins, provided four compressed datasets that contain terms and frequencies of unigram, bigram, trigram, and 4-gram in four datasets. The problem at hand was to come up with a model that can learn to predict relevant words to type next.
The following code uses the Katz-Backoff algorithm, leveraging the four n-gram term frequency datasets to predict the next word in a sentence:
Topic modeling is an excellent method that has a wide range of applications in information retrieval from text data. In this chapter, we learned about a few topic modeling methods, and its implementation using R. We also learned about feature extraction and text clustering using R. Last, but not least, we took a practical real-world problem, to build a baseline sentence completing application.
In the next chapter, we are going to dive into supervised learning algorithms and their use in text classification.