Packt+ | Advance your knowledge in tech

You're reading from Mastering Text Mining with R

Product typeBook

Published inDec 2016

Reading LevelIntermediate

PublisherPackt

ISBN-139781783551811

Edition1st Edition

Languages

Concepts

Data Mining

Author (1)

KUMAR ASHISH

Chapter 5. Text Summarization and Clustering

High dimensional unstructured data comes with the great trouble of organizing, querying, and information retrieval. If we can learn how to extract latent thematic structure in a text document or a collection of such documents, we can harness the wealth of information that can be retrieved; something that would not have been feasible without the advancements in natural language processing methodologies. In this chapter, we will learn about topic modeling and text summarization. We will learn how to extract hidden themes from documents and collections in order to be able to effectively use it for dozens of purposes such as corpus summarization, document organization, document classification, taxonomy generation of web documents, organizing search engine query results, news or article recommendation systems, and duplicate content detection. We will also discuss an interesting application of probabilistic language models in sentence completion:

Topic...

Topic modeling

Topic models can be used for discovering the underlying themes or topics that are present in an unstructured collection of documents. The collection of documents can be organized based on the discovered topics, so that users can easily browse through the documents based on topics of their interest. There are various topic modeling algorithms that can be applied to a collection of documents to achieve this. Clustering is a very useful technique used to group documents, but this doesn't always fit the requirements. When we cluster a text document, the results in each text exclusively belong to exactly one cluster. Let's consider this scenario: We have a book called Text Mining with R Programming Language. Should this book be grouped with R programming-related books, or with text mining-related books? The book is about R programming as well as text mining, and thus should be listed in both sections. In this topic, we will learn methods that do not cluster documents into completely...

Latent semantic analysis

Latent Semantic Analysis (LSA) is a modeling technique that can be used to understand a given collection of documents. It also provides us with insights into the relationship between words in the documents, unravels the concealed structure in the document contents, and creates a group of suitable topics - each topic has information about the data variation that explains the context of the corpus. This modeling technique can come in handy in a variety of natural language processing or information retrieval tasks. LSA can filter out the noise features in the data and represent the data in a simpler form, and discover topics with high affinity.

The topics that are extracted from the collection of documents have the following properties:

The amount of similarity each topic has with each document in the corpus.
The amount of similarity each topic has with each term in the corpus.
It also provides a significance score that highlights the importance of the topic and the variance...

Text clustering

Text clustering is an unsupervised learning algorithm that helps to find and group similar objects together. The objective is to create groups or clusters that are internally coherent but are substantially dissimilar from each other, or they are far from each other when we express similarity in terms of distance. In simple words, the objects inside a cluster are as similar to each other, as possible, while the objects in one cluster are as dissimilar or far from the objects in another cluster as possible.

Traditionally, clustering has been applied on numeric data. Lately, it has found its usage even in text data. Text clustering is utilized to group text objects of different granularities such as documents, paragraphs, sentences, or terms together. We can find the application of text clustering in many tasks related to text data, for example, corpus summarization, document organization, document classification, taxonomy generation of web documents, organizing search engine...

Document clustering

Document clustering is the process of grouping or partitioning text documents into meaningful groups. The hypothesis of the clustering algorithm is based on minimizing the distance between objects in a cluster, while keeping the intra-cluster distance at maximum.

For example, if we have a collection of news articles and we perform clustering on the collection, we will find that the similar documents are closer to each other and lie in the same cluster.

Some of the commonly used texts clustering methods are as follows:

Standard methods:
- K-means
- Hierarchical clustering
Specialized clustering:
- Suffix tree clustering
- Frequent-term set-based

Let's take a simple example of a term document matrix created from data available with tm package in R:

library(tm)
data("crude")
dtm<- DocumentTermMatrix(crude,control = list(weighting =
function(x)
weightTfIdf(x, normalize =
FALSE),
stopwords = TRUE))
dtm
<<DocumentTermMatrix (documents: 20, terms: 1200)>>
Non-/sparse entries: 1890...

Sentence completion

This is an interesting application of natural language processing. Sentence auto-completion is an interesting feature that is shockingly absent in our modern-day browsers and mobile interfaces. Getting grammatically and contextually relevant suggestions as to what to type next, while we are typing a few words, would be such a great feature to have.

Coursera, in one of the data science courses by Johns Hopkins, provided four compressed datasets that contain terms and frequencies of unigram, bigram, trigram, and 4-gram in four datasets. The problem at hand was to come up with a model that can learn to predict relevant words to type next.

The following code uses the Katz-Backoff algorithm, leveraging the four n-gram term frequency datasets to predict the next word in a sentence:

library(tm)
library(stringr)
# Load the n-gram data
load("/data_frame1.RData");
load("/data_frame2.RData");
load("/data_frame3.RData");
load("/data_frame4.RData");
CleanInputString<- function(input_string...

Summary

Topic modeling is an excellent method that has a wide range of applications in information retrieval from text data. In this chapter, we learned about a few topic modeling methods, and its implementation using R. We also learned about feature extraction and text clustering using R. Last, but not least, we took a practical real-world problem, to build a baseline sentence completing application.

In the next chapter, we are going to dive into supervised learning algorithms and their use in text classification.

The rest of the chapter is locked

You have been reading a chapter from

Mastering Text Mining with R

Published in: Dec 2016Publisher: PacktISBN-13: 9781783551811

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

KUMAR ASHISH

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about KUMAR ASHISH

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages