Packt+ | Advance your knowledge in tech

You're reading from Mastering Text Mining with R

Product typeBook

Published inDec 2016

Reading LevelIntermediate

PublisherPackt

ISBN-139781783551811

Edition1st Edition

Languages

Concepts

Data Mining

Author (1)

KUMAR ASHISH

Chapter 6. Text Classification

Text classification is an extensively used phenomenon in natural language processing which has widespread utility in the different domains. Also known as text categorization, text classification finds its usage in various tasks related to information retrieval and management. Spam detection in e-mails, opinion mining or sentiment analysis on social media data, priority e-mail sorting, intent identification from user queries in chatbots, and automated query answering mechanisms are a few examples where text categorization has proved to be highly effective. In earlier chapters, we have discussed various feature selection and dimensionality reduction methods, which are preprocessing steps before text classification. We will briefly discuss supervised learning or classification mechanisms, how a learner is designed, and then we will move on to their implementation in terms of text data. We will also discuss the different cross-validation and evaluation mechanisms...

Text classification

The digital era has seen a humongous increase in data, which is unstructured and needs to be processed to extract any information out of it. Research in the field of natural language processing has paved the way towards automatic organization and classification of documents into the categories that they belong to. Document classification finds its utility in numerous applications such as spam classification, mail routing, priority inbox or mail relevance ranking, news monitoring and censoring, identification or article genre, and indexing of documents. The text classification process flow is described in the following diagram. We have discussed the preprocessing steps in Chapter 5, Text Summarization and Clustering, which involves basic data cleansing. After this step, we choose the document representation method. Features extraction and selection is performed on the cleansed data as per the document representation method chosen in the last step. We have discussed feature...

Document representation

The first step in the text classification process is to figure out how to represent the document in a manner which is suitable for classification tasks and learning algorithms. This step is basically intended to reduce the complexity of documents, making it easier to work with. While doing so, the following questions come to mind:

Do we need to preserve the order of words?
Is losing the information about the order a concern for us?

An attribute value representation of documents implies that the order of words in a document is not of high significance and each unique word in a document can be considered as a feature and the frequency of its occurrence is represented as a value. Further, discarding the features with a very low value or occurrence in the document can reduce the high dimensionality.

Vector space representation of words considers each word in a document as a vector. The attribute value representation may be having a Boolean form, set-of-words approach that...

Kernel methods

Kernel methods exploit the similarity between documents, that is, by length, topic, language, and so on, to extract patterns from the documents. Inner products between data items can reveal a lot of latent information; in fact many of the standard algorithms can be represented in the form of inner products between data items in a potentially complex feature space. The reason why kernel methods are suitable for high dimensional data is that the complexity only depends on the choice of kernel, it does not depend upon the features of the data in use. Kernels solve the computational issues by transforming the data into richer feature spaces and non-linear features and then applying linear classifier to the transformed data, as shown in the following diagram:

Some of the kernel methods available are:

Linear kernel
Polynomial kernel
Radical base function kernel
Sigmoid kernel

Support vector machines

Support vector machines (SVM) is a kernel method of classification, which gained a lot...

Bias–variance trade-off and learning curve

It has been observed that non-linear classifiers are usually more powerful than the linear classifiers for text classification problems. But, that does not necessarily imply that a non-linear classifier is the solution to each text classification problem. It is quite interesting to note that there does not exist any optimal learning algorithm that can be universally applicable. Thus, the algorithm selection becomes quite a pivotal part of any modeling exercise. Also, the complexity of a model should not entirely be assumed by the fact that it is a linear or non-linear classifier; there are multiple other aspects of a modeling process, which can lead to complexity in the model, such as feature selection, regularization, and so on.

The error components in a learning model can be categorized broadly as irreducible errors and reducible errors. Irreducible errors are caused by inherent variability in a system; not much can be done about this component...

Learning curve

The learning curve is a plot between the training data used against the training and test error, plotted to diagnose the learning algorithm in order to minimize the reducible errors. The following example is a typical case of high variance:

The following diagram is a typical case of high bias. The training error and test error are too close and thus the model has under-fit. We need to choose a more complex algorithm which can fit well on this data and provide us with better generalization ability.

Dealing with reducible error components

High bias:

Add more features
Apply a more complex model
Use less instances to train
Reduce regularization

High variance:

Conduct feature selection and use less features
Get more training data
Use regularization to help overcome the issues due to complex models

Cross validation

Cross-validation is an important step in the model validation and evaluation process. It is a technique to validate the performance of a model before we apply it on an unobserved dataset. It is not advised to use the full training data to train the model, because in such a case we would have no idea how the model is going to perform in practice. As we learnt in the previous section, a good learner should be able to generalize well on an unseen dataset; that can happen only if the model is able to extract and learn the underlying patterns or relations among the dependent and independent attributes. If we train the model on the full training data and apply the same on a test data, it is...

Summary

In this chapter, we learned about text classification methods and their implementation in R language. We learnt how to use classifiers to build applications such as spam filter, topic models, and so on. We also looked into the model evaluation and validation methods. In the coming chapters, we will take up more practical examples to utilize the tools and knowledge gained in the previous chapters to mine and extract insights from social media. We will also look into different distributed text mining methods to help applications build in R to scale up for high dimensional data.

The rest of the chapter is locked

You have been reading a chapter from

Mastering Text Mining with R

Published in: Dec 2016Publisher: PacktISBN-13: 9781783551811

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

KUMAR ASHISH

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about KUMAR ASHISH

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages