Reader small image

You're reading from  Mastering Text Mining with R

Product typeBook
Published inDec 2016
Reading LevelIntermediate
PublisherPackt
ISBN-139781783551811
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
KUMAR ASHISH
KUMAR ASHISH
author image
KUMAR ASHISH

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about KUMAR ASHISH

Right arrow

Chapter 3. Categorizing and Tagging Text

In corpus linguistics, text categorization or tagging into various word classes or lexical categories is considered to be the second step in NLP pipeline after tokenization. We have all studied parts of speech in our elementary classes; we were familiarized with nouns, pronouns, verbs, adjectives, and their utility in English grammar. These word classes are not just the salient pillars of grammar, but also quite pivotal in many language processing activities. The process of categorizing and labeling words into different parts of speeches is known as parts of speech tagging or simply tagging.

The goal of this chapter is to equip you with the tools and the associated knowledge about different tagging, chunking, and entailment approaches and their usage in natural language processing.

Earlier chapters focused on basic text processing; this chapter improvises on those concepts to explain the different approaches of tagging texts into lexical categories...

Parts of speech tagging


In text mining we tend to view free text as a bag of tokens (words, n-grams). In order to do various quantitative analyses, searching and information retrieval, this approach is quite useful. However, when we take a bag of tokens approach, we tend to lose lots of information contained in the free text, such as sentence structure, word order, and context. These are some of the attributes of natural language processing which humans use to interpret the meaning of given text. NLP is a field focused on understanding free text. It attempts to understand a document completely like a human reader.

POS tagging is a prerequisite and one of the most import steps in text analysis. POS tagging is the annotation of the words with the right POS tags, based on the context in which they appear, POS taggers categorize words based on what they mean in a sentence or in the order they appear. POS taggers provide information about the semantic meaning of the word. POS taggers use some...

Hidden Markov Models for POS tagging


Hidden Markov Models (HMM) are conducive to solving classification problems with generative sequences. In natural language processing, HMM can be used for a variety of tasks such as phrase chunking, parts of speech tagging, and information extraction from documents. If we consider words as input, while any prior information on the input can be considered as states, and estimated conditional probabilities can be considered as the output, then POS tagging can be categorized as a typical sequence classification problem that can be solved using HMM.

Basic definitions and notations

According to (Rabiner), there are five elements needed to define an HMM:

  • N denotes the number of states (which are hidden) in the model. For parts of speech tagging, N is the number of tags that can be used by the system. Each possible tag for the system corresponds to one state of the HMM. The possible interconnections of individual states are denoted by S = {S1,Sn}. Let qt denote...

Collocation and contingency tables


When we look into a corpus, some words tend to appear in combination; for example, I need a strong coffee, John kicked the bucket, He is a heavy smoker. J. R. Firth drew attention to such words that are not combined randomly into a phrase or sentence. Firth coined the term collocations for such word combinations; the meaning of a word is in part determined by its characteristic collocations. In the field of natural language processing (NLP), the combination of words plays an important role.

Word combinations that are considered collocations can be compound nouns, idiomatic expressions, or combinations that are lexically restricted. This variability in definition is defined by terms such as multi-word expressions (MWE), multi-word units (MWU), bigrams and idioms.

Collocations can be observed in corpora and can be quantified. Multi-word expressions have to be stored as units in order to understand their complete meaning. Three characteristic properties emerge...

Feature extraction


Feature extraction is a very important and valuable step in text mining. A system that can extract features from text has potential to be used in lots of applications. The initial step for feature extraction would be tagging the document; this tagged document is then processed to extract the required entities that are meaningful.

The elements that can be extracted from the text are:

  • Entities: These are some of the pieces of meaningful information that can be found in the document, for example, location, companies, people, and so on

  • Attributes: These are the features of the extracted entities, for example the title of the person, type of organization, and so on

  • Events: These are the activities in which the entities participate, for example, dates

Textual Entailment Human communication is diverse in terms of the usage of different expressions to communicate the same message. This proves to be quite a challenging problem in natural language processing. Consider, for example...

Summary


In this chapter, we learned the different text categorizing and tagging methods, and how words can be grouped into different lexical categories or parts of speech to analyze the syntactical structure of a sentence. We also learned the approaches that can be leveraged to build language models which can extract concepts or sense out of a sentence, using sentence entailment.

In subsequent chapters, we are going to learn more about practical approaches in performing real-time text mining tasks.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Text Mining with R
Published in: Dec 2016Publisher: PacktISBN-13: 9781783551811
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
KUMAR ASHISH

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about KUMAR ASHISH