Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Mastering Text Mining with R

You're reading from  Mastering Text Mining with R

Product type Book
Published in Dec 2016
Publisher Packt
ISBN-13 9781783551811
Pages 258 pages
Edition 1st Edition
Languages
Concepts
Author (1):
KUMAR ASHISH KUMAR ASHISH
Profile icon KUMAR ASHISH

Chapter 3. Categorizing and Tagging Text

In corpus linguistics, text categorization or tagging into various word classes or lexical categories is considered to be the second step in NLP pipeline after tokenization. We have all studied parts of speech in our elementary classes; we were familiarized with nouns, pronouns, verbs, adjectives, and their utility in English grammar. These word classes are not just the salient pillars of grammar, but also quite pivotal in many language processing activities. The process of categorizing and labeling words into different parts of speeches is known as parts of speech tagging or simply tagging.

The goal of this chapter is to equip you with the tools and the associated knowledge about different tagging, chunking, and entailment approaches and their usage in natural language processing.

Earlier chapters focused on basic text processing; this chapter improvises on those concepts to explain the different approaches of tagging texts into lexical categories...

Parts of speech tagging


In text mining we tend to view free text as a bag of tokens (words, n-grams). In order to do various quantitative analyses, searching and information retrieval, this approach is quite useful. However, when we take a bag of tokens approach, we tend to lose lots of information contained in the free text, such as sentence structure, word order, and context. These are some of the attributes of natural language processing which humans use to interpret the meaning of given text. NLP is a field focused on understanding free text. It attempts to understand a document completely like a human reader.

POS tagging is a prerequisite and one of the most import steps in text analysis. POS tagging is the annotation of the words with the right POS tags, based on the context in which they appear, POS taggers categorize words based on what they mean in a sentence or in the order they appear. POS taggers provide information about the semantic meaning of the word. POS taggers use some...

Hidden Markov Models for POS tagging


Hidden Markov Models (HMM) are conducive to solving classification problems with generative sequences. In natural language processing, HMM can be used for a variety of tasks such as phrase chunking, parts of speech tagging, and information extraction from documents. If we consider words as input, while any prior information on the input can be considered as states, and estimated conditional probabilities can be considered as the output, then POS tagging can be categorized as a typical sequence classification problem that can be solved using HMM.

Basic definitions and notations

According to (Rabiner), there are five elements needed to define an HMM:

  • N denotes the number of states (which are hidden) in the model. For parts of speech tagging, N is the number of tags that can be used by the system. Each possible tag for the system corresponds to one state of the HMM. The possible interconnections of individual states are denoted by S = {S1,Sn}. Let qt denote...

Collocation and contingency tables


When we look into a corpus, some words tend to appear in combination; for example, I need a strong coffee, John kicked the bucket, He is a heavy smoker. J. R. Firth drew attention to such words that are not combined randomly into a phrase or sentence. Firth coined the term collocations for such word combinations; the meaning of a word is in part determined by its characteristic collocations. In the field of natural language processing (NLP), the combination of words plays an important role.

Word combinations that are considered collocations can be compound nouns, idiomatic expressions, or combinations that are lexically restricted. This variability in definition is defined by terms such as multi-word expressions (MWE), multi-word units (MWU), bigrams and idioms.

Collocations can be observed in corpora and can be quantified. Multi-word expressions have to be stored as units in order to understand their complete meaning. Three characteristic properties emerge...

Feature extraction


Feature extraction is a very important and valuable step in text mining. A system that can extract features from text has potential to be used in lots of applications. The initial step for feature extraction would be tagging the document; this tagged document is then processed to extract the required entities that are meaningful.

The elements that can be extracted from the text are:

  • Entities: These are some of the pieces of meaningful information that can be found in the document, for example, location, companies, people, and so on

  • Attributes: These are the features of the extracted entities, for example the title of the person, type of organization, and so on

  • Events: These are the activities in which the entities participate, for example, dates

Textual Entailment Human communication is diverse in terms of the usage of different expressions to communicate the same message. This proves to be quite a challenging problem in natural language processing. Consider, for example...

Summary


In this chapter, we learned the different text categorizing and tagging methods, and how words can be grouped into different lexical categories or parts of speech to analyze the syntactical structure of a sentence. We also learned the approaches that can be leveraged to build language models which can extract concepts or sense out of a sentence, using sentence entailment.

In subsequent chapters, we are going to learn more about practical approaches in performing real-time text mining tasks.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Mastering Text Mining with R
Published in: Dec 2016 Publisher: Packt ISBN-13: 9781783551811
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}