Chapter 3. Categorizing and Tagging Text
In corpus linguistics, text categorization or tagging into various word classes or lexical categories is considered to be the second step in NLP pipeline after tokenization. We have all studied parts of speech in our elementary classes; we were familiarized with nouns, pronouns, verbs, adjectives, and their utility in English grammar. These word classes are not just the salient pillars of grammar, but also quite pivotal in many language processing activities. The process of categorizing and labeling words into different parts of speeches is known as parts of speech tagging or simply tagging.
The goal of this chapter is to equip you with the tools and the associated knowledge about different tagging, chunking, and entailment approaches and their usage in natural language processing.
Earlier chapters focused on basic text processing; this chapter improvises on those concepts to explain the different approaches of tagging texts into lexical categories...
In text mining we tend to view free text as a bag of tokens (words, n-grams). In order to do various quantitative analyses, searching and information retrieval, this approach is quite useful. However, when we take a bag of tokens approach, we tend to lose lots of information contained in the free text, such as sentence structure, word order, and context. These are some of the attributes of natural language processing which humans use to interpret the meaning of given text. NLP is a field focused on understanding free text. It attempts to understand a document completely like a human reader.
POS tagging is a prerequisite and one of the most import steps in text analysis. POS tagging is the annotation of the words with the right POS tags, based on the context in which they appear, POS taggers categorize words based on what they mean in a sentence or in the order they appear. POS taggers provide information about the semantic meaning of the word. POS taggers use some...
Hidden Markov Models for POS tagging
Hidden Markov Models (HMM) are conducive to solving classification problems with generative sequences. In natural language processing, HMM can be used for a variety of tasks such as phrase chunking, parts of speech tagging, and information extraction from documents. If we consider words as input, while any prior information on the input can be considered as states, and estimated conditional probabilities can be considered as the output, then POS tagging can be categorized as a typical sequence classification problem that can be solved using HMM.
Basic definitions and notations
According to (Rabiner), there are five elements needed to define an HMM:
N denotes the number of states (which are hidden) in the model. For parts of speech tagging, N is the number of tags that can be used by the system. Each possible tag for the system corresponds to one state of the HMM. The possible interconnections of individual states are denoted by S = {S1,Sn}. Let qt denote...
Collocation and contingency tables
When we look into a corpus, some words tend to appear in combination; for example, I need a strong coffee, John kicked the bucket, He is a heavy smoker. J. R. Firth drew attention to such words that are not combined randomly into a phrase or sentence. Firth coined the term collocations for such word combinations; the meaning of a word is in part determined by its characteristic collocations. In the field of natural language processing (NLP), the combination of words plays an important role.
Word combinations that are considered collocations can be compound nouns, idiomatic expressions, or combinations that are lexically restricted. This variability in definition is defined by terms such as multi-word expressions (MWE), multi-word units (MWU), bigrams and idioms.
Collocations can be observed in corpora and can be quantified. Multi-word expressions have to be stored as units in order to understand their complete meaning. Three characteristic properties emerge...
Feature extraction is a very important and valuable step in text mining. A system that can extract features from text has potential to be used in lots of applications. The initial step for feature extraction would be tagging the document; this tagged document is then processed to extract the required entities that are meaningful.
The elements that can be extracted from the text are:
Entities: These are some of the pieces of meaningful information that can be found in the document, for example, location, companies, people, and so on
Attributes: These are the features of the extracted entities, for example the title of the person, type of organization, and so on
Events: These are the activities in which the entities participate, for example, dates
Textual Entailment Human communication is diverse in terms of the usage of different expressions to communicate the same message. This proves to be quite a challenging problem in natural language processing. Consider, for example...
In this chapter, we learned the different text categorizing and tagging methods, and how words can be grouped into different lexical categories or parts of speech to analyze the syntactical structure of a sentence. We also learned the approaches that can be leveraged to build language models which can extract concepts or sense out of a sentence, using sentence entailment.
In subsequent chapters, we are going to learn more about practical approaches in performing real-time text mining tasks.