Reader small image

You're reading from  Mastering Text Mining with R

Product typeBook
Published inDec 2016
Reading LevelIntermediate
PublisherPackt
ISBN-139781783551811
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
KUMAR ASHISH
KUMAR ASHISH
author image
KUMAR ASHISH

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about KUMAR ASHISH

Right arrow

Chapter 7. Entity Recognition

Extracting information out of unstructured text data is a tedious process, because of the complex nature of natural language. Even after advancements in the field of Natural language processing (NLP), we are far from the point where any unrestricted text can be analyzed and the meaning can be extracted for general purposes. However, if we just focus on a specific set of questions, we can extract a significant amount of information from the text data. Named entity recognition helps identify the important entities in a text, to be able to derive the meaning from the unstructured data. It is a vital component of NLP applications, for example, question-answering systems, product discovery on e-commerce websites, and so on.

In this chapter, we will cover the following topics:

  • Entity extraction

  • Coreference and relationship extraction

  • Sentence boundary detection

  • Named entity recognition

Entity extraction


The process of extracting information from unstructured documents is called information extraction. In today's world, most of the data produced over the internet is semi-structured or unstructured; this data is mostly in a human-understandable format, what we call natural language, so most of the time, natural language processing comes into play during information extraction. Entity recognition is a sub process in the chain of information extraction process. NER is one of the important and vital parts of the information extraction process. NER is sometimes also called entity extraction or entity chunking .The main job of NER is to extract the rigid designators in the document and classify these elements in the text to a predefined category. The named entity extractor has a set of predefined categories such as the following:

  • persons

  • organizations

  • locations

  • time

  • money

  • percentages

  • dates

Given an unstructured document, NER will annotate the block or extract the relevant features. Consider...

Sentence boundary detection


Sentence boundary detection is an important step in NLP and an essential problem to be solved before analyzing the text for further use in information extraction, word tokenization, part of speech tagging, and so on. A sentence is a basic unit of text. Tough SBD has been solved to a good extent, extracting sentences from a text document is not a simple process. Sentence boundary detection is language dependent since the sentence termination character in each language may be different. This can be done using a machine learning approach by training a model rule-based approach. If we consider the English language then the simple set of rules which can give us a fairly accurate results are:

  • Text is terminated by a period ( . )

  • Text is terminated by an exclamation mark ( ! )

  • Text is terminated by a question mark ( ? )

Consider the following example:

NLP is a vast topic. Lots of research has been done in this field.

When we apply the preceding set of rules, we can extract...

Named entity recognition


Named entity recognition in a sub process in the natural language processing pipeline. We identify the names and numbers from the input document. The names can be names of a person or company, location numbers can be money or percentages, to name a few. In order to perform named entity recognition, we will use Apache OpenNLP TokenNameFinderModel API. In order to invoke the code from the R environment, we will use the OpenNLP R package:

  1. Load the required libraries:

    library(rJava)
    library(NLP)
    library(openNLP)
  2. Create a sample text; we will extract the entities from this text:

    txt <- " IBM is an MNC with headquarters in New York. Oracle is a cloud company in California. James works in IBM. Oracle hired John for cloud expertise. They give 100% to their profession"
  3. We will convert it to string for processing:

    txt_str <- as.String(txt)
  4. We will process the text through the MaxEnt sentence token annotator and the MaxEnt word token annotator, both available in r packages and...

Summary


In this chapter, we learnt about entity extraction and recognition using R implementation of OpenNLP. Also, we learnt to use other functionalities present in Apache OpenNLP but not yet implemented in the OpenNLP package in R.

In the next chapters, we will learn about some real-life applications of the techniques we have learnt so far, on social media data.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Text Mining with R
Published in: Dec 2016Publisher: PacktISBN-13: 9781783551811
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
KUMAR ASHISH

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about KUMAR ASHISH