Reader small image

You're reading from  Mastering Data Mining with Python - Find patterns hidden in your data

Product typeBook
Published inAug 2016
Reading LevelIntermediate
Publisher
ISBN-139781785889950
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Megan Squire
Megan Squire
author image
Megan Squire

Megan Squire is a professor of computing sciences at Elon University. Her primary research interest is in collecting, cleaning, and analyzing data about how free and open source software is made. She is one of the leaders of the FLOSSmole.org, FLOSSdata.org, and FLOSSpapers.org projects.
Read more about Megan Squire

Right arrow

Chapter 6. Named Entity Recognition in Text

The next text mining tool we are going to add to our toolbox is actually from the domain of information extraction. When we talk about information extraction, we typically mean text mining techniques that use natural language processing to pull out key pieces of desired information from a large amount of unstructured text. I like to think of information extraction as being like a gold miner's sifting pan. Using these tools, we extract only the good stuff - the gold nuggets - and let the rest of the dirt fall away. In this chapter, the gold nuggets we will be sifting for are called named entities. Given a semi-structured or unstructured body of text, can we locate and extract all the named entities, such as people, places, or organizations, and leave the rest of the text behind?

In this chapter, we will learn:

  • What named entities are and why they are useful to search for

  • What the different techniques are for finding named entities, and what the benefits...

Why look for named entities?


Named Entity Recognition (NER) is the act of locating certain people, places, and things in a larger body of text. Finding the specific entities that are being discussed in a text is a critical task for creating better chatbots, for creating better Question Answering (QA) systems, or for helping speech recognition systems do a better job. When I am preparing dinner in my kitchen, if I ask Amazon Echo to tell me about meatloaf, will I get a description of the food, or of Meatloaf the musician? (For those who are wondering, I tried this at home and Echo responded with a description of the musician!)

Note

Named entity recognition should not be confused with the tasks we performed in Chapter 3, Entity Matching, earlier in this book. The two tasks are similar in that they both deal with nouns, called entities, but the comparison ends there. While NER tries to locate the entities in text, EM tries to figure out whether two entities are the same thing.

Named entity recognition...

Techniques for named entity recognition


Before we tackle the strategies for named entity recognition, we should differentiate between some similar terms that we will come across when doing this work. Usually, when English-speakers first begin to think about named entities, they assume named entities are just proper nouns. What is a proper noun? Proper nouns are typically capitalized in English, and refer to a specific named person, place, or thing. Proper names can include proper nouns as well as noun phrases. Alaska, Barack Obama, January, and The Grateful Dead are all proper names. Are all proper nouns and names capitalized? Not necessarily, as we saw with iPhone and iPad, and also eBay, and the author bell hooks. Are all capitalized nouns proper? No. For example, we write the Englishman came around for tea, where Englishman is capitalized, yet Englishman is a common noun in English.

NER is considerably more interesting than just recognizing nouns, or proper nouns. In a linguistic sense...

Building and evaluating NER systems


Based on our discussion so far in this chapter, we know that building an NER system will start with the following steps:

  1. Separate our document into sentences.

  2. Separate our sentences into tokens.

  3. Tag each token with a part of speech.

  4. Identify named entities from this tagged token set.

  5. Identify the class of each named entity.

To help us correctly find tokens at step 2, separate the real named entities from the impostors at step 4, and to ensure that the entities are placed into the correct class at step 5, it is common to leverage a machine learning approach, similar to what NLTK and its sentiment mining functions did for us in Chapter 5, Sentiment Analysis in Text. Relying on a large set of pre-classified examples will help us work out some of the more complicated issues we introduced above for recognizing named entities, for example, choosing the correct boundary in multi-word noun phrases, or recognizing novel approaches to capitalization, or knowing what kind...

Named entity recognition project


In this set of small projects, we will try our NER techniques on a variety of different types of text that we have seen already in prior chapters, as well as some new text. For variety, will look for named entities in e-mail texts, board meeting minutes, IRC chat dialogue, and human-created summaries of IRC chat dialogue. With these different types of data sources, we will be able to see how writing style and content both affect the accuracy of the NER system.

A simple NER tool

Our first step is to write a simple named entity recognition program that will allow us to find and extract named entities from a text sample. We will take this program and point it at several different text samples in turn. The code and text files for this project are all available on the GitHub site for this book, at https://github.com/megansquire/masteringDM/tree/master/ch6.

The code we will write is a short Python program that uses the same NLTK library we introduced in Chapter 3...

Summary


In this chapter, we learned about the task of Named Entity Recognition (NER) and how that works in practice. We reviewed the characteristics of a named entity, and compared many strategies for finding named entities in text and classifying found entities into their correct type. We implemented a simple NER program using NLTK and used it to detect named entities in four different types of technical communication: chat, chat summaries, e-mails, and meeting minutes. We calculated the accuracy of our NER program using precision, recall, and the F1-measure against each of these text samples, and learned how the characteristics of the text sample will affect the accuracy of the program.

One of the outcomes of this chapter was to demonstrate that text that is written in plain language with fewer technical terms will be easier to mine for named entities than very technical language with a lot of code snippets, function names, acronyms, and the like. We noticed that we got the best results...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Data Mining with Python - Find patterns hidden in your data
Published in: Aug 2016Publisher: ISBN-13: 9781785889950
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Megan Squire

Megan Squire is a professor of computing sciences at Elon University. Her primary research interest is in collecting, cleaning, and analyzing data about how free and open source software is made. She is one of the leaders of the FLOSSmole.org, FLOSSdata.org, and FLOSSpapers.org projects.
Read more about Megan Squire