Packt+ | Advance your knowledge in tech

You're reading from Mastering Data Mining with Python - Find patterns hidden in your data

Product typeBook

Published inAug 2016

Reading LevelIntermediate

Publisher

ISBN-139781785889950

Edition1st Edition

Languages

Python

Tools

NLTK Scikit-learn

Concepts

Data Mining

Author (1)

Megan Squire

Chapter 6. Named Entity Recognition in Text

The next text mining tool we are going to add to our toolbox is actually from the domain of information extraction. When we talk about information extraction, we typically mean text mining techniques that use natural language processing to pull out key pieces of desired information from a large amount of unstructured text. I like to think of information extraction as being like a gold miner's sifting pan. Using these tools, we extract only the good stuff - the gold nuggets - and let the rest of the dirt fall away. In this chapter, the gold nuggets we will be sifting for are called named entities. Given a semi-structured or unstructured body of text, can we locate and extract all the named entities, such as people, places, or organizations, and leave the rest of the text behind?

In this chapter, we will learn:

What named entities are and why they are useful to search for
What the different techniques are for finding named entities, and what the benefits...

Why look for named entities?

Named Entity Recognition (NER) is the act of locating certain people, places, and things in a larger body of text. Finding the specific entities that are being discussed in a text is a critical task for creating better chatbots, for creating better Question Answering (QA) systems, or for helping speech recognition systems do a better job. When I am preparing dinner in my kitchen, if I ask Amazon Echo to tell me about meatloaf, will I get a description of the food, or of Meatloaf the musician? (For those who are wondering, I tried this at home and Echo responded with a description of the musician!)

Note

Named entity recognition should not be confused with the tasks we performed in Chapter 3, Entity Matching, earlier in this book. The two tasks are similar in that they both deal with nouns, called entities, but the comparison ends there. While NER tries to locate the entities in text, EM tries to figure out whether two entities are the same thing.

Named entity recognition...

Techniques for named entity recognition

Before we tackle the strategies for named entity recognition, we should differentiate between some similar terms that we will come across when doing this work. Usually, when English-speakers first begin to think about named entities, they assume named entities are just proper nouns. What is a proper noun? Proper nouns are typically capitalized in English, and refer to a specific named person, place, or thing. Proper names can include proper nouns as well as noun phrases. Alaska, Barack Obama, January, and The Grateful Dead are all proper names. Are all proper nouns and names capitalized? Not necessarily, as we saw with iPhone and iPad, and also eBay, and the author bell hooks. Are all capitalized nouns proper? No. For example, we write the Englishman came around for tea, where Englishman is capitalized, yet Englishman is a common noun in English.

NER is considerably more interesting than just recognizing nouns, or proper nouns. In a linguistic sense...

Building and evaluating NER systems

Based on our discussion so far in this chapter, we know that building an NER system will start with the following steps:

Separate our document into sentences.
Separate our sentences into tokens.
Tag each token with a part of speech.
Identify named entities from this tagged token set.
Identify the class of each named entity.

To help us correctly find tokens at step 2, separate the real named entities from the impostors at step 4, and to ensure that the entities are placed into the correct class at step 5, it is common to leverage a machine learning approach, similar to what NLTK and its sentiment mining functions did for us in Chapter 5, Sentiment Analysis in Text. Relying on a large set of pre-classified examples will help us work out some of the more complicated issues we introduced above for recognizing named entities, for example, choosing the correct boundary in multi-word noun phrases, or recognizing novel approaches to capitalization, or knowing what kind...

Named entity recognition project

In this set of small projects, we will try our NER techniques on a variety of different types of text that we have seen already in prior chapters, as well as some new text. For variety, will look for named entities in e-mail texts, board meeting minutes, IRC chat dialogue, and human-created summaries of IRC chat dialogue. With these different types of data sources, we will be able to see how writing style and content both affect the accuracy of the NER system.

A simple NER tool

Our first step is to write a simple named entity recognition program that will allow us to find and extract named entities from a text sample. We will take this program and point it at several different text samples in turn. The code and text files for this project are all available on the GitHub site for this book, at https://github.com/megansquire/masteringDM/tree/master/ch6.

The code we will write is a short Python program that uses the same NLTK library we introduced in Chapter 3...

Summary

In this chapter, we learned about the task of Named Entity Recognition (NER) and how that works in practice. We reviewed the characteristics of a named entity, and compared many strategies for finding named entities in text and classifying found entities into their correct type. We implemented a simple NER program using NLTK and used it to detect named entities in four different types of technical communication: chat, chat summaries, e-mails, and meeting minutes. We calculated the accuracy of our NER program using precision, recall, and the F1-measure against each of these text samples, and learned how the characteristics of the text sample will affect the accuracy of the program.

One of the outcomes of this chapter was to demonstrate that text that is written in plain language with fewer technical terms will be easier to mine for named entities than very technical language with a lot of code snippets, function names, acronyms, and the like. We noticed that we got the best results...

The rest of the chapter is locked

You have been reading a chapter from

Mastering Data Mining with Python - Find patterns hidden in your data

Published in: Aug 2016Publisher: ISBN-13: 9781785889950

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Megan Squire

Megan Squire is a professor of computing sciences at Elon University. Her primary research interest is in collecting, cleaning, and analyzing data about how free and open source software is made. She is one of the leaders of the FLOSSmole.org, FLOSSdata.org, and FLOSSpapers.org projects.
Read more about Megan Squire

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages