Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Natural Language Processing with Java and LingPipe Cookbook

You're reading from  Natural Language Processing with Java and LingPipe Cookbook

Product type Book
Published in Nov 2014
Publisher
ISBN-13 9781783284672
Pages 312 pages
Edition 1st Edition
Languages

Table of Contents (14) Chapters

Natural Language Processing with Java and LingPipe Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
1. Simple Classifiers 2. Finding and Working with Words 3. Advanced Classifiers 4. Tagging Words and Tokens 5. Finding Spans in Text – Chunking 6. String Comparison and Clustering 7. Finding Coreference Between Concepts/People Index

Understanding precision and recall


The false positive from the preceding recipe is one of the four possible error categories. All the categories and their interpretations are as follows:

  • For a given category X:

    • True positive: The classifier guessed X, and the true category is X

    • False positive: The classifier guessed X, but the true category is a category that is different from X

    • True negative: The classifier guessed a category that is different from X, and the true category is different from X

    • False negative: The classifier guessed a category different from X, but the true category is X

With these definitions in hand, we can define the additional common evaluation metrics as follows:

  • Precision for a category X is true positive / (false positive + true positive)

    • The degenerate case is to make one very confident guess for 100 percent precision. This minimizes the false positives but will have a horrible recall.

  • Recall or sensitivity for a category X is true positive / (false negative + true positive)

    • The degenerate case is to guess all the data as belonging to category X for 100 percent recall. This minimizes false negatives but will have horrible precision.

  • Specificity for a category X is true negative / (true negative + false positive)

    • The degenerate case is to guess that all data is not in category X.

The degenerate cases are provided to make clear what the metric is focused on. There are metrics such as f-measure that balance precision and recall, but even then, there is no inclusion of true negatives, which can be highly informative. See the Javadoc at com.aliasi.classify.PrecisionRecallEvaluation for more details on evaluation.

  • In our experience, most business needs map to one of the three scenarios:

  • High precision / high recall: The language ID needs to have both good coverage and good accuracy; otherwise, lots of stuff will go wrong. Fortunately, for distinct languages where a mistake will be costly (such as Japanese versus English or English versus Spanish), the LM classifiers perform quite well.

  • High precision / usable recall: Most business use cases have this shape. For example, a search engine that automatically changes a query if it is misspelled better not make lots of mistakes. This means it looks pretty bad to change "Breck Baldwin" to "Brad Baldwin", but no one really notices if "Bradd Baldwin" is not corrected.

  • High recall / usable precision: Intelligence analysis looking for a particular needle in a haystack will tolerate a lot of false positives in support of finding the intended target. This was an early lesson from our DARPA days.

You have been reading a chapter from
Natural Language Processing with Java and LingPipe Cookbook
Published in: Nov 2014 Publisher: ISBN-13: 9781783284672
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}