Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Natural Language Processing with Java and LingPipe Cookbook

You're reading from  Natural Language Processing with Java and LingPipe Cookbook

Product type Book
Published in Nov 2014
Publisher
ISBN-13 9781783284672
Pages 312 pages
Edition 1st Edition
Languages

Table of Contents (14) Chapters

Natural Language Processing with Java and LingPipe Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Simple Classifiers Finding and Working with Words Advanced Classifiers Tagging Words and Tokens Finding Spans in Text – Chunking String Comparison and Clustering Finding Coreference Between Concepts/People Index

Viewing error categories – false positives


We can achieve the best possible classifier performance by examining the errors and making changes to the system. There is a very bad habit among developers and machine-learning folks to not look at errors, particularly as systems mature. Just to be clear, at the end of a project, the developers responsible for tuning the classifier should be very familiar with the domain being classified, if not expert in it, because they have looked at so much data while tuning the system. If the developer cannot do a reasonable job of emulating the classifiers that you are tuning, then you are not looking at enough data.

This recipe performs the most basic form of looking at what the system got wrong in the form of false positives, which are examples from training data that the classifier assigned to a category, but the correct category was something else.

How to do it...

Perform the following steps in order to view error categories using false positives:

  1. This recipe extends the previous How to train and evaluate with cross validation recipe by accessing more of what the evaluation class provides. Get a command prompt and type:

    java -cp lingpipe-cookbook.1.0.jar:lib/opencsv-2.4.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.ReportFalsePositivesOverXValidation
    
  2. This will result in:

    Training data is: data/disney_e_n.csv
    reference\response
              \e,n,
             e 10,1,
             n 6,4,
    False Positives for e
    Malisímos los nuevos dibujitos de disney, nickelodeon, cartoon, etc, no me gustannn : n
    @meeelp mas que venha um filhinho mais fofo que o próprio pai, com covinha e amando a Disney kkkkkkkkkkkkkkkkk : n
    @HedyHAMIDI au quartier pas a Disney moi : n
    @greenath_ t'as de la chance d'aller a Disney putain j'y ai jamais été moi. : n
    Prefiro gastar uma baba de dinheiro pra ir pra cancun doq pra Disney por exemplo : n
    ES INSUPERABLE DISNEY !! QUIERO VOLVER:( : n
    False Positives for n
    request now "let's get tricky" by @bellathorne and @ROSHON on @radiodisney!!! just call 1-877-870-5678 or at http://t.co/cbne5yRKhQ!! <3 : e
    
  3. The output starts with a confusion matrix. Then, we will see the actual six instances of false positives for p from the lower left-hand side cell of the confusion matrix labeled with the category that the classifier guessed. Then, we will see false positives for n, which is a single example. The true category is appended with :, which is helpful for classifiers that have more than two categories.

How it works…

This recipe is based on the previous one, but it has its own source in com/lingpipe/cookbook/chapter1/ReportFalsePositivesOverXValidation.java. There are two differences. First, storeInputs is set to true for the evaluator:

boolean storeInputs = true;
BaseClassifierEvaluator<CharSequence> evaluator = new BaseClassifierEvaluator<CharSequence>(null, categories, storeInputs);

Second, a Util method is added to print false positives:

for (String category : categories) {
  Util.printFalsePositives(category, evaluator, corpus);
}

The preceding code works by identifying a category of focus—e or English tweets—and extracting all the false positives from the classifier evaluator. For this category, false positives are tweets that are non-English in truth, but the classifier thought they were English. The referenced Util method is as follows:

public static <E> void printFalsePositives(String category, BaseClassifierEvaluator<E> evaluator, Corpus<ObjectHandler<Classified<E>>> corpus) throws IOException {
  final Map<E,Classification> truthMap = new HashMap<E,Classification>();
  corpus.visitCorpus(new ObjectHandler<Classified<E>>() {
    @Override
    public void handle(Classified<E> data) {
      truthMap.put(data.getObject(),data.getClassification());
    }
  });

The preceding code takes the corpus that contains all the truth data and populates Map<E,Classification> to allow for lookup of the truth annotation, given the input. If the same input exists in two categories, then this method will not be robust but will record the last example seen:

List<Classified<E>> falsePositives = evaluator.falsePositives(category);
System.out.println("False Positives for " + category);
for (Classified<E> classified : falsePositives) {
  E data = classified.getObject();
  Classification truthClassification = truthMap.get(data);
  System.out.println(data + " : " + truthClassification.bestCategory());
  }
}

The code gets the false positives from the evaluator and then iterates over all them with a lookup into truthMap built in the preceding code and prints out the relevant information. There are also methods to get false negatives, true positives, and true negatives in evaluator.

The ability to identify mistakes is crucial to improving performance. The advice seems obvious, but it is very common for developers to not look at mistakes. They will look at system output and make a rough estimate of whether the system is good enough; this does not result in top-performing classifiers.

The next recipe works through more evaluation metrics and their definition.

You have been reading a chapter from
Natural Language Processing with Java and LingPipe Cookbook
Published in: Nov 2014 Publisher: ISBN-13: 9781783284672
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}