Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Natural Language Processing with Java and LingPipe Cookbook

You're reading from  Natural Language Processing with Java and LingPipe Cookbook

Product type Book
Published in Nov 2014
Publisher
ISBN-13 9781783284672
Pages 312 pages
Edition 1st Edition
Languages

Table of Contents (14) Chapters

Natural Language Processing with Java and LingPipe Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Simple Classifiers Finding and Working with Words Advanced Classifiers Tagging Words and Tokens Finding Spans in Text – Chunking String Comparison and Clustering Finding Coreference Between Concepts/People Index

Eliminate near duplicates with the Jaccard distance


It often happens that the data has duplicates or near duplicates that should be filtered. Twitter data has lots of duplicates that can be quite frustrating to work with even with the -filter:retweets option available for the search API. A quick way to see this is to sort the text in the spreadsheet, and tweets with common prefixes will be neighbors:

Duplicate tweets that share a prefix

This sort only reveals shared prefixes; there are many more that don't share a prefix. This recipe will allow you to find other sources of overlap and threshold, the point at which duplicates are removed.

How to do it…

Perform the following steps to eliminate near duplicates with the Jaccard distance:

  1. Type in the command prompt:

    java -cp lingpipe-cookbook.1.0.jar:lib/opencsv-2.4.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.DeduplicateCsvData
    
  2. You will be overwhelmed with a torrent of text:

    Tweets too close, proximity 1.00
       @britneyspears do you ever miss the Disney days? and iilysm   please follow me. kiss from Turkey #AskBritneyJean ??
      @britneyspears do you ever miss the Disney days? and iilysm please follow me. kiss from Turkey #AskBritneyJean ??? 
    Tweets too close, proximity 0.50
      Sooo, I want to have a Disney Princess movie night....
      I just want to be a Disney Princess
    
  3. Two example outputs are shown—the first is a near-exact duplicate with only a difference in a final ?. It has a proximity of 1.0; the next example has proximity of 0.50, and the tweets are different but have a good deal of word overlap. Note that the second case does not share a prefix.

How it works…

This recipe jumps a bit ahead of the sequence, using a tokenizer to drive the deduplication process. It is here because the following recipe, for sentiment, really needs deduplicated data to work well. Chapter 2, Finding and Working with Words, covers tokenization in detail.

The source for main() is:

String inputPath = args.length > 0 ? args[0] : "data/disney.csv";
String outputPath = args.length > 1 ? args[1] : "data/disneyDeduped.csv";  
List<String[]> data = Util.readCsvRemoveHeader(new File(inputPath));
System.out.println(data.size());

There is nothing new in the preceding code snippet, but the following code snippet has TokenizerFactory:

TokenizerFactory tokenizerFactory = new RegExTokenizerFactory("\\w+");

Briefly, the tokenizer breaks the text into text sequences defined by matching the regular expression \w+ (the first \ escapes the second one in the preceding code—it is a Java thing). It matches contiguous word characters. The string "Hi, you here??" produces tokens "Hi", "you", and "here". The punctuation is ignored.

Next up, Util.filterJaccard is called with a cutoff of .5, which roughly eliminates tweets that overlap with half their words. Then, the filter data is written to disk:

double cutoff = .5;
List<String[]> dedupedData = Util.filterJaccard(data, tokenizerFactory, cutoff);
System.out.println(dedupedData.size());
Util.writeCsvAddHeader(dedupedData, new File(outputPath));
}

The Util.filterJaccard() method's source is as follows:

public static List<String[]> filterJaccard(List<String[]> texts, TokenizerFactory tokFactory, double cutoff) {
  JaccardDistance jaccardD = new JaccardDistance(tokFactory);

In the preceding snippet, a JaccardDistance class is constructed with a tokenizer factory. The Jaccard distance divides the intersection of tokens from the two strings over the union of tokens from both strings. Look at the Javadoc for more information.

The nested for loops in the following example explore each row with every other row until a higher threshold proximity is found or until all data has been looked at. Do not use this for large datasets because it is the O(n2)algorithm. If no row is above proximity, then the row is added to filteredTexts:

List<String[]> filteredTexts = new ArrayList<String[]>();
for (int i = 0; i < texts.size(); ++i) {
  String targetText = texts.get(i)[TEXT_OFFSET];
  boolean addText = true;
  for (int j = i + 1; j < texts.size(); ++j ) {
    String comparisionText = texts.get(j)[TEXT_OFFSET];
    double proximity = jaccardD.proximity(targetText,comparisionText);
    if (proximity >= cutoff) {
      addText = false;
      System.out.printf(" Tweets too close, proximity %.2f\n", proximity);
      System.out.println("\t" + targetText);
      System.out.println("\t" + comparisionText);
      break;
    }
  }
  if (addText) {
    filteredTexts.add(texts.get(i));
  }
}
return filteredTexts;
}

There are much better ways to efficiently filter the texts at a cost of extra complexity—a simple reverse-word lookup index to compute an initial covering set will be vastly more efficient—search for a shingling text lookup for O(n) to O(n log(n)) approaches.

Setting the threshold can be a bit tricky, but looking a bunch of data should make the appropriate cutoff fairly clear for your needs.

You have been reading a chapter from
Natural Language Processing with Java and LingPipe Cookbook
Published in: Nov 2014 Publisher: ISBN-13: 9781783284672
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}