Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Mastering Spark for Data Science

You're reading from  Mastering Spark for Data Science

Product type Book
Published in Mar 2017
Publisher Packt
ISBN-13 9781785882142
Pages 560 pages
Edition 1st Edition
Languages
Authors (4):
Andrew Morgan Andrew Morgan
Profile icon Andrew Morgan
Antoine Amend Antoine Amend
Profile icon Antoine Amend
Matthew Hallett Matthew Hallett
Profile icon Matthew Hallett
David George David George
Profile icon David George
View More author details

Table of Contents (22) Chapters

Mastering Spark for Data Science
Credits
Foreword
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
The Big Data Science Ecosystem Data Acquisition Input Formats and Schema Exploratory Data Analysis Spark for Geographic Analysis Scraping Link-Based External Data Building Communities Building a Recommendation System News Dictionary and Real-Time Tagging System Story De-duplication and Mutation Anomaly Detection on Sentiment Analysis TrendCalculus Secure Data Scalable Algorithms

Chapter 10. Story De-duplication and Mutation

How large is the World Wide Web? Although it is impossible to know the exact size - not to mention the Deep and Dark Web - it was estimated to hold more than a trillion pages in 2008, that, in the data era, was somehow the middle age. Almost a decade later, it is safe to assume that the Internet's collective brain has more neurons than our actual gray matter that's stuffed between our ears. But out of these trillion plus URLs, how many web pages are truly identical, similar, or covering the same topic?

In this chapter, we will de-duplicate and index the GDELT database into stories. Then, we will track stories over time and understand the links between them, how they may mutate and if they could lead to any subsequent event in the near future.

We will cover the following topics:

  • Understand the concept of Simhash to detect near duplicates

  • Build an online de-duplication API

  • Build vectors using TF-IDF and reduce dimensionality using Random Indexing

  • Build...

Detecting near duplicates


While this chapter is about grouping articles into stories, this first section is all about detecting near duplicates. Before delving into the de-duplication algorithm itself, it is worth introducing the notion of story and de-duplication in the context of news articles. Given two distinct articles - by distinct we mean two different URLs - we may observe the following scenarios:

  • The URL of article 1 actually redirects to article 2 or is an extension of the URL provided in article 2 (some additional URL parameters, for instance, or a shortened URL). Both articles with the same content are considered as true duplicates although their URLs are different.

  • Both article 1 and article 2 are covering the exact same event, but could have been written by two different publishers. They share lots of content in common, but are not truly similar. Based on certain rules explained hereafter, they might be considered as near-duplicates.

  • Both article 1 and article 2 are covering the...

Building stories


Simhash should be used to detect near-duplicate articles only. Extending our search to a 3-bit or 4-bit difference becomes terribly inefficient (3-bit difference requires 5,488 distinct queries to Cassandra while 41,448 queries will be needed to detect up to 4-bit differences) and seems to bring much more noise than related articles. Should the user want to build larger stories, a typical clustering technique must be applied then.

Building term frequency vectors

We will start grouping events into stories using a KMeans algorithm, taking the articles' word frequencies as input vectors. TF-IDF is simple, efficient, and a proven technique to build vectors out of text content. The basic idea is to compute a word frequency that we normalize using the inverse document frequency across the dataset, hence decreasing the weight on common words (such as stop words) while increasing the weight of words specific to the definition of a document. Its implementation is part of the basics...

Story mutation


We now have enough material to enter the heart of the subject. We were able to detect near-duplicate events and group similar articles within a story. In this section, we will be working in real time (on a Spark Streaming context), listening for news articles, grouping them into stories, but also looking at how these stories may change over time. We appreciate that the number of stories is undefined as we do not know in advance what events may arise in the coming days. As optimizing KMeans for each batch interval (15 mn in GDELT) would not be ideal, neither would it be efficient, we decided to take this constraint not as a limiting factor but really as an advantage in the detection of breaking news articles.

The Equilibrium state

If we were to divide the world's news articles into say 10 or 15 clusters, and fix that number to never change over time, then training a KMeans clustering should probably group similar (but not necessarily duplicate) articles into generic stories....

Summary


This chapter was really complex and the story mutation problem could not be easily solved in the time frame allowed for delivering this chapter. However, what we discovered is truly amazing as it opens up a lot of questions. We did not want to draw any conclusion though, so we stopped our process right after the observation of the Paris attack disturbance and left that discussion open for our readers. Feel free to download our code base and study any breaking news and their potential impacts in what we define as an Equilibrium state. We are very much looking forward to hearing back from you and learning about your findings and different interpretations.

Surprisingly, we did not know anything about the Galaxy Note 7 fiasco before writing this chapter, and without the API created in the first section, the related articles would surely have been indistinguishable from the mass. De-duplicating content using  Simhash really helped us get a better overview of the world news events.

In the...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Mastering Spark for Data Science
Published in: Mar 2017 Publisher: Packt ISBN-13: 9781785882142
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}