Packt+ | Advance your knowledge in tech

You're reading from Mastering Spark for Data Science

Product type Book

Published in Mar 2017

Publisher Packt

ISBN-13 9781785882142

Pages 560 pages

Edition 1st Edition

Languages

Concepts

Data Science

Authors (4):

Andrew Morgan

Antoine Amend

Matthew Hallett

David George

View More author details

Table of Contents (22) Chapters

Mastering Spark for Data Science

Credits

Foreword

About the Authors

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

The Big Data Science Ecosystem

Data Acquisition

Input Formats and Schema

Exploratory Data Analysis

Spark for Geographic Analysis

Scraping Link-Based External Data

Building Communities

Building a Recommendation System

News Dictionary and Real-Time Tagging System

Story De-duplication and Mutation

Anomaly Detection on Sentiment Analysis

TrendCalculus

Secure Data

Scalable Algorithms

Chapter 10. Story De-duplication and Mutation

How large is the World Wide Web? Although it is impossible to know the exact size - not to mention the Deep and Dark Web - it was estimated to hold more than a trillion pages in 2008, that, in the data era, was somehow the middle age. Almost a decade later, it is safe to assume that the Internet's collective brain has more neurons than our actual gray matter that's stuffed between our ears. But out of these trillion plus URLs, how many web pages are truly identical, similar, or covering the same topic?

In this chapter, we will de-duplicate and index the GDELT database into stories. Then, we will track stories over time and understand the links between them, how they may mutate and if they could lead to any subsequent event in the near future.

We will cover the following topics:

Understand the concept of Simhash to detect near duplicates
Build an online de-duplication API
Build vectors using TF-IDF and reduce dimensionality using Random Indexing
Build...

Detecting near duplicates

While this chapter is about grouping articles into stories, this first section is all about detecting near duplicates. Before delving into the de-duplication algorithm itself, it is worth introducing the notion of story and de-duplication in the context of news articles. Given two distinct articles - by distinct we mean two different URLs - we may observe the following scenarios:

The URL of article 1 actually redirects to article 2 or is an extension of the URL provided in article 2 (some additional URL parameters, for instance, or a shortened URL). Both articles with the same content are considered as true duplicates although their URLs are different.
Both article 1 and article 2 are covering the exact same event, but could have been written by two different publishers. They share lots of content in common, but are not truly similar. Based on certain rules explained hereafter, they might be considered as near-duplicates.
Both article 1 and article 2 are covering the...

Building stories

Simhash should be used to detect near-duplicate articles only. Extending our search to a 3-bit or 4-bit difference becomes terribly inefficient (3-bit difference requires 5,488 distinct queries to Cassandra while 41,448 queries will be needed to detect up to 4-bit differences) and seems to bring much more noise than related articles. Should the user want to build larger stories, a typical clustering technique must be applied then.

Building term frequency vectors

We will start grouping events into stories using a KMeans algorithm, taking the articles' word frequencies as input vectors. TF-IDF is simple, efficient, and a proven technique to build vectors out of text content. The basic idea is to compute a word frequency that we normalize using the inverse document frequency across the dataset, hence decreasing the weight on common words (such as stop words) while increasing the weight of words specific to the definition of a document. Its implementation is part of the basics...

Story mutation

We now have enough material to enter the heart of the subject. We were able to detect near-duplicate events and group similar articles within a story. In this section, we will be working in real time (on a Spark Streaming context), listening for news articles, grouping them into stories, but also looking at how these stories may change over time. We appreciate that the number of stories is undefined as we do not know in advance what events may arise in the coming days. As optimizing KMeans for each batch interval (15 mn in GDELT) would not be ideal, neither would it be efficient, we decided to take this constraint not as a limiting factor but really as an advantage in the detection of breaking news articles.

The Equilibrium state

If we were to divide the world's news articles into say 10 or 15 clusters, and fix that number to never change over time, then training a KMeans clustering should probably group similar (but not necessarily duplicate) articles into generic stories....

Summary

This chapter was really complex and the story mutation problem could not be easily solved in the time frame allowed for delivering this chapter. However, what we discovered is truly amazing as it opens up a lot of questions. We did not want to draw any conclusion though, so we stopped our process right after the observation of the Paris attack disturbance and left that discussion open for our readers. Feel free to download our code base and study any breaking news and their potential impacts in what we define as an Equilibrium state. We are very much looking forward to hearing back from you and learning about your findings and different interpretations.

Surprisingly, we did not know anything about the Galaxy Note 7 fiasco before writing this chapter, and without the API created in the first section, the related articles would surely have been indistinguishable from the mass. De-duplicating content using Simhash really helped us get a better overview of the world news events.

In the...