Reader small image

You're reading from  Mastering Spark for Data Science

Product typeBook
Published inMar 2017
PublisherPackt
ISBN-139781785882142
Edition1st Edition
Concepts
Right arrow
Authors (4):
Andrew Morgan
Andrew Morgan
author image
Andrew Morgan

Andrew Morgan is a specialist in data strategy and its execution, and has deep experience in the supporting technologies, system architecture, and data science that bring it to life. With over 20 years of experience in the data industry, he has worked designing systems for some of its most prestigious players and their global clients often on large, complex and international projects. In 2013, he founded ByteSumo Ltd, a data science and big data engineering consultancy, and he now works with clients in Europe and the USA. Andrew is an active data scientist, and the inventor of the TrendCalculus algorithm. It was developed as part of his ongoing research project investigating long-range predictions based on machine learning the patterns found in drifting cultural, geopolitical and economic trends. He also sits on the Hadoop Summit EU data science selection committee, and has spoken at many conferences on a variety of data topics. He also enjoys participating in the Data Science and Big Data communities where he lives in London.
Read more about Andrew Morgan

Antoine Amend
Antoine Amend
author image
Antoine Amend

Antoine Amend is a data scientist passionate about big data engineering and scalable computing. The books theme of torturing astronomical amounts of unstructured data to gain new insights mainly comes from his background in theoretical physics. Graduating in 2008 with a Msc. in Astrophysics, he worked for a large consultancy business in Switzerland before discovering the concept of big data at the early stages of Hadoop. He has embraced big data technologies ever since, and is now working as the Head of Data Science for cyber security at Barclays Bank. By combining a scientific approach with core IT skills, Antoine qualified two years running for the Big Data World Championships finals held in Austin TX. He Placed in the top 12 in both 2014 and 2015 edition (over 2000+ competitors) where he additionally won the Innovation Award using the methodologies and technologies explained in this book.
Read more about Antoine Amend

Matthew Hallett
Matthew Hallett
author image
Matthew Hallett

Matthew Hallett is a Software Engineer and Computer Scientist with over 15 years of industry experience. He is an expert Object Oriented programmer and systems engineer with extensive knowledge of low level programming paradigms and, for the last 8 years, has developed an expertise in Hadoop and distributed programming within mission critical environments, comprising multithousandnode data centres. With consultancy experience in distributed algorithms and the implementation of distributed computing architectures, in a variety of languages, Matthew is currently a Consultant Data Engineer in the Data Science & Engineering team at a top four audit firm.
Read more about Matthew Hallett

David George
David George
author image
David George

David George is a distinguished distributed computing expert with 15+ years of data systems experience, mainly with globally recognized IT consultancies and brands. Working with core Hadoop technologies since the early days, he has delivered implementations at the largest scale. David always takes a pragmatic approach to software design and values elegance in simplicity. Today he continues to work as a lead engineer, designing scalable applications for financial sector customers with some of the toughest requirements. His latest projects focus on the adoption of advanced AI techniques for increasing levels of automation across knowledge-based industries.
Read more about David George

View More author details
Right arrow

Chapter 7. Building Communities

With more and more people interacting together and communicating, exchanging information, or simply sharing a common interest in different topics, most data science use cases can be addressed using graph representations. Although very large graphs were, for a long time, only used by the Internet giants, government, and national security agencies, it is becoming more common place to work with large graphs containing millions of vertices. Hence, the main challenge of a data scientist will not necessarily be to detect communities and find influencers on graphs, but rather to do so in a fully distributed and efficient way in order to overcome the constraint of scale. This chapter progresses through building a graph example, at scale, using the persons we identified using NLP extraction described in Chapter 6, Scraping Link-Based External Data.

In this chapter, we will cover the following topics:

  • Use Spark to extract content from Elasticsearch, build a Graph of person...

Building a graph of persons


We previously used NLP entity recognition to identify persons from an HTML raw text format. In this chapter, we move to a lower level by trying to infer relations between these entities and detect the possible communities surrounding them.

Contact chaining

Within the context of news articles, we first need to ask ourselves a fundamental question. What defines a relation between two entities? The most elegant answer would probably be to study words using the Stanford NLP libraries described in Chapter 6, Scraping Link-Based External Data. Given the following input sentence, which is taken from http://www.ibtimes.co.uk/david-bowie-yoko-ono-says-starmans-death-has-left-big-empty-space-1545160:

"Yoko Ono said she and late husband John Lennon shared a close relationship with David Bowie"

We could easily extract the syntactic tree, a structure that linguists use to model how sentences are grammatically built and where each element is reported with its type such as a noun...

Using the Accumulo database


We have seen a method to read our personRdd object from Elasticsearch and this forms a simple and neat solution for our storage requirements. However, when writing commercial applications, we must always be mindful of security and, at the time of writing, Elasticsearch security is still in development; so it would be useful at this stage to introduce a storage mechanism with native security. This is an important consideration we are using GDELT data that is, of course, open source by definition. In a commercial environment, it is very common for datasets to be confidential or commercially sensitive in some way, and clients will often request details of how their data will be secured long before they discuss the data science aspect itself. It is the authors experience that many a commercial opportunity is lost due to the inability of solution providers to demonstrate a robust and secure data architecture.

Accumulo (http://accumulo.apache.org) is a NoSQL database...

Community detection algorithm


Community detection has become a popular field of research over the past few decades. Sadly, it did not move as fast as the digital world that a true data scientist lives in, with more and more data collected every second. As a result, most of the proposed solutions are simply not suitable for a big data environment.

Although a lot of algorithms suggest a new scalable way for detecting communities, none of them is actually meaning scalable in a sense of distributed algorithms and parallel computing.

Louvain algorithm

Louvain algorithm is probably the most popular and widely used algorithm for detecting communities on undirected weighted graphs.

Note

For more information about Louvain algorithm, refer to the publication: Fast unfolding of communities in large networks. Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre. 2008

The idea is to start with each vertex being the center of its own community. At each step, we look for community neighbors...

GDELT dataset


In order to validate our implementation, we use the GDELT dataset we analyzed in the previous chapter. We extracted all of the communities and spent some time looking at the person names to see whether or not our community clustering was consistent. The full picture of the communities is reported in Figure 7 and has been realized using the Gephi software, where only the top few thousand connections have been imported:

Figure 7: Community detection on January 12

We first observe that most of the communities we detected are totally aligned with the ones we could eyeball on a force-directed layout, giving a good confidence level about the algorithm accuracy.

The Bowie effect

Any well-defined community has been properly identified, and the less obvious ones are the ones surrounding highly connected vertices such as David Bowie. The name David Bowie being heavily mentioned in GDELT articles alongside so many different persons that, on that day of January 12, 2016, it became too large...

Summary


We have discussed and built a real-world implementation of graph communities leveraging the power of a secure and robust architecture. We have outlined the idea that there is no right or wrong solution in the community detection problem space, as it strongly depends on the use case. In a social network context, for example, where vertices are tightly connected together (an edge represents a true connection between two users), the edge weight does not really matter while the triangle approach probably does. In the telecommunication industry, one could be interested in the communities based on the frequency call of a given user A to a user B, hence turning to a weighted algorithm such as Louvain.

We appreciate that building this community algorithm was far from an easy task, and perhaps stretches the goals of this book, but it involves all of the techniques of graph processing in Spark that makes GraphX a fascinating and extensible tool. We introduced the concepts of message passing...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Spark for Data Science
Published in: Mar 2017Publisher: PacktISBN-13: 9781785882142
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (4)

author image
Andrew Morgan

Andrew Morgan is a specialist in data strategy and its execution, and has deep experience in the supporting technologies, system architecture, and data science that bring it to life. With over 20 years of experience in the data industry, he has worked designing systems for some of its most prestigious players and their global clients often on large, complex and international projects. In 2013, he founded ByteSumo Ltd, a data science and big data engineering consultancy, and he now works with clients in Europe and the USA. Andrew is an active data scientist, and the inventor of the TrendCalculus algorithm. It was developed as part of his ongoing research project investigating long-range predictions based on machine learning the patterns found in drifting cultural, geopolitical and economic trends. He also sits on the Hadoop Summit EU data science selection committee, and has spoken at many conferences on a variety of data topics. He also enjoys participating in the Data Science and Big Data communities where he lives in London.
Read more about Andrew Morgan

author image
Antoine Amend

Antoine Amend is a data scientist passionate about big data engineering and scalable computing. The books theme of torturing astronomical amounts of unstructured data to gain new insights mainly comes from his background in theoretical physics. Graduating in 2008 with a Msc. in Astrophysics, he worked for a large consultancy business in Switzerland before discovering the concept of big data at the early stages of Hadoop. He has embraced big data technologies ever since, and is now working as the Head of Data Science for cyber security at Barclays Bank. By combining a scientific approach with core IT skills, Antoine qualified two years running for the Big Data World Championships finals held in Austin TX. He Placed in the top 12 in both 2014 and 2015 edition (over 2000+ competitors) where he additionally won the Innovation Award using the methodologies and technologies explained in this book.
Read more about Antoine Amend

author image
Matthew Hallett

Matthew Hallett is a Software Engineer and Computer Scientist with over 15 years of industry experience. He is an expert Object Oriented programmer and systems engineer with extensive knowledge of low level programming paradigms and, for the last 8 years, has developed an expertise in Hadoop and distributed programming within mission critical environments, comprising multithousandnode data centres. With consultancy experience in distributed algorithms and the implementation of distributed computing architectures, in a variety of languages, Matthew is currently a Consultant Data Engineer in the Data Science & Engineering team at a top four audit firm.
Read more about Matthew Hallett

author image
David George

David George is a distinguished distributed computing expert with 15+ years of data systems experience, mainly with globally recognized IT consultancies and brands. Working with core Hadoop technologies since the early days, he has delivered implementations at the largest scale. David always takes a pragmatic approach to software design and values elegance in simplicity. Today he continues to work as a lead engineer, designing scalable applications for financial sector customers with some of the toughest requirements. His latest projects focus on the adoption of advanced AI techniques for increasing levels of automation across knowledge-based industries.
Read more about David George