Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Scala and Spark for Big Data Analytics

You're reading from  Scala and Spark for Big Data Analytics

Product type Book
Published in Jul 2017
Publisher Packt
ISBN-13 9781785280849
Pages 796 pages
Edition 1st Edition
Languages
Concepts
Authors (2):
Md. Rezaul Karim Md. Rezaul Karim
Profile icon Md. Rezaul Karim
Sridhar Alla Sridhar Alla
Profile icon Sridhar Alla
View More author details

Table of Contents (19) Chapters

Preface Introduction to Scala Object-Oriented Scala Functional Programming Concepts Collection APIs Tackle Big Data – Spark Comes to the Party Start Working with Spark – REPL and RDDs Special RDD Operations Introduce a Little Structure - Spark SQL Stream Me Up, Scotty - Spark Streaming Everything is Connected - GraphX Learning Machine Learning - Spark MLlib and Spark ML My Name is Bayes, Naive Bayes Time to Put Some Order - Cluster Your Data with Spark MLlib Text Analytics Using Spark ML Spark Tuning Time to Go to ClusterLand - Deploying Spark on a Cluster Testing and Debugging Spark PySpark and SparkR

Everything is Connected - GraphX

"Technology made large populations possible; large populations now make technology indispensable."

- Joseph Wood Krutch

In this chapter, we'll learn how many real-world problems can be modeled (and resolved) using graphs. We see that Apache Spark comes with its own graph library, and what you learned about RDDs can be used here too (this time as vertex and edge RDDs).
In a nutshell, the following topics will be covered in this chapter:

  • A brief introduction to graph theory
  • GraphX
  • VertexRDD and EdgeRDD
  • Graph operators
  • Pregel API
  • PageRank

A brief introduction to graph theory

To better understand graphs, let's look at Facebook and how you typically use Facebook. Every day you use your smart phone to post messages on your friend's wall or update your status. Your friends are all posting messages and photos and videos of their own.

You have friends, your friends have friends, who have friends, and so on. Facebook has settings that let you make new friends or remove friends from your friend list. Facebook also has permissions, which allow granular control on who sees what and who can communicate with who.

Now, when you consider that there are a billion Facebook users, the friends and friend's friends list for all users gets quite large and complicated. It is hard to even comprehend and manage all the different relationships or friendships.

So, if someone wants to find out if you and another person X...

GraphX

As shown in the preceding section, we can model many real-life use cases as Graphs with a set of vertices and a set of edges linking the vertices. We also wrote simple code trying to implement some basic graph operations and queries such as, Is X a friend of Y ? However, as we explored further, the algorithms only get more complicated along with use cases and also the size of graphs is much much larger than can be handled on one machine.

It is not possible to fit one billion Facebook users along with all their friendship relations into one machine or even a few machines.

What we need to do is to look beyond the one machine and few machines thrown together and rather start considering highly scalable architectures to implement the complex graph algorithms, which can handle the volume of data and complex interconnections of the data elements. We have already seen an introduction...

VertexRDD and EdgeRDD

A VertexRDD contains the set of vertices or nodes in a special data structure and an EdgeRDD contains the set of edges or links between the nodes/vertices again in a special data structure. Both the VertexRDD and the EdgeRDD are based on RDDs and the VertexRDD deals with every single node in the graph while the EdgeRDD contains all links between all nodes. In this section, we will look at how to create VertexRDD and EdgeRDD and then use these objects in building a graph.

VertexRDD

As seen earlier, the VertexRDD is an RDD containing the vertices and their associated attributes. Each element in the RDD represents a vertex or node in the graph. In order to maintain the uniqueness of the vertex, we need to...

Graph operators

Let's start with the operations we can directly perform using Graph object, such as filtering the vertices and edges of the graph to filter out based on some attribute of the object. We will also see an example of mapValues(), which can transform the graph to yield a custom RDD.

First, let's examine the vertices and the edges using the Graph object we created in the previous section and then look at some graph operators.

scala> graph.vertices.collect
res111: Array[(org.apache.spark.graphx.VertexId, User)] = Array((4,User(Liz,Doctor)), (6,User(Beth,Accountant)), (8,User(Mary,Cashier)), (10,User(Ken,Librarian)), (2,User(Mark,Doctor)), (1,User(John,Accountant)), (3,User(Sam,Lawyer)), (7,User(Larry,Engineer)), (9,User(Dan,Doctor)), (5,User(Eric,Accountant)))

scala> graph.edges.collect
res112: Array[org.apache.spark.graphx.Edge[String]] = Array(Edge(1,2...

Pregel API

Graphs are inherently recursive data structures as properties of vertices depend on properties of their neighbors, which in turn depend on properties of their own neighbors. As a consequence, many important graph algorithms iteratively recompute the properties of each vertex until a fixed-point condition is reached. A range of graph-parallel abstractions have been proposed to express these iterative algorithms. GraphX exposes a variant of the Pregel API.

At a high level, the Pregel operator in GraphX is a bulk-synchronous parallel messaging abstraction constrained to the topology of the graph. The Pregel operator executes in a series of steps in which vertices receive the sum of their inbound messages from the previous super step, compute a new value for the vertex property, and then send messages to neighboring vertices in the next super step. Using Pregel, messages...

PageRank

PageRank is one of the most important algorithms in the graph processing space. Originating at Google, the algorithm named after Larry page, the founder of Google, has evolved into many types of use cases based on the concept of ranking vertices or nodes based on relationships or edges.

Google PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites. For more information, you can read the description at https://en.wikipedia.org/wiki/PageRank

Using Google PageRank as an example, you can improve the relative importance of a web page on your company website or maybe your blog by promoting the web page among other popular websites and technical blogs. Using this approach, your blog website...

Summary

In this chapter, we have introduced graph theory using Facebook as an example; Apache Spark's graph processing library GraphX, VertexRDD, and EdgeRDDs; graph operators, aggregateMessages, TriangleCounting, and the Pregel API; and use cases such as the PageRank algorithm. We have also seen the traveling salesman problem and connected components and so on. We have seen how the GraphX API can be used to develop graph processing algorithms at scale.

In Chapter 11, Learning Machine Learning - Spark MLlib and ML, we will explore the exciting world of Apache Spark's Machine Learning library.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Scala and Spark for Big Data Analytics
Published in: Jul 2017 Publisher: Packt ISBN-13: 9781785280849
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}