Everything is Connected - GraphX

"Technology made large populations possible; large populations now make technology indispensable."

- Joseph Wood Krutch

In this chapter, we'll learn how many real-world problems can be modeled (and resolved) using graphs. We see that Apache Spark comes with its own graph library, and what you learned about RDDs can be used here too (this time as vertex and edge RDDs).
In a nutshell, the following topics will be covered in this chapter:

A brief introduction to graph theory
GraphX
VertexRDD and EdgeRDD
Graph operators
Pregel API
PageRank

A brief introduction to graph theory

To better understand graphs, let's look at Facebook and how you typically use Facebook. Every day you use your smart phone to post messages on your friend's wall or update your status. Your friends are all posting messages and photos and videos of their own.

You have friends, your friends have friends, who have friends, and so on. Facebook has settings that let you make new friends or remove friends from your friend list. Facebook also has permissions, which allow granular control on who sees what and who can communicate with who.

Now, when you consider that there are a billion Facebook users, the friends and friend's friends list for all users gets quite large and complicated. It is hard to even comprehend and manage all the different relationships or friendships.

So, if someone wants to find out if you and another person X...

GraphX

As shown in the preceding section, we can model many real-life use cases as Graphs with a set of vertices and a set of edges linking the vertices. We also wrote simple code trying to implement some basic graph operations and queries such as, Is X a friend of Y ? However, as we explored further, the algorithms only get more complicated along with use cases and also the size of graphs is much much larger than can be handled on one machine.

It is not possible to fit one billion Facebook users along with all their friendship relations into one machine or even a few machines.

What we need to do is to look beyond the one machine and few machines thrown together and rather start considering highly scalable architectures to implement the complex graph algorithms, which can handle the volume of data and complex interconnections of the data elements. We have already seen an introduction...

VertexRDD and EdgeRDD

A VertexRDD contains the set of vertices or nodes in a special data structure and an EdgeRDD contains the set of edges or links between the nodes/vertices again in a special data structure. Both the VertexRDD and the EdgeRDD are based on RDDs and the VertexRDD deals with every single node in the graph while the EdgeRDD contains all links between all nodes. In this section, we will look at how to create VertexRDD and EdgeRDD and then use these objects in building a graph.

VertexRDD

As seen earlier, the VertexRDD is an RDD containing the vertices and their associated attributes. Each element in the RDD represents a vertex or node in the graph. In order to maintain the uniqueness of the vertex, we need to...

Graph operators

Let's start with the operations we can directly perform using Graph object, such as filtering the vertices and edges of the graph to filter out based on some attribute of the object. We will also see an example of mapValues(), which can transform the graph to yield a custom RDD.

First, let's examine the vertices and the edges using the Graph object we created in the previous section and then look at some graph operators.

scala> graph.vertices.collect
res111: Array[(org.apache.spark.graphx.VertexId, User)] = Array((4,User(Liz,Doctor)), (6,User(Beth,Accountant)), (8,User(Mary,Cashier)), (10,User(Ken,Librarian)), (2,User(Mark,Doctor)), (1,User(John,Accountant)), (3,User(Sam,Lawyer)), (7,User(Larry,Engineer)), (9,User(Dan,Doctor)), (5,User(Eric,Accountant)))

scala> graph.edges.collect
res112: Array[org.apache.spark.graphx.Edge[String]] = Array(Edge(1,2...

Pregel API

Graphs are inherently recursive data structures as properties of vertices depend on properties of their neighbors, which in turn depend on properties of their own neighbors. As a consequence, many important graph algorithms iteratively recompute the properties of each vertex until a fixed-point condition is reached. A range of graph-parallel abstractions have been proposed to express these iterative algorithms. GraphX exposes a variant of the Pregel API.

At a high level, the Pregel operator in GraphX is a bulk-synchronous parallel messaging abstraction constrained to the topology of the graph. The Pregel operator executes in a series of steps in which vertices receive the sum of their inbound messages from the previous super step, compute a new value for the vertex property, and then send messages to neighboring vertices in the next super step. Using Pregel, messages...

PageRank

PageRank is one of the most important algorithms in the graph processing space. Originating at Google, the algorithm named after Larry page, the founder of Google, has evolved into many types of use cases based on the concept of ranking vertices or nodes based on relationships or edges.

Google PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites. For more information, you can read the description at https://en.wikipedia.org/wiki/PageRank

Using Google PageRank as an example, you can improve the relative importance of a web page on your company website or maybe your blog by promoting the web page among other popular websites and technical blogs. Using this approach, your blog website...

Summary

In this chapter, we have introduced graph theory using Facebook as an example; Apache Spark's graph processing library GraphX, VertexRDD, and EdgeRDDs; graph operators, aggregateMessages, TriangleCounting, and the Pregel API; and use cases such as the PageRank algorithm. We have also seen the traveling salesman problem and connected components and so on. We have seen how the GraphX API can be used to develop graph processing algorithms at scale.

In Chapter 11, Learning Machine Learning - Spark MLlib and ML, we will explore the exciting world of Apache Spark's Machine Learning library.