Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Apache Spark Graph Processing

You're reading from  Apache Spark Graph Processing

Product type Book
Published in Sep 2015
Publisher
ISBN-13 9781784391805
Pages 148 pages
Edition 1st Edition
Languages

Chapter 6. Iterative Graph-Parallel Processing with Pregel

Graphs are a very useful abstraction for solving many practical computing problems. For example, we can search through nearly five billion web pages today, thanks to the PageRank graph algorithm. Apart from the web search, there are other applications, such as social media, for which iterative graph processing is needed. In this chapter, we will learn how to use Pregel, a computational model, which is suitable for this task. Pregel was initially proposed by Google and has also been adopted by Spark as a generic programming interface for iterative graph computations. In this chapter, you will understand the Pregel model of computation. In addition, our learning goal is to clarify both the interface and implementation of the Pregel operator in Spark. After working through the concrete examples, you will be able to formulate your own algorithms with the Pregel interface.

The Pregel computational model


A Pregel program is a sequence of iterations called supersteps, in each of which a vertex can receive inbound messages that are sent by its neighbors in the previous iteration, and modify its attribute and its edges. In addition, each vertex also sends messages to its neighbors by the end of each superstep. By thinking as a vertex, this abstraction makes it simple to reason about parallel graph processing. All we need to think about is the type of message that each vertex should be receiving, the processing that it should do on its inbound messages, and the message that its neighbors need for the next superstep. Luckily, this message-passing approach is flexible enough to express a large class of graph algorithms. More importantly, a graph algorithm can make use of Spark's scalable architecture to process the messages in bulk and in a synchronous manner. This synchronous model of computation makes it easy to express most graph-parallel algorithms.

Example ...

The Pregel API in GraphX


Now, let's formalize the programming interface for the Pregel operator. Here is its definition:

class GraphOps[VD, ED] {
  def pregel[A]
      (initialMsg: A,
       maxIter: Int = Int.MaxValue,
       activeDir: EdgeDirection = EdgeDirection.Out)
      (vprog: (VertexId, VD, A) => VD,
       sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
       mergeMsg: (A, A) => A)
    : Graph[VD, ED]
}

The pregel method is invoked on a property graph, and returns a new graph with the same type and structure. While the edges remain intact, the attributes of the vertices may change from one superset to the next one. Pregel takes the following two lists of arguments. The first list contains:

  • An initial message with a user-defined type A—this message is received by each vertex when the algorithm starts

  • A maximum number of iterations

  • The edge direction along which to send messages

Note

A Pregel algorithm terminates when either there are no more messages to be sent, or...

Community detection through label propagation


In the following section, we are going to implement a community detection algorithm using the Pregel interface. Label Propagation Algorithm (LPA) is a simple and fast method for detecting communities within graphs. By construction, the communities obtained by the label propagation process require each node to have at least as many neighbors within its community as it has with each of the other communities.

Let's quickly describe how the LPA works. First, each node is initially given its vertex ID as its label. At the subsequent iterations, each node determines its community, based on the labels of its neighbors. Specifically, the node chooses to join the community to which the maximum number of its neighbors belong to. If there is a tie, one of the majority labels is picked randomly. As we propagate the labels in this way across the graph, most labels will disappear, whereas the remaining ones define the communities. Ideally, this iterative algorithm...

The Pregel implementation of PageRank


We have already seen that GraphX has a PageRank API. In the following, let us see how this famous web search algorithmic can be easily implemented using Pregel. Since we already explained in the previous chapter how PageRank works, we will now simply explain its Pregel implementation:

First of all, we need to initialize the ranking graph with each edge attribute set to 1, divided by the out-degree, and each vertex attribute to set 1.0:

val rankGraph: Graph[(Double, Double), Double] = 
    // Associate the degree with each vertex
    graph.outerJoinVertices(graph.outDegrees) {
        (vid, vdata, deg) => deg.getOrElse(0)
    }.mapTriplets( e => 1.0 / e.srcAttr )
     .mapVertices( (id, attr) => (0.0, 0.0) )

Following the Pregel abstraction, we define the three functions that are needed to implement PageRank in GraphX. First, we define the vertex program as follows:

val resetProb = 0.15
def vProg(id: VertexId, attr: (Double, Double), msgSum: Double...

Summary


In summary, Pregel is a generic and simplified interface for writing custom iterative, and parallel algorithms on large graphs. In this chapter, we have seen how to implement different iterative graph processing using this simple abstraction. In the next chapter, we will see how to use Spark's MLlib and GraphX to solve some machine learning problems with graph data.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Apache Spark Graph Processing
Published in: Sep 2015 Publisher: ISBN-13: 9781784391805
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}