Chapter 6. Iterative Graph-Parallel Processing with Pregel
Graphs are a very useful abstraction for solving many practical computing problems. For example, we can search through nearly five billion web pages today, thanks to the PageRank graph algorithm. Apart from the web search, there are other applications, such as social media, for which iterative graph processing is needed. In this chapter, we will learn how to use
Pregel, a computational model, which is suitable for this task. Pregel was initially proposed by Google and has also been adopted by Spark as a generic programming interface for iterative graph computations. In this chapter, you will understand the Pregel model of computation. In addition, our learning goal is to clarify both the interface and implementation of the Pregel operator in Spark. After working through the concrete examples, you will be able to formulate your own algorithms with the Pregel interface.
The Pregel computational model
A Pregel program is a sequence of iterations called
supersteps, in each of which a vertex can receive inbound messages that are sent by its neighbors in the previous iteration, and modify its attribute and its edges. In addition, each vertex also sends messages to its neighbors by the end of each superstep. By thinking as a vertex, this abstraction makes it simple to reason about parallel graph processing. All we need to think about is the type of message that each vertex should be receiving, the processing that it should do on its inbound messages, and the message that its neighbors need for the next superstep. Luckily, this message-passing approach is flexible enough to express a large class of graph algorithms. More importantly, a graph algorithm can make use of Spark's scalable architecture to process the messages in bulk and in a synchronous manner. This synchronous model of computation makes it easy to express most graph-parallel algorithms.
Now, let's formalize the programming interface for the Pregel operator. Here is its definition:
The pregel
method is invoked on a property graph, and returns a new graph with the same type and structure. While the edges remain intact, the attributes of the vertices may change from one superset to the next one. Pregel takes the following two lists of arguments. The first list contains:
An initial message with a user-defined type A—this message is received by each vertex when the algorithm starts
A maximum number of iterations
The edge direction along which to send messages
Note
A Pregel algorithm terminates when either there are no more messages to be sent, or...
Community detection through label propagation
In the following section, we are going to implement a community detection algorithm using the Pregel interface. Label Propagation Algorithm (LPA) is a simple and fast method for detecting communities within graphs. By construction, the communities obtained by the label propagation process require each node to have at least as many neighbors within its community as it has with each of the other communities.
Let's quickly describe how the LPA works. First, each node is initially given its vertex ID as its label. At the subsequent iterations, each node determines its community, based on the labels of its neighbors. Specifically, the node chooses to join the community to which the maximum number of its neighbors belong to. If there is a tie, one of the majority labels is picked randomly. As we propagate the labels in this way across the graph, most labels will disappear, whereas the remaining ones define the communities. Ideally, this iterative algorithm...
The Pregel implementation of PageRank
We have already seen that GraphX has a PageRank API. In the following, let us see how this famous web search algorithmic can be easily implemented using Pregel. Since we already explained in the previous chapter how PageRank works, we will now simply explain its Pregel implementation:
First of all, we need to initialize the ranking graph with each edge attribute set to 1, divided by the out-degree, and each vertex attribute to set 1.0:
Following the Pregel abstraction, we define the three functions that are needed to implement PageRank in GraphX. First, we define the vertex program as follows:
In summary, Pregel is a generic and simplified interface for writing custom iterative, and parallel algorithms on large graphs. In this chapter, we have seen how to implement different iterative graph processing using this simple abstraction. In the next chapter, we will see how to use Spark's MLlib and GraphX to solve some machine learning problems with graph data.