Packt+ | Advance your knowledge in tech

You're reading from Apache Spark Graph Processing

Product type Book

Published in Sep 2015

Publisher

ISBN-13 9781784391805

Pages 148 pages

Edition 1st Edition

Languages

Concepts

Data Processing

Table of Contents (16) Chapters

Apache Spark Graph Processing

Credits

Foreword

About the Author

About the Reviewer

www.PacktPub.com

Preface

Getting Started with Spark and GraphX

Building and Exploring Graphs

Graph Analysis and Visualization

Transforming and Shaping Up Graphs to Your Needs

Creating Custom Graph Aggregation Operators

Iterative Graph-Parallel Processing with Pregel

Learning Graph Structures

References

Index

Chapter 3. Graph Analysis and Visualization

In this chapter, we will learn how to analyze the characteristics of graphs using visualization tools and graph algorithms. For example, we will use some of the algorithms available in GraphX to see how connected a graph is. In addition, we will compute metrics that are commonly used, such as triangle counting and clustering coefficients. Furthermore, we will learn through a concrete example how the PageRank algorithm can be used to rank the importance of the nodes in a network. Along the way, we will introduce new RDD operations that will prove out to be useful here and in later chapters. Finally, this chapter offers practical tips on building Spark applications that rely on the third-party libraries. After doing the activities in this chapter, you will have learned the tools and concepts to:

Visualize large-scale graph data
Compute the connected components of a network
Use the PageRank algorithm to rank the node importance in networks
Build Spark...

Network datasets

We will be using the same datasets introduced in Chapter 2, Building and Exploring Graphs, including the social ego network, email graph, and food-compound network.

The graph visualization

Spark and GraphX do not provide any built-in functionality for data visualization, since their focus is on data processing. However, pictures are worth than thousands of numbers when it comes to data analysis. In the following sections, we will build a Spark application for visualizing and analyzing the connectedness of graphs. We will rely on the third-party library called GraphStream for drawing networks, and BreezeViz for plotting structural properties of graphs, such as degree distribution. These libraries are not perfect and have limitations but they are relatively stable and simple to use. So, we will use them for exploring the graph examples that are used in this chapter.

Note

Currently, there is still a lack of graph visualization engines and libraries for drawing large-scale networks, without requiring a huge amount of computing resources. For example, the popular network analysis software SNAP currently relies on the GraphViz engine to draw networks, but it...

The analysis of network connectedness

Next, we are going to visually explore and analyze the connectedness of the food network. Reload the ingredient and compound datasets using the steps explained in the previous chapter. After you are done, create a GraphStream graph object:

// Create a SingleGraph class for GraphStream visualization 
val graph: SingleGraph = new SingleGraph("FoodNetwork")
Then, set the ui.stylesheet attribute of the graph. Since the food network is a bipartite graph, it would be nice to display the nodes with two different colors. We do that using a new style sheet. While we are at it, let's also reduce the node size and hide the text attributes:
node {
    size: 5px;
    text-mode: hidden;
    z-index: 1;
    fill-mode: dyn-plain;
    fill-color: "#e7298a", "#43a2ca";
}
edge {
    shape: line;
    fill-color: #fee6ce;
    arrow-size: 2px, 1px;
    z-index: 0;
}

Tip

The color value in the style sheet is set in hexadecimal using #. You can choose your favorite colors from...

The network centrality and PageRank

Previously, we have used the degree distribution and clustering coefficients of a network to understand how connected a network is. In particular, we have learned how to find the largest connected components and the nodes that have the highest degree. Then, we visualized the networks and saw the nodes that have higher chances to play the role of hubs in the network since many nodes are connected to them. In some sense, the degree of a node can be interpreted as a centrality measure that determines how important that node is relative to the rest of the network. In this section, we are going to introduce a different centrality measure as well as the PageRank algorithm, which is useful for ranking nodes in graphs.

Note

There exist many other measures of centrality for graphs. For example, betweenness centrality is useful for information flow problems. Given a node, its betweenness is the number of shortest paths from all vertices to all others that pass through...

Scala Build Tool revisited

Previously, we have used the Scala console to interact with Spark. If we want to build a standalone application instead, it becomes unwieldy to manually manage the third-party library dependencies. Remember that first we had to download the JAR files for GraphStream and BreezeViz, as well as those of the libraries that they depend on. Then, we had to put them in the /lib folder and specify this list of JAR files when we submitted the Spark application using the --jars option. This process becomes extremely cumbersome when the application reuses many third-party libraries, which may also depend on several libraries. Fortunately, we can automate this process with SBT. Let's see how to manage the library dependencies, and how to create an uber JAR or assembly JAR with SBT. If you already know how to do this, feel free to skip this section and go ahead to the next chapter.

Organizing build definitions

SBT offers flexibility and power in defining builds and tracking library...

Summary

In this chapter, we learned about the different ways to visualize and analyze graphs in Spark. We studied the connectedness of different networks by looking at their degree distribution, finding their connected components, and by calculating their cluster coefficients. In addition, we also learned how to visualize graph data using GraphStream. After this, we showed how the PageRank algorithm can be used to rank node importance in different networks. This chapter also showed us how to use SBT to build a Spark program that relies on third-party libraries.

Throughout this chapter, we have also studied how the basic Spark RDD operations can be used to transform, join, and filter collections of graph vertices and edges. In the next chapter, we will learn about the graph-specific and higher-level operations that are used to transform and manipulate the structure of graphs.

In the next chapter, we will learn about graph-specific operators that help change the properties of graph elements...