Welcome to the world of Gephi, the open source network graph and analysis tool. If you aren't already familiar with Gephi, this chapter will provide you a brief primer on both Gephi and the many ways in which it can be used to address real-world network analysis. For those who are already familiar with the software, this section will inspire you to take advantage of the many opportunities provided by this powerful tool.

Advances in computing power have helped to revolutionize the science of network graph analysis by putting complex datasets into the hands of thousands of users, and thus leading to the evolution of tools that help practitioners make sense out of previously unmanageably large datasets. Gephi is one of the tools at the forefront of this revolution, with a growing user community that produces some exceptional visualization. Gephi is far more than just a tool for producing attractive graphs; it is also a powerful instrument for exploratory data analysis that enables users to learn more about a given network.

This chapter will cover the following topics:

First, we will examine a range of graph applications that represent some of the many ways we might choose to use Gephi in order to address real-world data

The next section will provide an overview of the network graph vernacular, which will cover the primary topics of connectivity, network structure, and network behaviors

Finally, we will begin to explore how we can use Gephi as the primary toolkit to move from the conceptual realm into a very hands-on, real-world application

Note that this chapter and the remainder of the book are focused solely on the Gephi desktop software, and not the Gephi toolkit. The toolkit provides Java library modules that can be plugged into new Java applications to deliver core Gephi functionality. More information on the toolkit can be found online at http://gephi.github.io/toolkit/.

Let's get started with some potential applications that will begin to illustrate the number of ways in which we can use Gephi to create compelling network analysis and visualization.

One doesn't have to look far to recognize the enormous growth of network graphs as a means to explore and explain networks. **Social Network Analysis** (**SNA**) has certainly been the most visible subset of network graph analysis, with thousands of cases where users have mapped Facebook, Twitter, and LinkedIn peer networks. While this has been, and continues to be, a viable use of the approach, there are many lesser known, but frequently more compelling, datasets with highly interesting networks that are available for our exploration. In the next few sections, we will walk through some of the primary categories where we can access data and use Gephi to construct highly informative graphs by employing definitions laid out in the book *Networks, Crowds, and Markets: Reasoning about a Highly Connected World*, by David Easley and Jon Kleinberg.

Collaboration graphs represent one of the more frequently encountered categories in the world of network analysis. These graphs include networks where individual nodes are connected based on having some sort of collaborative relationship. The nodes might represent individuals or institutions; these graphs often depict collaborative research between universities and their staff within a specific discipline.

Another popular utilization of network analysis has been through the viewing of network graphs based on a variety of communication methods between the actors in a network, often within the confines of a single corporation, organization, or educational institution. These graphs can be constructed using e-mail or phone communications, and will often focus on the frequency of contact, thus exposing the true information flows and power structures within the organization.

Graphs that examine the flow of information across the Web are typical of this category of network analysis. These linkages can reference anything from connections between bloggers, pages on Wikipedia, or among scientific paper citation networks. This is a very popular type of graph, given the accessibility of information via the Web and its various applications.

This category often manifests itself through physical structures but, as David Easley and Jon Kleinberg note, there are underlying economic structures here as well, in the form of companies, regulatory bodies, and other organizations. In these sorts of networks, connections between nodes are likely to refer to a literal physical linkage, as in connections between routers or computers.

Another discipline that has received much attention through the use of networks is in the world of biology, where graphs are used to show relationships between predators and prey, neural networks within the brain, and a number of other science-based scenarios. Other network types are likely to emerge as well, and hybrids of these networks is also a possibility.

The idea here is to help stimulate thought processes about what sort of graphs you might be most interested in creating, and begin working toward their creation using Gephi. Specific examples of each of these network types are provided in the *Appendix*, *Data Sources and Other Web Resources*.

Network graphs are an element of what is termed as graph theory, defined by *Merriam-Webster* (http://www.merriam-webster.com/dictionary/graph%20theory) as:

"Mathematical theory of networks. A graph consists of vertices (also called points or nodes) and edges (lines) connecting certain pairs of vertices"

Stated simply, network graphs are collections of nodes (often called **vertices**) that are connected by edges (sometimes called **connections, links, **or** ties**) to form a graph. Nodes can be thought of as individual elements in a network that might represent persons, places, or objects that collectively define a network.

As you might have guessed, this definition could apply to an unlimited number of datasets, ranging from the obvious (Facebook or Twitter networks) to less explored examples, such as connections between staff, students, and facilities at a University; teammates in a baseball team; or musicians within a specific genre of music. In these cases, the nodes will represent the individual entities (people or structures), while the edges act as the connections that link them to one another. Each of your Twitter followers, for instance, represents individual nodes. They are then connected to you (a node) through a series of connections (edges). There are literally hundreds of thousands of opportunities to apply the principles of graph theory to interesting datasets spanning thousands of fields of interest.

So what can we learn from the principles of graph theory beyond creating a compelling visualization of a complex network? By the way, I will never downplay the visual output aspect of network graphs, as I am a strong believer in the power of aesthetics to enhance the story we are telling through our data. For example, refer to *Beautiful Visualization: Looking at Data through the Eyes of Experts*, edited by *Julie Steele* and *Noah Ilinsky*, *O'Reilly Media*. Still, there is much more to the basics of graph theory than a pretty picture. Let's examine a few of the most prominent principles in the next several sections.

We can learn a great deal about how a network functions by looking at all of the connections within the graph and by understanding how they are structured. Perhaps the graph is loosely structured, with few connections between nodes, or it could be densely connected, or a combination of both with dense clusters interspersed with gaps or structural holes in the network. Why is this important? One example would be to understand the potential for the spread of an infectious disease (known as a **contagion** in network terms) across a large network. A network with many structural holes will not support the rapid spread of a disease, while a densely connected network can facilitate this spread. There are many such cases where network structure is critical to understanding the behavior of elements within the network. The following sections will touch on some of the key concepts employed to help us better comprehend the structure of a network, and how its member nodes interact with one another.

A **path** is quite simply defined as the set of connections required for one node to interact with another node. We can use paths to understand the shortest distance between nodes, or perhaps to determine the shortest route to reach a distinct cluster of nodes. The following figure illustrates an example of a path for Node **A** to reach Node **E**, using the *shortest path*, shown by the bold edges. The shortest path is also known as the
**geodesic path**. Note that the path could also have passed through Node **D** on the way to Node **E**, but this is not the most efficient path, unless the direct connection from **C** to **E** is severed at some point.

In some cases, a path might involve passing through a specific node more than once, but more often, this will not be true, especially if we are attempting to minimize the path distance. If our path does not repeat any nodes, it can be termed a **simple path**, as illustrated in the preceding figure.

A **cycle** is an important variant of a non simple path, where there are a minimum of three edges and the first and last nodes are the same. All other nodes must be distinct; the cycle cannot traverse any of these nodes more than once. Cycles are critical to understand shortest paths through the network. Here is a simple cycle diagram—let's suppose we start at Node **A**:

This cycle is easy to follow, as we would simply move from **A** to **B**, **C**, **D**, **E**, and return to **A** in order to complete the cycle. We could also move in the reverse order, moving in a clockwise direction starting at Node **E**. This would become a little more complex, when there were additional nodes that do not flow around the perimeter as we have shown in the following figure. For example, a cycle could not move through the newly added Node **F**, since it would need to pass through Node **C** a second time to complete the entire path:

**Connectivity** is defined as the degree of connectedness of a graph, and can be measured using several formulas. At its essence, connectivity is a measurement of the robustness of a graph, as defined by the relative number of connections within a network. Networks with low connectivity are inherently fragile, as the removal of a small number of edges serves to weaken the network, and can actually disconnect some members from the components of the graph.

A few of the more common forms of connectivity measures are the **beta**, **alpha**, and **gamma** indexes. The beta index is a simple measure that looks at the number of edges divided by the number of nodes in a graph. Very simple networks will have a score of less than one, while more complex, densely connected graphs will exceed a value of one, and might go much higher in many instances. This is a very simple equation: , where **e** equals the number of edges, and **v** represents the number of nodes.

An alpha index evaluates the number of cycles in a graph relative to the possible number of cycles. At one extreme, a simple tree network would have an alpha value of zero, as there is no way to cycle through the network without repeating nodes. A perfectly connected network would have a score of one, but this is both rare and impractical, as this would indicate an inefficient network.

Finally, a gamma index measures the number of actual or observed links relative to the number of possible links, which gives us a value between zero and one. Scores closer to one indicate a more densely connected graph, although it is unusual to find a network that approaches that level. The gamma index is particularly useful to assess temporal (time-based) changes in a network. There are two equations, one for a planar graph with no crossing edges, and a second for non-planar graphs.

More details on connectivity and its various measures will be provided to you in Chapter 4, *Network Patterns*.

We have thus far focused on the concepts of paths, cycles, and connectivity, all of which help us to understand the interactions within the graph and even provide us with some statistical measures of the composition of the network. Yet these approaches fall short to convey all the information about the overall structure of the graph, such as how influential individual nodes or clusters are within the network. Fortunately, there are many ways in which we can statistically measure the structure of a network. If our graph is limited to a small number of nodes and edges, it is not difficult to see connectivity patterns, node groupings, and the overall topology of the graph, and we might not be terribly concerned with the statistical output.

However, when the network grows to more than a few dozen nodes, simple visual assessment will not provide us with all of the information within the graph, so we need to rely on more sophisticated formulas that provide us with detailed insights into the data and its structure.

The following sections will provide some details on many of the primary measures that tell us more about our graph. Further details are provided in Chapter 6, *Graph Statistics*. There are also many additional sources that provide further details on these, as well as more advanced network concepts, statistics, and theory. If you have not already consulted some of these resources, I would encourage you to do so. A listing of many available sources is provided in the *Bibliography* section of Appendix, *Data Sources and Other Web Resources*.

One of the key constructs within network graph analysis is the idea of centrality, where we make an attempt to understand the relative influence of individual nodes within the network. As one might anticipate, there are several ways we can measure centrality, with each method providing a different definition, and often, very different results. Let's assume in each of the following cases that we are examining a subset of a network, rather than its entirety. Each centrality measure will be measured across an entire network, but we will use the following measures for illustrative purposes. The general principles do not differ.

There are four primary centrality measures to explore, which we will look at in no particular order. The first measure of centrality is **closeness centrality**, a measure of the proximity of a selected node to all other nodes within the graph. A node with strong closeness centrality would typically have very short paths to all other nodes within the network. Note that the result will be a lower average number, as we are talking about how many steps it takes to reach all other nodes. Here is a simple example:

Note that Node **D**, despite having direct connections to just three of the six remaining nodes, has a maximum distance of two to reach any other point in the graph, while all other nodes have paths that might require three or even four steps. The central location of **D** makes traversing the graph very simple. Generally speaking, we would expect this type of node to lie at or near the physical center of a graph, although this is not always the case. In any event, this category of node is very prominent within the graph, and is also likely to have strong **degree centrality**, which we will discuss in a moment.

Another key measure is **betweenness centrality**, which often returns a very different result than the other centrality measures. In this case, we will find nodes that are highly influential in connecting otherwise remote regions of a graph, even though these nodes might have low influence as measured by other centrality measures. These nodes form a **bridge** between parts of the graph and thus play a key role in reducing path distances when traversing the graph. An example of this might be a jazz musician of relatively limited stature who managed to perform with both Duke Ellington in the 1940s and Wynton Marsalis in the 1990s, thus forming a bridge between musicians of different eras.

Here is a simple illustration of betweenness centrality, again using the Node **D**:

In this case, **D** has just two first degree connections yet it plays a pivotal role in the network structure by being the bridge between the **BAF** and **CEG** clusters, which otherwise would be unable to connect.

Another critical centrality measure is **eigenvector centrality**, where the influence of a particular node is defined by the connectedness of its closest neighbors. This can be thought of as the *who you know* type of centrality, wherein an individual node might not be thought of as important on its own, but its relationship to other highly connected nodes indicates a high level of influence. In our modern society, you might view this type of node as being a confidant of a popular celebrity, athlete, actress, and so on (perhaps a gatekeeper for some highly influential neighbors).

Let's examine a basic illustration of eigenvector centrality at work:

In this instance, **D** has only two first degree connections, but is surrounded by a host of very well connected nodes. Node **D** would thus score highly on the eigenvector centrality measure due to the relative importance of its first and second degree neighbors.

Finally, we consider the
**degree centrality** measure mentioned earlier, which examines the number (or proportion) of other nodes linked to a specific node, either through inbound, outbound, or undirected connections. This type of node might act as a sort of hub for information flow—it might not be the source of direct information, but plays a critical role in dispersing this information to others.

The following example illustrates a node with a high degree centrality (once again using Node D):

In this case, **D** has direct connections to five of the six remaining nodes, while no other node has more than three such edges, making **D** a hub within the network. Based on this structure, information is likely to flow though **D**, particularly as nodes communicate across the network (say from **B** to **G**).

To summarize, centrality is an essential measure for how information flows within a network, and should be assessed using a combination of the above measures to achieve a complete understanding of the network. We will go into greater detail on this critical topic in Chapter 4, *Network Patterns*, where a number of more detailed visual examples will be presented, so that we can begin to apply these concepts to real-world graphs.

Graphs can be termed as **connected**, where all nodes are joined through a fully linked network, and **disconnected**, in which there are separate groups of nodes with no relationship between the groups. It is in this latter instance where components take root with multiple groups of nodes standing alone with no linkages to other portions of the graph. Let's look at a hypothetical group of friends' network, first in a connected state, and then in a disconnected state with two components.

First, in the connected state, (as shown in the following figure) all member nodes can reach one another, with a maximum path distance of four Nodes (from **G** to both **C** and **E**). This is a rather loose set of connections yet it remains fully connected:

Next, let us suppose that Node **F** has done something to alienate both nodes **A** and **D**, forcing them to break off their connection to **F**. Now not only is **F** cut off from the remainder of the network, but so is **G**, who was previously connected through **F**. We now have two distinct components to the graph, as shown here:

Many friends' networks exhibit this type of behavior, albeit on a larger scale than we have shown here. One of the more notable examples in the literature comes from *Berman*, *Moody*, and *Stovel* published in 2004 (http://www.soc.duke.edu/~jmoody77/chains.pdf) and shows the romantic connections in a selected high school, with a graph composed of nearly 20 distinct components.

Let's take a brief look at a specific type of component—the **giant component**.

A giant component (http://en.wikipedia.org/wiki/Giant_component) might be thought of as the largest cluster in a network assuming it follows a specific mathematical formulation. For simplicity, we might refer to this as **large **components or **largest connected **components. These might also qualify as giant components, but will not require the same level of qualification. In our prior example where we saw a split in our friends' network, nodes **A** through **E** all remained connected within one cluster, while nodes **F** and **G** formed their own smaller cluster. In this case, the first cluster becomes the large component, by virtue of its larger size relative to the two node cluster represented by **F** and **G**. Now let's consider a case where we have more than two distinct clusters.

Take a look at the following network:

We started with the same two clusters from our disconnected network example, but added nodes **H** through **R**, which have formed a new cluster that is clearly the dominant grouping in the graph. This is now the giant component, as it encompasses 11 nodes, compared to the other clusters with five and two, respectively. In Chapter 4, *Network Patterns*, we will take a look at how giant components form using various connectivity thresholds and assumptions.

Even when graphs are connected within a single component and we no longer have multiple components, clustering still plays a critical role in helping us to understand relationships, information flow, the spread of disease, and other relevant topics. We can assess the degree of clustering through statistics such as the **clustering coefficient**, applied at both global and local (that is, neighborhood) levels. Many clustering applications will be covered in greater detail in Chapter 4, *Network Patterns*, and the statistical measures will be covered in Chapter 6, *Graph Statistics*.

**Homophily** is one of the key concepts that we need to understand when we examine network graphs, and is critical in helping us to assess networks with significant clustering. The term refers to what is often characterized as "birds of a feather stick together", wherein individuals tend to congregate with other like-minded individuals to form tightly knit clusters. Homophily might be driven by gender, race, age, occupation, education level, social status, or some other salient characteristic possessed by individuals within the network. These attributes might act individually, but will often interact with one another to define subgroups within a graph. Here are a few simple examples of groups we might find within a network graph:

Women with a postgraduate degree

Electricians belonging to the same union

Executives serving on overlapping boards of directors

I think you get the idea, and could no doubt come up with many other relevant combinations. Once our graph is created, we can test for homophily and begin to examine its causes by exploring the tightly knit clusters that characterize its presence. In Chapter 4, *Network Patterns*, we will learn more about the critical role of homophily as it relates to the spread of information and innovation, while some of the statistical measures covered in Chapter 6, *Graph Statistics* will help us understand the presence of homophily in a network.

Graph density is a measure of how tightly interconnected a network is calculated by examining the proportion of edges relative to the possible number of connections. A network with a high degree of homophily will tend to have a low density (due to the lack of connections beyond the local clusters), while networks that show a high degree of interaction across the network will have higher density levels. This will depend to some degree on whether we are viewing the entire network or a more localized sample. Two networks with identical numbers of nodes might have very different density levels; even the same network measured at different time intervals is likely to have differing density measures as links are formed or broken over time. A more detailed exploration of this measurement will be included in Chapter 6, *Graph Statistics*.

We have just completed a brief overview on network structure and how it helps us to explain the patterns we see within an already existing network. Now it is time to touch on some of the behaviors that can take place within those structures and how they might develop or fail to develop according to features of the graph.

There are two terms used almost interchangeably to explain the evolution of a process within a network—**contagion** and **diffusion**. We will provide a basic overview in this section and then explore this concept in much greater detail in Chapter 4, *Network Patterns* and Chapter 8, *Dynamic Networks*.

The concept of contagion is typically associated with the spread of disease, but it can be used to describe a variety of phenomena in the marketing and social spaces. At its essence, contagion refers to the ability of something—a disease, an idea, or a book—to spread rapidly based on its network structure. Given the typical verbal association of contagion and the spread of disease, I will use that term in cases where we examine the progression of a disease. In all other contexts, the book will default to the use of diffusion, as that seems to be the predominant term when we look at how ideas propagate through a network.

To understand the potential of diffusion, we need to comprehend both, the structure of the network and the influence of various actors (nodes) within the network. If our subject (idea, book, information, and so on) flows through highly influential nodes within the network, we would then anticipate a rapid spread throughout the graph. To the contrary, if an idea is launched from a distant, poorly connected node, it is highly likely that there will be a very low level of diffusion, and only a small portion of the network will be exposed to the idea.

The following figure shows a hypothetical diffusion of a new product through a small network, using the simple assumption that each node requires 33 percent of its neighbors to adopt the product before it tries the product. Note that whenever this criterion is not met, the diffusion will cease in that part of the network. This figure shows the first four rounds of diffusion, beginning with the purchase of the new product by Node **A**:

In addition to the influence factor just mentioned, the basic structure of a network will help to determine the rapidity with which information spreads. Networks with lots of localized clusters tend to limit or even terminate the flow of information, while densely connected networks with few gaps promote the transmission of information. As noted earlier, we will explore these concepts in greater detail in Chapter 4, *Network Patterns* and Chapter 8, *Dynamic Networks*, with visual examples that illustrate information flows through the existing networks.

Much of what we have considered until this point has been by looking at an existing set of networks without regard for additional growth or expansion of the network. In reality, we know that networks do grow, often very quickly and unpredictably. There exist a number of models in the literature that attempt to predict network growth using a variety of assumptions. These include the Erdos-Renyi random growth model as well as the Barabási–Albert model of preferential attachment, plus a host of variations on these themes.

Barabási offers a concise definition of a random network model (http://barabasilab.neu.edu/networksciencebook/downlPDF.html#):

| ||

--Barabási, L. (2012), Network Science |

Given this definition, we can quickly determine that random graphs could take on a variety of structures, especially as we increase or decrease the *p* value. If *p* is high, say 0.8, then our graph will tend toward a dense structure, while a *p* value of 0.2 would lead to a very sparse graph with few connections between nodes. Random networks have been criticized for being unrealistic in modeling network growth (refer to *Social and Economic Networks* (2010), P. 78, *Matthew Jackson*, which can be retrieved from http://books.google.com), yet they continue to serve as a useful starting point in understanding how networks evolve.

Preferential attachment, on the other hand, has been shown to correlate much more closely, if imperfectly, to real-world situations, such as the structure of connections on the Web. Barabási notes the basic premise behind preferential attachment. The following is retrieved from http://barabasilab.neu.edu/networksciencebook/downlPDF.html#:

| ||

--Barabási, L. (2012), Network Science |

In the preferential attachment model, new nodes are more likely to connect to nodes that have higher degrees, a pattern often referred to as *the rich get richer*. Thus, our network winds up with a number of hubs with many connections, surrounded by a greater number of nodes with fewer edges. Many real-life examples have shown this sort of behavior, from mathematical processes to citation networks and on to the Web.

We will defer further discussion on this subject until Chapter 4, *Network Patterns* and especially in Chapter 8, *Dynamic Networks*, where we will discuss the potential ways a network might grow, how quickly it will evolve, and what it might look like at some future point in time. Gephi provides the capability to perform this type of analysis, which can lead to powerful illustrations for how a network has changed over time.