# Introducing Graphs in the Real World

Social network analysis, fraud detection, modeling the stability of systems (for example, rail and energy grids), and recommendation systems all rely on graphs as the lynchpin underpinning these types of networks. In each of these examples, the relationships between individual people, bank accounts, or other single units are fundamental to describe and model the data. As opposed to traditional data models, a graph is a perfect way to represent groups of interacting elements.

This chapter will serve as an introduction to why graphs are important and introduce you to the fundamentals of what makes up a graph network. Moreover, we will look at how to transition from traditional data storage strategies, such as **relational databases** (**RDBs**), to how you can use this knowledge to work with **graph databases** (**GDBs**). Throughout this book, we will be working with a popular graph database, namely Neo4j. This will be followed by an explanation of how graphs are utilized in the *real world* and then a gentle introduction to working with the main package workhorses, known as igraph and NetworkX, which are the best and most stable graph packages for graph data analysis and modeling.

In this chapter, we’re going to cover the following topics:

- Why should you use graphs?
- The fundamentals of nodes and edges and the properties of a graph
- Comparing RDBs and GDBs
- The use of graphs across various industries
- Introduction to NetworkX and igraph

# Technical requirements

We will be using the Jupyter Notebook to run our coding exercises. For this, you will require `python>=3.8.0`

, along with the following packages, which will need to be installed in your environment with the `pip `

`install`

command:

`networkx==2.8.8`

`igraph==0.9.8`

All notebooks, along with the coding exercises, are available at the following GitHub link: https://github.com/PacktPublishing/Graph-Data-Modeling-in-Python/tree/main/CH01.

# Why should you use graphs?

In modern, data-driven solutions and enterprises, graph data structures are becoming more and more common. This is because, in our modern, data-driven world, relationships between things are becoming as, if not more important, than the things themselves. In modern industries and enterprises, graphs are starting to become more common and powerful in understanding the relationships between entities. I would say that these relationships and how they are connected have become more important than the entities themselves. We will demonstrate examples of real-life graphs in our use cases in the following chapters with detailed instructions on how to build these networks and the core considerations you need to make for the graph design.

Graphs are fundamental to many systems we use every day. Each time you are online and receive a product recommendation, it is likely that a graph solution is powering this recommendation. This is why learning how to work with graph data and leveraging these types of networks is a fast-growing and key skill in data science.

## Composite components of a graph

Networks are a tool to represent complex systems and the complex nature of the connections arising in today’s data. We have already referenced how graphs are powering some of the big powerhouse recommendation systems in action today.

Graph methods tend to fall into four different areas:

**Movement**: Movement is concerned with how things travel (move) through a network. These types of graphs are the drivers behind routing and GPS solutions and are utilized by the biggest players in finding the optimal path across a road network.**Influence**: On social media, this area specifies who the known influencers are and how they propagate this influence across a network.**Groups and interactions**: This area involves identifying groups and how actors in the network interact with each other. We will look at an example of how to apply community detection methods to find these communities through the node (the actor involved) and its connections (the edges). Don’t worry if you don’t know what these terms are for now; we will focus on these in the*Fundamentals of nodes, edges, and the properties of a**graph*section.**Pattern detection**: Pattern detection involves using a graph to find similarities in the network that can be explored. We must look at this from the actor’s (node’s) point of view and find similarities between that actor and other actors in the network. Here,*actor*is taken to mean person, author profile, and so on.

In this section, we have explained the core components of a graph by providing simple working definitions. In the following section, we will delve deeper into these fundamental elements, which make up every graph you will come across in the industry. We will look at nodes, edges, and the various properties of a graph.

# The fundamentals of nodes and edges and the properties of a graph

Graphs, or networks, are particularly powerful data structures for modeling and describing relationships between things, whether these things are people, products, or machines. In a graph, those *things* that we coined earlier are represented by *nodes* (sometimes known as *vertices*). Nodes are connected by *edges* (sometimes referred to as *relationships*). In a network, an edge represents a relationship between two things, indicating that, somehow, they are linked.

The following sections will look at the structures and types of graphs. First, we will start with undirected graphs before moving on to directed graphs. After that, we will look at node properties, then delve into heterogeneous graphs, and end by looking at schema design considerations.

## Undirected graphs

To illustrate, a simple example is that of a real-life social network. In the following example, Jeremy and Mark are each represented by a node. Jeremy is friends with Mark, and the *friend* relationship is represented by an edge connecting the two nodes. The following diagram shows an **undirected graph**:

Figure 1.1 – Two friend nodes are linked together with a single edge

However, not all social networks are the same, and in some online social media platforms, relationships between users of a social network may not be mutual.

For example, on Twitter, one user can follow another, but this doesn’t mean the inverse must be true. On Twitter, Jeremy may follow Mark, but Mark may not follow Jeremy.

## Directed graphs

Here, a directional edge is used to show that Jeremy follows Mark, while the absence of an edge in the reverse direction shows that Mark does not follow Jeremy in return:

Figure 1.2 – Two friend nodes are linked together with a single edge

This type of graph is known as a **directed graph**. For reference, sometimes, undirected edges like those in *Figure 1**.1* are shown as bidirectional edges, pointing to both nodes. This bidirectional representation is equivalent to an undirected edge in most senses and represents a mutual relationship.

Importantly, when creating a data model with directional edges, naming relationships appropriately becomes important. In our Twitter example, if the edge representing the interaction between Mark and Jeremy is *follows*, then the edge goes from the Jeremy node to the Mark node.

On the other hand, if the edge represents a concept such as *followed by*, then this should be in the other direction – that is, from Mark to Jeremy. This has particularly strong implications for some more complex graph modeling and use cases, which we will cover in *Chapter 2*, *Working with Graph **Data Models*.

## Node properties

While nodes and edges (directional or not) are the basic building blocks of a graph, they are often not sufficient to fully describe a dataset. Nodes can have data associated with them that may not be relational, so it would not be expressed as a relationship with another node.

In these cases, to represent data associated with nodes, we can use node properties (sometimes known as node attributes). Similarly, where an edge has additional information associated with it, in addition to representing a relationship, edge properties can be used to hold that data.

The following diagram shows a black line, indicating that Jeremy follows Mark but that Mark does not follow Jeremy – therefore, the black line indicates directionality:

Figure 1.3 – Two friend nodes are linked together with a single edge

In the preceding model, node properties are used to describe the number of followers Mark and Jeremy have, as well as the locations listed in their Twitter bios. This kind of additional node information is particularly important for querying graph data when asking questions that involve filtering.

In our Twitter example, properties would need to be present in the graph if, for example, we wanted to know who followed users with above 1,000 followers. We will revisit answering graphical questions using nodes, edges, and properties in later chapters.

Depending on the dataset, there may be cases where different nodes have different sets of properties. In this case, it is common to have several distinct types of nodes in the same graph.

## Heterogeneous graphs

Node types can also be referred to as layers, or nodes with different labels, though for this book, they will be known simply as types.

The following diagram shows the nodes representing Jeremy and Mark as people, where each node type has different properties, and there are multiple relationship types. Due to this, we can term these multiple relationships as heterogenous:

Figure 1.4 – Example of a heterogenous Twitter graph

Now, we have added nodes representing Mark and Jeremy as people, relationships representing their relationship outside of Twitter, and their ownership of their respective accounts. Note that since we have increased the number of node types, we also need new edge types to refer to the different interactions between different types of nodes.

Graphs with multiple node types are known as heterogeneous, multilayer, or multilevel graphs, though going forward we will use the term heterogeneous to refer to graphs with multiple types of nodes. In contrast, graphs with only one node type, as in the previous examples, are referred to as homogeneous graphs.

## Schema design

At this point, it is reasonable to ask the question: *What features of a dataset should be nodes, edges, **and properties?*

For any given dataset, there are multiple ways to represent data as a graph, and each is more suited to different purposes. Herein lies the trick to good graph modeling: a question or use case-driven schema design.

If we were particularly interested in the locations of Twitter users in our network, then we could move the location node property on the Twitter user nodes to create the `LOCATED_IN`

relationship type. This is shown in the following diagram:

Figure 1.5 – The same graph but with the location property moved from a node property to a node type

If we were particularly interested in the locations of Twitter users in our network, then we could move the location node property on the Twitter user nodes to a separate node type and create the `LOCATED_IN`

relationship type. We could even go one step further to represent the information we know about these locations, adding the country related to each location as a separate, abstracted node.

This graph structure models the same data in a different way, which may be more or less suitable or performant for particular use cases. We will explore the effects of schema design on the types of questions that can be asked, and performance, in later chapters.

In the next section, we will compare how graph data structures differ from traditional RDBs. This will expand on why GDBs can be more performant when modeled as a graph data problem.

# Comparing RDBs and GDBs

**RDBs** have been a standard for data storage and data analysis across most industries for a very long time. Their strength lies in being able to hold multiple tables of different information, where some table fields are found across tables, enabling data linkage.

With this data linkage, complex questions can be asked of data in an RDB. However, there are drawbacks to this relational structure. While RDBs are useful for returning a large number of rows that meet particular criteria, they are not suited to questions involving many chained relationships.

To illustrate this, consider a standard database containing train services and their station stops, alongside a graph that might represent the same information:

Figure 1.6 – Relational data structure of trains and their stops

In an RDB structure, it would not be difficult to retrieve all trains that service a particular stop. On the other hand, it may be a slow operation that returns a series of trains that can be taken between two chosen stations.

Consider the steps needed in a traditional RDB to find the route between Truro and Glasgow Central in the preceding table. Starting at Truro, we would know the **GW1426** train service stops at Truro, Liskeard, and Plymouth. Knowing that these stations can be reached from Truro, we would then need to find what train services stop at each of these stations to find our route.

Upon finding that Plymouth station is reachable and that a separate service runs to many more stations, we would need to repeat this process over and over until Glasgow Central is reached.

These steps essentially result in a series of computationally costly *join* operations on the same table, where one resulting row would give us the path between our stations of interest.

## GDBs to the rescue

Using a graph structure to represent this train network puts greater emphasis on relationships between our data points, as illustrated in the following diagram:

\

Figure 1.7 – Graph data structure of trains and their stops

Using a graph structure to represent this train network puts greater emphasis on relationships between our data points. Starting from Truro station, as in the RDB example, we find the train that services that station. However, when traversing the graph to find a possible route between Truro and Glasgow Central, at each station or train node we are considering fewer data points, and therefore fewer options.

This is in contrast to the RDB example, where repeated table joins are necessary to return a path. In this case, the complexity of the operations required over the graph representation is lower, which equates to a faster, more efficient method. Among many other use cases, those that require some sort of *pathfinding* often benefit from a graph data model.

In addition to being more suitable for specific types of queries, graphs are typically useful where a flexible, evolving data model is needed. Again, using the example of the train network, imagine that, as the database administrator, you have received a request to add bus transport links to the data model.

With an RDB, a new table would be required, since several bus services would likely serve each train station. In this new table, the names of each station would need to be duplicated from the existing table, to list alongside their related bus services.

Not only does this duplication increase the size of data stored, but it also increases the complexity of the database schema:

Figure 1.8 – Adding a new data type (buses) to the train station graph

Where the train station data is represented with a graph, the new information on buses can be added directly to the existing database as a new node type.

There is no requirement for a new table, and no need to duplicate each station node to represent the required information; the existing train nodes can be directly linked to new **Bus** nodes. This holds for any new data type that would require the addition of a new table in a traditional RDB.

In a graph, where new data could be represented in an equivalent RDB as a new column in an existing table, this may be a good candidate for a node property, as opposed to a new node type.

Here, an example suitable for being represented as a node property would be a code for each train station, where stations and their codes have a 1-to-1 relationship.

A comparison, in short, is captured in the following:

- RDBs have a rigid data format and a new table must be added for a new type of data. GDBs are more flexible when it comes to the format of the data and can be extended with new node types.
- RDBs can be queried via path-based queries – for example, how many steps between two people in a friend network, which involves multiple joins and can be extremely slow as the paths become longer. GDBs query paths directly, with no join operations, so information retrieval is more streamlined and quite frankly faster.

In summary, where the use case for a database concerns querying many relationships between objects – that is, paths – or when a flexible data schema is needed, a graph data model is likely to be a good fit to represent your data.

# The use of graphs across various industries

Graph data science is prevalent across a wide array of industries.

The main areas where graphs are being used effectively are as follows:

**Finance**: To look at fraud detection and portfolio risk.**Government**: To aid with intelligence profiling and supply chain analytics.**Life sciences**: For looking at patient journeys through a hospital (the transition of a patient through various services), drug response rates, and the transition of infections through a population.**Network & IT**: Security networks and user access management (nodes on a network represent each user logging into a network).**Telecoms**: Through network optimization and churn prediction.**Marketing**: Mainly for customer and market segmentation purposes.**Social media analysis**: We work for a company that specializes in platform moderation, online harm protection, and brand defense. By creating graphs to defend against attacks on brands, we can find vulnerable people or moderate the most severe type of content.

In terms of graphs in industry, they are pervasive due to the reasons we have already explored in this chapter. The ability to quickly link nodes and edges, and create relationships between them, is the reason why more problems in data science are being modeled as graphs or network science problems. In addition, the underlying data can be queried at a rapid rate. This can be done instead of using traditional database solutions, which, as we have already identified, are slow to query compared to GDBs.

Following this, in the next section, we will introduce the main two driving packages for graph analytics and modeling. We will show you the basic usage of the packages. In the subsequent chapters, we will keep building on why these packages are powerful.

# Introduction to NetworkX and igraph

In this chapter, we will introduce two Python packages for creating in-memory graphs: NetworkX and igraph.

NetworkX lets you create graphs, perform graph manipulation, study and visualize their structures, and perform several graph manipulation functions when working with graphs. Their website (https://networkx.org/) contains details of the major changes to the package and the intended usage of the tool.

igraph contains a suite of useful and practical analysis tools, with the aim being to make these efficient and easy to use, in a reproducible way. What is great about igraph is that it is open source and free, plus it supports networks to be built in *R*, *Python*, *Mathematica*, and *C*/*C++*. This is our recommended package for creating large networks that can load much more quickly than NetworkX. To read more about igraph, go to https://igraph.org/.

In the following subsections, we will look at the basics of both NetworkX and igraph, with easy-to-follow coding steps. This is the first time you are going to get your hands dirty with graph data modeling.

## NetworkX basics

NetworkX is one of the originally available graph libraries for Python and is particularly focused on being user-friendly and Pythonic in its style. It also natively includes methods for calculating some classic network analysis measures:

- To import
`NetworkX`

into Python, use the following command:**import networkx as nx** - And to create an empty graph,
`g`

, use the following command:**g = nx.Graph()** - Now, we need to add nodes to our graph, which can be done using methods of the
`Graph`

object belonging to`g`

. There are multiple ways to do this, with the simplest being adding one node at a time:g.add_node(Jeremy)

- Alternatively, multiple nodes can be added to the graph at once, like so:
g.add_nodes_from([Mark, Jeremy])

- Properties can be added to nodes during creation by passing a node and dictionary tuple to
`Graph.add_nodes_from`

:g.add_nodes_from([(Mark, {followers: 2100}), (Jeremy, {followers: 130})])

- To add an edge to the graph, we can use the
`Graph.add_edge`

method, and reference the nodes already present in the graph:g.add_edge(Jeremy, Mark)

It is worth noting that, in NetworkX, when adding an edge, any nodes specified as part of that edge not already in the graph will be added implicitly.

- To confirm that our graph now contains nodes and edges, we may want to plot it, using
`matplotlib`

and`networkx.draw()`

. The`with_labels`

parameter adds the names of the nodes to the plot:import matplotlib.pyplot as plt nx.draw(g, with_labels=True) plt.show()

This section showed you how you can get up and running with NetworkX in a couple of lines of Python code. In the next section, we will turn our focus to the popular `igraph`

package, which allows us to perform calculations over larger datasets much quicker than using the popular NetworkX.

## igraph basics

NetworkX, while user-friendly, suffers from slow speeds when using larger graphs. This is due to its implementation behind the scenes and because it is written in Python, with some C, C++, and FORTRAN.

In contrast, igraph is implemented in pure C, giving the library an advantage when working with large graphs and complex network algorithms. While not as immediately accessible as NetworkX for beginners, igraph is a useful tool to have under your belt when code efficiency is paramount.

Initially, working with igraph is very similar to working with NetworkX. Let’s take a look:

- To import
`igraph`

into Python, use the following command:**import igraph as ig** - And to create an empty graph,
`g`

, use the following command:**g = ig.Graph()**

In contrast to NetworkX, in igraph, all nodes have a prescribed internal integer ID. The first node that’s added has an ID of 0, with all subsequent nodes assigned increasing integer IDs.

- Similar to NetworkX, changes can be made to a graph by using the methods of a
`Graph`

object. Nodes can be added to the graph with the`Graph.add_vertices`

method (note that a vertex is another way to refer to a node). Two nodes can be added to the graph with the following code:g.add_vertices(2)

- This will add nodes 0 and 1 to the graph. To name them, we have to assign properties to the nodes. We can do this by accessing the vertices of the
`Graph`

object. Similar to how you would access elements of a list, each node’s properties can be accessed by using the following notation. Here, we are setting the`name`

and`followers`

attributes of nodes 0 and 1:g.vs[0][name] = Jeremy g.vs[1][name] = Mark g.vs[0][followers] = 130 g.vs[1][followers] = 2100

- Node properties can also be added listwise, where the first list element corresponds to node ID 0, the second to node ID 1, and so on. The following two lines are equivalent to the four lines shown in
*step 4*:g.vs["name"] = [Jeremy, Mark] g.vs[followers] = [130, 2100]

- To add an edge, we can use the
`Graph.add_edges()`

method:g.add_edges([(0, 1)])

Here, we are only adding one edge, but additional edges can be added to the list parameter required by `add_edges`

. As with NetworkX, if edges are added for nodes that are not currently in the graph, nodes will be created implicitly. However, since igraph requires nodes to have sequential IDs, attempting to add the edge pair (1, 3) to a graph with two vertices will fail.

# Summary

In this chapter, we looked at why you should start to think graph, from the benefits of why these methods are becoming the most widely utilized and discussed approaches in various industries. We looked at what a graph is and explained the various types of graphs, such as graphs that are concerned with how things move through a network, to influence graphs (who is influencing who on social media), to graph methods to identify groups and interactions, and how graphs can be utilized to detect patterns in a network.

Moving on from that, we examined the fundamentals of what makes up a graph. Here, we looked at the fundamental elements of nodes, edges, and properties and delved into the difference between an undirected and directed graph. Additionally, we examined the properties of nodes, looked at heterogeneous graphs, and examined the types of schema contained within a graph.

This led to how GDBs compare to legacy RDBs and why, in many cases, graphs are much easier and faster to transverse and query, with examples of how graphs can be utilized to optimize the stops on a train journey and how this can be extended, with ease, to add bus stops as well, as a new data source.

Following this, we looked at how graphs are being deployed across various industries and some use cases for why graphs are important in those industries, such as fraud detection in the finance sector, intelligence profiling in the government sector, patient journeys in hospitals, churn across networks, and customer segmentation in marketing. Graphs truly are becoming ubiquitous across various industries.

We wrapped up this chapter by providing an introduction to the powerhouses of graph analytics and network analysis – igraph and NetworkX. We showed you how, in a few lines of Python code, you can easily start to populate a graph.

In the next chapter, we will look at how to work with and create graph data models. The next chapter will contain many more hands-on examples of how to structure your data using graph data models in Python.