Reader small image

You're reading from  Graph Data Science with Neo4j

Product typeBook
Published inJan 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781804612743
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Estelle Scifo
Estelle Scifo
author image
Estelle Scifo

Estelle Scifo possesses over 7 years experience as a data scientist, after receiving her PhD from the Laboratoire de lAcclrateur Linaire, Orsay (affiliated to CERN in Geneva). As a Neo4j certified professional, she uses graph databases on a daily basis and takes full advantage of its features to build efficient machine learning models out of this data. In addition, she is also a data science mentor to guide newcomers into the field. Her domain expertise and deep insight into the perspective of the beginners needs make her an excellent teacher.
Read more about Estelle Scifo

Right arrow

Characterizing a Graph Dataset

Two graphs can differ in many ways, depending on their number of nodes or types of edges, for instance. But many more metrics exist to characterize them so that we can get an idea of the graph based on some numbers. Just as the mean value and standard deviation help in comprehending a numeric variable distribution, graph metrics help in understanding the graph topology: is it a highly connected graph? Are there isolated nodes?

In this chapter, we are going to learn about a few metrics for characterizing a graph. Focusing on the degree and degree distribution, this will be an opportunity for us to draw our first plot using the NeoDash graph application. We will also use the Neo4j Python driver to extract data from Neo4j into a DataFrame and perform some basic analysis of this data.

In this chapter, we’re going to cover the following main topics:

  • Characterizing a graph from its node and edge properties
  • Computing the graph degree...

Technical requirements

To be able to reproduce the examples provided in this chapter, you’ll need the following tools:

  • Neo4j installed on your computer (see the installation instructions in Chapter 1, Introducing and Installing Neo4j).
  • The necessary Python and Jupyter notebooks installed. We are not going to cover the installation instructions in this book.
  • You’ll also need the following Python packages:
    • matplotlib
    • pandas
    • neo4j
  • An internet connection to download the plugins and the dataset and to use the public API in the last section of this chapter.
  • Any code listed in the book will be available in the associated GitHub repository, https://github.com/PacktPublishing/Graph-Data-Science-with-Neo4j, in the corresponding chapter folder.

Characterizing a graph from its node and edge properties

There is not a single type of graph. Each of them has specific characteristics, depending on the modeled process. This section describes some of the characteristics of a graph you should question when starting your journey with a new dataset.

Link direction

Links between nodes can be directed (and are then called arcs in graph theory) or undirected (and are called edges).

While graph theory makes the distinction between directed and undirected links in their naming, the graph database vocabulary usually doesn’t, and all links are called edges or relationships, regardless of whether they’re considered directed or not. In a more general way, I’ll stick to the wording used within the Neo4j Graph Data Science Library, which may sound inaccurate to graph theorists.

Undirected graphs include the following:

  • Facebook social network: If you are connected to X, X is also connected to you.
  • Co...

Computing the graph degree distribution

After the number of nodes and edges, the node’s degree is one of the first metrics to compute when studying a new graph. It tells us whether the edges are equally split across nodes or if some nodes monopolize almost all connections, leaving the others disconnected. Now that we’ve defined the node’s degree, we will learn how to compute it with Cypher and draw the distribution using the NeoDash graph application.

Definition of a node’s degree

The degree of a node is the number of links connected to this node. For undirected graphs, there is only one degree, since we just count all the edges connected to a given node. For directed graphs, we can compute the node’s degree in three different ways:

  • Incoming degree: We count only the edges pointing toward the node
  • Outgoing degree: We count only the edges pointing outward of the node
  • Total degree: We count all edges attached to a node, regardless...

Installing and using the Neo4j Python driver

We can use the Neo4j Python driver to fetch data from Neo4j and analyze it from Python. In this section, we are going to plot the degree distribution using Python visualization packages.

Counting node labels and relationship types in Python

Let’s open the Neo4j_Driver notebook (https://github.com/PacktPublishing/Graph-Data-Science-with-Neo4J/blob/main/Chapter03/notebooks/Neo4j_Driver.ipynb). To install the Neo4j driver, run the following code in the first cell:

!pip install neo4j

Let’s instantiate the driver and fetch our first bit of data from Neo4j:

  1. First, import the required objects:
    from collections import defaultdict
    from neo4j import GraphDatabase
  2. Then, instantiate a driver object, providing it with the connection parameters:
    driver = GraphDatabase.driver(
        "bolt://localhost:7687",
        auth=("neo4j", "<PASSWORD>")
    )
  3. With...

Learning about other characterizing metrics

The degree is not the only metric that can be computed to characterize a graph. Let’s look at a graph detail page on the Network Repository Project (for instance, https://networkrepository.com/socfb-UVA16.php). It contains data about the number of nodes, edges, degrees, and other metrics, such as the number of triangles and clustering coefficient.

In the rest of this section, we will provide definitions for some of the metrics listed in the preceding Figure 3.11. We will refer to this section in the next few chapters when we use graph-based metrics to build a machine learning model.

Triangle count

The name is self-explanatory, but a triangle is defined by three connected nodes. In a directed graph, edge orientation needs to be taken into account.

For a given node, n, its triangle count is found by checking whether its neighbors are also connected to another neighbor of n. Look at the following undirected graph:

...

Summary

This chapter taught you some aspects of graph statistics. You now know a few metrics you need to compute when you first start analyzing a new graph, from the number of nodes/edges and the node and edge types to degree-related metrics and distribution.

You also installed the Neo4j Python driver and learned how to extract data from Neo4j to Python and create a DataFrame from data exported from Neo4j.

In the next chapter, we will dig deeper into graph analytics by using unsupervised graph algorithms to learn even more about graph topology. We will learn how to find clusters or communities of nodes in the graph. On the way, we will install and learn about the basic principles of the Neo4j Graph Data Science Library, the plugin we will use intensively in the rest of this book.

Further reading

To better understand some of the concepts that were just approached in this chapter, you can refer to the following resources:

Exercises

Challenge yourself with the following exercises related to the content covered in this chapter:

  1. Can you imagine an example of a tri-partite graph?
  2. Create the RELATED_TO relationship between movies that share at least one person (as actor or director).

Update the Cypher query we used to compute the degree distribution to obtain the normalized degree (divide by the total number of nodes in the graph).

  1. Can you draw the weighted degree distribution (total)?

Hint: The weighted total degree is the sum of all weights of relationships attached to a given node.

  1. Advanced: Can you write a Cypher query to compute the triangle count for each node?

Here is the code to create the small graph we used as an example in Neo4j:

CREATE (A:Label {id: "A"})
CREATE (B:Label {id: "B"})
CREATE (C:Label {id: "C"})
CREATE (D:Label {id: "D"})
CREATE (E:Label {id: "E"})
CREATE (A)-[:REL]->(B)
CREATE...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Graph Data Science with Neo4j
Published in: Jan 2023Publisher: PacktISBN-13: 9781804612743
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Estelle Scifo

Estelle Scifo possesses over 7 years experience as a data scientist, after receiving her PhD from the Laboratoire de lAcclrateur Linaire, Orsay (affiliated to CERN in Geneva). As a Neo4j certified professional, she uses graph databases on a daily basis and takes full advantage of its features to build efficient machine learning models out of this data. In addition, she is also a data science mentor to guide newcomers into the field. Her domain expertise and deep insight into the perspective of the beginners needs make her an excellent teacher.
Read more about Estelle Scifo