Reader small image

You're reading from  Graph Data Science with Neo4j

Product typeBook
Published inJan 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781804612743
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Estelle Scifo
Estelle Scifo
author image
Estelle Scifo

Estelle Scifo possesses over 7 years experience as a data scientist, after receiving her PhD from the Laboratoire de lAcclrateur Linaire, Orsay (affiliated to CERN in Geneva). As a Neo4j certified professional, she uses graph databases on a daily basis and takes full advantage of its features to build efficient machine learning models out of this data. In addition, she is also a data science mentor to guide newcomers into the field. Her domain expertise and deep insight into the perspective of the beginners needs make her an excellent teacher.
Read more about Estelle Scifo

Right arrow

Writing Your Custom Graph Algorithms with the Pregel API in Java

In this final chapter related to creating data science projects on graphs using Neo4j and its plugins, we are going to use an advanced feature of GDS: the Pregel API. This API lets us use the optimized in-memory projected graph to run an algorithm written in Java. GDS takes care of everything else, including parallelism and how to return the result (stream or write back to Neo4j). We will use the PageRank algorithm as an example and learn about its principles before studying a small Python implementation. Then, we will implement it with the Pregel API and test our algorithm with the GDS tools. Finally, we will build the JAR file needed to run our algorithm from Cypher, like any other GDS algorithm.

In this chapter, we’re going to cover the following main topics:

  • Introducing the Pregel API
  • Implementing the PageRank algorithm
  • Testing our code
  • Using our algorithm from Cypher

Technical requirements

To be able to reproduce the examples given in this chapter, you’ll need the following tools:

  • Neo4j 5.x installed on your computer (see the installation instructions in Chapter 1, Introducing and Installing Neo4j).
    • The GDS plugin (version >= 2.2).
  • Java (OpenJDK 11). We will also use the gradle build tools.
  • It is advised that you use an IDE to manage dependencies and build projects in Java.

Introducing the Pregel API

Neo4j allows you to write plugins: following some API, you can write code in Java, which can then be used from Cypher through, for instance, the CALL statement, given that the JAR file containing your code has been placed in the plugins folder and your code has been properly annotated so that Neo4j can find the relevant information. That’s how APOC and GDS are implemented. Thanks to the Pregel API, we can extend not only Neo4j but GDS itself, leveraging its main features.

In this section, we will cover these GDS features and the basic principles behind the Pregel API.

GDS’s features

GDS allows users to extend it while taking advantage of many common functionalities, such as the following:

  • In-memory projected graph: We won’t have to write code to create a projected graph – we can directly work on an existing projected graph in the GDS graph catalog.
  • Stream/write/mutate procedures: The execution modes are automatically...

Implementing the PageRank algorithm

As an example, we will use the PageRank algorithm. It is a centrality metric developed by Larry Page, Google’s co-founder, to rank results on the search engine.

In this section, we will dig into the algorithm’s mechanisms and work on a simple implementation using Python before implementing the algorithm in Java, leveraging the Pregel API.

The PageRank algorithm

This algorithm is based on the following assumptions:

  1. The more connections you have, the more important you are.
  2. Not all connections share the same weight. For example, let’s say a backlink from the New York Times is driving more traffic to your website than a backlink from a less popular website. Scores are propagated from neighbors to account for the neighbor’s importance.
  3. At the same time, links from a website with fewer links show more relevance. Imagine that there’s a Wikipedia article linking every single noun to the corresponding...

Testing our code

GDS also provides a utility for writing unit tests for our code. We are going to test both implementations, PageRank and PageRankTol, starting with the former.

Test for the PageRank class

Let’s get started and detail the code block by block:

  1. First, we must define our test class, called PageRankTest:
    @GdlExtension
    class PageRankTest {
  2. Then, we must create a graph object from a Cypher string, using NATURAL orientation for the relationship (the default one):
       @GdlGraph(orientation = Orientation.NATURAL)
        private static final String TEST_GRAPH =
                "CREATE" +
                        "  (A:Node)" +
                        "...

Using our algorithm from Cypher

This is the last step of plugin development: we are going to annotate our file so that Cypher knows what we are talking about, then build the JAR file and test it from Cypher.

Adding annotations

Before generating the JAR file, we need to annotate our PageRank class so that we can configure how it can be used from Neo4j:

@PregelProcedure(name = "gdsbook.pr", modes = {GDSMode.STREAM, GDSMode.MUTATE, GDSMode.WRITE})
public class PageRank implements PregelComputation<PageRank.Config> {
// rest of the code is unchanged ...
}

Here, we specify two important parameters:

  • The procedure’s name: This is the name of the procedure known by Cypher. In short, it means that we will be able to write the following:
    CALL gdsbook.pr.<mode>(…) in Cypher
  • The procedure’s modes: Here, we can choose one or many among the GDS modes (stream, mutate, write, and stats) that will be available.
  • Optionally, we can...

Summary

This is the end of this chapter, where you were introduced to the method you can use to extend GDS and take advantage of all the common features we are looking for when dealing with graph analytics: memory and CPU performance. The projected graph and GDS internal management of job batches are easily accessible to us if we write a couple of Java classes.

We also studied the PageRank algorithm and implemented two versions of it: one relying only on the maximum number of iterations as stopping criteria, and another version that considers the stability of computed scores compared to the previous iteration, within a certain tolerance. We also learned how to unit test our algorithm by writing a simple test that runs our algorithm on a sample graph, which we were able to define by writing a Cypher CREATE statement.

This chapter is also the end of this book! We have come a long way since Chapter 1, Introducing and Installing Neo4j, where we introduced the concept of graphs, and...

Further reading

If you want to learn more about the topics covered in this chapter, I recommend the following resources:

Exercises

Practice with the Pregel API and write an algorithm. If you need some intermediate steps, here are a couple of exercises to help you get started:

  1. Update the Python implementation so that it computes the normalized PageRank given by the following formula:

Here, N is the total number of nodes in the graph.

Warning: Be careful with the score initialization.

  1. Again, using the Python implementation, take into account relationship weights. Hint: Relationship weights are entered into the outgoing degree part of the formula.
  2. Update the Java implementation to track the PR values at each step. We want to be able to see the evolution of PR at each iteration when calling the algorithm in Cypher.

This means adding a new field to our schema of the double[] type and extending it at each iteration.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Graph Data Science with Neo4j
Published in: Jan 2023Publisher: PacktISBN-13: 9781804612743
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Estelle Scifo

Estelle Scifo possesses over 7 years experience as a data scientist, after receiving her PhD from the Laboratoire de lAcclrateur Linaire, Orsay (affiliated to CERN in Geneva). As a Neo4j certified professional, she uses graph databases on a daily basis and takes full advantage of its features to build efficient machine learning models out of this data. In addition, she is also a data science mentor to guide newcomers into the field. Her domain expertise and deep insight into the perspective of the beginners needs make her an excellent teacher.
Read more about Estelle Scifo