Reader small image

You're reading from  Graph Data Science with Neo4j

Product typeBook
Published inJan 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781804612743
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Estelle Scifo
Estelle Scifo
author image
Estelle Scifo

Estelle Scifo possesses over 7 years experience as a data scientist, after receiving her PhD from the Laboratoire de lAcclrateur Linaire, Orsay (affiliated to CERN in Geneva). As a Neo4j certified professional, she uses graph databases on a daily basis and takes full advantage of its features to build efficient machine learning models out of this data. In addition, she is also a data science mentor to guide newcomers into the field. Her domain expertise and deep insight into the perspective of the beginners needs make her an excellent teacher.
Read more about Estelle Scifo

Right arrow

Predicting Future Edges

Link prediction (LP) is a key topic in Graph Data Science (GDS), since it is a problem very specific to graphs. While we can do classification for many kinds of datasets, not only graphs, LP can only be performed if we have links, meaning if our data is a graph. But the applications of these problems are quite wide: from understanding the dynamics of social network to product recommendations to criminal network analysis.

This chapter is going to give you a short introduction to the LP problem. We will define what observations are and how to build the initial dataset. We will also talk about the metrics that can be used to infer the presence of a hidden or future link and compute them using the GDS library. Finally, we will use a GDS pipeline to build a simple link prediction model, fit it on data stored in Neo4j, and make predictions.

In this chapter, we’re going to cover the following main topics:

  • Introducing the LP problem
  • LP features...

Technical requirements

In order to be able to reproduce the examples given in this chapter, you’ll need the following tools:

  • Neo4j 5.x installed on your computer (see the installation instructions in Chapter 1, Introducing and Installing Neo4j)
    • The GDS plugin (version >= 2.2)
  • A Python environment with Jupyter to run notebooks
  • Any code listed in the book is available in the associated GitHub repository,https://github.com/PacktPublishing/Graph-Data-Science-with-Neo4j, in the corresponding chapter folder

Code samples

Unless otherwise indicated, all code snippets in this chapter and the following ones use the GDS Python client. Library import and client initialization are omitted in this chapter for brevity, but a detailed explanation can be found in Chapter 6, Building a Machine Learning Model with Graph Features, in the Introducing the GDS Python client section. Also note that the code in the code bundle provided with the book is fully runnable and...

Introducing the LP problem

Let’s pause for a minute and understand what exactly LP is and how we can formulate this kind of problem using machine learning (ML) vocabulary.

LP examples

In order to understand what LP is, let’s see some real-life scenarios where these problems can be and are used:

  • Social networks: In a social network containing people who have certain relationships with each other, we can try and predict who the next people to meet or collaborate on a project will be. We can think of the following types of relationships, but there are many more:
    • Social media (know, follow)
    • Communication network (phone call)
    • Research paper authors: co-authorship of a research paper (research collaboration)
  • Criminal networks: A criminal network, by nature, is not fully known to the people analyzing it (police authorities). The LP technique helps in identifying unknown links between people and better predicting criminal behavior.
  • Entity resolution: Sometimes...

LP features

Here, we’ll describe the characteristics that can be attached to a pair of nodes and used as predictors for an LP model. We’ll start with topological features, which are built by analyzing both nodes’ neighborhoods. Then, we explore how to use each node’s features and combine them into a feature vector for the pair.

Topological features

Topological features rely on nodes’ neighborhoods and graph topology to infer new links. We can, for instance, use the following:

  • Common neighbors: Given two nodes, A and B, count the number of common neighbors between A and B. This metric assumes that the more common neighbors A and B have, the more likely they are to be connected.
  • Adamic-Adar: A variation of the common neighbors approach, the Adamic-Adar metric incorporates the fact that nodes with fewer connections give more information than nodes with many links. In a web page linking hundreds of other pages, the relevance of each...

Building an LP pipeline with the GDS

Our task will be to predict the future collaboration of actors and directors, using the homogeneous graph made of Person nodes and KNOWS relationships. We will only use the persons in the main component according to the connected component algorithm, identified by the MainComponent label.

Creating and configuring the pipeline

The process of creating, training, and making predictions with a GDS pipeline is very similar to the node classification case. We will detail the steps in the following subsections.

Building the projected graph

First, we are going to create a projected graph, as follows:

projected_graph_object = create_projected_graph(
    gds,
    graph_name="graph-lp-collab",
    node_spec={
        "Person": {
            "label": "...

Summary

In this chapter, you have learned about the LP problem, an ML technique that’s only possible with graph data. It can be used in many contexts to predict future or unknown links between any type of nodes, as long as we have some example or context data. You have learned how to build an LP pipeline with Neo4j’s GDS, which takes care of negative observation sampling, model training, and storage for us.

This chapter is the last one where we will talk about predictions and ML. Overall, we have studied several use cases for ML on graphs, including node classification and future/unknown LP. You have learned how to extract graph-based features or embeddings to feed an ML model in your preferred library (we’ve used scikit-learn). You have also learned that the whole ML pipeline can be managed within Neo4j and its GDS library thanks to built-in pipelines and models.

GDS contains many interesting tools, but it is generally still young compared to other ML tools...

Further reading

If you want to learn more about the topics covered in this chapter, I recommend the following resources:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Graph Data Science with Neo4j
Published in: Jan 2023Publisher: PacktISBN-13: 9781804612743
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Estelle Scifo

Estelle Scifo possesses over 7 years experience as a data scientist, after receiving her PhD from the Laboratoire de lAcclrateur Linaire, Orsay (affiliated to CERN in Geneva). As a Neo4j certified professional, she uses graph databases on a daily basis and takes full advantage of its features to build efficient machine learning models out of this data. In addition, she is also a data science mentor to guide newcomers into the field. Her domain expertise and deep insight into the perspective of the beginners needs make her an excellent teacher.
Read more about Estelle Scifo