Reader small image

You're reading from  Graph Data Science with Neo4j

Product typeBook
Published inJan 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781804612743
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Estelle Scifo
Estelle Scifo
author image
Estelle Scifo

Estelle Scifo possesses over 7 years experience as a data scientist, after receiving her PhD from the Laboratoire de lAcclrateur Linaire, Orsay (affiliated to CERN in Geneva). As a Neo4j certified professional, she uses graph databases on a daily basis and takes full advantage of its features to build efficient machine learning models out of this data. In addition, she is also a data science mentor to guide newcomers into the field. Her domain expertise and deep insight into the perspective of the beginners needs make her an excellent teacher.
Read more about Estelle Scifo

Right arrow

Building a GDS Pipeline for Node Classification Model Training

Classifying observations within categories is a classical machine learning (ML) task. As we learned in the preceding chapters, we can use existing ML models such as decision trees to classify a graph’s nodes. The graph structure is used to find extra features, bringing more knowledge into the model. In this chapter, we will discover another key feature of the Neo4j GDS library: pipelines. They let you configure and train an ML model, before using it to make predictions on unseen nodes. You can do all of this from Neo4j, without having to add another library such as scikit-learn to the tech stack.

Also, we are going to work on the Netflix dataset we created earlier in this book (the code is available on GitHub if you don’t have it yet). We will try and make predictions by building a node classification pipeline, focusing on the how rather than the why.

In this chapter, we’re going to cover the...

Technical requirements

In order to be able to reproduce the examples given in this chapter, you’ll need the following tools:

  • Neo4j 5.x installed on your computer (see the installation instructions from Chapter 1, Introducing and Installing Neo4j):
    • The Graph Data Science plugin (version >= 2.2)
  • A Python environment with the following:
    • Jupyter to run the notebooks
    • scikit-learn
  • Any code listed in the book will be available in the associated GitHub repository (https://github.com/PacktPublishing/Graph-Data-Science-with-Neo4j) in the corresponding chapter folder

Code samples

Unless otherwise indicated, all code snippets in this chapter and the following ones use the GDS Python client. Library import and client initialization are omitted in this chapter for brevity, but a detailed explanation can be found in the Introducing the GDS Python client section of Chapter 6, Building a Machine Learning Model with Graph Features. Also, note that the code in the code...

The GDS pipelines

This section introduces GDS pipelines, where we explain what the purpose of this feature is, illustrate its intended usage, and show the basic usage of the pipeline catalog.

What is a pipeline?

As data scientists, we run data pipelines every day. Any logical flow of action is somehow a pipeline, and when you run your Jupyter notebook, you already have a pipeline. However, here, we refer to explicitly defined workflows, with sequential tasks such as the one we can build with scikit-learn. Let’s take a look at the Pipeline object in this library before focusing on GDS pipelines to understand their similarities and differences.

scikit-learn pipeline

Often, we think about ML as finding the best model for a given problem, but as data professionals, we know that finding the right model is only a small part of the problem. Before we can even think about fitting a model, many preliminary steps are required: from data gathering to feature extraction. Some...

Building and training a pipeline

Similarly to models, in order to add a pipeline to the catalog, we’ll have to train it. Pipeline training requires several steps:

  1. Create and name the pipeline object.
  2. Optionally, compute features from other GDS algorithms (such as graph algorithms, embeddings, or pre-processing).
  3. Define the feature set from the features added in the previous step, and/or any node property included in the projected graph.
  4. Select the ML models to be tested with their hyperparameters: The pipeline training will run all algorithms and select the best one.
  5. Finally, train the model.

The following sub-sections detail each of these steps. The supporting notebook is Pipeline_Train_Predict. This can be found in the Chapter08 folder of the code bundle that comes with this book.

Creating the pipeline and choosing the features

In GDS, we can create three kinds of pipelines:

  • Node classification: Each node gets assigned to one target...

Making predictions

In order to make predictions, we are going to use the same projected graph that already contains the test nodes.

With this projected graph, and the model object returned by the pipeline training, we can now predict the class of new nodes:

predictions = model.predict_stream(
     projected_graph_object,
     targetNodeLabels=["Test", "Train"],
)

Note that the model object also exposes a predict_mutate function to store the results in the projected graph. This will be useful to us when dealing with embedding features in the last section of this chapter.

In the preceding code block, we include both the Test and Train nodes in order for the Louvain results to be computed properly, using the whole graph. We will filter out the predictions for the train nodes as we evaluate the model performances.

For instance, in order to evaluate our model, we can compute the confusion matrix using our...

Using embedding features

The performed analysis is equivalent to the analysis performed in Chapter 6, Building a Machine Learning Model with Graph Features, with scikit-learn, except that here, there is no need to add another package for model training, as everything is taken care of in GDS.

However, in the preceding chapter, we learned about another way to find node features, by learning them from the graph structure itself: node embeddings. In this section, we will use node embeddings as features for our classification task.

Choosing the graph embedding algorithm to use

In Chapter 7, Automatically Extracting Features with Graph Embeddings for Machine Learning, we talked about two graph embedding algorithms included in GDS: Node2Vec and GraphSAGE. They have some differences, and one of them is the kind of information they tend to encode. While Node2Vec tends to model the node positions in the graph (nodes close to each other in the graph will have close embeddings), GraphSAGE...

Summary

In this chapter, you learned about how to use GDS pipelines to simplify the processes of training an ML model involving graph-based features. GDS pipelines can be configured to run graph algorithms such as the Louvain algorithm and use the result as a feature in a classification or regression model. These models are part of the GDS, so we do not have to explicitly extract data from Neo4j and use another ML library. Everything can be run using the projected graph, which is stored in the model and pipeline catalogs, and used to make predictions on unseen nodes. This lets us use a single tool to compute graph features and perform ML tasks, including the training and prediction of different models, without explicit data exchange from and to the database.

Additionally, we played with the embedding algorithms included in the GDS, starting to surface their advantages and disadvantages.

In the next chapter, we will use another type of pipeline from the GDS to solve another kind...

Further reading

If you want to learn more about the topics covered in this chapter, I recommend the following readings:

Exercise

  1. Use a Cypher projection to build the projected graph we used in the first section. It must include nodes with the MainTrain label and the nbMovies and isUSCitizen properties, along with relationships of the KNOWS type.
  2. Create the graph represented in the following figure (same as Figure 7.4) in Neo4j. Then, run the Node2Vec algorithm by changing the p and q parameters and try and understand their behavior:
Figure 8.4 – An example graph

Figure 8.4 – An example graph

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Graph Data Science with Neo4j
Published in: Jan 2023Publisher: PacktISBN-13: 9781804612743
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Estelle Scifo

Estelle Scifo possesses over 7 years experience as a data scientist, after receiving her PhD from the Laboratoire de lAcclrateur Linaire, Orsay (affiliated to CERN in Geneva). As a Neo4j certified professional, she uses graph databases on a daily basis and takes full advantage of its features to build efficient machine learning models out of this data. In addition, she is also a data science mentor to guide newcomers into the field. Her domain expertise and deep insight into the perspective of the beginners needs make her an excellent teacher.
Read more about Estelle Scifo