Graph databases in general, and Neo4j in particular, have gained increasing interest in the past few years. They provide a natural way of modeling entities and relationships and take into account observation context, which is often crucial to extract the most out of your data. Among the different graph database vendors, Neo4j has become one of the most popular for both data storage and analytics. A lot of tools have been developed by the company itself or the community to make the whole ecosystem consistent and easy to use: from storage to querying, to visualization to graph data science. As you will see through this book, there is a well-integrated application or plugin for each of these topics.
In this chapter, you will get to know what Neo4j is, positioning it in the broad context of databases. We will also introduce the aforementioned plugins that are used for graph data science.
Finally, you will set up your first Neo4j instance locally if you haven’t done so already and run your first Cypher queries to populate the database with some data and retrieve it.
In this chapter, we’re going to cover the following main topics:
To follow this chapter well, you will need access to the following resources:
Before we get our hands dirty and start playing with Neo4j, it is important to understand what Neo4j is and how different it is from the data storage engine you are used to. In this section, we are going to discuss (quickly) the different types of databases you can find today, and why graph databases are so interesting and popular both for developers and data professionals.
Databases make up an important part of computer science. Discussing the evolution and state-of-the-art areas of the different implementations in detail would require several books like this one – fortunately, this is not a requirement to use such systems effectively. However, it is important to be aware of the existing tools related to data storage and how they differ from each other, to be able to choose the right tool for the right task. The fact that, after reading this book, you’ll be able to use graph databases and Neo4j in your data science project doesn’t mean you will have to use it every single time you start a new project, whatever the context is. Sometimes, it won’t be suitable; this introduction will explain why.
A database, in the context of computing, is a system that allows you to store and query data on a computer, phone, or, more generally, any electronic device.
As developers or data scientists of the 2020s, we have mainly faced two kinds of databases:
The preceding list is non-exhaustive; other types of data stores have been created over time and abandoned or were born in the past years, so we’ll need to wait to see how useful they can be. We can mention, for instance, vector databases, such as Weaviate, which are used to store data with their vector representations to ease searching in the vector space, with many applications in machine learning once a vector representation (embedding) of an observation has been computed.
Graph databases can also be classified as NoSQL databases. They bring another approach to the data storage landscape, especially in the data model phase.
In the previous section, we talked about databases. Before discussing graph databases, let’s introduce the concept of graphs.
A graph is a mathematical object defined by the following:
The following figure shows several examples of graphs, big and small:
Figure 1.1 – Representations of some graphs
As you can see, there’s a Road network (in Europe), a Computer network, and a Social network. But in practice, far more objects can be seen as graphs:
Figure 1.2 – Figure generated with the spacy Python library, which was able to identify the relationships between words in a sentence using NLP techniques
A graph can be seen as a generalization of these static representations, where links can be created with fewer constraints.
Another advantage of graphs is that they can be easily traversed, going from one node to another by following edges. They have been used for representing networks for a long time – road networks or communication infrastructure, for instance. The concept of a path, especially the shortest path in a graph, is a long-studied field. But the analysis of graphs doesn’t stop here – much more information can be extracted from carefully analyzing a network, such as its structure (are there groups of nodes disconnected from the rest of the graph? Are groups of nodes more tied to each other than to other groups?) and node ranking (node importance). We will discuss these algorithms in more detail in Chapter 4, Using Graph Algorithms to Characterize a Graph Dataset.
So, we know what a database is and what a graph is. Now comes the natural question: what is a graph database? The answer is quite simple: in a graph database, data is saved into nodes, which can be connected through edges to model relationships between them.
At this stage, you may be wondering: ok, but where can I find graph data? While we are used to CSV or JSON datasets, graph formats are not yet common and it might be misleading to some of you. If you do not have graph data, why would you need a graph database? There are two possible answers to this question, both of which we are going to discuss.
Data scientists know how to find or generate datasets that fit their needs. Randomly generating a variable distribution while following some probabilistic law is one of the first things you’ll learn in a statistics course. Similarly, graph datasets can be randomly generated, following some rules. However, this book is not a graph theory book, so we are not going to dig into these details here. Just be aware that this can be done. Please refer to the references in the Further reading section to learn more.
Regarding existing datasets, some of them are very popular and data scientists know about them because they have used them while learning data science and/or because they are the topic of well-known Kaggle competitions. Think, for instance, about the Titanic or house price datasets. Other datasets are also used for model benchmarking, such as the MNIST or ImageNet datasets in computer vision tasks.
The same holds for graph data science, where some datasets are very common for teaching or benchmarking purposes. If you investigate graph theory, you will read about the Zachary’s karate club (ZKC) dataset, which is probably one of the most famous graph datasets out there (side note: there is even a ZKC trophy, which is awarded to the first person in a graph conference that mentions this dataset). The ZKC dataset is very simple (30 nodes, as we’ll see in Chapter 3, Characterizing a Graph Dataset, and Chapter 4, Using Graph Algorithms to Characterize a Graph Dataset, on how to characterize a graph dataset), but bigger and more complex datasets are also available.
There are websites referencing graph datasets, which can be used for benchmarking in a research context or educational purpose, such as this book. Two of the most popular ones are the following:
If you browse these websites and start downloading some of the files, you’ll notice the data comes in unfamiliar formats. We’re going to list some of them next.
The datasets we are used to are mainly exchanged as CSV or JSON files. To represent a graph, with nodes on one side and edges on the other, several specific formats have been imagined.
The main data formats that are used to save graph data as text files are the following:
A
, B
, C
) and two edges (A
-B
and C
-A
) is defined by the following edgelist
file:A B
C A
.mtx
extension): This format is an extension of the previous one. It is quite frequent on the network repository website.NxN
matrix (where N
is the number of nodes in the graph) where the ij
element is 1
if nodes i
and j
are connected through an edge and 0
otherwise. The adjacency matrix of the simple graph with three nodes and two edges is a 3x3
matrix, as shown in the following code block. I have explicitly displayed the row and column names only for convenience, to help you identify what i
and j
are:A B C
A 0 1 0
B 0 0 0
C 1 0 0
Note
The adjacency matrix is one way to vectorize a graph. We’ll come back to this topic in Chapter 7, Automatically Extracting Features with Graph Embeddings for Machine Learning.
name
property to nodes and a length
property to edges:<?xml version='1.0' encoding='utf-8'?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"
>
<!-- DEFINING PROPERTY NAME WITH TYPE AND ID -->
<key attr.name="name" attr.type="string" for="node" id="d1"/>
<key attr.name="length" attr.type="double" for="edge" id="d2"/>
<graph edgedefault="directed">
<!-- DEFINING NODES -->
<node id="A">
<!-- SETTING NODE PROPERTY -->
<data key="d1">"Point A"</data>
</node>
<node id="B">
<data key="d1">"Point B"</data>
</node>
<node id="C">
<data key="d1">"Point C"</data>
</node>
<!-- DEFINING EDGES
with source and target nodes and properties
-->
<edge id="AB" source="A" target="B">
<data key="d2">123.45</data>
</edge>
<edge id="CA" source="C" target="A">
<data key="d2">56.78</data>
</edge>
</graph>
</graphml>
If you find a dataset already formatted as a graph, it is likely to be using one of the preceding formats. However, most of the time, you will want to use your own data, which is not yet in graph format – it might be stored in the previously described databases or CSV or JSON files. If that is the case, then the next section is for you! There, you will learn how to transform your data into a graph.
The second answer to the main question in this section is: your data is probably a graph, without you being aware of it yet. We will elaborate on this topic in the next chapter (Chapter 2, Using Existing Data to Build a Knowledge Graph), but let me give you a quick overview.
Let’s take the example of an e-commerce website, which has customers (users) and products. As in every e-commerce website, users can place orders to buy some products. In the relational world, the data schema that’s traditionally used to represent such a scenario is represented on the left-hand side of the following screenshot:
Figure 1.3 – Modeling e-commerce data as a graph
The relational data model works as follows:
id
) and a username (apart from security and personal information required for such a website, you can easily imagine how to add columns to this table).Note
Please refer to the colored version of the preceding figure, which can be found in the graphics bundle link provided in the Preface, for a better understanding of the correspondence between the two sides of the figure.
In a graph database, all the _id
columns are replaced by actual relationships, which are real entities with graph databases, not just conceptual ones like in the relational model. You can also get rid of the order product
table since information specific to a product in a given order such as the ordered quantity can be stored directly in the relationship between the order and the product node. The data model is much more natural and easier to document and present to other people on your team.
Now that we have a better understanding of what a graph database is, let’s explore the different implementations out there. Like the other types of databases, there is no single implementation for graph databases, and several projects provide graph database functionalities.
In the next section, we are going to discuss some of the differences between them, and where Neo4j is positioned in this technology landscape.
Even when restricting the scope to graph databases, there are still different ways to envision such data stores:
Subject Predicate Object
type. This is a complex vocabulary that expresses a relationship of a certain type (the predicate) between a subject and an object; for instance:Alice(Subject) KNOWS(Predicate) Bob(Object)
Very famous knowledge bases such as DBedia and Wikidata use the RDF format. We will talk about this a bit more in the next chapter (Chapter 2, Using Existing Data to Build a Knowledge Graph).
Alice
and Bob
are nodes with the Person
label, and the relationship between them has the KNOWS
label) and have properties (people have names; an acquaintance relationship can contain the date when both people first met as a property).Neo4j is a labeled-property graph. And even there, like MySQL, PostgreSQL, and Microsoft SQL Server are all relational databases, you will find different vendors proposing LPG graph databases. They differ in many aspects:
openCypher
project, allowing other databases to use the same language (see, for instance, RedisGraph or Amazon Neptune). Other vendors have created their own languages (AQL
for ArangoDB or CQL
for TigerGraph, for instance). To me, this is a key point to take into account since the learning curve can be very different from one language to another. Cypher has the advantage of being very intuitive and a few minutes are enough to write your own queries without much effort.A note about performances
Almost every vendor claims to be the best one, at least in some aspects. This book won’t create another debate about that. The best option, if performances are crucial for your application, is to test the candidates with a scenario close to your final goal in terms of data volume and the type of queries/analysis.
The Neo4j database is already very helpful by itself, but the amount of extensions, libraries, and applications related to it makes it the most complete solution. In addition, it has a very active community of members always keen to help each other, which is one of the reasons to choose it.
The core Neo4j database capabilities can be extended thanks to some plugins. Awesome Procedures on Cypher (APOC), a common Neo4j extension, contains some procedures that can extend the database and Cypher capabilities. We will use it later in this book to load JSON data.
The main plugin we will explore in this book is the Graph Data Science Library. Its predecessor, the Graph Algorithm Library, was first released in 2018 by the Neo4j lab team. It was quickly replaced by the Graph Data Science Library, a fully production-ready plugin, with improved performance. Algorithms are improved and added regularly. Version 2.0, released in 2021, takes graph data science even further, allowing us to train models and build analysis pipelines directly from the library. It also comes with a handy Python client, which is very convenient for including graph algorithms into your usual machine learning processes, whether you use scikit-learn
or other machine learning libraries such as TensorFlow or PyTorch.
Besides the plugins, there are also lots of applications out there to help us deal with Neo4j and explore the data it contains. The first application we will use is Neo4j Desktop, which lets us manage several Neo4j databases. Continue reading to learn how to use it. Neo4j Desktop also lets you manage your installed plugins and applications.
Applications installed into Neo4j Desktop are granted access to your active database. While reading this book, you will use the following:
Figure 1.4 – Neo4j Browser
Figure 1.5 – Neo4j Bloom
Figure 1.6 – Neodash
This list of applications is non-exhaustive. You can find out more here: https://install.graphapp.io/.
Good to know
You can create your own graph application to be run within Neo4j Desktop. This is why there are so many diverse applications, some of which are being developed by community members or Neo4j partners.
This section described Neo4j as a database and the various extensions that can be added to it to make it more powerful. Now, it is time to start using it. In the following section, you are going to install Neo4j locally on our computer so that you can run the code examples provided in this book (which you are highly encouraged to do!).
There are several ways to use Neo4j:
For the scope of this book, we will use the Neo4j Desktop option, since this application takes care of many things for us and we do not want to go into server management at this stage.
The easiest way to use Neo4j on your local computer when you are in the experimentation phase, is to use the Neo4j Desktop application, which is available on Windows, Mac, and Linux OS. This user interface lets you create Neo4j databases, which are organized into Projects, manage the installed plugins and applications, and update the DB configuration – among other things.
Installing it is super easy: go to the Neo4j download center and follow the instructions. We recap the steps here, with screenshots to guide you through the process:
Figure 1.7 – Neo4j Download Center
eyJhbGciOiJQUzI1NiIsInR5cCI6IkpXVCJ9.eyJlbWFpbCI6InN0ZWxsYTBvdWhAZ21haWwuY29tIiwibWl4cGFuZWxJZ CI6Imdvb2dsZS1vYXV0a
...
...
The following steps depend on your operating system:
For Linux users, here is how to proceed:
# update path depending on your system
$ cd Downloads/
AppImage
file you’ve just downloaded:$ DESKTOP_VERSION=`ls -tr neo4j-desktop*.AppImage | tail -1 | grep -Po "(?<=neo4j-desktop-)[^AppImage]+"
$ echo ${DESKTOP_VERSION}
echo
command shows something like 1.4.11-x86_64.
, you’re good to go. Alternatively, you can identify the pattern yourself and create the variable, like so:$ DESKTOP_VERSION=1.4.11-x86_64. # include the final dot
chmod
and run the application:# make file executable:
$ chmod +x neo4j-desktop-${DESKTOP_VERSION}AppImage
# run the application:
$ ./neo4j-desktop-${DESKTOP_VERSION}AppImage
The last command in the preceding code snippet starts the Neo4j Desktop application. The first time you run the application, it will ask you for the activation key you saved when downloading the executable. And that’s it – the application will be running, which means we can start creating Neo4j databases and interact with them.
Creating a new database with Neo4j desktop is quite straightforward:
This process is illustrated in the following screenshot:
Figure 1.8 – Adding a new database with Neo4j Desktop
Note
Save the password in a safe place; you’ll need to provide it to drivers and applications when connecting to this database.
Figure 1.9 – Choosing a name, password, and version for your new database
Figure 1.10 – Starting your newly created database
Note
You can’t have two databases running at the same time. If you start a new database while another is still running, the previous one must be stopped before the new one can be started.
You now have Neo4j Desktop installed and a running instance of Neo4j on your local computer. At this point, you are ready to start playing with graph data. Before moving on, let me introduce Neo4j Aura, which is an alternative way to quickly get started with Neo4j.
Neo4j also has a DB-as-a-service component called Aura. It lets you create a Neo4j database hosted in the cloud (either on Google Cloud Platform or Amazon Web Services, your choice) and is fully managed – there’s no need to worry about updates anymore. This service is entirely free up to a certain database size (50k nodes and 150k relationships), which makes it sufficient for experimenting with it. To create a database in Neo4j Aura, visit https://neo4j.com/cloud/platform/aura-graph-database/.
The following screenshot shows an example of a Neo4j database running in the cloud thanks to the Aura service:
Figure 1.11 – Neo4j Aura dashboard with a free-tier instance
Clicking Explore opens Neo4j Bloom, which we will cover in Chapter 3, Characterizing a Graph Dataset, while clicking Query starts Neo4j Browser in a new tab. You’ll be requested to enter the connection information for your database. The URL can be found in the previous screenshot – the username and password are the ones you set when creating the instance.
In the rest of this book, examples will be provided using a local database managed with the Neo4j Desktop application, but you are free to use whatever technique you prefer. However, note that some minor changes are to be expected if you choose something different, such as directory location or plugin installation method. In the latter case, always refer to the plugin or application documentation to find out the proper instructions.
Now that our first database is ready, it is time to insert some data into it. For this, we will use our first Cypher queries.
Cypher, as we discussed at the beginning of this chapter, is the query language developed by Neo4j. It is used by other graph database vendors, such as Redis Graph.
First, let’s create some nodes in our newly created database.
To do so, open Neo4j Browser by clicking on the Open button next to your database and selecting Neo4j Browser:
Figure 1.12 – Start the Neo4j Browser application from Neo4j Desktop
From there, you can start and write Cypher queries in the upper text area.
Let’s start and create some nodes with the following Cypher query:
CREATE (:User {name: "Alice", birthPlace: "Paris"}) CREATE (:User {name: "Bob", birthPlace: "London"}) CREATE (:User {name: "Carol", birthPlace: "London"}) CREATE (:User {name: "Dave", birthPlace: "London"}) CREATE (:User {name: "Eve", birthPlace: "Rome"})
Before running the query, let me detail its syntax:
Figure 1.13 – Anatomy of a node creation Cypher statement
Note that all of these components except for the parentheses are optional. You can create a node with no label and no properties with CREATE ()
, even if creating an empty record wouldn’t be really useful for data storage purposes.
Tips
You can copy and paste the preceding query and execute it as-is; multiple line queries are allowed by default in Neo4j Browser.
If the upper text area is not large enough, press the Esc key to maximize it.
Now that we’ve created some nodes and since we are dealing with a graph database, it is time to learn how to connect these nodes by creating edges, or, in Neo4j language, relationships.
The following code snippet starts by fetching the start and end nodes (Alice
and Bob
), then creates a relationship between them. The created relationship is of the KNOWS
type and carries one property (the date Alice and Bob met):
MATCH (alice:User {name: "Alice"}) MATCH (bob:User {name: "Bob"}) CREATE (alice)-[:KNOWS {since: "2022-12-01"}]->(bob)
We could have also put all our CREATE
statements into one big query, for instance, by adding aliases to the created nodes:
CREATE (alice:User {name: "Alice", birthPlace: "Paris"}) CREATE (bob:User {name: "Bob", birthPlace: "London"}) CREATE (alice)-[:KNOWS {since: "2022-12-01"}]->(bob)
Note
In Neo4j, relationships are directed, meaning you have to specify a direction when creating them, which we can do thanks to the >
symbol. However, Cypher lets you select data regardless of the relationship’s direction. We’ll discuss this when appropriate in the subsequent chapters.
Inserting data into the database is one thing, but without the ability to query and retrieve this data, databases would be useless. In the next section, we are going to use Cypher’s powerful pattern matching to read data from Neo4j.
So far, we have put some data in Neo4j and explored it with Neo4j Browser. But unsurprisingly, Cypher also lets you select and return data programmatically. This is what is called pattern matching in the context of graphs.
Let’s analyze such a pattern:
MATCH (usr:User {birthPlace: "London"}) RETURN usr.name, usr.birthPlace
Here, we are selecting nodes with the User
label while filtering for nodes with birthPlace
equal to London
. The RETURN
statement asks Neo4j to only return the name and the birthPlace
property of the matched nodes. The result of the preceding query, based on the data created earlier, is as follows:
╒══════════╤════════════════╕ │"usr.name"│"usr.birthPlace"│ ╞══════════╪════════════════╡ │"Bob" │"London" │ ├──────────┼────────────────┤ │"Carol" │"London" │ ├──────────┼────────────────┤ │"Dave" │"London" │ └──────────┴────────────────┘
This is a simple MATCH
statement, but most of the time, you’ll need to traverse the graph somehow to explore relationships. This is where Cypher is very convenient. You can write queries with an easy-to-remember syntax, close to the one you would use when drafting your query on a piece of paper. As an example, let’s find the users known by Alice, and return their names:
MATCH (alice:User {name: "Alice"})-[:KNOWS]->(u:User) RETURN u.name
The highlighted part in the preceding query is a graph traversal. From the node(s) matching label, User
, and name, Alice
, we are traversing the graph toward another node through a relationship of the KNOWS
type. In our toy dataset, there is only one matching node, Bob
, since Alice
is connected to a single relationship of this type.
Note
In our example, we are using a single-node label and relationship type. You are encouraged to experiment by adding more data types. For instance, create some nodes with the Product
label and relationships of the SELLS
/BUYS
type between users and products to build more complex queries.
In this chapter, you learned about the specificities of graph databases and started to learn about Neo4j and the tools around it. Now, you know a lot more about the Neo4j ecosystem, including plugins such as APOC and the graph data science (GDS) library and graph applications such as Neo4j Browser and Neodash. You installed Neo4j on your computer and created your first graph database. Finally, you created your first nodes and relationships and built your first Cypher MATCH
statement to extract data from Neo4j.
At this point, you are ready for the next chapter, which will teach you how to import data from various data sources into Neo4j, using built-in tools and the common APOC library.
If you want to explore the concepts described in this chapter in more detail, please refer to the following references:
To make sure you fully understand the content described in this chapter, you are encouraged to think about the following exercises before moving on:
MATCH (x:User)
RETURN x.name
MATCH (x:User)
RETURN x
MATCH (:User)
RETURN x.name
MATCH (x:User)-[k:KNOWS]->(y:User) RETURN x,
k, y
MATCH (x:User)-[:KNOWS]-(y) RETURN
x, y
Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.
If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.
Please Note: Packt eBooks are non-returnable and non-refundable.
Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:
If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:
Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.
You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.
Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.
When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.
For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.