Introducing and Installing Neo4j
Graph databases in general, and Neo4j in particular, have gained increasing interest in the past few years. They provide a natural way of modeling entities and relationships and take into account observation context, which is often crucial to extract the most out of your data. Among the different graph database vendors, Neo4j has become one of the most popular for both data storage and analytics. A lot of tools have been developed by the company itself or the community to make the whole ecosystem consistent and easy to use: from storage to querying, to visualization to graph data science. As you will see through this book, there is a well-integrated application or plugin for each of these topics.
In this chapter, you will get to know what Neo4j is, positioning it in the broad context of databases. We will also introduce the aforementioned plugins that are used for graph data science.
Finally, you will set up your first Neo4j instance locally if you haven’t done so already and run your first Cypher queries to populate the database with some data and retrieve it.
In this chapter, we’re going to cover the following main topics:
- What is a graph database?
- Finding or creating a graph database
- Neo4j in the graph databases landscape
- Setting up Neo4j
- Inserting data into Neo4j with Cypher, the Neo4j query language
- Extracting data from Neo4j with Cypher pattern matching
Technical requirements
To follow this chapter well, you will need access to the following resources:
- You’ll need a computer that can run Neo4j locally; Windows, macOS, and Linux are all supported. Please refer to the Neo4j website for more details about system requirements: https://neo4j.com/docs/operations-manual/current/installation/requirements/.
- Any code listed in the book will be available in the associated GitHub repository – that is, https://github.com/PacktPublishing/Graph-Data-Science-with-Neo4j – in the corresponding chapter folder.
What is a graph database?
Before we get our hands dirty and start playing with Neo4j, it is important to understand what Neo4j is and how different it is from the data storage engine you are used to. In this section, we are going to discuss (quickly) the different types of databases you can find today, and why graph databases are so interesting and popular both for developers and data professionals.
Databases
Databases make up an important part of computer science. Discussing the evolution and state-of-the-art areas of the different implementations in detail would require several books like this one – fortunately, this is not a requirement to use such systems effectively. However, it is important to be aware of the existing tools related to data storage and how they differ from each other, to be able to choose the right tool for the right task. The fact that, after reading this book, you’ll be able to use graph databases and Neo4j in your data science project doesn’t mean you will have to use it every single time you start a new project, whatever the context is. Sometimes, it won’t be suitable; this introduction will explain why.
A database, in the context of computing, is a system that allows you to store and query data on a computer, phone, or, more generally, any electronic device.
As developers or data scientists of the 2020s, we have mainly faced two kinds of databases:
- Relational databases (SQL) such as MySQL or PostgreSQL. These store data as records in tables whose columns are attributes or fields and whose rows represent each entity. They have a predefined schema, defining how data is organized and the type of each field. Relationships between entities in this representation are modeled by foreign keys (requiring unique identifiers). When the relationship is more complex, such as when attributes are required or when we can have many relationships between the same objects, an intermediate junction (join) table is required.
- NoSQL databases contain many different types of databases:
- Key-value stores such as Redis or Riak. A key-value (KV) store, as the name suggests, is a simple lookup database where the key is usually a string, and the value can be a more complex object that can’t be used to filter the query – it can only be retrieved. They are known to be very efficient for caching in a web context, where the key is the page URL and the value is the HTML content of the page, which is dynamically generated. KV stores can also be used to model graphs when building a native graph engine is not an option. You can see KV stores in action in the following projects:
- IndraDB: This is a graph database written in Rust that relies on different types of KV stores: https://github.com/indradb/indradb
- Document-oriented databases such as MongoDB or CouchDB. These are useful for storing schema-less documents (usually JSON (or a derivative) objects). They are much more flexible compared to relational databases, since each document may have different fields. However, relationships are harder to model, and such databases rely a lot on nested JSON and information duplication instead of joining multiple tables.
- Key-value stores such as Redis or Riak. A key-value (KV) store, as the name suggests, is a simple lookup database where the key is usually a string, and the value can be a more complex object that can’t be used to filter the query – it can only be retrieved. They are known to be very efficient for caching in a web context, where the key is the page URL and the value is the HTML content of the page, which is dynamically generated. KV stores can also be used to model graphs when building a native graph engine is not an option. You can see KV stores in action in the following projects:
The preceding list is non-exhaustive; other types of data stores have been created over time and abandoned or were born in the past years, so we’ll need to wait to see how useful they can be. We can mention, for instance, vector databases, such as Weaviate, which are used to store data with their vector representations to ease searching in the vector space, with many applications in machine learning once a vector representation (embedding) of an observation has been computed.
Graph databases can also be classified as NoSQL databases. They bring another approach to the data storage landscape, especially in the data model phase.
Graph database
In the previous section, we talked about databases. Before discussing graph databases, let’s introduce the concept of graphs.
A graph is a mathematical object defined by the following:
- A set of vertices or nodes (the dots)
- A set of edges (the connections between these dots)
The following figure shows several examples of graphs, big and small:

Figure 1.1 – Representations of some graphs
As you can see, there’s a Road network (in Europe), a Computer network, and a Social network. But in practice, far more objects can be seen as graphs:
- Time series: Each observation is connected to the next one
- Images: Each pixel is linked to its eight neighbors (see the bottom-right picture in Figure 1.1)
- Texts: Here, each word is connected to its surrounding words or a more complex mapping, depending on its meaning (see the following figure):

Figure 1.2 – Figure generated with the spacy Python library, which was able to identify the relationships between words in a sentence using NLP techniques
A graph can be seen as a generalization of these static representations, where links can be created with fewer constraints.
Another advantage of graphs is that they can be easily traversed, going from one node to another by following edges. They have been used for representing networks for a long time – road networks or communication infrastructure, for instance. The concept of a path, especially the shortest path in a graph, is a long-studied field. But the analysis of graphs doesn’t stop here – much more information can be extracted from carefully analyzing a network, such as its structure (are there groups of nodes disconnected from the rest of the graph? Are groups of nodes more tied to each other than to other groups?) and node ranking (node importance). We will discuss these algorithms in more detail in Chapter 4, Using Graph Algorithms to Characterize a Graph Dataset.
So, we know what a database is and what a graph is. Now comes the natural question: what is a graph database? The answer is quite simple: in a graph database, data is saved into nodes, which can be connected through edges to model relationships between them.
At this stage, you may be wondering: ok, but where can I find graph data? While we are used to CSV or JSON datasets, graph formats are not yet common and it might be misleading to some of you. If you do not have graph data, why would you need a graph database? There are two possible answers to this question, both of which we are going to discuss.