Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Graph Data Science with Neo4j
Graph Data Science with Neo4j

Graph Data Science with Neo4j: Learn how to use Neo4j 5 with Graph Data Science library 2.0 and its Python driver for your project

By Estelle Scifo
€27.99 €18.99
Book Jan 2023 288 pages 1st Edition
eBook
€27.99 €18.99
Print
€35.99
Subscription
€14.99 Monthly
eBook
€27.99 €18.99
Print
€35.99
Subscription
€14.99 Monthly

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Jan 31, 2023
Length 288 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781804612743
Vendor :
Google
Category :
Table of content icon View table of contents Preview book icon Preview Book

Graph Data Science with Neo4j

Introducing and Installing Neo4j

Graph databases in general, and Neo4j in particular, have gained increasing interest in the past few years. They provide a natural way of modeling entities and relationships and take into account observation context, which is often crucial to extract the most out of your data. Among the different graph database vendors, Neo4j has become one of the most popular for both data storage and analytics. A lot of tools have been developed by the company itself or the community to make the whole ecosystem consistent and easy to use: from storage to querying, to visualization to graph data science. As you will see through this book, there is a well-integrated application or plugin for each of these topics.

In this chapter, you will get to know what Neo4j is, positioning it in the broad context of databases. We will also introduce the aforementioned plugins that are used for graph data science.

Finally, you will set up your first Neo4j instance locally if you haven’t done so already and run your first Cypher queries to populate the database with some data and retrieve it.

In this chapter, we’re going to cover the following main topics:

  • What is a graph database?
  • Finding or creating a graph database
  • Neo4j in the graph databases landscape
  • Setting up Neo4j
  • Inserting data into Neo4j with Cypher, the Neo4j query language
  • Extracting data from Neo4j with Cypher pattern matching

Technical requirements

To follow this chapter well, you will need access to the following resources:

What is a graph database?

Before we get our hands dirty and start playing with Neo4j, it is important to understand what Neo4j is and how different it is from the data storage engine you are used to. In this section, we are going to discuss (quickly) the different types of databases you can find today, and why graph databases are so interesting and popular both for developers and data professionals.

Databases

Databases make up an important part of computer science. Discussing the evolution and state-of-the-art areas of the different implementations in detail would require several books like this one – fortunately, this is not a requirement to use such systems effectively. However, it is important to be aware of the existing tools related to data storage and how they differ from each other, to be able to choose the right tool for the right task. The fact that, after reading this book, you’ll be able to use graph databases and Neo4j in your data science project doesn’t mean you will have to use it every single time you start a new project, whatever the context is. Sometimes, it won’t be suitable; this introduction will explain why.

A database, in the context of computing, is a system that allows you to store and query data on a computer, phone, or, more generally, any electronic device.

As developers or data scientists of the 2020s, we have mainly faced two kinds of databases:

  • Relational databases (SQL) such as MySQL or PostgreSQL. These store data as records in tables whose columns are attributes or fields and whose rows represent each entity. They have a predefined schema, defining how data is organized and the type of each field. Relationships between entities in this representation are modeled by foreign keys (requiring unique identifiers). When the relationship is more complex, such as when attributes are required or when we can have many relationships between the same objects, an intermediate junction (join) table is required.
  • NoSQL databases contain many different types of databases:
    • Key-value stores such as Redis or Riak. A key-value (KV) store, as the name suggests, is a simple lookup database where the key is usually a string, and the value can be a more complex object that can’t be used to filter the query – it can only be retrieved. They are known to be very efficient for caching in a web context, where the key is the page URL and the value is the HTML content of the page, which is dynamically generated. KV stores can also be used to model graphs when building a native graph engine is not an option. You can see KV stores in action in the following projects:
    • Document-oriented databases such as MongoDB or CouchDB. These are useful for storing schema-less documents (usually JSON (or a derivative) objects). They are much more flexible compared to relational databases, since each document may have different fields. However, relationships are harder to model, and such databases rely a lot on nested JSON and information duplication instead of joining multiple tables.

The preceding list is non-exhaustive; other types of data stores have been created over time and abandoned or were born in the past years, so we’ll need to wait to see how useful they can be. We can mention, for instance, vector databases, such as Weaviate, which are used to store data with their vector representations to ease searching in the vector space, with many applications in machine learning once a vector representation (embedding) of an observation has been computed.

Graph databases can also be classified as NoSQL databases. They bring another approach to the data storage landscape, especially in the data model phase.

Graph database

In the previous section, we talked about databases. Before discussing graph databases, let’s introduce the concept of graphs.

A graph is a mathematical object defined by the following:

  • A set of vertices or nodes (the dots)
  • A set of edges (the connections between these dots)

The following figure shows several examples of graphs, big and small:

Figure 1.1 – Representations of some graphs

Figure 1.1 – Representations of some graphs

As you can see, there’s a Road network (in Europe), a Computer network, and a Social network. But in practice, far more objects can be seen as graphs:

  • Time series: Each observation is connected to the next one
  • Images: Each pixel is linked to its eight neighbors (see the bottom-right picture in Figure 1.1)
  • Texts: Here, each word is connected to its surrounding words or a more complex mapping, depending on its meaning (see the following figure):
Figure 1.2 – Figure generated with the spacy Python library, which was able to identify the relationships between words in a sentence using NLP techniques

Figure 1.2 – Figure generated with the spacy Python library, which was able to identify the relationships between words in a sentence using NLP techniques

A graph can be seen as a generalization of these static representations, where links can be created with fewer constraints.

Another advantage of graphs is that they can be easily traversed, going from one node to another by following edges. They have been used for representing networks for a long time – road networks or communication infrastructure, for instance. The concept of a path, especially the shortest path in a graph, is a long-studied field. But the analysis of graphs doesn’t stop here – much more information can be extracted from carefully analyzing a network, such as its structure (are there groups of nodes disconnected from the rest of the graph? Are groups of nodes more tied to each other than to other groups?) and node ranking (node importance). We will discuss these algorithms in more detail in Chapter 4, Using Graph Algorithms to Characterize a Graph Dataset.

So, we know what a database is and what a graph is. Now comes the natural question: what is a graph database? The answer is quite simple: in a graph database, data is saved into nodes, which can be connected through edges to model relationships between them.

At this stage, you may be wondering: ok, but where can I find graph data? While we are used to CSV or JSON datasets, graph formats are not yet common and it might be misleading to some of you. If you do not have graph data, why would you need a graph database? There are two possible answers to this question, both of which we are going to discuss.

Finding or creating a graph database

Data scientists know how to find or generate datasets that fit their needs. Randomly generating a variable distribution while following some probabilistic law is one of the first things you’ll learn in a statistics course. Similarly, graph datasets can be randomly generated, following some rules. However, this book is not a graph theory book, so we are not going to dig into these details here. Just be aware that this can be done. Please refer to the references in the Further reading section to learn more.

Regarding existing datasets, some of them are very popular and data scientists know about them because they have used them while learning data science and/or because they are the topic of well-known Kaggle competitions. Think, for instance, about the Titanic or house price datasets. Other datasets are also used for model benchmarking, such as the MNIST or ImageNet datasets in computer vision tasks.

The same holds for graph data science, where some datasets are very common for teaching or benchmarking purposes. If you investigate graph theory, you will read about the Zachary’s karate club (ZKC) dataset, which is probably one of the most famous graph datasets out there (side note: there is even a ZKC trophy, which is awarded to the first person in a graph conference that mentions this dataset). The ZKC dataset is very simple (30 nodes, as we’ll see in Chapter 3, Characterizing a Graph Dataset, and Chapter 4, Using Graph Algorithms to Characterize a Graph Dataset, on how to characterize a graph dataset), but bigger and more complex datasets are also available.

There are websites referencing graph datasets, which can be used for benchmarking in a research context or educational purpose, such as this book. Two of the most popular ones are the following:

  • The Stanford Network Analysis Project (SNAP) (https://snap.stanford.edu/data/index.html) lists different types of networks in different categories (social networks, citation networks, and so on)
  • The Network Repository Project, via its website at https://networkrepository.com/index.php, provides hundreds of graph datasets from real-world examples, classified into categories (for example, biology, economics, recommendations, road, and so on)

If you browse these websites and start downloading some of the files, you’ll notice the data comes in unfamiliar formats. We’re going to list some of them next.

A note about the graph dataset’s format

The datasets we are used to are mainly exchanged as CSV or JSON files. To represent a graph, with nodes on one side and edges on the other, several specific formats have been imagined.

The main data formats that are used to save graph data as text files are the following:

  • Edge list: This is a text file where each row contains an edge definition. For instance, a graph with three nodes (A, B, C) and two edges (A-B and C-A) is defined by the following edgelist file:
    A B
    C A
  • Matrix Market (with the .mtx extension): This format is an extension of the previous one. It is quite frequent on the network repository website.
  • Adjacency matrix: The adjacency matrix is an NxN matrix (where N is the number of nodes in the graph) where the ij element is 1 if nodes i and j are connected through an edge and 0 otherwise. The adjacency matrix of the simple graph with three nodes and two edges is a 3x3 matrix, as shown in the following code block. I have explicitly displayed the row and column names only for convenience, to help you identify what i and j are:
      A B C
    A 0 1 0
    B 0 0 0
    C 1 0 0

Note

The adjacency matrix is one way to vectorize a graph. We’ll come back to this topic in Chapter 7, Automatically Extracting Features with Graph Embeddings for Machine Learning.

  • GraphML: Derived from XML, the GraphML format is much more verbose but lets us define more complex graphs, especially those where nodes and/or edges carry properties. The following example uses the preceding graph but adds a name property to nodes and a length property to edges:
    <?xml version='1.0' encoding='utf-8'?>
    <graphml xmlns="http://graphml.graphdrawing.org/xmlns"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"
    >
        <!-- DEFINING PROPERTY NAME WITH TYPE AND ID -->
        <key attr.name="name" attr.type="string" for="node" id="d1"/>
        <key attr.name="length" attr.type="double" for="edge" id="d2"/>
        <graph edgedefault="directed">
           <!-- DEFINING NODES -->
           <node id="A">
                 <!-- SETTING NODE PROPERTY -->
                <data key="d1">"Point A"</data>
            </node>
            <node id="B">
                <data key="d1">"Point B"</data>
            </node>
            <node id="C">
                <data key="d1">"Point C"</data>
            </node>
            <!-- DEFINING EDGES
            with source and target nodes and properties
        -->
            <edge id="AB" source="A" target="B">
                <data key="d2">123.45</data>
            </edge>
            <edge id="CA" source="C" target="A">
                <data key="d2">56.78</data>
            </edge>
        </graph>
    </graphml>

If you find a dataset already formatted as a graph, it is likely to be using one of the preceding formats. However, most of the time, you will want to use your own data, which is not yet in graph format – it might be stored in the previously described databases or CSV or JSON files. If that is the case, then the next section is for you! There, you will learn how to transform your data into a graph.

Modeling your data as a graph

The second answer to the main question in this section is: your data is probably a graph, without you being aware of it yet. We will elaborate on this topic in the next chapter (Chapter 2, Using Existing Data to Build a Knowledge Graph), but let me give you a quick overview.

Let’s take the example of an e-commerce website, which has customers (users) and products. As in every e-commerce website, users can place orders to buy some products. In the relational world, the data schema that’s traditionally used to represent such a scenario is represented on the left-hand side of the following screenshot:

Figure 1.3 – Modeling e-commerce data as a graph

Figure 1.3 – Modeling e-commerce data as a graph

The relational data model works as follows:

  • A table is created to store users, with a unique identifier (id) and a username (apart from security and personal information required for such a website, you can easily imagine how to add columns to this table).
  • Another table contains the data about the available products.
  • Each time a customer places an order, a new row is added to an order table, referencing the user by its ID (a foreign key with a one-to-many relationship, where a user can place many orders).
  • To remember which products were part of which orders, a many-to-many relationship is created (an order contains many products and a product is part of many orders). We usually create a relationship table, linking orders to products (the order product table, in our example).

Note

Please refer to the colored version of the preceding figure, which can be found in the graphics bundle link provided in the Preface, for a better understanding of the correspondence between the two sides of the figure.

In a graph database, all the _id columns are replaced by actual relationships, which are real entities with graph databases, not just conceptual ones like in the relational model. You can also get rid of the order product table since information specific to a product in a given order such as the ordered quantity can be stored directly in the relationship between the order and the product node. The data model is much more natural and easier to document and present to other people on your team.

Now that we have a better understanding of what a graph database is, let’s explore the different implementations out there. Like the other types of databases, there is no single implementation for graph databases, and several projects provide graph database functionalities.

In the next section, we are going to discuss some of the differences between them, and where Neo4j is positioned in this technology landscape.

Neo4j in the graph databases landscape

Even when restricting the scope to graph databases, there are still different ways to envision such data stores:

  • Resource description framework (RDF): Each record is a triplet of the Subject Predicate Object type. This is a complex vocabulary that expresses a relationship of a certain type (the predicate) between a subject and an object; for instance:
    Alice(Subject) KNOWS(Predicate) Bob(Object)

Very famous knowledge bases such as DBedia and Wikidata use the RDF format. We will talk about this a bit more in the next chapter (Chapter 2, Using Existing Data to Build a Knowledge Graph).

  • Labeled-property graph (LPG): A labeled-property graph contains nodes and relationships. Both of these entities can be labeled (for instance, Alice and Bob are nodes with the Person label, and the relationship between them has the KNOWS label) and have properties (people have names; an acquaintance relationship can contain the date when both people first met as a property).

Neo4j is a labeled-property graph. And even there, like MySQL, PostgreSQL, and Microsoft SQL Server are all relational databases, you will find different vendors proposing LPG graph databases. They differ in many aspects:

  • Whether they use a native graph engine or not: As we discussed earlier, it is possible to use a KV store or even a SQL database to store graph data. In this case, we’re talking about non-native storage engines since the storage does not reflect the graphical nature of the data.
  • The query language: Unlike SQL, the query language to deal with graph data has not yet been standardized, even if there is an ongoing effort being led by the GQL group (see, for instance, https://gql.today/). Neo4j uses Cypher, a declarative query language developed by the company in 2011 and then open-sourced in the openCypher project, allowing other databases to use the same language (see, for instance, RedisGraph or Amazon Neptune). Other vendors have created their own languages (AQL for ArangoDB or CQL for TigerGraph, for instance). To me, this is a key point to take into account since the learning curve can be very different from one language to another. Cypher has the advantage of being very intuitive and a few minutes are enough to write your own queries without much effort.
  • Their (integrated or not) support for graph analytics and data science.

A note about performances

Almost every vendor claims to be the best one, at least in some aspects. This book won’t create another debate about that. The best option, if performances are crucial for your application, is to test the candidates with a scenario close to your final goal in terms of data volume and the type of queries/analysis.

Neo4j ecosystem

The Neo4j database is already very helpful by itself, but the amount of extensions, libraries, and applications related to it makes it the most complete solution. In addition, it has a very active community of members always keen to help each other, which is one of the reasons to choose it.

The core Neo4j database capabilities can be extended thanks to some plugins. Awesome Procedures on Cypher (APOC), a common Neo4j extension, contains some procedures that can extend the database and Cypher capabilities. We will use it later in this book to load JSON data.

The main plugin we will explore in this book is the Graph Data Science Library. Its predecessor, the Graph Algorithm Library, was first released in 2018 by the Neo4j lab team. It was quickly replaced by the Graph Data Science Library, a fully production-ready plugin, with improved performance. Algorithms are improved and added regularly. Version 2.0, released in 2021, takes graph data science even further, allowing us to train models and build analysis pipelines directly from the library. It also comes with a handy Python client, which is very convenient for including graph algorithms into your usual machine learning processes, whether you use scikit-learn or other machine learning libraries such as TensorFlow or PyTorch.

Besides the plugins, there are also lots of applications out there to help us deal with Neo4j and explore the data it contains. The first application we will use is Neo4j Desktop, which lets us manage several Neo4j databases. Continue reading to learn how to use it. Neo4j Desktop also lets you manage your installed plugins and applications.

Applications installed into Neo4j Desktop are granted access to your active database. While reading this book, you will use the following:

  • Neo4j Browser: A simple but powerful application that lets you write Cypher queries and visualize the result as a graph, table, or JSON:
Figure 1.4 – Neo4j Browser

Figure 1.4 – Neo4j Browser

  • Neo4j Bloom: A graph visualization application in which you can customize node styles (size, color, and so on) based on their labels and/or properties:
Figure 1.5 – Neo4j Bloom

Figure 1.5 – Neo4j Bloom

  • Neodash: This is a dashboard application that allows us to draw plots from the data stored in Neo4j, without having to extract this data into a DataFrame first. Plots can be organized into nice dashboards that can be shared with other users:
Figure 1.6 – Neodash

Figure 1.6 – Neodash

This list of applications is non-exhaustive. You can find out more here: https://install.graphapp.io/.

Good to know

You can create your own graph application to be run within Neo4j Desktop. This is why there are so many diverse applications, some of which are being developed by community members or Neo4j partners.

This section described Neo4j as a database and the various extensions that can be added to it to make it more powerful. Now, it is time to start using it. In the following section, you are going to install Neo4j locally on our computer so that you can run the code examples provided in this book (which you are highly encouraged to do!).

Setting up Neo4j

There are several ways to use Neo4j:

  • Through short-lived time sandboxes in the cloud, which is perfect for experimenting
  • Locally, with Neo4j Desktop
  • Locally, with Neo4j binaries
  • Locally, with Docker
  • In the cloud, with Neo4j Aura (free plan available) or Neo4j AuraDS

For the scope of this book, we will use the Neo4j Desktop option, since this application takes care of many things for us and we do not want to go into server management at this stage.

Downloading and starting Neo4j Desktop

The easiest way to use Neo4j on your local computer when you are in the experimentation phase, is to use the Neo4j Desktop application, which is available on Windows, Mac, and Linux OS. This user interface lets you create Neo4j databases, which are organized into Projects, manage the installed plugins and applications, and update the DB configuration – among other things.

Installing it is super easy: go to the Neo4j download center and follow the instructions. We recap the steps here, with screenshots to guide you through the process:

  1. Visit the Neo4j download center at https://neo4j.com/download-center/. At the time of writing, the website looks like this:
Figure 1.7 – Neo4j Download Center

Figure 1.7 – Neo4j Download Center

  1. Click the Download Neo4j Desktop button at the top of the page.
  2. Fill in the form that’s asking for some information about yourself (name, email, company, and so on).
  3. Click Download Desktop.
  4. Save the activation key that is displayed on the next page. It will look something like this (this one won’t work, so don’t copy it!):
    eyJhbGciOiJQUzI1NiIsInR5cCI6IkpXVCJ9.eyJlbWFpbCI6InN0ZWxsYTBvdWhAZ21haWwuY29tIiwibWl4cGFuZWxJZ CI6Imdvb2dsZS1vYXV0a
    ...
    ...

The following steps depend on your operating system:

  • On Windows, locate the installer, double-click on it, and follow the steps provided.
  • On Mac, just click on the downloaded file.
  • On Linux, you’ll have to make the downloaded file executable before running it. More instructions will be provided next.

For Linux users, here is how to proceed:

  1. When the download is over (this can take some time since the file is a few hundred MBs), open a Terminal and go to your download directory:
    # update path depending on your system
    $ cd Downloads/
  2. Then, run the following command, which will extract the version and architecture name from the AppImage file you’ve just downloaded:
    $ DESKTOP_VERSION=`ls -tr  neo4j-desktop*.AppImage | tail -1 | grep -Po "(?<=neo4j-desktop-)[^AppImage]+"
    $ echo ${DESKTOP_VERSION}
  3. If the preceding echo command shows something like 1.4.11-x86_64., you’re good to go. Alternatively, you can identify the pattern yourself and create the variable, like so:
    $ DESKTOP_VERSION=1.4.11-x86_64.  # include the final dot
  4. Then, you need to make the file executable with chmod and run the application:
    # make file executable:
    $ chmod +x neo4j-desktop-${DESKTOP_VERSION}AppImage
    # run the application:
    $ ./neo4j-desktop-${DESKTOP_VERSION}AppImage

The last command in the preceding code snippet starts the Neo4j Desktop application. The first time you run the application, it will ask you for the activation key you saved when downloading the executable. And that’s it – the application will be running, which means we can start creating Neo4j databases and interact with them.

Creating our first Neo4j database

Creating a new database with Neo4j desktop is quite straightforward:

  1. Start the Neo4j Desktop application.
  2. Click on the Add button in the top-right corner of the screen.
  3. Select Local DBMS.

This process is illustrated in the following screenshot:

Figure 1.8 – Adding a new database with Neo4j Desktop

Figure 1.8 – Adding a new database with Neo4j Desktop

  1. The next step is to choose a name, a password, and the version of your database.

Note

Save the password in a safe place; you’ll need to provide it to drivers and applications when connecting to this database.

  1. It is good practice to always choose the latest available version; Neo4j Desktop takes care of checking which version it is. The following screenshot shows this step:
Figure 1.9 – Choosing a name, password, and version for your new database

Figure 1.9 – Choosing a name, password, and version for your new database

  1. Next, just click Create, and wait for the database to be created. If the latest Neo4j version needs to be downloaded, it can take some time, depending on your connection.
  2. Finally, you can start your database by clicking on the Start button that appears when you hover your new database name, as shown in the following screenshot:
Figure 1.10 – Starting your newly created database

Figure 1.10 – Starting your newly created database

Note

You can’t have two databases running at the same time. If you start a new database while another is still running, the previous one must be stopped before the new one can be started.

You now have Neo4j Desktop installed and a running instance of Neo4j on your local computer. At this point, you are ready to start playing with graph data. Before moving on, let me introduce Neo4j Aura, which is an alternative way to quickly get started with Neo4j.

Creating a database in the cloud – Neo4j Aura

Neo4j also has a DB-as-a-service component called Aura. It lets you create a Neo4j database hosted in the cloud (either on Google Cloud Platform or Amazon Web Services, your choice) and is fully managed – there’s no need to worry about updates anymore. This service is entirely free up to a certain database size (50k nodes and 150k relationships), which makes it sufficient for experimenting with it. To create a database in Neo4j Aura, visit https://neo4j.com/cloud/platform/aura-graph-database/.

The following screenshot shows an example of a Neo4j database running in the cloud thanks to the Aura service:

Figure 1.11 – Neo4j Aura dashboard with a free-tier instance

Figure 1.11 – Neo4j Aura dashboard with a free-tier instance

Clicking Explore opens Neo4j Bloom, which we will cover in Chapter 3, Characterizing a Graph Dataset, while clicking Query starts Neo4j Browser in a new tab. You’ll be requested to enter the connection information for your database. The URL can be found in the previous screenshot – the username and password are the ones you set when creating the instance.

In the rest of this book, examples will be provided using a local database managed with the Neo4j Desktop application, but you are free to use whatever technique you prefer. However, note that some minor changes are to be expected if you choose something different, such as directory location or plugin installation method. In the latter case, always refer to the plugin or application documentation to find out the proper instructions.

Now that our first database is ready, it is time to insert some data into it. For this, we will use our first Cypher queries.

Inserting data into Neo4j with Cypher, the Neo4j query language

Cypher, as we discussed at the beginning of this chapter, is the query language developed by Neo4j. It is used by other graph database vendors, such as Redis Graph.

First, let’s create some nodes in our newly created database.

To do so, open Neo4j Browser by clicking on the Open button next to your database and selecting Neo4j Browser:

Figure 1.12 – Start the Neo4j Browser application from Neo4j Desktop

Figure 1.12 – Start the Neo4j Browser application from Neo4j Desktop

From there, you can start and write Cypher queries in the upper text area.

Let’s start and create some nodes with the following Cypher query:

CREATE (:User {name: "Alice", birthPlace: "Paris"})
CREATE (:User {name: "Bob", birthPlace: "London"})
CREATE (:User {name: "Carol", birthPlace: "London"})
CREATE (:User {name: "Dave", birthPlace: "London"})
CREATE (:User {name: "Eve", birthPlace: "Rome"})

Before running the query, let me detail its syntax:

Figure 1.13 – Anatomy of a node creation Cypher statement

Figure 1.13 – Anatomy of a node creation Cypher statement

Note that all of these components except for the parentheses are optional. You can create a node with no label and no properties with CREATE (), even if creating an empty record wouldn’t be really useful for data storage purposes.

Tips

You can copy and paste the preceding query and execute it as-is; multiple line queries are allowed by default in Neo4j Browser.

If the upper text area is not large enough, press the Esc key to maximize it.

Now that we’ve created some nodes and since we are dealing with a graph database, it is time to learn how to connect these nodes by creating edges, or, in Neo4j language, relationships.

The following code snippet starts by fetching the start and end nodes (Alice and Bob), then creates a relationship between them. The created relationship is of the KNOWS type and carries one property (the date Alice and Bob met):

MATCH (alice:User {name: "Alice"})
MATCH (bob:User {name: "Bob"})
CREATE (alice)-[:KNOWS {since: "2022-12-01"}]->(bob)

We could have also put all our CREATE statements into one big query, for instance, by adding aliases to the created nodes:

CREATE (alice:User {name: "Alice", birthPlace: "Paris"})
CREATE (bob:User {name: "Bob", birthPlace: "London"})
CREATE (alice)-[:KNOWS {since: "2022-12-01"}]->(bob)

Note

In Neo4j, relationships are directed, meaning you have to specify a direction when creating them, which we can do thanks to the > symbol. However, Cypher lets you select data regardless of the relationship’s direction. We’ll discuss this when appropriate in the subsequent chapters.

Inserting data into the database is one thing, but without the ability to query and retrieve this data, databases would be useless. In the next section, we are going to use Cypher’s powerful pattern matching to read data from Neo4j.

Extracting data from Neo4j with Cypher pattern matching

So far, we have put some data in Neo4j and explored it with Neo4j Browser. But unsurprisingly, Cypher also lets you select and return data programmatically. This is what is called pattern matching in the context of graphs.

Let’s analyze such a pattern:

MATCH (usr:User {birthPlace: "London"})
RETURN usr.name, usr.birthPlace

Here, we are selecting nodes with the User label while filtering for nodes with birthPlace equal to London. The RETURN statement asks Neo4j to only return the name and the birthPlace property of the matched nodes. The result of the preceding query, based on the data created earlier, is as follows:

╒══════════╤════════════════╕
│"usr.name"│"usr.birthPlace"│
╞══════════╪════════════════╡
│"Bob"     │"London"        │
├──────────┼────────────────┤
│"Carol"   │"London"        │
├──────────┼────────────────┤
│"Dave"    │"London"        │
└──────────┴────────────────┘

This is a simple MATCH statement, but most of the time, you’ll need to traverse the graph somehow to explore relationships. This is where Cypher is very convenient. You can write queries with an easy-to-remember syntax, close to the one you would use when drafting your query on a piece of paper. As an example, let’s find the users known by Alice, and return their names:

MATCH (alice:User {name: "Alice"})-[:KNOWS]->(u:User)
RETURN u.name

The highlighted part in the preceding query is a graph traversal. From the node(s) matching label, User, and name, Alice, we are traversing the graph toward another node through a relationship of the KNOWS type. In our toy dataset, there is only one matching node, Bob, since Alice is connected to a single relationship of this type.

Note

In our example, we are using a single-node label and relationship type. You are encouraged to experiment by adding more data types. For instance, create some nodes with the Product label and relationships of the SELLS/BUYS type between users and products to build more complex queries.

Summary

In this chapter, you learned about the specificities of graph databases and started to learn about Neo4j and the tools around it. Now, you know a lot more about the Neo4j ecosystem, including plugins such as APOC and the graph data science (GDS) library and graph applications such as Neo4j Browser and Neodash. You installed Neo4j on your computer and created your first graph database. Finally, you created your first nodes and relationships and built your first Cypher MATCH statement to extract data from Neo4j.

At this point, you are ready for the next chapter, which will teach you how to import data from various data sources into Neo4j, using built-in tools and the common APOC library.

Further reading

If you want to explore the concepts described in this chapter in more detail, please refer to the following references:

  • Graph Databases, The Definitive Book of Graph Databases, by I. Robinson, J. Webber, and E. Eifrem (O’Reilly). The authors, among which is Emil Eifrem, the CEO of Neo technologies, explain graph databases and graph data modeling, also covering the internal implementation. Very instructive!
  • Learning Neo4j 3.x - Second Edition, by J. Baton and R. Van Bruggen. Even if written for an older version of Neo4j, most of the concepts it describes are still valid – the newer Neo4j versions have mostly added new features such as clustering for scalability, without breaking changes.
  • The openCypher project (https://opencypher.org/) and GQL specification (https://www.gqlstandards.org/) to learn about graph query language beyond Cypher.

Exercises

To make sure you fully understand the content described in this chapter, you are encouraged to think about the following exercises before moving on:

  1. Which information do you need to define a graph?
  2. Do you need a graph dataset to start using a graph database?
  3. True or false:
    1. Neo4j can only be started with Neo4j Desktop.
    2. The application to use to create dashboards from Neo4j data is Neo4j Browser.
    3. Graph data science is supported by default by Neo4j.
  4. Are the following Cypher syntaxes valid, and why/why not? What are they doing?
    1. MATCH (x:User) RETURN x.name
    2. MATCH (x:User) RETURN x
    3. MATCH (:User) RETURN x.name
    4. MATCH (x:User)-[k:KNOWS]->(y:User) RETURN x, k, y
    5. MATCH (x:User)-[:KNOWS]-(y) RETURN x, y
  5. Create more data (other node labels/relationship types) and queries.
Left arrow icon Right arrow icon

Key benefits

  • Extract meaningful information from graph data with Neo4j's latest version 5
  • Use Graph Algorithms into a regular Machine Learning pipeline in Python
  • Learn the core principles of the Graph Data Science Library to make predictions and create data science pipelines.

Description

Neo4j, along with its Graph Data Science (GDS) library, is a complete solution to store, query, and analyze graph data. As graph databases are getting more popular among developers, data scientists are likely to face such databases in their career, making it an indispensable skill to work with graph algorithms for extracting context information and improving the overall model prediction performance. Data scientists working with Python will be able to put their knowledge to work with this practical guide to Neo4j and the GDS library that offers step-by-step explanations of essential concepts and practical instructions for implementing data science techniques on graph data using the latest Neo4j version 5 and its associated libraries. You’ll start by querying Neo4j with Cypher and learn how to characterize graph datasets. As you get the hang of running graph algorithms on graph data stored into Neo4j, you’ll understand the new and advanced capabilities of the GDS library that enable you to make predictions and write data science pipelines. Using the newly released GDSL Python driver, you’ll be able to integrate graph algorithms into your ML pipeline. By the end of this book, you’ll be able to take advantage of the relationships in your dataset to improve your current model and make other types of elaborate predictions.

What you will learn

Use the Cypher query language to query graph databases such as Neo4j Build graph datasets from your own data and public knowledge graphs Make graph-specific predictions such as link prediction Explore the latest version of Neo4j to build a graph data science pipeline Run a scikit-learn prediction algorithm with graph data Train a predictive embedding algorithm in GDS and manage the model store

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Jan 31, 2023
Length 288 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781804612743
Vendor :
Google
Category :

Table of Contents

16 Chapters
Preface Chevron down icon Chevron up icon
Part 1 – Creating Graph Data in Neo4j Chevron down icon Chevron up icon
Chapter 1: Introducing and Installing Neo4j Chevron down icon Chevron up icon
Chapter 2: Importing Data into Neo4j to Build a Knowledge Graph Chevron down icon Chevron up icon
Part 2 – Exploring and Characterizing Graph Data with Neo4j Chevron down icon Chevron up icon
Chapter 3: Characterizing a Graph Dataset Chevron down icon Chevron up icon
Chapter 4: Using Graph Algorithms to Characterize a Graph Dataset Chevron down icon Chevron up icon
Chapter 5: Visualizing Graph Data Chevron down icon Chevron up icon
Part 3 – Making Predictions on a Graph Chevron down icon Chevron up icon
Chapter 6: Building a Machine Learning Model with Graph Features Chevron down icon Chevron up icon
Chapter 7: Automatically Extracting Features with Graph Embeddings for Machine Learning Chevron down icon Chevron up icon
Chapter 8: Building a GDS Pipeline for Node Classification Model Training Chevron down icon Chevron up icon
Chapter 9: Predicting Future Edges Chevron down icon Chevron up icon
Chapter 10: Writing Your Custom Graph Algorithms with the Pregel API in Java Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.