Home Data Hands-On Graph Analytics with Neo4j

Hands-On Graph Analytics with Neo4j

By Estelle Scifo
books-svg-icon Book
eBook $35.99 $24.99
Print $48.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $35.99 $24.99
Print $48.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Graph Databases
About this book
Neo4j is a graph database that includes plugins to run complex graph algorithms. The book starts with an introduction to the basics of graph analytics, the Cypher query language, and graph architecture components, and helps you to understand why enterprises have started to adopt graph analytics within their organizations. You’ll find out how to implement Neo4j algorithms and techniques and explore various graph analytics methods to reveal complex relationships in your data. You’ll be able to implement graph analytics catering to different domains such as fraud detection, graph-based search, recommendation systems, social networking, and data management. You’ll also learn how to store data in graph databases and extract valuable insights from it. As you become well-versed with the techniques, you’ll discover graph machine learning in order to address simple to complex challenges using Neo4j. You will also understand how to use graph data in a machine learning model in order to make predictions based on your data. Finally, you’ll get to grips with structuring a web application for production using Neo4j. By the end of this book, you’ll not only be able to harness the power of graphs to handle a broad range of problem areas, but you’ll also have learned how to use Neo4j efficiently to identify complex relationships in your data.
Publication date:
August 2020
Publisher
Packt
Pages
510
ISBN
9781839212611

 
Graph Databases

Graph databases have gained increasing attention in the last few years. Data models built from graphs bring together the simplicity of document-oriented databases and the clarity of SQL tables. Among others, Neo4j is a database that comes with a large ecosystem, including the database, but also tools to build web applications, such as the GRANDstack, and tools to use graph data in a machine learning pipeline, as well as the Graph Data Science Library. This book will discuss those tools, but let's first start from the beginning.

Talking about graph databases means talking about graphs. Even if you do not need to know all the details about graph theory, it’s always a good idea to learn some of the basic concepts underlying the tool you are using. In this chapter, we will start by defining graphs and giving some simple and less simple examples of graphs and their applications. We will then see how to move from the well-known SQL tables to graph data modeling. We’ll conclude by introducing Neo4j and its building blocks, and review some design principles to understand what can and can’t be done with Neo4j.

This chapter will cover the following topics:

  • Graph definition and examples
  • Moving from SQL to graph databases
  • Neo4j: the nodes, relationships, and properties model
  • Understanding graph properties
  • Considerations for graph modeling in Neo4j
 

Graph definition and examples

The question you may ask at this point is "Why should I care about graphs? After all, my company/business/interest is not about graphs or networks of any kind. I know my data model, well arranged into SQL tables or NoSQL documents, and I can retrieve the information I want when I want." This book will teach you how to empower your data by looking at it in a different way. Surprisingly enough, graphs can be used to model a lot of processes, from the more obvious ones such as road networks, to less intuitive use cases such as video games or credit card fraud detection, among many others.

Graph theory

Let's start from the beginning and answer the question "What is a graph?"

A bit of history: the Seven Bridges of Königsberg problem

Graph studies originate back to Leonhard Euler, a prolific Swiss mathematician who lived in the eighteenth century. In 1735, he published a paper proposing a solution to the Seven Bridges of Königsberg problem. The problem is the following:

Given the city whose geography is depicted in the following image, is there a way to walk across each of the seven bridges of the city once and only once, and return to our starting point?

As you can see, this city is crossed by a river that splits the city into two banks, A and B. The river meander additionally creates two islands, C and D, also part of the city. Those two banks and two islands are connected by a total of seven bridges: two bridges between A and C, two other bridges between C and B, one between C and D, one between B and D, and a last one between D and A:

Euler's reasoning (on the right side) was to reduce this complex geography to the most simple drawing, like the one you can see on the right of the previous image, since the route used within each island is not relevant. Each island then becomes a single point, or node, connected to another by one or several links, or edges, representing the bridges.

With this simple visualization, the mathematician was able to solve the initial problem by noting that, if you arrive at an island (vertex) via one bridge, you will need to leave it using another bridge (except for the start and end vertices). In other words, all vertices but two need to be connected to an even number of relationships. This is not the case in the Königsberg graph, since we have the following:

A: 3 connections (to C twice, and to D once)
B: 3 connections (to C twice, and to D once)
C: 5 connections (to A twice, to B twice and to D once)
D: 3 connections (to A once, to C once and to D once)

This kind of path, where each edge is used once and only once, is called a Eulerian cycle and it can be said that a graph has a Eulerian cycle if and only if all of its vertices have even degrees.

The number of connections for a node is called the degree of the node.

Graph definition

This leads us to the mathematical definition of a graph:

A graph G = (V, E) is a pair of:

  • V, a set of nodes or vertices: the islands in our previous example
  • E, a set of edges connecting nodes: the bridges

The Königsberg graph illustrated on the right of the preceding image can then be defined as follows:

V = [A, B, C, D]
E = [
(A, C),
(A, C),
(C, B),
(C, B),
(A, D),
(C, D),
(B, D)
]

Graphs, like many mathematical objects, are well defined. While it can be difficult to find a good visualization for some of those objects, graphs, on the other hand, suffer from the almost infinite number of ways to draw them.

Visualization

Apart from very special cases, there is no single way to draw a graph and visualize it. Indeed, graphs are most often an abstract representation of reality. For instance, all four graphs depicted in the following image represent the exact same set of nodes and edges, so, by definition, the same mathematical graph:

We cannot rely only on our eyes to find patterns within graphs. For instance, looking only at the lower-right plot, it would be impossible to see the pattern that is visible in the upper-right plot. That's where graph algorithms enter into the game, which will be discussed in more detail in Chapter 6, Node Importance, and Chapter 7, Community Detection and Similarity Measures.

Examples of graphs

Now that we have a better idea of what a graph is, it's time to discover some more examples to understand which purposes graphs can be useful for.

Networks

With the graph definition in mind (a set of nodes connected to each other via edges) and the bridges example from the last section, we can easily imagine how all kinds of networks can be seen as graphs, including road networks, computer networks, or even social networks.

Road networks

Road networks are a perfect example of graphs. In such networks, the nodes are the road intersections, and edges are the roads themselves, as you can see in the following image:

This image shows the road network around Central Park in New York City, wherein streets are edges between junctions representing nodes

With road networks, many questions can be answered with graph analysis, such as the following:

  • What is the shortest path between two points (nodes)?
  • How long is this shortest path?
  • Are there alternative routes?
  • How can you visit all nodes within a list in a minimal amount of time?

This last question is especially important for parcel delivery, in order to minimize the number of driven miles to maximize the number of delivered parcels and satisfied customers.

We will go into more detail about this topic in Chapter 4, The Graph Data Science Library and Path Finding.

Computer networks

In a computer network, each computer/router is a node and the cables between them are the edges. The following image illustrates some possible topologies used for a computer network (credit: https://commons.wikimedia.org/wiki/File:NetworkTopologies.svg):

You can now draw the parallel with the graph definition we discovered in the last section. Here again, the graph structure helps in answering some common questions you may ask yourself about your network:

  • How fast will this information be transferred from A to B? This sounds like a shortest path issue.
  • Which of my nodes is the most critical one? By critical, we mean that if this node is not working for some reason, the whole network will be impacted. Not all nodes have the same impact on the network. That's where centrality algorithms come into the game (see Chapter 6, Node Importance).

Social networks

Facebook, LinkedIn, and all of our favorite social networks use graphs to model their users and interactions. In the most basic example of a social graph, nodes represent people, and edges the friendship or professional relationship between them, as illustrated in the following image:

Here again, graphs allow us to see the data from a different perspective. For instance, we have seen this kind of information when looking at someone’s profile on LinkedIn:

In that case, it tells us that the connected user (me) is just two connections away from Clark Kent. In other words, one person in my network is already connected to a person who is connected to Clark Kent. The following image illustrates this more clearly, in terms of degrees of separation:

You've probably heard about the Six Degrees of Separation theory. In 1929, the Hungarian journalist Frigyes Karinthy proposed a theory according to which each person on Earth is at most six connections away from any other person. In other words, if you want to talk to one person, say Barack Obama, a friend of yours has a friend whose friend has a friend... who knows Barack Obama and can introduce you to him. According to Karinthy, this connection chain must contain less than six connections, or seven people in total, including you and Barack Obama.

Given that there are more than 7 billion human beings on Earth, that's a surprisingly small number! With the large databases that are available nowadays, such as the friendship connections from Facebook or email exchanges from Microsoft, researchers have tried to prove the preceding statement. From the Microsoft email database, for instance, it was shown in 2008 that the average degree of separation between 180 billion distinct pairs of people was around 6.6. But this is just an average, and the number of hops to connect two people could go up to 29 with that dataset.

Many other kinds of analyses can be performed over social graphs:

  • Node importance: Again, it might be very useful to have an idea of which nodes (persons) are the most important. However, the definition of importance here will be different than in the case of a computer network, since it is very unlikely that a single person's retirement from social media makes the whole world collapse. However, influencers have a particular interest for marketing experts.
  • Community detection: Also called clustering, is a way to find a group of nodes sharing some characteristics. For instance, finding users who share the same interests, or visit the same places, can be used to recommend products to them.
  • Link prediction: With a graph, you can think of creating intelligent models to predict whether two entities are likely to be connected in the future. Here again, recommendation engines are one possible application of such a tool.
You can find more information about the Facebook graph as an example at https://developers.facebook.com/docs/graph-api.

As you can see, networks of all kinds are very well suited to graph databases. But we can go far beyond that view and imagine all kinds of data as a graph, which will open up a lot of new perspectives.

Your data is also a graph

You may have noticed that in the previous image of a social graph, the edges have names. Indeed, some people are friends, while some others have a father/son relationship. Now, let's imagine we can have any kind of relationship, meaning we can start connecting different kinds of entities. For instance, a person is living in a particular country, so (s)he is connected to that country with a relationship of type LIVES_IN. Are you beginning to see the point? With that kind of reasoning, the world itself is a graph and your business is a subpart of it.

Graphs are about relationships, and the world is connected, meaning there are relationships everywhere. We’ll talk about this in more detail in Chapter 3, Empowering Your Business with Pure Cypher, which is dedicated to knowledge graphs.

Graph databases allow you to model the data in that way: nodes, connected by relationships of some type. Let's see how to migrate data stored in relational databases to graph databases.

 

Moving from SQL to graph databases

Before going into detail about Neo4j, let's first have a look at the existing database models. We'll then focus on the most famous one, the relational model, and learn how to migrate from SQL to a graph database model.

Database models

Think about water: depending on the end goal, you will not use the same container. If you want to drink, you might use a glass of water but if you want to have a bath, you will probably choose another one. The choice of container will be different in each scenario. The problem is similar for data: without a proper container, there is nothing we can do with it and we need a proper container depending on the situation, which will not only store data but also contribute to solve the problem we have. This data container is the database.

Drawing an exhaustive list of database types on the market is not impossible but would go far beyond the scope of this book. However, I want to give some examples of the most popular ones, so that you can see where graph databases stand in the big picture:

  • Relational databases: They are by far the most well known type of database. From SQLite to MySQL or PostgreSQL, they use a common query language, called Structured Query Language (SQL), with some variations among the different implementations. They are well established and allow a clear structure of the data. However, they suffer from performance issues when the data grows and are surprisingly not that good at managing complex relationships, since the relationships require many joins between tables.
  • Document-oriented databases: Document-oriented databases, part of the NoSQL (Not Only SQL) era, have gained increasing interest during the last few years. Contrary to relational databases, they can manage flexible data models and are known for better scaling with a large amount of data. Examples of NoSQL databases include MongoDB and Cassandra, but you can find many more on the market.
  • Key-value stores: Redis, RocksDB, and Amazon DynamoDB are examples of key-value databases. They are very simple and known to be very fast, but are not well suited to store complex data.

Here is how the different databases can be viewed in a figurative representation:

Graph databases try to bring the best of each world into a single place by being quite simple to use, flexible, and very performant when it comes to relationships.

SQL and joins

Let's briefly focus on relational databases. Imagine we want to create a Question and Answer website (similar to the invaluable Stack Overflow). The requirements are the following:

  • Users should be able to log in.
  • Once logged in, users can post questions.
  • Users can post answers to existing questions.
  • Questions need to have tags to better identify which question is relevant for which user.

As developers or data scientists, who are used to SQL, we would then naturally start thinking in terms of tables. Which table(s) should I create to store this data? Well, first we will look for entities that seem to be the core of the business. In this QA website, we can identify the following:

  • Users, with attributes: ID, name, email, password
  • Questions: ID, title, text
  • Answers: ID, text
  • Tags: ID, text

With those entities, we now need to create the relationships between them. To do so, we can use foreign keys. For instance, the question has been asked by a given user, so we can just add a new column to the question table, author_id, referencing the user table. The same applies for the answers: they are written by a given user, so we add an author_id column to the Answer table:

It becomes more complicated for tags, since one question can have multiple tags and a single tag can be assigned to many questions. We are in the many-to-many relationship type, which requires adding a join table, a table that is just there to remember that kind of relationship. This is the case of the QuestionTag table in the preceding diagram; it just holds the relationship between tags and questions.

It's all about relationships

The previous exercise probably looks easy to you, because you have followed lectures or tutorials about SQL and use it pretty often in your daily work. But let's be honest, the first time you face a problem and have to create the data model that will allow you to solve it, you will probably draw something that looks like the diagram in the following image, right?

This is one of the powers of graph databases:

Neo4j, as a graph database, allows you to create vertices, or nodes, and the relationships connecting them. The next section summarizes how the different entities in the Neo4j ecosystem are related to each other and how we can structure our data in a graph.

The whiteboard model is your data model.
 

Neo4j – the nodes, relationships, and properties model

Neo4j is a Java-based highly scalable graph database whose code is publicly available on GitHub at github.com/neo4j/neo4j. This section describes its building blocks and its main cases, but also the cases where it won't perform well because no system is perfect.

Building blocks

As we discussed with graph theory, a graph database such as Neo4j is made of at least two essential building blocks:

  • Nodes
  • Relationships between nodes:
Unlike SQL, Neo4j does not require a fixed and predetermined schema: nodes and relationships can be added on demand.

Let's look at each of of these in detail.

Nodes

In Neo4j, vertices are called nodes. Nodes can be of different types, like Question and Answer were in our former example. In order to differentiate those entities, nodes can have a label. If we continue the parallel with SQL, all nodes with a given label would be in the same table. But the analogy ends here, because nodes can have multiple labels. For instance, Clark Kent is a journalist, but he is also a superhero; the node representing this person can then have the two labels: Journalist and SuperHero.

Labels define the kind of entity the node belongs to. It is also important to store the characteristics of that entity. In Neo4j, this is done by attaching properties to nodes.

Relationships

Like nodes, relationships carry different pieces of information. A relationship between two persons can be of type MARRIED_TO or FRIEND_WITH or many other types of relationship. That's why Neo4j's relationships must have one and only one type.

One of the main powers of Neo4j is that relationships also have properties. For instance, when adding the relationship MARRIED_TO between two persons, we could add the wedding date, place, whether they signed a prenuptial agreement, and so on, as relationship properties.

Be careful: even though we discussed undirected graphs earlier, in Neo4j, all relationships are oriented!

Properties

Properties are saved as key-value pairs where the key is a string capturing the property name. Each value can then be of any of the following types:

  • Number: Integer or Float
  • Text: String
  • Boolean: Boolean
  • Time properties: Date, Time, DateTime, LocalTime, LocalDateTime, or Duration
  • Spatial: Point

Internally, properties are saved as a LinkedList, each element containing a key-value pair. The node or relationship is then linked to the first element of its property list.

Naming conventions: Node labels are written in UpperCamelCase while relationship names use UPPER_CASE with words separated by an underscore. Property names usually use the lowerCamelCase convention.

SQL to Neo4j translator

Here are a few guidelines to be able to easily go from your relational model to a graph model:

SQL world Neo4j world
Table Node label
Row Node
Column Node property
Foreign key Relationship
Join table Relationship
NULL Do not store null values inside properties; just omit the property

Applying those guidelines to the SQL model in the question and answer table diagram, we built up the graph model displayed in the whiteboard model for our simple Q&A website, earlier in the chapter. The full graph model can be seen as follows:

You can use the online tool Arrows, which enables you to draw a graph diagram and export it to Cypher, the Neo4j query language:
http://www.apcjones.com/arrows.

Neo4j use cases

Like any other tool, Neo4j is very good in some situations, but not well suited to others. The basic principle is that Neo4j provides amazing performance for graph traversal. Everything requiring jumping from one node to another is incredibly fast.

On the other hand, Neo4j is probably not the best tool if you want to do the following:

  • Perform full DB scans, for instance, answering the question "What is?"
  • Do full table aggregates
  • Store large documents: the key-values properties list needs to be kept small (let's say no more than around 10 properties)

Some of those pain points can be addressed with a proper graph model. For instance, instead of saving all information as node properties, can we consider moving some of them to another node with a relationship between them? Depending on the kind of requests you are interested in, the graph schema most suited to your application may differ. Before going into the details of graph modeling, we need to stop briefly and and talk about the different kinds of graph properties which will also influence our choice.

 

Understanding graph properties

Graphs are classified into several categories, depending on the properties of the connections between nodes. These categories are important to consider when modeling your data as a graph, and even more so when you want to run algorithms on them, because the behavior of the algorithm might change and create unexpected results in some situations. Let's discover some of those properties with examples.

Directed versus undirected

So far, we've only talked about graphs as a set of connected nodes. Those connections can be different depending on whether you are going from node A to node B or vice versa. For instance, some streets can only be driven in one direction, right? Or, the fact that you are following someone on Twitter (there is a Follows relationship from you to that user) doesn’t mean that user is also following you. That’s the reason why some relationships are directed. On the contrary, a relationship of type married to is naturally undirected, if A is married to B, usually B is also married to A:

In the right-hand side graph, directed, the transition between B and A is not allowed in that direction

In Neo4j, all relationships are directed. However, Cypher allows you to take this direction into account, or not. That will be explained in more detail in Chapter 2, The Cypher Query Language.

Weighted versus unweighted

As well as direction, relationships can also carry more information. Indeed, in a road network for instance, not all streets have the same importance in a routing system. They have different lengths or occupancy during peak hours, meaning the travel time will be very different from one street to another. The way to model this fact with graphs is to assign a weight to each edge:

In the weighted graph, relationships have weights 16 and 4

Algorithms such as shortest path algorithms take this weight into account to compute a shortest weighted path.

This is not only important for road networks. In a computer network, the distance between units may also have its own importance in terms of connection speed. In social networks, distance is often not the most important property to quantify the strength of a relationship, but we can think of other metrics. For instance, how long have those two people been connected? Or when was the last time user A reacted to a post of user B?

Weights can represent anything related to the relationship between two nodes, such as distance, time, or field-dependent metrics such as the interaction strength between two atoms in a molecule.

Cyclic versus acyclic

A cycle is a loop. A graph is cyclic if you can find at least one path starting from one node and going back to that exact same node.

As you can imagine, it's important to be aware of the presence of cycles in a graph, since they can create infinite loops in a graph traversal algorithm, if we do not pay enough attention to it. The following image shows an acyclic graph versus a cyclic graph:

All those properties are not self-exclusive and can be combined. Especially, we can have directed acyclic graphs (DAG), which usually behave nicely.

Dense versus sparse

Graph density is another property that is important to have in mind. It is related to the average number of connections for each node. In the next diagram, the left graph only has three connections for four nodes, whereas the right graph has six connections for the same amount of nodes, meaning the latter is denser than the former:

Graph traversal

But why worry about density? It is important to realize that everything within graph databases is about graph traversal, jumping from one node to another by following an edge. This can become very time-consuming with a very dense graph, especially when traversing the graph in a breadth-first way. The following image shows the graph traversals in which breadth-first search is compared to depth-first search:

Connected versus disconnected

Last but not least is the notion of connectivity or a connected graph. In the next diagram, it is always possible to go from one vertex to another, whatever pair of vertices is considered. This graph is said to be connected. On the other side of the figure, you can see that D is isolated - there is no way to go from D to A, for example. This graph is disconnected, and we can even say that it has two components. We will cover the analysis of this kind of structure in Chapter 7, Community Detection and Similarity Measures. The following image shows connected versus disconnected graphs:

This is a non-exhaustive list of graph properties, but they are the main ones we will have to worry about within our graph database adventure. Some of them are important when creating the whiteboard graph model for our data, as we will discuss in the next section.

 

Considerations for graph modeling in Neo4j

The whiteboard model for our simple Q&A website is the very first approach to this problem. Depending on the kind of questions you want your application to answer, and in order to take full advantage of Neo4j, the schema can be very different.

As a rule of thumb, we consider nodes as entities or objects, while relationships are verbs.

Relationship orientation

As we discussed earlier in the chapter, relationships in Neo4j are oriented. However, with Cypher, we can build queries that are orientation-independent (see Chapter 2, The Cypher Query Language). In that scenario, a relationship that is imperatively bijective should not be stored twice in the database. In short, a friendship relationship should be created only in one direction, as illustrated in the following image:

Node or property?

Another question that arises quite often and early in the choice of graph model is whether a (categorical) characteristic should be stored as a node property, or whether it should have its own node.

Nodes are the following:

  • Explicit: Nodes appear in the graph schema, which makes them more visible than properties.
  • Performant: If we need to find nodes sharing the same characteristics, it will be much faster with a node than by filtering on properties.

On the other hand, properties are the following:

  • Simpler: It's just a property with a value, not a complex node and relationship structure. If you don't need to perform special analysis on the property and just want it to be available when working on the node it belongs to, a property is just fine.
  • Indexable: Neo4j supports indexing for properties. It is used to find the first node in a graph traversal and is very efficient (more details in Chapter 2, The Cypher Query Language).

As you can see, there is no universal answer to this question, and the solution will depend on your use case. It is usually recommended to list them and try to write and test the associated queries in order to make sure you are not falling into a Neo4j pitfall.

For more information and more examples on graph modeling, I encourage you to have a look at Neo4j Graph Data Modeling (see the Further reading section).

 

Summary

In this chapter, we discussed the mathematical definition of a graph and how it is related to graph databases. We then moved on to the specific instance of graph databases we will use throughout this book, Neo4j. We've learned about its building blocks, nodes having labels and properties, and relationships that must have a type and, optionally, some properties as well. We concluded with some important explanations about different types of graphs, weighted, directed, cyclic, or connected, and how this can help in defining the data model best suited to a given use case.

In the next chapter, we are going to look at Cypher, the query language used by Neo4j, and how we can feed the database and retrieve data from it.

 

Further reading

  • If you want to learn more about the mathematics behind graph theory, you can start with Graph Theory with Applications, J.A. Bondy and U.S.R. Murty, Elsevier, whose first edition is available for free on the web, for instance at https://www.freetechbooks.com/graph-theory-with-applications-t559.
  • For more information about choosing the appropriate schema for your graph data, see Neo4j Graph Data Modeling, M. Lal, Packt Publishing.
  • Some references to learn more about degrees of separation:

  • Paper about the Microsoft study: Planetary-Scale Views on an Instant-Messaging Network by Jure Leskovec and Eric Horvitz
  • Facebook research blog post: https://research.fb.com/blog/2016/02/three-and-a-half-degrees-of-separation/
About the Author
  • Estelle Scifo

    Estelle Scifo possesses over 7 years experience as a data scientist, after receiving her PhD from the Laboratoire de lAcclrateur Linaire, Orsay (affiliated to CERN in Geneva). As a Neo4j certified professional, she uses graph databases on a daily basis and takes full advantage of its features to build efficient machine learning models out of this data. In addition, she is also a data science mentor to guide newcomers into the field. Her domain expertise and deep insight into the perspective of the beginners needs make her an excellent teacher.

    Browse publications by this author
Hands-On Graph Analytics with Neo4j
Unlock this book and the full library FREE for 7 days
Start now