Reader small image

You're reading from  Graph Data Science with Neo4j

Product typeBook
Published inJan 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781804612743
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Estelle Scifo
Estelle Scifo
author image
Estelle Scifo

Estelle Scifo possesses over 7 years experience as a data scientist, after receiving her PhD from the Laboratoire de lAcclrateur Linaire, Orsay (affiliated to CERN in Geneva). As a Neo4j certified professional, she uses graph databases on a daily basis and takes full advantage of its features to build efficient machine learning models out of this data. In addition, she is also a data science mentor to guide newcomers into the field. Her domain expertise and deep insight into the perspective of the beginners needs make her an excellent teacher.
Read more about Estelle Scifo

Right arrow

What is a graph database?

Before we get our hands dirty and start playing with Neo4j, it is important to understand what Neo4j is and how different it is from the data storage engine you are used to. In this section, we are going to discuss (quickly) the different types of databases you can find today, and why graph databases are so interesting and popular both for developers and data professionals.

Databases

Databases make up an important part of computer science. Discussing the evolution and state-of-the-art areas of the different implementations in detail would require several books like this one – fortunately, this is not a requirement to use such systems effectively. However, it is important to be aware of the existing tools related to data storage and how they differ from each other, to be able to choose the right tool for the right task. The fact that, after reading this book, you’ll be able to use graph databases and Neo4j in your data science project doesn’t mean you will have to use it every single time you start a new project, whatever the context is. Sometimes, it won’t be suitable; this introduction will explain why.

A database, in the context of computing, is a system that allows you to store and query data on a computer, phone, or, more generally, any electronic device.

As developers or data scientists of the 2020s, we have mainly faced two kinds of databases:

  • Relational databases (SQL) such as MySQL or PostgreSQL. These store data as records in tables whose columns are attributes or fields and whose rows represent each entity. They have a predefined schema, defining how data is organized and the type of each field. Relationships between entities in this representation are modeled by foreign keys (requiring unique identifiers). When the relationship is more complex, such as when attributes are required or when we can have many relationships between the same objects, an intermediate junction (join) table is required.
  • NoSQL databases contain many different types of databases:
    • Key-value stores such as Redis or Riak. A key-value (KV) store, as the name suggests, is a simple lookup database where the key is usually a string, and the value can be a more complex object that can’t be used to filter the query – it can only be retrieved. They are known to be very efficient for caching in a web context, where the key is the page URL and the value is the HTML content of the page, which is dynamically generated. KV stores can also be used to model graphs when building a native graph engine is not an option. You can see KV stores in action in the following projects:
    • Document-oriented databases such as MongoDB or CouchDB. These are useful for storing schema-less documents (usually JSON (or a derivative) objects). They are much more flexible compared to relational databases, since each document may have different fields. However, relationships are harder to model, and such databases rely a lot on nested JSON and information duplication instead of joining multiple tables.

The preceding list is non-exhaustive; other types of data stores have been created over time and abandoned or were born in the past years, so we’ll need to wait to see how useful they can be. We can mention, for instance, vector databases, such as Weaviate, which are used to store data with their vector representations to ease searching in the vector space, with many applications in machine learning once a vector representation (embedding) of an observation has been computed.

Graph databases can also be classified as NoSQL databases. They bring another approach to the data storage landscape, especially in the data model phase.

Graph database

In the previous section, we talked about databases. Before discussing graph databases, let’s introduce the concept of graphs.

A graph is a mathematical object defined by the following:

  • A set of vertices or nodes (the dots)
  • A set of edges (the connections between these dots)

The following figure shows several examples of graphs, big and small:

Figure 1.1 – Representations of some graphs

Figure 1.1 – Representations of some graphs

As you can see, there’s a Road network (in Europe), a Computer network, and a Social network. But in practice, far more objects can be seen as graphs:

  • Time series: Each observation is connected to the next one
  • Images: Each pixel is linked to its eight neighbors (see the bottom-right picture in Figure 1.1)
  • Texts: Here, each word is connected to its surrounding words or a more complex mapping, depending on its meaning (see the following figure):
Figure 1.2 – Figure generated with the spacy Python library, which was able to identify the relationships between words in a sentence using NLP techniques

Figure 1.2 – Figure generated with the spacy Python library, which was able to identify the relationships between words in a sentence using NLP techniques

A graph can be seen as a generalization of these static representations, where links can be created with fewer constraints.

Another advantage of graphs is that they can be easily traversed, going from one node to another by following edges. They have been used for representing networks for a long time – road networks or communication infrastructure, for instance. The concept of a path, especially the shortest path in a graph, is a long-studied field. But the analysis of graphs doesn’t stop here – much more information can be extracted from carefully analyzing a network, such as its structure (are there groups of nodes disconnected from the rest of the graph? Are groups of nodes more tied to each other than to other groups?) and node ranking (node importance). We will discuss these algorithms in more detail in Chapter 4, Using Graph Algorithms to Characterize a Graph Dataset.

So, we know what a database is and what a graph is. Now comes the natural question: what is a graph database? The answer is quite simple: in a graph database, data is saved into nodes, which can be connected through edges to model relationships between them.

At this stage, you may be wondering: ok, but where can I find graph data? While we are used to CSV or JSON datasets, graph formats are not yet common and it might be misleading to some of you. If you do not have graph data, why would you need a graph database? There are two possible answers to this question, both of which we are going to discuss.

Previous PageNext Page
You have been reading a chapter from
Graph Data Science with Neo4j
Published in: Jan 2023Publisher: PacktISBN-13: 9781804612743
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Estelle Scifo

Estelle Scifo possesses over 7 years experience as a data scientist, after receiving her PhD from the Laboratoire de lAcclrateur Linaire, Orsay (affiliated to CERN in Geneva). As a Neo4j certified professional, she uses graph databases on a daily basis and takes full advantage of its features to build efficient machine learning models out of this data. In addition, she is also a data science mentor to guide newcomers into the field. Her domain expertise and deep insight into the perspective of the beginners needs make her an excellent teacher.
Read more about Estelle Scifo