Graphs are all around us. Each time we access the Internet, the data packets travel across a network of routers, switches, and cables and deliver what we have requested. While representing key concepts/objects in a problem and defining relationships or interactions between the concepts/objects involved, we generally draw bubbles or boxes to denote the objects, and arrows between those objects to represent the interactions or relationships. We use a similar notation while drawing a map to explain routes to others. The beauty of these notations, such as bubbles and arrows, is their expressiveness, a property that is usually lost when we obfuscate the model into records and tables. Graphs allow us to discover information and ease the modeling pain, which eventually makes our life smoother. To be able to use graphs better, we will need to understand a few basic concepts related to a graph database. In this chapter, we will explore the following:
Graphs in mathematics
The property graph model
Reasons for using a graph database
Usage of graphs—some obvious and some not-so-obvious graph problems
Advantages of using Neo4j
We chose Neo4j to explain graph data modeling in this book. However, the modeling concepts discussed here will apply to any graph database.
A few readers might be experienced Neo4j users and if you fall into this category, you might want to skip this chapter. However, if you are new to Neo4j or want a brief refresher, please carry on.
A graph is a mathematical structure of objects in which some pairs of objects are connected by links. The objects are denoted by abstractions called nodes (also known as vertices) and their links are represented by relationships (also known as edges). The relationships might be directed where it makes semantic sense in one particular direction. In cases where the semantics work in both directions, we can safely use undirected relationships to denote the link.
In Figure 1.1, we have three actors or entities, Alice, Bob, and London, which are represented as nodes. The links between them are denoted by relationships. Alice is married to Bob and Bob is married to Alice. Both true, hence we represent Is Married To as an undirected relationship. However, Alice lives in London is represented by a directed relationship, Lives In, from Alice to London. This is because London lives in Alice cannot be true.
In Neo4j, we use a property graph model to represent information. The property graph model is an extension of the graphs from mathematics. The following figure gives an example of how data from Figure 1.1 can be represented in Neo4j:
Labels: These are used to represent the role of the node in our domain. A node can have multiple labels at the same time. Apart from adding more meaning to nodes, labels are also used to add constraints and indices that are local to the particular label. In the preceding figure, :Person and :Location are the two labels that we used. We can add an index or constraint on name for each of these labels, which will result in two separate indices—one for :Location and the other for :Person.
Relationships: These depict directed, semantically relevant connections between two nodes. A relationship in Neo4j will always have a start node, an end node, and a single type. While relationships need to be created with a direction, we can ignore the direction while traversing them. :LIVES_IN and :IS_MARRIED_TO in Figure 1.2 are relationship types.
Properties: These are key-value pairs that contain information about the node or relationship. In the previous figure, name and since are both properties that divulge more information about the node or relationship they are associated with. Neo4j can accept any Java Virtual Machine (JVM) type as a property, including but not limited to, date, string, double, and arrays.
This property graph model allows us to model data as close to the real world as possible.
The resultant model is simpler and more expressive. It also explicitly calls out relationships. In contrast to an RDBMS, which uses foreign keys to imply relationships, having them explicitly defined allows us to retrieve data by traversing relationships to find the information we need. This is a deliberate, practical algorithmic approach that uses the connectedness of data, rather than relying on some index lookups or joins to find the related data. Explicit relationships also make the property graph model a natural fit for most problem domains, as they are interconnected.
As with all database management systems, graph databases have the concept of storage and query engines, which deal with persistence and queries over connected data. The query engine of the database is responsible for running the queries and retrieving or modifying data. The query engine exposes the graph data model through Create, Read, Update, and Delete operations (commonly referred to as CRUD). Storage deals with how the data is stored physically and how it is represented logically when retrieved. Its knowledge can help in choosing a graph database.
Relationships are an important part of any domain model and need to be traversed frequently. In a graph database, the relationships are explicit rather than inferred. Making relationships explicit is achieved either via the query engine working on a non-native graph storage (such as RDBMS, column stores, document stores) or using a native graph storage.
In a graph database relying on non-native graph storage, relationships need to be inferred at runtime. For example, if we want to model a graph in an RDBMS, our processing engine will have to infer the relationships using foreign keys and reify the relationships at runtime. This problem is computationally expensive and is infeasible for traversing multiple relationships because of the recursive joins involved. There are other graph databases in which NoSQL stores such as HDFS, column stores such as Cassandra, or documents are used to store data and expose a Graph API. Though there are no joins in a graph database using NoSQL stores, the database still has to use index lookups. In cases where non-native storage is used, the query engines have to make more computational effort.
Neo4j uses a native graph storage. Each node has a handle to all the outgoing relationships it has and each relationship, in turn, knows its terminal nodes. At runtime, to find neighboring nodes, Neo4j doesn't have to do an index lookup. Instead, neighboring nodes can be identified by looking at the relationships of the current node. This feature is called index-free adjacency. Index-free adjacency is mechanically sympathetic and allows the Neo4j query engine to have a significant performance boost while traversing the graph.
Every morning when we check our Facebook feed, we are welcomed by a stream of updates from friends and news. Using information about how data is connected and matching it with our individual preferences, Facebook builds a stream of activities from our network that are relevant and interest us. LinkedIn does something similar while suggesting jobs within our network. When we fire up Google Maps or some application such as TomTom or Sygic maps and start navigating to a destination, we use the data that represents connections of various intersections within the city, and work out how best to traverse it. While shopping online, products are recommended to us based on how closely they are connected to what we have already bought or similar products that others have bought. We leverage connected data more and more every day without realizing it.
The query performance of a graph database is a few orders of magnitude better than RDBMS or other NoSQL alternatives. As the dataset grows, RDBMS join performance deteriorates because of the ever-increasing size of the join tables. On the other hand, graph traversals are localized to a portion of the graph. So query execution time is proportional to the number of nodes visited, rather than being proportional to the overall amount of data stored. This makes the query performance fairly constant over time even though the data might increase exponentially.
Flexibility and agility are major considerations in today's world where business needs are constantly evolving. Developers need to have a tool that allows them to incrementally think of the model rather than locking down the data model before they start coding. Graph databases allow for addition of relationships, node types, and properties without making any changes to the existing queries. We can connect the model incrementally, thereby allowing for more sophisticated querying. This flexibility also means fewer migrations. Even in case of changes to the data model, migrations are relatively pain free and can be done without taking the database offline for a long time, thus helping teams deliver software faster while concentrating on the domain rather than managing infrastructure and communication.
Lesser ambiguity leads to better models. Since graph databases are schema-less, the schema is dictated by the application and hence is better validated. It allows for better design thinking by developers since there is no ambiguity of the domain model compared to how it is stored in tables.
The design to delivery time is reduced. From a developer's standpoint, one of the best features of a graph database is that it is whiteboard friendly. We can make a data model on a whiteboard and not worry about trying to translate it to a set of tables, which don't necessarily represent the data model as is. This allows the developers to concentrate on development rather than translation, thereby saving time.
While all that has been said might seem like jargon, it boils down to economics. Graph databases make more economic sense when the data is highly connected.
Let's start by citing a few problem statements that are more suited to graph databases.
Routing is a graph problem and much research has been done in that respect. One of the leading delivery services in the world uses a Neo4j-based solution to route packages in real time based on information being collected worldwide.
Social networks are problems suited for graphs since they leverage the connections of users to fetch data and decide on what is accessible and what isn't. Facebook, in particular, uses its graph search and has exposed it to the users to enable them to make better searches. Facebook relies heavily on the graph of people and their friends to curate the feed.
Recommendation is again a graph problem that can be solved using graph databases. While companies such as eBay originally relied on MySQL, they eventually turned to Neo4j.
Search, for example, doesn't come across as a graph problem and is not a very intuitive one. However, Google uses its Knowledge Graph to give you search results based on how well connected a piece of content is to the term being searched. More recently, Facebook has leveraged its social graph to help search become better.
Medical research is another domain where graphs are being used. Medical data is highly interconnected and hence can benefit greatly from the use of graph databases. Companies are now using graph databases for drug discovery and storing medical information.
Storage of ontologies is increasingly being solved using graph databases, which are rapidly finding applications in machine learning and analytics. Companies are also using graph databases in domains such as energy supply and transportation.
Neo4j is a fast, native graph database that satisfies Atomicity, Consistency, Isolation, Durability (ACID) properties. Through usage of transactions, developers can ensure that the failure of a transaction leaves the database's state unchanged ensuring atomicity. Any change to the database doesn't destroy data, ensuring consistency. Data modified by a transaction is isolated from other transactions till it is committed. Since Neo4j is a persistent graph database, the results of a committed transaction can always be retrieved, thus making it durable.
It started off supporting the TinkerPop stack. More information about the TinkerPop stack can be found at http://www.tinkerpop.com.
Neo4j provides numerous modeling and technical affordances, which are valuable when building real-world systems such as:
Neo4j is the most mature graph database and has been in production round the clock since 2003. Neo4j is open source with an enormous community. The Neo4j development team is highly engaged with that community so that the features and bugs are rapidly addressed. Neo4j provides native graph storage that enables its engine to perform native graph processing. From the query language to disks, everything is mechanically sympathetic to the transactional storage and rapid retrieval of graph data.
Cypher is a very expressive query language used to retrieve data from Neo4j. While it is superficially similar to SQL in some respect, Cypher is the only declarative query language that is built ground-up for humane yet performant graph queries and writes. The Neo4j Java API can be used on JVM-based languages as a more imperative and performant method of querying. This gives the best of both worlds by supporting imperative and declarative querying. (Neo4j plans to move away from supporting Gremlin in the long run, and currently Gremlin is supported through a plugin). Neo4j is open source and allows plugins to enhance or add functionalities, and there is a vibrant ecosystem of tooling around the core database.
Any Cypher statement that updates the graph is run within a transaction. If a transaction exists, the newly fired Cypher query will be run in it. If no transaction exists, then the statement will itself be transactional.
The community being fostered is incredible. This is also partly made possible by the project being open source. Neo4j is currently being used in production by companies such as UBS, Cisco, Walmart, eBay, Telenor, HP, Pitney Bowes, Accenture, Lockheed Martin, Glassdoor, and many others.
This book is divided into two sections:
Section 1 (Chapter 2, Modeling Flights and Cities, to Chapter 5, Refactoring the Data Model) is essential to understand graph modeling concepts that you will use in your daily routine. We cover how to model a graph, how to query it, how to evolve a graph database to accommodate changes in the domain, and how to translate a RDBMS data model into a graph design.
Section 2 (Chapter 6, Modeling Communication Chains, to Chapter 8, Recommendations and Analysis of Historical Data) are more reference oriented with models that you might need for optimization or for specialized cases. Topics covered are modeling chains and advantages of modeling chains, modeling access control, and designing recommendation systems based on the data present.
In this chapter, we discussed that graph databases are structures that help represent data as nodes, relationships, and properties; relationships explicitly specify and qualify the connection between two entities; labels add semantic meaning to nodes and allow for addition of indices and constraints; properties add more information to the nodes and relationships. We saw a few use cases in which graphs are used currently.
From the next chapter onward, we will delve into designing a data model and use actual Cypher queries to feed it into Neo4j. The queries used in this book are compatible with Neo4j 2.2.3. They have also been tested with Neo4j 2.3.0-M02.