Getting Started with the Graph Query Language (GQL)

Evolution Towards Graph Databases

In today’s data-driven world, the way we store, manage, and query data has evolved significantly. As businesses and organizations handle more complex, interconnected datasets, traditional database models are being stretched to their limits. Graph databases, in particular, have gained traction due to their ability to model relationships in ways that relational databases cannot. According to a recent report by Gartner, the graph database market is expected to grow at a compound annual growth rate (CAGR) of 28.1%, reflecting its increasing adoption across industries such as finance, healthcare, and social media.

This chapter explores the evolution of database query languages, right from the early innovations that laid the foundation for modern database systems. We’ll trace the journey from relational databases to NoSQL, and ultimately to the emergence of Graph Query Language (GQL), which promises to redefine how we query and manage complex, interconnected data in the digital age.

History of database query languages

Before electronic databases, data management was manual. Records were maintained in physical forms such as ledgers, filing cabinets, and card catalogs. Until the mid-20th century, this was the primary approach to data management. This method, while systematic, was labor-intensive and prone to human errors.

Discussions on database technology often begin with the 1950s and 1960s, particularly with the introduction of magnetic tapes and disks. These developments paved the way for navigational data models and, eventually, relational models. While these discussions are valuable, they sometimes overlook deeper historical perspectives.

Before magnetic tapes, punched cards were widely used, particularly for the 1890 U.S. Census. The company behind these tabulating systems later evolved into IBM Corporation, one of the first major technological conglomerates. I vividly recall my father attending college courses on modern computing, where key experiments involved operating IBM punch-card computers—decades before personal computers emerged in the 1980s.

Examining punched card systems reveals a connection to the operation of looms, one of humanity’s earliest sophisticated machines. Looms, which possibly originated in China and spread globally, have been found in various forms, including in remote African and South American villages.

Across the 3,000 to 5,000 years of recorded history, there have been many inventions for memory-aiding, messaging, scheduling, or recording data, ranging from tally sticks to quipu (khipu). While tally sticks were once thought to be a European invention, Marco Polo, after his extensive travels in China, reported that they were widely used there to track daily transactions.

On the other hand, when quipu was first discovered by Spanish colonists, it was believed to be an Inca invention. However, if the colonists had paid more attention to the pronunciation of khipu, they would have noticed that it means recording book in ancient Chinese. This suggests that quipu was a popular method for recording data and information long before written languages were developed.

Why focus on these pre-database inventions? Understanding these historical innovations through a graph-thinking lens helps illustrate how interconnected these concepts are and underscores the importance of recognizing these connections. Embracing this perspective allows us to better understand and master modern technologies, such as graph databases and graph query languages.

Early computer-based data management

The advent of electronic computers marked the beginning of computerized data storage. World Wars I and II drove major advancements in computing technology, notably the German Enigma machine and the Polish and Allied forces deciphering its encrypted messages, which contained top-secret information from Nazi Germany. When mechanical machines proved inadequate for the required computing power—such as in brute-force decryption—electronic and much more powerful alternatives were invented. Consequently, the earliest computers were developed during and before the end of World War II.

Early computers such as the ENIAC (1946) and UNIVAC (1951) were used for calculations and data processing. The Bureau of the Census and military and defense departments quickly adopted them to optimize troop deployment and arrange the most cost-effective logistics. These efforts laid the foundation for modern global supply chains, network analytics, and social behavior network studies.

The concept of systematic data management, or databases, became feasible with the rapid advancement of electronic computers and storage media, such as magnetic disks. Initially, most of these computers operated in isolation; the development of computer networks lagged significantly behind telecommunication networks for over a century.

The development of database technologies is centered around how data modeling is conducted, and the general perception is that there have been three phases so far:

Phase 1: Navigational data modeling
Phase 2: Relational (or SQL) data modeling
Phase 3: Not-only-SQL (or post-relational, or GQL) data modeling

Let’s briefly examine the three development phases so that we have a clear understanding of why GQL or the graphical way of data modeling and processing was invented.

Navigational data modeling

Before navigational data modeling (or navigational databases), the access of data on punched-cards or magnetic-tapes was sequential. Hence, this was very counter-productive. To improve speed, systems introduced references, which were similar to pointers, that allowed users to navigate data more efficiently. This led to the development of two data navigation models:

Hierarchical model (or tree-like model)
Network model

The hierarchical model was first developed by IBM in the 1960s on top of their mainframe computers, while the network model, though conceptually more comprehensive, was never widely adopted beyond the mainframe era. Both models were quickly displaced by the relational model in the 1970s.

One key reason for this shift was that navigational database programming is intrinsically procedural, focusing on instructing the computer systems with steps on how to access the desired data record. This approach had two major drawbacks:

Strong data dependency
Low usability due to programming complexity

Relational data modeling

The relational model, unlike the navigational model, is intrinsically declarative. This means instructing the system what data to retrieve, which means better data independence and program usability.

Another key reason for the shift from navigational databases/models was their limited search capabilities, as data records were stored using linked lists. This limitation led Edgar F. Codd, while working at IBM’s San Jose, California Labs, to invent tables as a replacement for linked lists. His groundbreaking work culminated in the highly influential 1970 paper titled A Relational Model of Data for Large Shared Data Banks. This seminal paper inspired a host of relational databases, including IBM’s System R (1974), UC Berkeley’s INGRES (1974, which spawned several well-known products such as PostgreSQL, Sybase, and Microsoft SQL Server), and Larry Ellison’s Oracle (1977).

Not-only-SQL data modeling

Today, there are approximately 500 known and active database management systems (DBMS) worldwide (as shown in Figure 1.1). While over one-third are relational DBMS, the past two decades have seen a rise in hundreds of non-relational (NoSQL) databases. This growth is driven by increasing data volumes, which have given rise to many big data processing frameworks that utilize both data modeling and processing techniques beyond the relational model. Additionally, evolving business demands have led to more sophisticated architectural designs, requiring more streamlined data processing.

The entry of major players into the database market has further propelled this transformation, with large technology companies spearheading the development of new database systems tailored to handle diverse and increasingly complex data structures. These companies have helped define and redefine database paradigms, providing a foundation for a variety of solutions in different industries.

As the landscape has continued to evolve, OpenAI, among other cutting-edge companies, has contributed to this revolution with diverse database systems to optimize data processing in machine learning models. In OpenAI’s system architecture, a variety of databases (both commercial and open source) are used, including PostgreSQL (RDBMS), Redis (key-value), Elasticsearch (full-text), MongoDB (document), and possibly Rockset (a derivative of the popular KV-library RocksDB, ideal for real-time data analytics). This heterogeneous approach is typical in large-scale, especially highly distributed, data processing environments. Often, multiple types of databases are leveraged to meet diverse data processing needs, reflecting the difficulty—if not impossibility—of a single database type performing all functions optimally.

Figure 1.1: Changes in Database popularity per category (August 2024, DB-Engines)

Despite the wide range of database genres, large language models still struggle with questions requiring “deep knowledge.” Figure 1.2 illustrates how a large language model encounters challenges with queries necessitating extensive traversal.

Figure 1.2: Hallucination with LLM

The question in Figure 1.2 involves finding causal paths (simply the shortest path) between different entities. While large language models are trained on extensive datasets, including Wikipedia, they may struggle to calculate and retrieve hidden paths between entities if they are not directly connected.

Figure 1.3 demonstrates how Wikipedia articles—represented as nodes (titles or hyperlinks) and their relationships as predicates—can be ingested into the Ultipa graph database. By performing a real-time six-hop-deep shortest path query, the results yield casual paths that are self-explanatory:

Genghis Khan launched the Mongol invasions of West Asia and Europe.
These invasions triggered the spread of the Black Death.
The last major outbreak of the Black Death was the Great Plague of London.
Isaac Newton fled the plague while attending Trinity College.

Figure 1.3: The shortest paths between entities using a graph database

The key takeaway from this section is the importance of looking beyond the surface when addressing complex issues or scenarios. The ability to connect the dots and delve deeper into the underlying details allows for identifying root causes, which in turn fosters better decision-making and a more comprehensive understanding of the world’s intricate dynamics.

The rise of SQL

The introduction of the relational model revolutionized database management by offering a more structured and flexible way to organize and retrieve data. With the relational model as its foundation, SQL emerged as the standard query language, enabling users to interact with relational databases in a more efficient and intuitive manner. This section will explore how SQL’s development, built upon the relational model, became central to modern database systems and continues to influence their evolution today.

The rise of the relational model

Edgar F. Codd’s 1970 paper, A Relational Model of Data for Large Shared Data Banks (https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf) laid the foundation for the relational model. Codd proposed a table-based structure for organizing data, introducing the key concepts of relations (tables), columns (attributes), rows (tuples), primary keys, and foreign keys. When compared to the navigational model (hierarchical model), such a structure provided a more intuitive and flexible way to handle data.

While we state that the relational model is more intuitive and flexible, it is within the context of dealing with data processing scenarios back in the 1970s-1990s. Things have gradually and constantly changed. The relational model has been facing more challenges and criticism with the rise of NoSQL and eventually the standardization of GQL. We will expand the topic on the limitations of SQL and the promises of GQL in the last section of this chapter.

The key concept of the relational model is rather simple, as simple as just five components, which are table, schema, key, relationship, and transaction. Let’s break them down one by one.

Table

A table, or relation, in the relational model is a structured collection of related data organized into rows and columns. Each table is defined by its schema, which specifies the structure and constraints of the table. Here’s a breakdown of its components:

Table Name: Each table has a unique name that describes its purpose. For instance, a table named Employees would likely store employee-related information.
Columns (Attributes): Each table consists of a set of columns, also known as attributes or fields. Columns represent the specific characteristics or properties of the entities described by the table. Each column has a name and a data type, which defines the kind of data it can hold. For example, in an Employees table, columns might include EmployeeID, FirstName, LastName, HireDate, and Department. The data type for each column could be integer, varchar (variable character), date, etc.
Rows (Tuples): Rows, or tuples, represent individual records within a table. Each row contains a set of values corresponding to the columns defined in the table. For example, a row in the Employees table might include 101, John, Doe, 2023-06-15, Marketing. Each row is a unique instance of the data described by the table.

Schema

Another key concept tied to tables (sometimes even tied to the entire RDBMS) is the schema. The schema of a table is a blueprint that outlines the table’s structure. It includes the following:

Column Definitions: For each column, the schema specifies its name, data type, and any constraints. Constraints might include NOT NULL (indicating that a column cannot have null values), UNIQUE (ensuring all values in a column are distinct), or DEFAULT (providing a default value if none is specified).
Primary Key: A primary key is a column or a set of columns that uniquely identifies each row in the table. It ensures that no two rows can have the same value for the primary key columns. This uniqueness constraint is crucial for maintaining data integrity and enabling efficient data retrieval. For example, EmployeeID in the Employees table could serve as the primary key.
Foreign Keys: Foreign keys are columns that create relationships between tables. They refer to the primary key of another table, establishing a link between the two tables. This mechanism supports referential integrity, ensuring that relationships between data in different tables are consistent.

Here we need to talk about normalization, which is a process applied to table design to reduce redundancy and improve data integrity. It involves decomposing tables into smaller, related tables and defining relationships between them. The goal is to minimize duplicate data and ensure that each piece of information is stored in only one place.

For example, rather than storing employee department information repeatedly in the Employees table, a separate Departments table can be created, with a foreign key in the Employees table linking to it.

The normalization concept sounds absolutely wonderful, but only on the surface. In many large data warehouses, too many tables have been formed. What was once seen as being intuitive and flexible in the relational model, can be a huge limitation and burden from a data governance perspective.

Relationships

Entity Relationship (ER) modeling is a foundational technique for designing databases using the relational model. Developed by Peter Chen in 1976, ER modeling provides a graphical framework for representing data and its relationships. It is crucial for understanding and organizing the data within a relational database. The keyword of ER modeling is graphical. The core concepts include entities, relationships, and attributes:

Entities: In ER modeling, an entity represents a distinct object or concept within the database. For example, in a university database, entities might include Student, Course, and Professor. Each entity is represented as a table in the relational model.
Attributes: Attributes describe the properties of entities. For instance, the Student entity might have attributes such as Student_ID, Name, Date_Of_Birth, and Major. Attributes become columns within the corresponding table.
Relationships: Relationships in ER modeling illustrate how entities are associated with one another. Relationships represent the connections between entities and are essential for understanding how data is interrelated. For example, a Student might be enrolled in a Course, creating a relationship between these two entities.

The caveat about relationships is that there are many types of relationships:

One-to-One: In this type of relationship, each instance of entity A is associated with exactly one instance of entity B, and vice versa. For example, each Student might have one Student_ID, and each Student_ID corresponds to exactly one student.
One-to-Many: This relationship type occurs when a single instance of entity A is associated with multiple instances of entity B, but each instance of entity B is associated with only one instance of entity A. For example, a Professor might teach multiple Courses, but each Course is taught by only one Professor. If we pause here, we can immediately sense that a problem will arise when such a rigid relationship is enforced, if a course is to be taught by two or three professors (a rare scenario but it does happen), the schema and table design would need a change. And the more exceptions you can think of here, the more re-designs you would experience.
Many-to-Many: This relationship occurs when multiple instances of entity A can be associated with multiple instances of entity B. For example, a Student can enroll in multiple Courses, and each Course can have multiple Students enrolled. To model many-to-many relationships, a junction table (or associative entity) is used, which holds foreign keys referencing both entities.

ER diagrams offer a clear and structured way to represent entities, their attributes, and the relationships between them:

Entities are represented by rectangles
Attributes are shown as ovals, each connected to its corresponding entity
Relationships are illustrated as diamonds, linking the relevant entities

This visual framework provides a comprehensive way to design database schemas and better understand how different data elements interact within a system.

The ER diagram is essentially the graph data model we will be discussing throughout the book. The only difference between SQL and GQL in terms of ER diagrams is that GQL and graph databases natively organize and represent entities and their relationships, while SQL and RDBMS use ER diagrams with metadata, where real data records are stored in lower-dimensional tables. It’s tempting to believe that the prevalence of the relational model matches with the limited computing power at the time it was invented. Exponentially higher computing power eventually would demand something more advanced, and more intuitive and flexible as well.

Transactions

Transactions are a crucial aspect of relational databases, ensuring that operations are performed reliably and consistently. To better understand how these principles work in practice, let’s explore ACID properties.

The ACID properties – Atomicity, Consistency, Isolation, and Durability – define the key attributes of a transaction. Let’s explore them in detail:

Atomicity: Atomicity ensures that a transaction is treated as a single, indivisible unit of work. This is crucial for maintaining data integrity, especially in scenarios where multiple operations are performed as part of a single transaction. This means that either all operations within the transaction are completed successfully, or none are applied. If any operation fails, the entire transaction is rolled back, leaving the database in its previous state. It prevents partial updates that could lead to inconsistent data states.
Consistency: Consistency ensures that a transaction takes the database from one valid state to another valid state, preserving the integrity constraints defined in the schema. All business rules, data constraints, and relationships must be maintained throughout the transaction. Consistency guarantees that database rules are enforced and that the database remains in a valid state before and after the transaction.
Isolation: Isolation ensures that the operations of a transaction are isolated from other concurrent transactions. Even if multiple transactions are executed simultaneously, each transaction operates as if it were the only one interacting with the database. Isolation prevents interference between transactions, avoiding issues such as dirty reads, non-repeatable reads, and phantom reads. It ensures that each transaction’s operations are independent and not affected by others.
Durability: Durability guarantees that once a transaction is committed, its changes are permanent and persist even in the event of a system failure or crash. The committed data is stored in non-volatile memory, ensuring its longevity. Durability ensures that completed transactions are preserved and that changes are not lost due to unforeseen failures. This property provides reliability and trustworthiness in the database system.

These attributes are best illustrated by linking them to a real-world system and application ecosystem. Considering any financial institution’s transaction processing system where a transaction involves transferring funds from one account to another, the transaction must ensure that both the debit and credit operations are completed successfully (atomicity), the account balances remain consistent (consistency), other transactions do not see intermediate states (isolation), and the changes persist even if the system fails (durability). These properties are essential for the accuracy and reliability of financial transactions.

The ACID properties were introduced in 1976 by Jim Gray and laid the foundation for reliable database transaction management. These properties were gradually incorporated into the SQL standard with the SQL-86 standard and have since remained integral to relational database systems. For over fifty years, the principles of ACID have been continuously adopted and refined by most relational database vendors, ensuring robust transaction management and data integrity. When comparing relational database management systems (RDBMS) with NoSQL and graph databases, the needs and implementation priorities of ACID properties vary, influencing how these systems handle transaction management and consistency.

Modern RDBMS include robust transaction management mechanisms to handle ACID properties. These systems use techniques such as logging, locking, and recovery to ensure transactions are executed correctly and data integrity is maintained. Managing concurrent transactions is essential for ensuring isolation and consistency. Techniques such as locking (both exclusive and shared) and multi-version concurrency control (MVCC) are used to handle concurrent access to data and prevent conflicts.

Evolution of NoSQL and new query paradigms

Today, big data is ubiquitous, influencing nearly every industry across the globe. As data grows in complexity and scale, traditional relational databases show limitations in addressing these new challenges. Unlike the structured, table-based model of relational databases, the real world is rich, high-dimensional, and interconnected, requiring new approaches to data management. The evolution of big data and NoSQL technologies demonstrates how traditional models struggled to meet the needs of complex, multi-faceted datasets. In this context, graph databases have emerged as a powerful and flexible solution, capable of modeling and querying intricate relationships in ways that were previously difficult to achieve. As industries continue to generate and rely on interconnected data, graph databases are positioning themselves as a transformative force, offering significant advantages in managing and leveraging complex data relationships.

The emergence of NoSQL and big data

The advent of big data marked a significant turning point in data management and analytics. While we often date the onset of the big data era to around 2012, the groundwork for this revolution was laid much earlier. A key milestone was the release of Hadoop by Yahoo! in 2006, which was subsequently donated to the Apache Foundation. Hadoop’s design was heavily inspired by Google’s seminal papers on the Google File System (GFS) and MapReduce.

GFS, introduced in 2003, and MapReduce, which followed in 2004, provided a new way of handling vast amounts of data across distributed systems. These innovations stemmed from the need to process and analyze the enormous data generated by Google’s search engine. At the core of Google’s search engine technology was PageRank, a graph algorithm for ranking web pages based on their link structures. Named intentionally as a pun after Google co-founder Larry Page. This historical context illustrates that big data technologies have deep roots in graph theory, evolving towards increasingly sophisticated and large-scale systems.

Figure 1.4: From data to big data to fast data and deep data

Examining the trajectory of data processing technologies over the past 50 years reveals a clear evolution through distinct stages:

The Era of Relational Databases (1970s-present): This era is defined by the dominance of relational databases, which organize data into structured tables and use SQL for data manipulation and retrieval.
The Era of Non-Relational Databases and Big Data Frameworks (2000s-present): The rise of NoSQL databases and big data frameworks marked a departure from traditional relational models. These technologies address the limitations of relational databases in handling unstructured data and massive data volumes.
The Post-Relational Database Era (2020s and beyond): Emerging technologies signal a shift towards post-relational databases, including NewSQL and Graph Query Language (GQL). These advancements seek to overcome the constraints of previous models and offer enhanced capabilities for managing complex, interconnected data.

Each of these stages has been accompanied by the development of corresponding query languages:

Relational Database—SQL: Standardized in 1983, SQL became the cornerstone of relational databases, providing a powerful and versatile language for managing structured data.
Non-Relational Database—NoSQL: The NoSQL movement introduced alternative models for data storage and retrieval, focusing on scalability and flexibility. NoSQL databases extend beyond SQL’s capabilities but lack formal standardization.
Post-Relational Database—NewSQL and GQL: NewSQL brings SQL-like functionality to scalable, distributed systems, while GQL, with the first edition released in 2024, is designed to address the needs of graph databases.

These stages reflect an evolution in data characteristics and processing capabilities:

Relational Database—Data, Pre-Big Data Era: Focused on managing structured data with well-defined schemas.
Non-Relational Database—Big Data, Fast Data Era: Emphasized handling large volumes of diverse and rapidly changing data, addressing the 4Vs—volume, variety, velocity, and veracity.
Post-Relational Database—Deep Data, or Graph Data Era: Represented a shift towards understanding and leveraging complex relationships within data, enhancing depth and analysis beyond the 4Vs.

The 4Vs – volume, variety, velocity, and veracity – capture the essence of big data:

Volume: The sheer amount of data
Variety: The different types and sources of data
Velocity: The speed at which data is generated and processed
Veracity: The reliability and accuracy of data

As data complexity grows, an additional dimension, depth, becomes crucial. This deep data perspective focuses on uncovering hidden relationships and extracting maximum value from interconnected data.

Understanding deep relationships among data is essential for various business and technological challenges:

Business Dimension: Value is embedded within networks of relationships, making it critical to analyze these connections to derive actionable insights.
Technology Dimension: Traditional databases struggle with network-based value extraction due to their tabular structure, which limits their ability to quickly identify deep associations between entities.

From 2004 to 2006, as Yahoo! developed Hadoop, other data processing projects emerged from different teams within the company. Yahoo!’s vast server clusters processed the massive volumes of web logs, posing significant processing challenges. Hadoop was designed to utilize low-cost, low-configuration machines in a distributed manner, but it encountered inefficiencies in terms of data processing speed and analytical depth. Despite excelling at handling large volumes and a variety of data types, these limitations led to its donation to the Apache Foundation.

The introduction of Apache Spark in 2014 brought a major shift. Developed by the University of California, Berkeley, Spark addressed many of Hadoop’s performance issues. Its in-memory processing capabilities allowed it to process data up to 100x faster than Hadoop. With components such as GraphX, Spark enabled graph analytics, including algorithms such as PageRank and Connected Component. However, Spark’s focus remained on batch processing rather than real-time, dynamic data processing, leaving gaps in real-time, deep data analysis.

The ability to process deep data involves extracting insights from multi-layered, multi-dimensional data quickly. Graph databases and GQL, the focus of this book, are designed to address these challenges. By applying graph theory principles, they enable advanced network analysis, offering unique advantages over traditional NoSQL databases and big data frameworks. Their ability to perform real-time, dynamic analysis of complex data relationships makes them well-suited to the evolving demands of data management and analysis.

This book will guide readers through the historical development, current state, and future trends of graph databases, emphasizing their relevance to market needs and technological implementation.

Graphs and graph models

Graphs offer a natural way to model entities and relationships in data, making them an essential tool in various domains. From social networks to recommendation systems, graph-based approaches provide a flexible and efficient means of representing complex structures. Let’s explore the theoretical foundations of graph models and their role in modern data management.

Graph theory and graph data models

Graph database technology is fundamentally rooted in graph theory, which provides both the theoretical and practical foundations for graph computing. In this discussion, we will use the terms graph computing and graph database interchangeably, highlighting that computing plays a more pivotal role than storage in this context. This section explores the evolution of graph theory and its application to graph data modeling.

Graph theory can be traced back nearly 300 years to the groundbreaking work of the Swiss mathematician Leonhard Euler. Widely regarded as one of the greatest mathematicians, Euler laid the groundwork for this discipline with his solution to the Seven Bridges of Königsberg problem. In 1736, he abstracted the city’s physical layout—which included seven bridges and two islands connected to the mainland, forming four distinct land areas—into a graph composed of nodes and edges. His work led to the development of graph theory, which focuses on the study of graphs—structures made up of vertices (nodes) connected by edges (links or relationships).

Figure 1.5: Seven Bridges of Königsberg and graph theory

Euler’s exploration of the Seven Bridges problem involved determining whether it was possible to traverse each bridge exactly once in a single journey. He proved that such a path, known as a Eulerian path, was impossible in this specific configuration. Euler’s criteria for a Eulerian path are still fundamental to graph theory today: a graph can have a Eulerian path if and only if it has exactly zero or two vertices with an odd degree (number of edges connected to a vertex). If all vertices have even degrees, a Eulerian circuit, a special type of path that returns to the starting point, can be found. This early work established essential graph theory concepts that continue to influence modern graph computing.

Graph theory found practical applications beyond Euler’s initial problem. One notable example is the map coloring problem, which arose during the Age of Discovery and the subsequent rise of nation-states. The problem of coloring maps such that no two adjacent regions share the same color was first addressed by mathematicians in the mid-19th century. This problem led to the formulation of the Four-Color Theorem, which states that any map can be colored with no more than four colors such that no two adjacent regions share the same color. The proof of this theorem, completed with the assistance of computer algorithms in 1976, marked a significant milestone in both graph theory and computational methods.

In parallel with these developments, Johann B. Listing’s introduction of topology in 1847, which included concepts such as connectivity and dimensionality, further advanced the field. Sylvester’s work in 1878 formalized the concept of a graph as a collection of vertices connected by edges, introducing terminology that remains central to graph theory.

The systematic study of random graphs by mathematicians Erdős and Rényi in the 1960s laid the groundwork for understanding complex networks. Random graph theory became a fundamental tool for analyzing various types of networks, from social interactions to biological systems.

The advent of the semantic web in the early 1990s, proposed by Tim Berners-Lee, marked a significant application of graph theory to the World Wide Web. The semantic web conceptualizes web resources as nodes in a vast, interconnected graph, promoting the development of standards like the Resource Description Framework (RDF). While RDF did not achieve widespread industry adoption, it paved the way for the growth of knowledge graphs and social graphs, which became integral to major tech companies such as Yahoo!, Google, and Facebook.

Graph databases are now considered a subset of NoSQL databases, providing a contrast to traditional SQL-based relational databases. While SQL databases use tabular structures to model data, graph databases leverage high-dimensional graphs to represent complex relationships more naturally. Graph databases utilize vertices and edges to encode relationships, offering a more intuitive and efficient means of handling interconnected data. This approach contrasts with the tabular, two-dimensional constraints of traditional relational databases, which often struggle with complex, high-dimensional problems.

Graph theory has various applications, including navigation, recommendation engines, and resource scheduling. Despite the theoretical alignment with graph computing, many existing solutions use relational or columnar databases to tackle graph problems. This results in inefficient solutions that fail to leverage the full potential of graph-based methodologies. As knowledge graphs gain traction, the significance of graph databases and computing continues to grow, addressing challenges that traditional databases are ill-equipped to handle.

The evolution of graph theory and its integration into graph computing reflects a broader shift toward leveraging complex, interconnected data structures. Graph databases offer promising solutions to the limitations faced by previous data management approaches, making them a crucial component of modern data infrastructure.

Property graphs and semantic knowledge graphs

The evolution of technology often follows a trajectory marked by phases of innovation, adoption, peak excitement, disillusionment, and eventual maturity. This pattern is evident in the realm of graph database (or graph computing) development, where two primary types of graph models — Property Graphs (PGs) and Semantic Knowledge Graphs (SKGs) — have emerged, each contributing to the field in distinct ways.

Property Graphs

Property graphs, also known as Labeled Property Graphs (LPGs), represent one of the most influential models in graph computing. The concept of PGs revolves around nodes, edges, and properties. Nodes, also referred to as vertices or entities, and edges, the connections or relationships between nodes, can have associated attributes or properties. These attributes might include identifiers, weights, timestamps, and other metadata that provide additional context to the relationships and entities within the graph.

LPG is a term popularized by Neo4j, a graph database, where a label is considered a special kind of index that can be associated with either nodes or edges for accelerated data access. Many people use LPG and PG interchangeably.

However, LPGs are actually a subset of property graph databases, as there are multiple ways to implement a graph database’s data model. For instance, Neo4j’s LPG implementation is schema-free, while GQL’s PG design is schematic.

It’s easy to see that, without properties (attribute fields), the expressive power of graphs would be significantly diminished. However, there is a reason for this. In the 1980s and 1990s, social behavior analytics gained traction, eventually leading to the uprising of Social Network Services (SNSs). Traditionally, data analysis in SNSs focused primarily on the skeleton (or topology) of the data, and properties were not a priority. This has been the case for most of the graph-processing frameworks predating almost all PG databases.

The property graph model has seen a proliferation of implementations, including DGraph, TigerGraph, Memgraph, and Ultipa. These systems differ in architectural choices, service models, and APIs, reflecting the diverse needs and rapid evolution of the graph database market. The dynamic landscape of PG databases illustrates the flexibility and adaptability of this model in addressing a wide range of use cases.

Semantic Knowledge Graphs (SKGs)

In contrast to property graphs, SKGs are built upon principles derived from the Resource Description Framework (RDF) and related standards. SKGs focus on representing knowledge through semantic relationships, enabling more sophisticated querying and reasoning about the data.

The RDF standard, developed by the World Wide Web Consortium in 2004 (v1.0) and updated in 2014 (v1.1), provides a structured framework for describing metadata and relationships in a machine-readable format. RDF’s primary query language, SPARQL, allows for querying complex data structures, but it is often criticized for its verbosity and complexity. RDF’s emphasis on semantic relationships aligns with the goal of creating interoperable and extensible knowledge representations.

Despite its strong academic foundation, RDF and its associated technologies have faced challenges in gaining widespread adoption in the industry. The complexity of RDF and SPARQL has led to a preference for more user-friendly alternatives, such as JSON and simpler query languages. Exactly for this reason, property graph databases and GQL were born and are used by many graph enthusiasts and innovative enterprises who are looking to digitally transform their businesses.

While property graphs excel in data traversals with practical applications and ease of use, SKGs focus on the Natural Language Processing (NLP) aspect of things, offering a richer framework for semantic reasoning and interoperability. The interplay between these two models of graph stores, often in the form of using a PG database for link analysis (path-finding or deep traversals), and using an RDF store for semantic processing, reflects a broader trend toward integrating the strengths of both approaches to address diverse data challenges.

Current and future trends in graph database technology

As graph database technology continues to evolve, several key trends and advancements are shaping its development. These trends reflect the growing complexity of data environments and the increasing demand for powerful, efficient solutions.

This section explores three significant trends: Hybrid Transactional and Analytical Processing (HTAP), handling sea-volume large-scale data while maintaining performance, and compliance with the emerging GQL standard.

Hybrid Transactional and Analytical Processing (HTAP)

HTAP represents a transformative approach in the graph database arena. Traditionally, databases were categorized into transactional systems such as Online Transaction Processing (OLTP) and analytical systems such as Online Analytical Processing (OLAP), each optimized for different workloads. Transactional systems focus on managing and recording day-to-day operations, while analytical systems are designed for complex queries and large-scale data analysis.

Many, if not most, graph databases and almost all graph-processing frameworks were originally designed to handle AP-centric traffic. This is also true for most NoSQL and big-data frameworks. These AP-centric graph systems tend to ingest volume data in offline mode and process the static data in online mode, meaning they are slow to ingest data in online mode. If graph database systems are to become the next mainstream database, the most critical requirement is the HTAP capabilities.

HTAP bridges this divide by enabling a single system, usually in the form of a cluster of multiple instances, to handle both transactional and analytical workloads. This integration is crucial for modern applications that require real-time analytics on live or transactional data.

In the context of graph databases, HTAP offers several advantages:

Performance Optimization: Advances in HTAP technology include innovations in indexing, query optimization, and in-memory processing. These improvements help maintain high performance levels even as data volumes and query complexities increase.
Real-Time Insights: HTAP enables real-time analytics on graph data, allowing organizations to gain immediate insights from ongoing transactions. This capability is particularly valuable in scenarios such as online fraud detection, recommendation engines, operation support and decision-making, and dynamic network analysis.
Streamlined Architecture: By consolidating transactional and analytical processing into a single logical system, HTAP reduces the complexity of maintaining separate databases for different purposes. This integration simplifies architecture and improves data consistency across various use cases.

Recent developments in HTAP for graph databases include the adoption of in-memory processing with large-scale parallelization and distributed computing. In-memory processing allows for faster data access and query execution, while distributed computing techniques enable the scaling of HTAP systems to handle large and complex graph data.

There are different approaches to in-memory computing, primarily distinguished by their ability to update datasets in real time. One school of design may simply project data into memory space while the data stays unchanged; while another school may support real-time synchronization of in-memory data with the persistent layer, which requires more sophisticated design and engineering skills.

Handling large-scale graph data

As data volumes grow exponentially, handling sea-volume large-scale data without significant performance degradation becomes a critical challenge. Traditional graph databases often struggled with performance issues, particularly when executing deep traversal queries that required extensive computation.

Modern graph databases address these challenges through several key strategies:

Distributed architecture
Graph partitioning (sharding)
Hardware-aided storage and computing
Graph query/algorithm optimization

Let’s explore them in detail.

Distributed architecture

Distributed graph databases leverage clusters of machines to distribute data and computational workloads. This architecture supports both vertical and horizontal scaling, enabling the system to handle vast amounts of data by adding higher-end hardware components or more nodes to the cluster.

Systems typically evolve from a standalone instance to master-slave or high-availability architecture, then to a distributed-consensus architecture, and eventually to horizontally scalable architecture.

For readers who are interested in scalable graph database design, it is recommended to read books such as The Essential Criteria of Graph Databases by Ricky Sun, published in 2024, with dedicated chapters to scalable graph database design.

Graph partitioning (sharding)

Graph partitioning techniques divide a large graph into smaller, more manageable subgraphs (shards). These partitions are contained on individual server instance nodes to be processed independently, reducing the computational load on each node. Efficient partitioning strategies minimize inter-node communication and improve overall performance.

The commonly used partitioning/sharding techniques are to cut by vertex or by edge. Note that both techniques would involve extra architectural components to be added (i.e., meta-server, name-server, shard-server, etc.) and data duplication (i.e., 2x or 3x more data points to be stored to ensure the data linkages are unbroken).

Hardware-aided storage and computing

Performance bottlenecks can be mitigated through hardware-accelerated storage and computing. In-memory databases reduce latency by storing data in RAM, SSDs offer faster data access compared to traditional hard drives, and the GPU and FPGA will help offloading the CPUs. These storage and compute solutions are increasingly integrated into graph databases to enhance performance and scalability.

Optimized graph queries and algorithms

Advancements at the hardware and software levels would require the matching graph queries and algorithms to be re-invented. Many graph algorithms were originally designed to be run in sequential mode (single-thread implementation), and have to be re-engineered to be able to harness data to vastly improve parallel computing power with modern CPUs and distributed environments, and the same holds true for many graph queries, such as path-finding, k-hoping, or just online data ingestion, which can be greatly improved with large-scale and distributed data processing. Queries and algorithms that minimize redundant computations and optimize data access patterns can significantly improve performance during deep traversals.

GQL compliance

GQL is emerging as a major standard in graph database technology, providing a unified query language for graph data. With the first edition of GQL already published, compliance with this standard is becoming a key focus for graph database vendors, as well as traditional RDBMS and NoSQL providers. Compliance helps vendors retain existing customers and attract new ones by ensuring interoperability and standardization across graph data platforms.

Key aspects of GQL compliance include the following:

Standardized Query Syntax: GQL offers a standardized syntax for querying graph data, making it easier for developers to write and maintain queries across different graph database systems. This standardization promotes interoperability and reduces the learning curve associated with adopting new graph databases.
Advanced Query Capabilities: GQL supports advanced querying capabilities, including pattern matching, traversal, and aggregation. By defining a comprehensive set of features, GQL will enable more sophisticated queries and analyses, enhancing the flexibility and power of graph databases.
Interoperability: Compliance with GQL improves integration and interoperability between different graph databases and applications. This is particularly important for organizations that use multiple graph technologies or require data exchange between systems.
Industry Adoption: As GQL gains traction, industry adoption is likely to drive further innovation and refinement. Vendors that prioritize GQL compliance will position themselves as leaders in the graph database market, attracting customers seeking standardized and future-proof solutions.

The trends in graph database technology highlight a dynamic and rapidly evolving field. HTAP is revolutionizing how graph databases handle transactional and analytical workloads, enabling real-time insights and streamlined architectures. Addressing sea-volume large-scale data handling challenges through distributed architectures, graph partitioning, and optimized algorithms ensures that graph databases can scale efficiently. Finally, GQL compliance is set to standardize and enhance graph querying, fostering greater interoperability and innovation in the industry. As these trends continue to develop, they will shape the future of (graph) database technology, driving advancements and new applications.

Why is GQL the new standard?

GQL stands as a pivotal advancement in the landscape of graph database technology. To understand why GQL represents a new standard, it is essential to explore its origins and evolution. GQL’s journey mirrors broader database technologies, reflecting the ongoing quest for more intuitive, expressive, and powerful methods to query complex data structures that are often beyond the reach of tables or columns.

The genesis of GQL

The origins of GQL trace back to the early days of graph databases. In the late 20th and early 21st centuries, as the digital world grew increasingly complex, traditional relational databases began to show their limitations in handling interconnected data. While SQL-based systems excelled in managing tabular data, they struggled with the flexible and multi-dimensional relationships typical of graph-based data.

During this period, both RDF stores and graph databases gained traction. RDF stores focused on semantics and NLP, while graph databases focused on efficient data traversals. These efforts laid the groundwork for network-traversal-oriented query languages. The need for a standardized query language that could elegantly handle these graph structures became evident.

GQL emerged from the convergence of various graph query languages and best practices, aiming to unify and standardize how we interact with graph data. Its inception was driven by the need to provide a consistent, powerful query language that could serve as a universal tool for graph databases, transcending the limitations of previous, often proprietary query languages.

The GQL standardization by ISO/IEC was officially kickstarted in 2019, and the joint technical committee’s project goal statement articulated why the world needs GQL:

“Using graph as a fundamental representation for data modeling is an emerging approach in data management. In this approach, the data set is modeled as a graph, representing each data entity as a vertex (also called a node) of the graph and each relationship between two entities as an edge between corresponding vertices. The graph data model has been drawing attention for its unique advantages.

Firstly, the graph model can be a natural fit for data sets that have hierarchical, complex, or even arbitrary structures. Such structures can be easily encoded into the graph model as edges. This can be more convenient than the relational model, which requires the normalization of the data set into a set of tables with fixed row types.

Secondly, the graph model enables efficient execution of expensive queries or data analytic functions that need to observe multi-hop relationships among data entities, such as reachability queries, shortest or cheapest path queries, or centrality analysis. There are two graph models in current use: the Resource Description Framework (RDF) model and the Property Graph model. The RDF model has been standardized by W3C in a number of specifications. The Property Graph model, on the other hand, has a multitude of implementations in graph databases, graph algorithms, and graph processing facilities. However, a common, standardized query language for property graphs (like SQL for relational database systems) is missing. GQL is proposed to fill this void.”

Evolution pathways

The evolution of GQL is a tale of gradual refinement and adaptation. In the early days of graph databases, numerous specialized query languages emerged, each tailored to individual systems. Languages such as Cypher, OpenCypher, Gremlin, GSQL, AQL, nQL, UQL, and so on, were uniquely designed to carry out easy yet powerful recursive traversal features. But there have been cases where these languages were very challenging to learn and read, simply reflecting the language designer’s preferences. In summary, they lacked the interoperability required for broader adoption. As the graph database community matured, the call for a standard language grew louder.

The development of GQL can be seen as a response to this call. The language was designed to address several key challenges: providing a unified syntax, ensuring compatibility across different graph database systems, and incorporating advanced features for querying complex graph structures. The transition from initial prototypes to a draft standard involved extensive collaboration within the graph database community, including contributions from academic researchers, industry practitioners, and standardization bodies.

The journey of GQL involved several significant milestones. The initial drafts were informed by existing graph query languages, such as Cypher used in Neo4j, Gremlin from the Apache TinkerPop project, GSQL (formerly Graph SQL) from TigerGraph, and even lots of inputs from Oracle’s PL/SQL and PGQL. These languages provided valuable insights and were instrumental in shaping the foundational aspects of GQL. Moreover, as part of the ongoing development, care was taken to align GQL with the SQL/PGQ Graph Pattern Matching languages (GPM or GPML), especially with the publication of the SQL/PGQ standard in 2023. This alignment ensured consistency between the two languages, facilitating better integration across graph-based and relational systems.

As GQL continued to evolve, it incorporated feedback from a broad range of stakeholders, including those working on SQL/PGQ, and underwent rigorous testing to ensure its robustness and effectiveness.

Personal reflections on GQL’s evolution

Reflecting on the evolution of GQL, Ricky Sun finds it remarkable how this language embodies the collective effort of the graph database community. It’s not merely a technical achievement but a testament to the power of collaboration and innovation. GQL represents a convergence of ideas, drawing from the strengths of existing languages while introducing new concepts that address the unique challenges of graph data.

In many ways, GQL reminds me of the early days of SQL. Just as SQL revolutionized relational databases by providing a standardized way to interact with tabular data, GQL has the potential to do the same for graph databases, and maybe all other databases as well. It’s exciting to witness the birth of a new standard that promises to bring clarity and coherence to the field of graph computing and beyond.

In conclusion, the evolution of GQL reflects a broader trend in technology towards standardization and interoperability. It represents a critical step forward in the quest for more effective ways to manage and query complex, interconnected data. As we move forward, GQL will undoubtedly play a central role in shaping the future of graph databases, offering a powerful and unified approach to graph querying that will benefit both practitioners and researchers alike.

Core features and capabilities

GQL addresses the limitations of previous query languages and introduces several core features designed to enhance both usability and performance. This section highlights the pivotal features that make GQL a robust and versatile tool for modern graph databases.

Flexibility and expressiveness

GQL offers a high degree of flexibility and expressiveness, allowing users to construct complex queries with relative ease. The language supports a rich set of operations for traversing and manipulating graph data, including the following:

Graph Data Modeling: GQL enables users to represent complex, interconnected data in a way that reflects real-world entities and their interactions. It supports the creation and manipulation of nodes and edges, allowing users to define the structure of their graph database clearly. This includes specifying the types and attributes of nodes and edges, as well as establishing relationships and constraints. By providing a flexible and expressive framework for data modeling, GQL allows users to create schemas that capture the nuances of their data and facilitate efficient querying and analysis. This foundational capability ensures that the graph’s structure aligns with the specific needs of various applications, enhancing the overall effectiveness and scalability of graph-based solutions.
Pattern Matching: GQL provides powerful pattern-matching capabilities, enabling users to find specific subgraphs or structures within large datasets. This feature is essential for applications such as fraud detection, social network analysis, and recommendation systems.
Path Traversals: One of the standout features of GQL is its support for deep and wide path traversals. Users can specify detailed paths through the graph, including variable-length paths and patterns that span multiple relationships, without incurring significant performance penalties. This feature sets GQL apart from SQL, which has long been criticized for lacking “recursive query” capabilities, and GQL makes that fast and easy.
Subgraph Extraction: GQL allows for the extraction and creation of subgraphs based on specific criteria, facilitating the extraction of relevant portions of a graph for focused analysis or reporting.

Usability and developer experience

User experience and developer productivity are central to GQL’s design. GQL’s syntax is crafted to be intuitive and user-friendly, reducing the learning curve for new users and enhancing productivity for experienced developers. The language balances complexity with clarity, making it accessible for a broad range of applications.

GQL’s core features and capabilities represent a significant advancement in the field of graph query languages. By addressing flexibility, interoperability, and usability, GQL sets a new standard for querying graph databases, positioning itself as a powerful tool for modern data analysis and management.

GQL performance is implementation-specific, which means potentially the same GQL clause can experience different speeds on different vendor-created platforms. Accuracy and result validation would be another thing to carefully investigate – after all, the processing logic and results of GQL are inherently more complicated compared with the tabular SQL.

Advantages of GQL over traditional query languages

GQL represents a significant advancement in querying graph data, addressing the limitations of traditional SQL-based and non-relational query languages. This section explores the key advantages of GQL over traditional query languages, highlighting its impact on graph database management and querying.

Intuitive representation of graph data

Unlike SQL, which is built for tabular data, GQL is tailored for graph-based data structures. SQL’s reliance on JOIN operations makes handling interconnected data cumbersome. In contrast, GQL’s native graph orientation allows for a more intuitive representation of data relationships, using nodes and edges directly. This simplifies the querying process and makes it easier for users to understand and manipulate complex networks of data. The result is more natural and efficient queries that align closely with the underlying data model.

Simplified and expressive querying

The graph-centric syntax introduced by GQL is geared towards augmenting graph-based queries. This syntax design is more expressive when it comes to traversing relationships and patterns within the graph. Traditional query languages often require elaborate queries with nested sub-queries and multiple JOINs to achieve similar results. GQL streamlines this process by providing concise and expressive constructs for traversing nodes and edges, making it easier for users to write, understand, and maintain queries. This expressiveness not only reduces query complexity but also enhances readability and debugging.

Enhanced performance for relationship queries

GQL is designed from the bottom up to empower graph traversal. Traditional relational databases can struggle with performance when dealing with complex relationships and deep traversals. SQL queries involving multiple JOINs can become inefficient and slow, especially with large datasets. GQL, however, is optimized for handling intricate relationships and deep traversals. Its design allows for efficient pathfinding and pattern matching, which are crucial for applications such as social networks, fraud detection, and recommendation systems. The result is significantly better performance for queries involving complex relationships and connections.

Flexibility in schema design

Flexible schema design is another ostensible advantage of GQL over SQL. Traditional SQL databases often require a rigid schema that must be predefined and adhered to. Changes to the schema can be disruptive and require significant effort. In contrast, GQL supports dynamic schema design, allowing for greater flexibility in how data is represented and modified. This flexibility is particularly beneficial in graph databases where the structure of the data may evolve over time. GQL’s ability to handle evolving schemas with ease means that users can adapt their data models without being constrained by rigid schema definitions.

Advanced pattern matching and analysis

Rich pattern-matching features are another area where GQL shines. GQL includes advanced pattern-matching capabilities that are inherently suited for analyzing complex graph structures. Traditional query languages, like SQL, do not natively support pattern matching in the same way and often require additional processing or external tools to achieve similar results. GQL’s pattern-matching features allow users to query for specific graph patterns, sub-graphs, and relationships directly. This capability is invaluable for use cases such as network analysis, fraud detection, and social graph analysis, where understanding and identifying patterns are critical.

Streamlined integration with graph algorithms

While GQL itself focuses on querying, when a vendor implements GQL, it often ensures that GQL works seamlessly with graph algorithms, which are essential for advanced analytics and insights in graph databases. Traditional SQL queries typically operate in isolation from the algorithms used for in-depth graph analysis. GQL’s design facilitates integration with graph algorithms, allowing users to execute sophisticated analytical tasks directly within the query environment. This integration enhances the ability to perform tasks such as centrality analysis, community detection, and shortest path computations without needing to switch contexts or tools.

Future-proofing and standardization

Future-proofing and standardization are other key advantages of GQL. GQL brings a level of consistency and reliability to graph querying that is often lacking in the diverse landscape of traditional query languages. The establishment of GQL as a standardized query language means that it will provide a consistent foundation for graph databases across different platforms and implementations. This standardization helps ensure interoperability and future-proofs the technology, making it easier for organizations to adopt and integrate graph databases into their existing systems.

Enhanced support for real-time applications

Many modern applications require real-time data processing and analysis. Traditional SQL databases may face limitations in handling real-time graph queries due to their tabular nature and the overhead associated with JOIN operations. GQL is designed to support real-time querying and analysis, making it well-suited for applications that demand instantaneous insights, such as live recommendation engines and real-time fraud detection systems.

Better alignment with use cases

GQL’s design is inherently aligned with use cases that involve complex relationships and interconnections. Traditional query languages often require workarounds to address these scenarios, which can lead to inefficiencies and convoluted queries. GQL’s focus on graph-centric use cases ensures that it provides the right tools and features to address the unique challenges of graph data management and querying.

In summary, GQL offers a range of advantages over traditional query languages, particularly when it comes to handling graph-based data. Its intuitive graph-centric design, simplified querying, enhanced performance, and alignment with modern use cases make it a powerful tool for working with complex, interconnected data. As the field of graph databases continues to evolve, GQL’s role as a standard query language will likely become even more significant, driving advancements in data management and analysis.

In the NoSQL landscape, it’s interesting to note that graph databases stand out as the only database model that supports industry standards. This chapter has discussed two main declinations of the graph data model: Property Graphs (PGs) and Semantic Knowledge Graphs (SKGs). SKGs have long had their own standards, such as W3C’s RDF and SPARQL for querying. Until recently, property graphs lacked a unified standard. However, this changed with the release of GQL, providing a standard way to query property graph databases.

It’s noteworthy that no other NoSQL data model, apart from graph databases, can leverage such comprehensive standards. This isn’t likely a coincidence. Graph databases offer significant advantages over other NoSQL models such as key-value, document, or wide-column databases, particularly in representing and querying complex relationships.

The existence of these standards for graph databases may reflect their growing importance and the need for interoperability in increasingly complex data ecosystems. It also highlights the maturity and evolving nature of graph database technology in handling interconnected data structures.