Agile data modeling with Neo4j

Sumit Gupta

February 2015

In this article by Sumit Gupta, author of the book Neo4j Essentials, we will discuss data modeling in Neo4j, which is evolving and flexible enough to adapt to changing business requirements. It captures the new data sources, entities, and their relationships as they naturally occur, allowing the database to easily adapt to the changes, which in turn results in an extremely agile development and provides quick responsiveness to changing business requirements.

(For more resources related to this topic, see here.)

Data modeling is a multistep process and involves the following steps:

  1. Define requirements or goals in the form of questions that need to be executed and answered by the domain-specific model.
  2. Once we have our goals ready, we can dig deep into the data and identify entities and their associations/relationships.
  3. Now, as we have our graph structure ready, the next step is to form patterns from our initial questions/goals and execute them against the graph model.

This whole process is applied in an iterative and incremental manner, similar to what we do in agile, and has to be repeated again whenever we change our goals or add new goals/questions, which need to be answered by your graph model.

Let's see in detail how data is organized/structured and implemented in Neo4j to bring in the agility of graph models.

Based on the principles of graph data structure available at http://en.wikipedia.org/wiki/Graph_(abstract_data_type), Neo4j implements the property graph data model at storage level, which is efficient, flexible, adaptive, and capable of effectively storing/representing any kind of data in the form of nodes, properties, and relationships.

Neo4j not only implements the property graph model, but has also evolved the traditional model and added the feature of tagging nodes with labels, which is now referred to as the labeled property graph.

Essentially, in Neo4j, everything needs to be defined in either of the following forms:

  • Nodes: A node is the fundamental unit of a graph, which can also be viewed as an entity, but based on a domain model, it can be something else too.
  • Relationships: These defines the connection between two nodes. Relationships also have types, which further differentiate relationships from one another.
  • Properties: Properties are viewed as attributes and do not have their own existence. They are related either to nodes or to relationships. Nodes and relationships can both have their own set of properties.
  • Labels: Labels are used to construct a group of nodes into sets. Nodes that are labeled with the same label belong to the same set, which can be further used to create indexes for faster retrieval, mark temporary states of certain nodes, and there could be many more, based on the domain model.

Let's see how all of these four forms are related to each other and represented within Neo4j.

A graph essentially consists of nodes, which can also have properties. Nodes are linked to other nodes. The link between two nodes is known as a relationship, which also can have properties. Nodes can also have labels, which are used for grouping the nodes.
Neo4j Essentials

Let's take up a use case to understand data modeling in Neo4j. John is a male and his age is 24. He is married to a female named Mary whose age is 20. John and Mary got married in 2012.

Now, let's develop the data model for the preceding use case in Neo4j:

  • John and Mary are two different nodes.
  • Marriage is the relationship between John and Mary.
  • Age of John, age of Mary, and the year of their marriage become the properties.
  • Male and Female become the labels.
    Neo4j Essentials

Easy, simple, flexible, and natural… isn't it?

The data structure in Neo4j is adaptive and effectively can model everything that is not fixed and evolves over a period of time.

The next step in data modeling is fetching the data from the data model, which is done through traversals. Traversals are another important aspect of graphs, where you need to follow paths within the graph starting from a given node and then following its relationships with other nodes. Neo4j provides two kinds of traversals: breadth first available at http://en.wikipedia.org/wiki/Breadth-first_search and depth first available at http://en.wikipedia.org/wiki/Depth-first_search.

If you are from the RDBMS world, then you must now be wondering, "What about the schema?" and you will be surprised to know that Neo4j is a schemaless or schema-optional graph database. We do not have to define the schema unless we are at a stage where we want to provide some structure to our data for performance gains. Once performance becomes a focus area, then you can define a schema and create indexes/constraints/rules over data.

Unlike the traditional models where we freeze requirements and then draw our models, Neo4j embraces data modeling in an agile way so that it can be evolved over a period of time and is highly responsive to the dynamic and changing business requirements.

Read-only Cypher queries

In this section, we will discuss one of the most important aspects of Neo4j, that is, read-only Cypher queries.

Read-only Cypher queries are not only the core component of Cypher but also help us in exploring and leveraging various patterns and pattern matching constructs. It either begins with MATCH, OPTIONAL MATCH, or START, which can be used in conjunction with the WHERE clause and further followed by WITH and ends with RETURN. Constructs such as ORDER BY, SKIP, and LIMIT can also be used with WITH and RETURN.

We will discuss in detail about read-only constructs, but before that, let's create a sample dataset and then we will discuss constructs/syntax of read-only Cypher queries with illustration.

Creating a sample dataset – movie dataset

Let's perform the following steps to clean up our Neo4j database and insert some data which will help us in exploring various constructs of Cypher queries:

  1. Open your Command Prompt or Linux shell and open the Neo4j shell by typing <$NEO4J_HOME>/bin/neo4j-shell.
  2. Execute the following commands on your Neo4j shell for cleaning all the previous data:
    //Delete all relationships between Nodes
    MATCH ()-[r]-() delete r;
    //Delete all Nodes
    MATCH (n) delete n;
  3. Now we will create a sample dataset, which will contain movies, artists, directors, and their associations. Execute the following set of Cypher queries in your Neo4j shell to create the list of movies:
    CREATE (:Movie {Title : 'Rocky', Year : '1976'});
    CREATE (:Movie {Title : 'Rocky II', Year : '1979'});
    CREATE (:Movie {Title : 'Rocky III', Year : '1982'});
    CREATE (:Movie {Title : 'Rocky IV', Year : '1985'});
    CREATE (:Movie {Title : 'Rocky V', Year : '1990'});
    CREATE (:Movie {Title : 'The Expendables', Year : '2010'});
    CREATE (:Movie {Title : 'The Expendables II', Year : '2012'});
    CREATE (:Movie {Title : 'The Karate Kid', Year : '1984'});
    CREATE (:Movie {Title : 'Rocky', Year : '1976'});
    CREATE (:Movie {Title : 'Rocky II', Year : '1979'});
    CREATE (:Movie {Title : 'Rocky III', Year : '1982'});
    CREATE (:Movie {Title : 'Rocky IV', Year : '1985'});
    CREATE (:Movie {Title : 'Rocky V', Year : '1990'});
    CREATE (:Movie {Title : 'The Expendables', Year : '2010'});
    CREATE (:Movie {Title : 'The Expendables II', Year : '2012'});
    CREATE (:Movie {Title : 'The Karate Kid', Year : '1984'});
    CREATE (:Movie {Title : 'The Karate Kid II', Year : '1986'});
  4. Execute the following set of Cypher queries in your Neo4j shell to create the list of artists:
    CREATE (:Artist {Name : 'Sylvester Stallone', WorkedAs : ["Actor", "Director"]});
    CREATE (:Artist {Name : 'John G. Avildsen', WorkedAs : ["Director"]});
    CREATE (:Artist {Name : 'Ralph Macchio', WorkedAs : ["Actor"]});
    CREATE (:Artist {Name : 'Simon West', WorkedAs : ["Director"]});
  5. Execute the following set of cypher queries in your Neo4j shell to create the relationships between artists and movies:
    Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "Rocky"}) CREATE (artist)-[:ACTED_IN {Role : "Rocky Balboa"}]->(movie);
    Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "Rocky II"}) CREATE (artist)-[:ACTED_IN {Role : "Rocky Balboa"}]->(movie);
    Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "Rocky III"}) CREATE (artist)-[:ACTED_IN {Role : "Rocky Balboa"}]->(movie);
    Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "Rocky IV"}) CREATE (artist)-[:ACTED_IN {Role : "Rocky Balboa"}]->(movie);
    Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "Rocky V"}) CREATE (artist)-[:ACTED_IN {Role : "Rocky Balboa"}]->(movie);
    Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "The Expendables"}) CREATE (artist)-[:ACTED_IN {Role : "Barney Ross"}]->(movie);
    Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "The Expendables II"}) CREATE (artist)-[:ACTED_IN {Role : "Barney Ross"}]->(movie);
    Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "Rocky II"}) CREATE (artist)-[:DIRECTED]->(movie);
    Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "Rocky III"}) CREATE (artist)-[:DIRECTED]->(movie);
    Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "Rocky IV"}) CREATE (artist)-[:DIRECTED]->(movie);
    Match (artist:Artist {Name : "Sylvester Stallone"}), (movie:Movie {Title: "The Expendables"}) CREATE (artist)-[:DIRECTED]->(movie);
    Match (artist:Artist {Name : "John G. Avildsen"}), (movie:Movie {Title: "Rocky"}) CREATE (artist)-[:DIRECTED]->(movie);
    Match (artist:Artist {Name : "John G. Avildsen"}), (movie:Movie {Title: "Rocky V"}) CREATE (artist)-[:DIRECTED]->(movie);
    Match (artist:Artist {Name : "John G. Avildsen"}), (movie:Movie {Title: "The Karate Kid"}) CREATE (artist)-[:DIRECTED]->(movie);
    Match (artist:Artist {Name : "John G. Avildsen"}), (movie:Movie {Title: "The Karate Kid II"}) CREATE (artist)-[:DIRECTED]->(movie);
    Match (artist:Artist {Name : "Ralph Macchio"}), (movie:Movie {Title: "The Karate Kid"}) CREATE (artist)-[:ACTED_IN {Role:"Daniel LaRusso"}]->(movie);
    Match (artist:Artist {Name : "Ralph Macchio"}), (movie:Movie {Title: "The Karate Kid II"}) CREATE (artist)-[:ACTED_IN {Role:"Daniel LaRusso"}]->(movie);
    Match (artist:Artist {Name : "Simon West"}), (movie:Movie {Title: "The Expendables II"}) CREATE (artist)-[:DIRECTED]->(movie);
  6. Next, browse your data through the Neo4j browser. Click on Get some data from the left navigation pane and then execute the query by clicking on the right arrow sign that will appear on the extreme right corner just below the browser navigation bar, and it should look something like this:
    Neo4j Essentials

Now, let's understand the different pieces of read-only queries and execute those against our movie dataset.

Working with the MATCH clause

MATCH is the most important clause used to fetch data from the database. It accepts a pattern, which defines "What to search?" and "From where to search?". If the latter is not provided, then Cypher will scan the whole tree and use indexes (if defined) in order to make searching faster and performance more efficient.

Working with nodes

Let's start asking questions from our movie dataset and then form Cypher queries, execute them on <$NEO4J_HOME>/bin/neo4j-shell against the movie dataset and get the results that will produce answers to our questions:

  • How do we get all nodes and their properties?
  • Answer: MATCH (n) RETURN n;
  • Explanation: We are instructing Cypher to scan the complete database and capture all nodes in a variable n and then return the results, which in our case will be printed on our Neo4j shell.
  • How do we get nodes with specific properties or labels?
  • Answer: Match with label, MATCH (n:Artist) RETURN n; or MATCH (n:Movies) RETURN n;
  • Explanation: We are instructing Cypher to scan the complete database and capture all nodes, which contain the value of label as Artist or Movies.
  • Answer: Match with a specific property MATCH (n:Artist {WorkedAs:["Actor"]}) RETURN n;
  • Explanation: We are instructing Cypher to scan the complete database and capture all nodes that contain the value of label as Artist and the value of property WorkedAs is ["Actor"]. Since we have defined the WorkedAs collection, we need to use square brackets, but in all other cases, we should not use square brackets.

We can also return specific columns (similar to SQL). For example, the preceding statement can also be formed as MATCH (n:Artist {WorkedAs:["Actor"]}) RETURN n.name as Name;.

Working with relationships

Let's understand the process of defining relationships in the form of Cypher queries in the same way as you did in the previous section while working with nodes:

  • How do we get nodes that are associated or have relationships with other nodes?
  • Answer: MATCH (n)-[r]-(n1) RETURN n,r,n1;
  • Explanation: We are instructing Cypher to scan the complete database and capture all nodes, their relationships, and nodes with which they have relationships in variables n, r, and n1 and then further return/print the results on your Neo4j shell. Also, in the preceding query, we have used - and not -> as we do not care about the direction of relationships that we retrieve.
  • How do we get nodes, their associated properties that have some specific type of relationship, or the specific property of a relationship?
  • Answer: MATCH (n)-[r:ACTED_IN {Role : "Rocky Balboa"}]->(n1) RETURN n,r,n1;
  • Explanation: We are instructing Cypher to scan the complete database and capture all nodes, their relationships, and nodes, which have a relationship as ACTED_IN and with the property of Role as Rocky Balboa. Also, in the preceding query, we do care about the direction (incoming/outgoing) of a relationship, so we are using ->.

For matching multiple relations replace [r:ACTED_IN] with [r:ACTED_IN | DIRECTED] and use single quotes or escape characters wherever there are special characters in the name of relationships.

  • How do we get a coartist?
  • Answer: MATCH (n {Name : "Sylvester Stallone"})-[r]->(x)<-[r1]-(n1) return n.Name as Artist,type(r),x.Title as Movie, type(r1), n1.Name as Artist2;
  • Explanation: We are trying to find out all artists that are related to Sylvester Stallone in some manner or the other. Once you run the preceding query, you will see something like the following image, which should be self-explanatory. Also, see the usage of as and type. as is similar to the SQL construct and is used to define a meaningful name to the column presenting the results, and type is a special keyword that gives the type of relationship between two nodes.
  • How do we get the path and number of hops between two nodes?
  • Answer: MATCH p = (:Movie{Title:"The Karate Kid"})-[:DIRECTED*0..4]-(:Movie{Title:"Rocky V"}) return p;
  • Explanation: Paths are the distance between two nodes. In the preceding statement, we are trying to find out all paths between two nodes, which are between 0 (minimum) and 4 (maximum) hops away from each other and are only connected through the relationship DIRECTED. You can also find the path, originating only from a particular node and re-write your query as MATCH p = (:Movie{Title:"The Karate Kid"})-[:DIRECTED*0..4]-() return p;.

Integration of the BI tool – QlikView

In this section, we will talk about the integration of the BI tool—QlikView with Neo4j. QlikView is available only on the Windows platform, so this section is only applicable for Windows users.

Neo4j as an open source database exposes its core APIs for developers to write plugins and extends its intrinsic capabilities.

Neo4j JDBC is one such plugin that enables the integration of Neo4j with various BI / visualization and ETL tools such as QlikView, Jaspersoft, Talend, Hive, Hbase, and many more.

Let's perform the following steps for integrating Neo4j with QlikView on Windows:

  1. Download, install, and configure the following required software:
    1. Download the Neo4j JDBC driver directly from https://github.com/neo4j-contrib/neo4j-jdbc as the source code. You need to compile and create a JAR file or you can also directly download the compiled sources from http://dist.neo4j.org/neo4j-jdbc/neo4j-jdbc-2.0.1-SNAPSHOT-jar-with-dependencies.jar.
    2. Depending upon your Windows platform (32 bit or 64 bit), download the QlikView Desktop Personal Edition. In this article, we will be using QlikView Desktop Personal Edition 11.20.12577.0 SR8.
    3. Install QlikView and follow the instructions as they appear on your screen. After installation, you will see the QlikView icon in your Windows start menu.
  2. QlikView leverages QlikView JDBC Connector for integration with JDBC data sources, so our next step would be to install QlikView JDBC Connector. Let's perform the following steps to install and configure the QlikView JDBC Connector:
    1. Download QlikView JDBC Connector either from http://www.tiq-solutions.de/display/enghome/ENJDBC or from https://s3-eu-west-1.amazonaws.com/tiq-solutions/JDBCConnector/JDBCConnector_Setup.zip.
    2. Open the downloaded JDBCConnector_Setup.zip file and install the provided connector.
    3. Once the installation is complete, open JDBC Connector from your Windows Start menu and click on Active for a 30-day trial (if you haven't already done so during installation).
    4. Create a new profile of the name Qlikview-Neo4j and make it your default profile.
      Neo4j Essentials
    5. Open the JVM VM Options tab and provide the location of jvm.dll, which can be located at <$JAVA_HOME>/jre/bin/client/jvm.dll.
      Neo4j Essentials
    6. Click on Open Log-Folder to check the logs related to the Database connections.

      You can configure the Logging Level and also define the JVM runtime options such as -Xmx and -Xms in the textbox provided for Option in the preceding screenshot.

    7. Browse through the JDBC Driver tab, click on Add Library, and provide the location of your <$NEO4J-JDBC driver>.jar, and add the dependent JAR files.
      Neo4j Essentials

      Instead of adding individual libraries, we can also add a folder containing the same list of libraries by clicking on the Add Folder option.

      We can also use non JDBC-4 compliant drivers by mentioning the name of the driver class in the Advanced Settings tab. There is no need to do that, however, if you are setting up a configuration profile that uses a JDBC-4 compliant driver.

    8. Open the License Registration tab and request Online Trial license, which will be valid for 30 days. Assuming that you are connected to the Internet, the trial license will be applied immediately.
      Neo4j Essentials
    9. Save your settings and close QlikView JDBC Connector configuration.
  3. Open <$NEO4J_HOME>\bin\neo4jshell and execute the following set of Cypher statements one by one to create sample data in your Neo4j database; then, in further steps, we will visualize this data in QlikView:
    CREATE (movies1:Movies {Title : 'Rocky', Year : '1976'});
    CREATE (movies2:Movies {Title : 'Rocky II', Year : '1979'});
    CREATE (movies3:Movies {Title : 'Rocky III', Year : '1982'});
    CREATE (movies4:Movies {Title : 'Rocky IV', Year : '1985'});
    CREATE (movies5:Movies {Title : 'Rocky V', Year : '1990'});
    CREATE (movies6:Movies {Title : 'The Expendables', Year : '2010'});
    CREATE (movies7:Movies {Title : 'The Expendables II', Year : '2012'});
    CREATE (movies8:Movies {Title : 'The Karate Kid', Year : '1984'});
    CREATE (movies9:Movies {Title : 'The Karate Kid II', Year : '1986'});
  4. Open the QlikView Desktop Personal Edition and create a new view by navigating to File | New. The Getting Started wizard may appear as we are creating a new view. Close this wizard.
  5. Navigate to File | EditScript and change your database to JDBCConnector.dll (32).
    Neo4j Essentials
  6. In the same window, click on Connect and enter "jdbc:neo4j://localhost:7474/" in the "url" box.
  7. Leave the username and password as empty and click on OK. You will see that a CUSTOM CONNECT TO statement is added in the box provided.
  8. Next insert the highlighted Cypher statements in the provided window just below the CUSTOM CONNECT TO statement.
    Neo4j Essentials
  9. Save the script and close the EditScript window.
  10. Now, on your Qlikviewsheet, execute the script by pressing Ctrl + R on our keyboard.
  11. Next, add a new TableObject on your Qlikviewsheet, select "MovieTitle" from the provided fields and click on OK.
    Neo4j Essentials

And we are done!!!!

You will see the data appearing in the listbox in the newly created Table Object. The data is fetched from the Neo4j database and QlikView is used to render this data.

The same process is used for connecting to other JDBC-compliant BI / visualization / ETL tools such as Jasper, Talend, Hive, Hbase, and so on. We just need to define appropriate JDBC Type-4 drivers in JDBC Connector.

We can also use ODBC-JDBC Bridge provided by EasySoft at http://www.easysoft.com/products/data_access/odbc_jdbc_gateway/index.html. EasySoft provides the ODBC-JDBC Gateway, which facilitates ODBC access from applications such as MS Access, MS Excel, Delphi, and C++ to Java databases. It is a fully functional ODBC 3.5 driver that allows you to access any JDBC data source from any ODBC-compatible application.

Summary

In this article, you have learned the basic concepts of data modeling in Neo4j and have walked you through the process of BI integration with Neo4j.

Resources for Article:


Further resources on this subject:


You've been reading an excerpt of:

Neo4j Essentials

Explore Title
comments powered by Disqus