You're reading from Graph Data Science with Neo4j

Product type Book

Published in Jan 2023

Publisher Packt

ISBN-13 9781804612743

Pages 288 pages

Edition 1st Edition

Languages

Python

Concepts

Mobile Application Development

Author (1):

Estelle Scifo

Table of Contents (16) Chapters

Preface

1. Part 1 – Creating Graph Data in Neo4j

2. Chapter 1: Introducing and Installing Neo4j

3. Chapter 2: Importing Data into Neo4j to Build a Knowledge Graph

4. Part 2 – Exploring and Characterizing Graph Data with Neo4j

5. Chapter 3: Characterizing a Graph Dataset

6. Chapter 4: Using Graph Algorithms to Characterize a Graph Dataset

7. Chapter 5: Visualizing Graph Data

8. Part 3 – Making Predictions on a Graph

9. Chapter 6: Building a Machine Learning Model with Graph Features

10. Chapter 7: Automatically Extracting Features with Graph Embeddings for Machine Learning

11. Chapter 8: Building a GDS Pipeline for Node Classification Model Training

12. Chapter 9: Predicting Future Edges

13. Chapter 10: Writing Your Custom Graph Algorithms with the Pregel API in Java

14. Index

Why subscribe?

15. Other Books You May Enjoy

Importing Data into Neo4j to Build a Knowledge Graph

As discussed in the previous chapter, you do not have to fetch a graph dataset specifically to work with a graph database. Many datasets we are used to working with as data scientists contain information about the relationships between entities, which can be used to model this dataset as a graph. In this chapter, we will discuss one such example: a Netflix catalog dataset. It contains movies and series available on the streaming platform, including some metadata, such as director or cast.

First, we are going to study this dataset, which is saved as a CSV file, and learn how to import CSV data into Neo4j. In a second exercise, we will use the same dataset, stored as a JSON file. Here, we will have to use the Awesome Procedures on Cypher (APOC) Neo4j plugin to be able to parse this data and import it into Neo4j.

Finally, we will learn about the existing public knowledge graphs, one of the biggest ones being Wikidata. We will...

Technical requirements

To be able to reproduce the examples provided in this chapter, you’ll need the following tools:

Neo4j installed on your computer (see the installation instructions in the previous chapter).
The Neo4j APOC plugin (the installation instructions will be provided later in this chapter, in the Introducing the APOC library to deal with JSON data section).
Python and the Jupyter Notebook installed. We are not going to cover the installation instructions in this book. However, they are detailed in the code repository associated with this book (see the last bullet if you need such details).
An internet connection to download the plugins and the datasets. This will also be required for you to use the public API in the last section of this chapter (Enriching our graph with Wikidata information).
Any code listed in this book will be available in the associated GitHub repository, https://github.com/PacktPublishing/Graph-Data-Science-with-Neo4j...

Importing CSV data into Neo4j with Cypher

The comma-separated values (CSV) file format is the most widely used to share data among data scientists. According to the dataset of Kaggle datasets (https://www.kaggle.com/datasets/morriswongch/kaggle-datasets), this format represents more than 57% of all datasets in this repository, while JSON files account for less than 10%. It is popular for the following reasons:

How it resembles the tabular data storage format (relational databases)
Its closeness to the machine learning world of vectors and matrices
Its readability – you usually just have to read column names to understand what it is about (of course, a more detailed description is required to understand how the data was collected, the unit of physical quantities, and so on) and there are no hidden fields (compared to JSON, where you can only have a key existing from the 1,000th record and later, which is hard to know without a proper description or advanced data...

Introducing the APOC library to deal with JSON data

The JavaScript Object Notation (JSON) file format is another data format you have probably used in your data science work. It is used by NoSQL document-like databases (or an equivalent of it, such as Binary JSON (BSON) for MongoDB). It is also one of the most used formats for data serialization and hence sharing data via web interfaces (APIs).

In this section, we will learn how to import JSON data into Neo4j. This format is not supported by Cypher directly, so we will have to rely on the APOC library to load such data. First, let’s have a look at the dataset we are going to use in this section.

Browsing the dataset

The file we are going to use contains the same data we used in the previous section but in a different format. Here is an example record from the JSON file:

{'cast': [{'name': 'Billy Magnussen'},
             ...

Discovering the Wikidata public knowledge graph

Wikidata is a publicly available knowledge graph. It stores data in the RDF format. Like many RDF-like data stores, the query language used by Wikidata is SPARQL. Even if this is not the main topic of this book, we will see a couple of examples by the end of this chapter so that you can perform basic queries.

Wikidata data can be accessed via the following methods:

A web browser, starting from the home page at https://www.wikidata.org/. Then, use the search bar to find the item of interest for you.
A SPARQL playground, which is available at https://query.wikidata.org/.
A public API using the endpoint: http://query.wikidata.org/sparql?format=json&query=””.

You are highly encouraged to navigate through Wikidata using your browser to understand the data format. You can, for instance, start from here:

The Neo4j page at https://wikidata.org/wiki/Q1628290
The page for India at https:/...

Enriching our graph with Wikidata information

In this section, we are going to use the preceding SPARQL query and the Wikidata query API to retrieve information about each person in our Neo4j graph and add their country of citizenship.

Loading data into Neo4j for one person

Using the previous query, we are going to query the Wikidata API SPARQL endpoint using APOC, and save the result into Neo4j:

Save the query as a parameter in Neo4j Browser:

:param query=>apoc.text.urlencode("SELECT ?personLabel ?countryLabel WHERE {?person rdfs:label 'George Clooney'@en ; wdt:P27 ?country . SERVICE wikibase:label {bd:serviceParam wikibase:language 'en' .}}")

Make sure you encode the query since it’s going to be used in the query string to perform an HTTP GET query. You can see what the encoded query looks like by just using RETURN $query, which prints the following:
```
"SELECT+%3FpersonLabel+%3FcountryLabel+WHERE+%7B%3Fperson +rdfs%3Alabel...
```

Dealing with spatial data in Neo4j

Neo4j has a built-in type for dealing with spatial data, but only for points so far (lines and polygons are not supported (yet)).

Wikidata contains coordinates information for many entities. For instance, each country has a location. We can extract it using the following query:

SELECT ?country ?countryLabel ?lat ?lon
WHERE {
  ?country rdfs:label "India"@en;
                 wdt:P31 wd:Q6256.
     ?country p:P625 ?coordinate.
     ?coordinate psv:P625 ?coordinate_node.
     ?coordinate_node wikibase:geoLongitude ?lon.
     ?coordinate_node wikibase:geoLatitude ?lat.
  SERVICE wikibase:label {bd:serviceParam wikibase:language "en" .}
}

Going further

Explanations of the new parts of this query are available here: https...

Importing data in the cloud

To import data into Neo4j Aura, the cloud-hosted Neo4j database, we can use the aforementioned method of reading files from an accessible URL. But Neo4j also provides a frontend application that can deal with CSV files only.

Starting from a Neo4j Aura console, as illustrated in Figure 1.11 in Chapter 1, Introducing and Installing Neo4j, you can click on the Import button. That will open the data importer login window, as shown in the following screenshot:

Figure 2.10 – Logging into Neo4j Data Importer

After entering your credentials, you will be redirected to the application. The UI is made up of three parts:

The Files manager, on the left: The area where you can drag and drop the files to be imported
The graph view (middle panel): This is where you can draw your graph schema, including nodes and relationships
Mapping details: In this section, you can define node and relationship properties, and you...

Summary

Importing existing data into a brand-new database is always a concern, as we covered in this chapter. From a flat (a non-graph format) file, you can identify node labels and relationship types between them, transforming a flat dataset into a real graph. Whether your data is stored as CSV, JSON, on your local disk or distant server, or via an API endpoint, you can now load this data into Neo4j and start exploring your graph. You also learned about the Neo4 Data Importer tool, which is used to import data stored as CSV files in a cloud-hosted Neo4j database (Aura).

You also learned about public knowledge graphs, such as Wikidata, which can be used to extend your knowledge by importing more data about a specific topic.

Finally, you learned how to import your data into the cloud thanks to the Neo4j Data Importer application.

Being able to create a graph dataset is only the beginning, though. Like any dataset, graph datasets are very different from one to another. While...

Exercises

To make sure you understand the topics covered in this chapter before moving on to the next one, you are encouraged to think about the following:

What is the advantage of a MERGE statement over CREATE?
Can raw Cypher parse JSON data? What tool should you use for that?
Practice! Using the Netflix dataset, set the movie’s genres contained in the listed_in column in the CSV dataset (assume Movies has already been imported).
Practice! Using the Netflix JSON dataset, write a Cypher query to import actors (assume Movies has already been imported).
Knowing that a given user – let’s call her Alice – watched the movie named Confessions of an Invisible Girl, what other Netflix content can we recommend to Alice?
Practice! Refine the SPARQL query we’ve built to make sure the person is an actor or movie director.