Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Graph Data Science with Neo4j

You're reading from  Graph Data Science with Neo4j

Product type Book
Published in Jan 2023
Publisher Packt
ISBN-13 9781804612743
Pages 288 pages
Edition 1st Edition
Languages
Author (1):
Estelle Scifo Estelle Scifo
Profile icon Estelle Scifo

Table of Contents (16) Chapters

Preface 1. Part 1 – Creating Graph Data in Neo4j
2. Chapter 1: Introducing and Installing Neo4j 3. Chapter 2: Importing Data into Neo4j to Build a Knowledge Graph 4. Part 2 – Exploring and Characterizing Graph Data with Neo4j
5. Chapter 3: Characterizing a Graph Dataset 6. Chapter 4: Using Graph Algorithms to Characterize a Graph Dataset 7. Chapter 5: Visualizing Graph Data 8. Part 3 – Making Predictions on a Graph
9. Chapter 6: Building a Machine Learning Model with Graph Features 10. Chapter 7: Automatically Extracting Features with Graph Embeddings for Machine Learning 11. Chapter 8: Building a GDS Pipeline for Node Classification Model Training 12. Chapter 9: Predicting Future Edges 13. Chapter 10: Writing Your Custom Graph Algorithms with the Pregel API in Java 14. Index 15. Other Books You May Enjoy

Importing Data into Neo4j to Build a Knowledge Graph

As discussed in the previous chapter, you do not have to fetch a graph dataset specifically to work with a graph database. Many datasets we are used to working with as data scientists contain information about the relationships between entities, which can be used to model this dataset as a graph. In this chapter, we will discuss one such example: a Netflix catalog dataset. It contains movies and series available on the streaming platform, including some metadata, such as director or cast.

First, we are going to study this dataset, which is saved as a CSV file, and learn how to import CSV data into Neo4j. In a second exercise, we will use the same dataset, stored as a JSON file. Here, we will have to use the Awesome Procedures on Cypher (APOC) Neo4j plugin to be able to parse this data and import it into Neo4j.

Finally, we will learn about the existing public knowledge graphs, one of the biggest ones being Wikidata. We will...

Technical requirements

To be able to reproduce the examples provided in this chapter, you’ll need the following tools:

  • Neo4j installed on your computer (see the installation instructions in the previous chapter).
  • The Neo4j APOC plugin (the installation instructions will be provided later in this chapter, in the Introducing the APOC library to deal with JSON data section).
  • Python and the Jupyter Notebook installed. We are not going to cover the installation instructions in this book. However, they are detailed in the code repository associated with this book (see the last bullet if you need such details).
  • An internet connection to download the plugins and the datasets. This will also be required for you to use the public API in the last section of this chapter (Enriching our graph with Wikidata information).
  • Any code listed in this book will be available in the associated GitHub repository, https://github.com/PacktPublishing/Graph-Data-Science-with-Neo4j...

Importing CSV data into Neo4j with Cypher

The comma-separated values (CSV) file format is the most widely used to share data among data scientists. According to the dataset of Kaggle datasets (https://www.kaggle.com/datasets/morriswongch/kaggle-datasets), this format represents more than 57% of all datasets in this repository, while JSON files account for less than 10%. It is popular for the following reasons:

  • How it resembles the tabular data storage format (relational databases)
  • Its closeness to the machine learning world of vectors and matrices
  • Its readability – you usually just have to read column names to understand what it is about (of course, a more detailed description is required to understand how the data was collected, the unit of physical quantities, and so on) and there are no hidden fields (compared to JSON, where you can only have a key existing from the 1,000th record and later, which is hard to know without a proper description or advanced data...

Introducing the APOC library to deal with JSON data

The JavaScript Object Notation (JSON) file format is another data format you have probably used in your data science work. It is used by NoSQL document-like databases (or an equivalent of it, such as Binary JSON (BSON) for MongoDB). It is also one of the most used formats for data serialization and hence sharing data via web interfaces (APIs).

In this section, we will learn how to import JSON data into Neo4j. This format is not supported by Cypher directly, so we will have to rely on the APOC library to load such data. First, let’s have a look at the dataset we are going to use in this section.

Browsing the dataset

The file we are going to use contains the same data we used in the previous section but in a different format. Here is an example record from the JSON file:

{'cast': [{'name': 'Billy Magnussen'},
             ...

Discovering the Wikidata public knowledge graph

Wikidata is a publicly available knowledge graph. It stores data in the RDF format. Like many RDF-like data stores, the query language used by Wikidata is SPARQL. Even if this is not the main topic of this book, we will see a couple of examples by the end of this chapter so that you can perform basic queries.

Wikidata data can be accessed via the following methods:

You are highly encouraged to navigate through Wikidata using your browser to understand the data format. You can, for instance, start from here:

Enriching our graph with Wikidata information

In this section, we are going to use the preceding SPARQL query and the Wikidata query API to retrieve information about each person in our Neo4j graph and add their country of citizenship.

Loading data into Neo4j for one person

Using the previous query, we are going to query the Wikidata API SPARQL endpoint using APOC, and save the result into Neo4j:

  1. Save the query as a parameter in Neo4j Browser:
    :param query=>apoc.text.urlencode("SELECT ?personLabel ?countryLabel WHERE {?person rdfs:label 'George Clooney'@en ; wdt:P27 ?country . SERVICE wikibase:label {bd:serviceParam wikibase:language 'en' .}}")
  2. Make sure you encode the query since it’s going to be used in the query string to perform an HTTP GET query. You can see what the encoded query looks like by just using RETURN $query, which prints the following:
    "SELECT+%3FpersonLabel+%3FcountryLabel+WHERE+%7B%3Fperson +rdfs%3Alabel...

Dealing with spatial data in Neo4j

Neo4j has a built-in type for dealing with spatial data, but only for points so far (lines and polygons are not supported (yet)).

Wikidata contains coordinates information for many entities. For instance, each country has a location. We can extract it using the following query:

SELECT ?country ?countryLabel ?lat ?lon
WHERE {
  ?country rdfs:label "India"@en;
                 wdt:P31 wd:Q6256.
     ?country p:P625 ?coordinate.
     ?coordinate psv:P625 ?coordinate_node.
     ?coordinate_node wikibase:geoLongitude ?lon.
     ?coordinate_node wikibase:geoLatitude ?lat.
  SERVICE wikibase:label {bd:serviceParam wikibase:language "en" .}
}

Going further

Explanations of the new parts of this query are available here: https...

Importing data in the cloud

To import data into Neo4j Aura, the cloud-hosted Neo4j database, we can use the aforementioned method of reading files from an accessible URL. But Neo4j also provides a frontend application that can deal with CSV files only.

Starting from a Neo4j Aura console, as illustrated in Figure 1.11 in Chapter 1, Introducing and Installing Neo4j, you can click on the Import button. That will open the data importer login window, as shown in the following screenshot:

Figure 2.10 – Logging into Neo4j Data Importer

Figure 2.10 – Logging into Neo4j Data Importer

After entering your credentials, you will be redirected to the application. The UI is made up of three parts:

  • The Files manager, on the left: The area where you can drag and drop the files to be imported
  • The graph view (middle panel): This is where you can draw your graph schema, including nodes and relationships
  • Mapping details: In this section, you can define node and relationship properties, and you...

Summary

Importing existing data into a brand-new database is always a concern, as we covered in this chapter. From a flat (a non-graph format) file, you can identify node labels and relationship types between them, transforming a flat dataset into a real graph. Whether your data is stored as CSV, JSON, on your local disk or distant server, or via an API endpoint, you can now load this data into Neo4j and start exploring your graph. You also learned about the Neo4 Data Importer tool, which is used to import data stored as CSV files in a cloud-hosted Neo4j database (Aura).

You also learned about public knowledge graphs, such as Wikidata, which can be used to extend your knowledge by importing more data about a specific topic.

Finally, you learned how to import your data into the cloud thanks to the Neo4j Data Importer application.

Being able to create a graph dataset is only the beginning, though. Like any dataset, graph datasets are very different from one to another. While...

Further reading

The following resources might help you gain a better understanding of the topics covered in this chapter:

  • Graph data modeling is covered in The Practitioner’s Guide to Graph Data, by D. Gosnell and M. Broecheler (O’Reilly)
  • The SPARQL language to model RDF data is described in detail in Semantic Web for the Working Ontologist, by D. Allemang and J. Hendler (Morgan Kaufmann)

Exercises

To make sure you understand the topics covered in this chapter before moving on to the next one, you are encouraged to think about the following:

  1. What is the advantage of a MERGE statement over CREATE?
  2. Can raw Cypher parse JSON data? What tool should you use for that?
  3. Practice! Using the Netflix dataset, set the movie’s genres contained in the listed_in column in the CSV dataset (assume Movies has already been imported).
  4. Practice! Using the Netflix JSON dataset, write a Cypher query to import actors (assume Movies has already been imported).
  5. Knowing that a given user – let’s call her Alice – watched the movie named Confessions of an Invisible Girl, what other Netflix content can we recommend to Alice?
  6. Practice! Refine the SPARQL query we’ve built to make sure the person is an actor or movie director.

Help: You can use Robert Cullen as an example.

The answers are provided at the end of this book.

...
lock icon The rest of the chapter is locked
You have been reading a chapter from
Graph Data Science with Neo4j
Published in: Jan 2023 Publisher: Packt ISBN-13: 9781804612743
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}