Hierarchical Data Format

Janu Verma

January 10th, 2017

Hierarchical Data Format (HDF) is an open source file format for storing huge amounts of numerical data. It’s typically used in research applications to distribute and access very large datasets in a reasonable way, without centralizing everything through a database. We can use HDF5 data format for pretty fast serialization of or random access to fairly large datasets in a local/development environment. The Million Song Dataset, for example, is distributed this way.

HDF was developed by National Center for Supercomputing Applications.

Think of HDF as a file system within a file. It lets you organize the data hierarchically and manage a large amount of data very efficiently. Every object in an HDF5 file has a name, and they are arranged in a POSIX - style hierarchy with / separators, e.g.: /path/to/resource

HDF5 has two kinds of objects:

  • Groups
  • Datasets

Groups are folder-like objects which contain datasets and other groups. Datasets contain the actual data in the from of arrays.

HDF in python

For my work, I had to study the data stored in HDF5 files. These files are not human readable, and so I had to write some codes in Python to access the data. Luckily, there is the PyTables package, which has a framework to parse HDF5 files.

The PyTables package does much more than that. PyTables can be used in any scenario where you need to save and retrieve large amounts of multidimensional data and provide metadata for it. PyTables can also be employed if you need to structure some portions of your cluttered RDBMS. For example, if you have very large tables in your existing relational database, then you can move those tables to PyTables so as to reduce the burden of your existing database while efficiently keeping those huge tables on disk.

Reading a HDF5 file in python:

from tables import *
h5file = open_file("myHDF5file.h5", "a")

All the nodes in the file:

for node in h5file:
print node

This will print all the nodes in the file. This is of little use as this is like listing all the files in my filesystem. The main advantage of a hierarchical framework is that you want to retrieve data in a hierarchical fashion. So the first step would be to look at all the groups (folders):

for group in h5file.walk_groups():
    print group

/ (RootGroup) ''
/group1 (Group)
/group2 (Group)

We have 3 groups in this file, the root, group1, and group2. Everything is either a direct or indirect child of the root, as in a tree. Think of the home folder on your computer. Now, we would want to look at the contents of the groups (which will be either subgroups or datasets):

print h5file.root._v_children
{'group1': /group1 (Group) ''
  children := ['group2' (Group), '/someList' (Array(40000,)], 

 'list2': /list2 (Array(2500,)) ''
  atom := Int8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := None, 

 'tags': /tags (Array(2, 19853)) ''
  atom := Int64Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None}

_v_children gives a dictionary of the children of a group, the root in the above example. Now we can see that from node, there are 3 children hanging – a group and two arrays. We can also read that group1 has two children – a group and an array.

We saw earlier that h5file.walk_groups() is a way to iterate through all the groups of the HDF5 file; this can be used to loop over the groups:

for group in h5file.walk_groups():
    nodes = group._v_children
    namesOfNodes = nodes.keys()
        print namesOfNodes

This will print the names of the children for each group. One can do more interesting things using .walk_groups().

A very important procedure one can run on a group is:

x = group._v_name
for array in h5file.list_nodes(x, classname="Array"):
        array_name = array._v_name
        array_contents = array.read()
        print array_contents

This will print the contents of all the arrays that are the children of the group. The supported classes in classname are 'Group', 'Leaf', 'Table', and 'Array’. Recall that array.read() for each array gives a Numpy array, so all the Numpy operations like ndim, shape, etc., work for these objects.

With these operations, you can start exploring an HDF5 file. For more procedures and methods, check out the tutorials on PyTables.

Converting HDF to JSON

I wrote a class to convert the contents of the the HDF5 file into a JSON object. The codes can be found here. Feel free to use and comment.

The motivation for this is two-fold:

  • JSON format provides a very easy tool for data serialization and it has always been my first choice for serialization/deserialization.
  • JSON schema is used in many NOSQL databases, e.g.: Membase and MongoDB. We can store information in JSON schema in relational databases also. In fact, there are claims that PostgreSQL 9.4 is now faster than MongoDB for storing JSON documents.

We know that the HDF5 files are not human-readable. This class renders them into human-readable data objects consisting of key–value pairs. This creates a JSON file of the same name as the input HDF5 file with the .json extension. When decoded, the file contains a nested python dictionary:

HDF5toJSON.py hdf2file.h5

json_data = converter(h5file)
contents = json_data.jsonOutput() 
> 'hdf2file.json'

Recall that every object in an HDF5 file has a name and is arranged in a POSIX-style hierarchy with / separators, e.g.: /group1/group2/dataArray. I wanted to maintain the same hierarchy in the JSON file also. So, if you want to access the contents of dataArray in JSON file:

json_file = open('createdJSONfile.json')
for line in json_file:
record = json.loads(line)
print record['/']['group1']['group2']['dataArray']

The main key is always going to the root key, '/'.

This class also has methods to access the contents of a group directly without following the hierarchy. If you want to get a list of all the groups in the HDF5 file:

json_data = converter(h5file)
groups = json_data.Groups()
print groups

> ['/', 'group1', 'group2']

One can also directly look at the contents of group1:

json_data = converter(h5file)
contents = json_data.groupContents('group1')
print contents

> {'group2':{'dataArray':[12,24,36]}, array1:[1,2,4,9]}

Or, if you are interested in the group objects hanging from group1:

json_data = converter(h5file)
groups = json_data.subgroups('group1')

> ['group2']

About the author

Janu Verma is a Researcher in IBM T.J. Watson Research Center, New York. His research interests are in mathematics, machine learning, information visualization, computational biology, and healthcare analytics. He had held research positions at Cornell University, Kansas State University, Tata Institute of Fundamental Research, Indian Institute of Science, and Indian Statistical Institute. He has written papers for IEEE Vis, KDD, International Conference on HealthCare Informatics, Computer Graphics and Applications, Nature Genetics, IEEE Sensors Journals, etc. His current focus is on the development of visual analytics systems for prediction and understanding. He advises startups and companies on data science and machine learning in the Delhi-NCR area. Email to schedule a meeting.

comments powered by Disqus