Using Node.js and Hadoop to store distributed data

Harri Siirak

September 25th, 2015

Hadoop is a well-known open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. It's designed with a fundamental assumption that hardware failures can (and will) happen and thus should be automatically handled in software by the framework.

Under the hood it's using HDFS (Hadoop Distributed File System) for the data storage. HDFS can store large files across multiple machines and it achieves reliability by replicating the data across multiple hosts (default replication factor is 3 and can be configured to be higher when needed). Although it's designed for mostly immutable files and may not be suitable for systems requiring concurrent write-operations. Its target usage is not only restricted to MapReduce jobs, but it also can be used for cost effective and reliable data storage.

In the following examples, I am going to give you an overview of how to establish connections to HDFS storage (namenode) and how to perform basic operations on the data. As you can probably guess, I'm using Node.js to build these examples. Node.js is a platform built on Chrome's JavaScript runtime for easily building fast, scalable network applications. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient, perfect for data-intensive real-time applications that run across distributed devices. So it's really ideal for what I want to show you next.

Two popular libraries for acccessing HDFS in Node.js are node-hdfs and webhdfs. The first one uses Hadoop's native libhdfs library and protocol to communicate with Hadoop namenode, albeit it seems to be not maintained anymore and doesn't support Stream API. Another one is using WebHDFS, which defines a public HTTP REST API, directly built into Hadoop's core (namenodes and datanodes both) and which permits clients to access Hadoop from multiple languages without installing Hadoop, and supports all HDFS user operations including reading files, writing to files, making directories, changing permissions and renaming.

More details about WebHDFS REST API and about its implementation details and response codes/types can be found from here.

At this point I'm assuming that you have Hadoop cluster up and running. There are plenty of good tutorials out there showing how to setup and run Hadoop cluster (single and multi node).

Installing and using the webhdfs library

webhdfs implements most of the REST API calls, albeit it's not yet supporting Hadoop delegation tokens. It's also Stream API compatible what makes its usage pretty straightforward and easy. Detailed examples and use cases for another supported calls can be found from here.

Install webhdfs from npm:

npm install wehbhdfs

Create a new script named webhdfs-client.js:

// Include webhdfs module
var WebHDFS = require('webhdfs');

// Create a new
var hdfs = WebHDFS.createClient({
  user: 'hduser', // Hadoop user
  host: 'localhost', // Namenode host
  port: 50070 // Namenode port
});

module.exports = hdfs;

Here we initialized new webhdfs client with options, including namenode's host and port where we are connecting to. Let's proceed with a more detailed example.

Storing file data in HDFS

Create a new script named webhdfs-write-test.js and add the code below.

// Include created client
var hdfs = require('./webhdfs-client');

// Include fs module for local file system operations
var fs = require('fs');

// Initialize readable stream from local file
// Change this to real path in your file system
var localFileStream = fs.createReadStream('/path/to/local/file');

// Initialize writable stream to HDFS target
var remoteFileStream = hdfs.createWriteStream('/path/to/remote/file');

// Pipe data to HDFS
localFileStream.pipe(remoteFileStream);

// Handle errors
remoteFileStream.on('error', function onError (err) {
  // Do something with the error
});

// Handle finish event
remoteFileStream.on('finish', function onFinish () {
  // Upload is done
});

Basically what we are doing here is that we're initializing readable file stream from a local filesystem and piping its contents seamlessly into remote HDFS target. Optionally webhdfs exposes error and finish.

Reading file data from HDFS

Let's retrieve the data what we just stored in HDFS storage. Create a new script named webhdfs-read-test.js and add code below.

var hdfs = require('./webhdfs-client');
var fs = require('fs');

// Initialize readable stream from HDFS source
var remoteFileStream = hdfs.createReadStream('/path/to/remote/file');

// Variable for storing data
var data = new Buffer();

remoteFileStream.on('error', function onError (err) {
  // Do something with the error
});

remoteFileStream.on('data', function onChunk (chunk) {
  // Concat received data chunk
  data = Buffer.concat([ data, chunk ]);
});

remoteFileStream.on('finish', function onFinish () {
  // Upload is done
  // Print received data
  console.log(data.toString());
});

What's next?

Now when we have data in Hadoop cluster, we can start processing it by spawning some MapReduce jobs, and when it's processed we can retrieve the output data.

In the second part of this article, I'm going to give you an overview of how Node.js can be used as part of MapReduce jobs.

About the author

Harri is a senior Node.js/Javascript developer among a talented team of full-stack developers who specialize in building scalable and secure Node.js based solutions. He can be found on Github at harrisiirak.