Integrating Accumulo into Various Cloud Platforms

Exclusive offer: get 50% off this eBook here
Apache Accumulo for Developers

Apache Accumulo for Developers — Save 50%

Build and integrate Accumulo clusters with various cloud platforms with this book and ebook

$20.99    $10.50
by Guðmundur Jón Halldórsson | October 2013 | Open Source

In this article by Guðmundur Jón Halldórsson, the author of Apache Accumulo for Developers we will learn how to integrate Accumulo into various cloud platforms, both as a single-node and as a pseudo-distributed mode, and then expand it to a multi-node.

The following example will show you how to create an Accumulo cluster on various cloud platforms. The steps needed to complete the task of setting up the cluster are similar for those cloud platforms. The difference is in the tools and scripts used to accomplish the task of creating the cluster.

These are the topics that will be covered in this article:

  • Amazon EC2
  • Google Cloud Platform
  • Rackspace
  • Windows Azure

(For more resources related to this topic, see here.)

Hadoop is supported by many cloud vendors as the popularity of Map/Reduce has grown over the past few years. Accumulo is another story; even though popularity is growing, the support of cloud vendors hasn't caught up.

Amazon EC2

Amazon has great support for Accumulo, Hadoop, and ZooKeeper. For Hadoop and ZooKeeper, there is a set of libraries called Apache Whirr. Apache Whirr supports Amazon EC2, Rackspace, and many more cloud providers. Apache Whirr uses low-level API libraries. For Accumulo, you have two options: one is to use the Amazon EMR command-line interface, and the other is to create a new virtual machine and then setup it.

Prerequisites for Amazon EC2

Prerequisites needed to complete the setup phase for Amazon EC2 are as follows:

Creating Amazon EC2 Hadoop and ZooKeeper cluster

The following steps are required to create Amazon EC2 Hadoop and the ZooKeeper cluster:

  1. Log in to https://console.aws.amazon.com.
  2. The management console for Amazon Services has a nice graphical overview of all the actions that you can do. In our case, we use the Amazon AWS Console to verify what we have done while setting up the cluster.
  3. From the drop-down menu under your name at the top-right corner, select Security Credentials .
  4. Under Access Keys, you need to create a new root key and download the file containing AWSAccessKeyId and AWSSecretKey.
  5. Normally, you would create an AWS Identity and Access Management (IAM) user with limited permissions, and give only that user the access to the cluster. But in this case, we are creating a demo cluster and will be destroying it after use.
  6. Create a new key by running the following command:
    • For Linux and Windows Cygwin:

      ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr

    The rsa key is used later when configuring Whirr. It is not required to copy the key to the ~/.ssh/authorized_keys folder because the rsa key is going to be used from the current location.

  7. Download Whirr and set it up using the following commands:

    cd /usr/local sudo wget http://apache.claz.org/whirr/stable/whirr-0.8.2.tar.gz sudo tar xzf whirr-0.8.2.tar.gz sudo mv whirr-0.8.2 whirr sudo chown –R hadoopuser:hadoopgroup whirr

    Download Whirr in the /usr/local folder, unpack it, and rename it to whirr. For Cygwin, don't run the last command in the script.

  8. Set up the credentials for Amazon EC2:
    • For Linux and Cygwin:

      sudo cp /usr/local/whirr/conf/credentials.sample/usr/local/whirr/credentials sudo nano /usr/local/whirr/conf/credentials

    • Skip the sudo command in Cygwin. Elevated privileges in Windows are usually acquired by right-clicking on the icon and choosing Run as administrator.
    • Edit the /usr/local/whirr/const/credentials file and change the following lines:

      PROVIDER=aws-ec2 IDENTITY=<The value from the variable AWSAccessKeyId> CREDENTIAL= <The value from the variable AWSSecretKey>

    • By default, Whirr will look for the credentials file in the home directory; if it's not found there, it will look in /usr/local/whirr/conf. I prefer to use the /usr/local/whirr/conf directory to keep everything at the same place.
  9. The first step in simplifying the creation of the cluster is to create a configuration file, which will be named cluster.properties for this example.
    • For Linux:

      sudo nano /usr/local/whirr/conf/cluster.properties

    • For Cygwin:

      nano /usr/local/whirr/conf/cluster.properties

      Add the following lines:

      whirr.cluster-name=demo-cluster whirr.instance-templates=1 zookeeper,1 hadoop-jobtracker+hadoop-namenode,
      1 hadoop-datanode+Hadoop-tasktracker whirr.provider=aws-ec2 whirr.private-key-file=${sys:user.home}/.ssh/id_rsa_whirr whirr.public-key-file=${sys:user.home}/.ssh/id_rsa_whirr.pub

    This file describes a single cluster with one ZooKeeper node, one Hadoop node running JobTracker and NameNode, and one Hadoop node running DataNode and JobTracker.

  10. Create our cluster as described in the cluster.properties file:
    • For Linux:

      su - hadoopuser

    • For Windows Cygwin:

      cd /usr/local/whirr bin/whirr launch-cluster --config conf/cluster.properties

    If you get the error message java.io.FileNotFoundException: whirr.log (Permission denied), then the current user has not got permission to access the whirr.log file.

    After a few seconds, you will see that the script will start to print out the status message and information about what is going to be done, as shown in the following screenshot:

    The result from creating a cluster using Whirr is very detailed and important for troubleshooting and monitoring purposes, as shown in the following screenshot:

    The output from running the script gives very valuable information about the cluster created. Every instance has a role and an external and internal IP address. The ID of every node is in the form <region>/<unique id>.

  11. After creating the cluster, please visit https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instancesto to see your new cluster. If the cluster was created in another region, change it to the correct region at the top.

  12. Destroy our cluster as described in the cluster.properties file, by running the following command for Linux and Windows Cygwin:

    cd /usr/local/whirr bin/whirr destroy-cluster --config conf/cluster.properties

  13. The directory ~/.whirr/demo-cluster has been created as a direct result of the previous step, and contains information about the cluster just created and three files:
    • hadoop-proxy.sh: Run this script to create a proxy tunnel to be able to connect to the cluster using the SSH tunnel. Use this example to create a proxy auto-config (PAC) file: https://svn.apache.org/repos/asf/whirr/trunk/resources/hadoop-ec2-proxy.pac.
    • hadoop-site.xml: It contains information about the Hadoop cluster.
    • instances: It contains information about each node instance (location, instance, role(s), external IP address, and internal IP address).
  14. All nodes in the preceding example were created in the same security group that allows them to talk to each other.

Setting up Accumulo

The easiest way to set up Accumulo on Amazon is to use the Amazon CLI (command-line interface). There is a single ZooKeeper node up and running, that should be used while setting up Accumulo.

  1. Browse to the Amazon EC2 console https://console.aws.amazon.com/s3/home?region=us-east-1#, and create a new bucket with a unique name. For this example, the name demo-accumulo will be used.
  2. To create an instance of Accumulo, we use the following commands in Amazon CLI:

    For Linux and Windows:

    elastic-mapreduce --create --alive --name "Accumulo"--bootstrap-action \ s3://elasticmapreduce/samples/accumulo/accumulo-install.sh \ --args "<zookeeper ip address>, Demo-Database, DBPassword"
    --bootstrap-name "install Accumulo" \
    --enable-debugging –log-url s3://demo-accumulo/Accumulo-logs/\ --instance-type m1.large --instance-count 4 --key- pair<Key Pair Name>

    Locate the key pair name at https://console.aws.amazon.com/ec2/home?region=us-east-1#s=KeyPairs.

Apache Accumulo for Developers Build and integrate Accumulo clusters with various cloud platforms with this book and ebook
Published: October 2013
eBook Price: $20.99
Book Price: $34.99
See more
Select your format and quantity:

Google Cloud Platform

Accumulo design is based on Google's BigTable design published in 2006, therefore, you will see lot of similarities between Google and Accumulo with respect to performing a search. Google doesn't have the same support for Hadoop as Amazon but you can easily perform the same tasks. One of the trademarks Google has is the simple user interface for the Google Cloud Console.

Prerequisites for Google Cloud Platform

The prerequisites for Google Cloud Platform are:

  • A valid user to access the Google Cloud Console. Remember billing is required to continue, as shown in the following screenshot:

  • Downloading and installing Python 1.7.x

Creating the project

Everything in Google Cloud Platform revolves around the project. The first task that we need to do is to create a new project. We are going to name the project as AccumuloProject with the Project ID accumulo-project, as shown in the following screenshot. Of course, a more descriptive name would be used in practice:

Installing the Google gcutil tool

Now, you need to install the Google gcutil tool from https://developers.google.com/compute/docs/gcutil/, in order to continue. The gcutil tool has a lot of useful commands, but we are only going to focus on the commands that we need to complete our tasks. In the following examples, gcutil is installed in the /usr/local/gsutil directory.

Configuring credentials

After setting up the gcutil tool, you need to run the following commands:

For Linux:

/usr/local/gcutil/gcutil auth --project=accumulo-project

For Windows (Cygwin):

python /usr/local/gcutil/gcutil.py auth --project=accumulo-project

On running this command, you will be prompted to open the website to get the verification code that you need to enter (or copy from the webpage).

Configuring the project

To simplify the usage of gcutil, we are using the flag --cache_flag_values. This will cause the file ~/.gcutil.flags to be created, and the default project ID will be stored in that file.

For Linux, use the following command:

/usr/local/gcutil/gcutil getproject
--project=accumulo-project--cache_flag_values

For Windows Cygwin, use the following command:

python /usr/local/gcutil/gcutil.py getproject--project
=accumulo-project
--cache_flag_values

After running this command, you should get a report like the following:

Creating the firewall rules

We need to create firewall rules that permit incoming HTTP traffic on ports 50030, 50060, and 50070. The easiest way to accomplish that is through the Google Cloud Console.

Creating the cluster

Google Cloud only supports Debian and CentOS Linux. In this example, CentOS-6- v20130813 is going to be used, but this changes on a regular basis, and you need to see what images are available before starting.

The boot disk is unspecified. I can create a new persistent boot disk and use it (preferred), or use a scratch disk (not recommended). Answer the following question with a Y (yes) when asked during the setup process:

Do you want to use a persistent boot disk? [y/n]

Creating the cluster involves four actions:

  • Create and set up a Hadoop NameNode. A new instance is created, and then Hadoop is set up and started as a master.
  • Create and set up a Hadoop DataNode. A new instance is created, and then Hadoop is set up and started as a slave.
  • Create and set up a ZooKeeper node. A new instance is created, and then ZooKeeper is set up and started.
  • Create and set up an Accumulo node. A new instance is created, and then Accumulo is set up and started.

Hadoop

Create two nodes: NameNode and DataNode.

  • Create the Hadoop NameNode:
    • For Linux:

      /usr/local/gcutil/gcutil addinstance hadoop -namenode--machine_type=n1-standard-1 --image=centos-6-v20130731 --zone=europe-west1-b

    • For Windows (Cygwin):

      python /usr/local/gcutil/gcutil.py addinstance hadoop-namenode --machine_type=n1-standard-1 --image=centos-6-v20130731 --zone=europe-west1-b

  • Connect to the newly created Hadoop NameNode:
    • For Linux:

      /usr/local/gcutil/gcutil --service_version="v1beta15" --project="accumulo-project" ssh --zone="europe-west1-b" "hadoop-namenode"

    • For Windows:

      python /usr/local/gcutil/gcutil --service_version="v1beta15" --project="accumulo-project" ssh --zone="europe-west1-b" "hadoop-namenode"

  • Follow the guidelines, to set up Hadoop (master).
  • Create the Hadoop DataNode:
    • For Linux:

      /usr/local/gcutil/gcutil addinstance hadoop-datanode --machine_type=n1-standard-1 --image=centos-6-v20130731 --zone=europe-west1-b

    • For Windows (Cygwin):

      python /usr/local/gcutil/gcutil.py addinstance hadoop-datanode
      --machine_type=n1-standard-1 --image=centos-6-v20130731--zone=europe-west1-b

  • Connect to the newly created Hadoop DataNode:
    • For Linux:

      /usr/local/gcutil/gcutil --service_version="v1beta15"
      --project="accumulo-project" ssh --zone="europe-west1-b""hadoop-datanode"

    • For Windows:

      python /usr/local/gcutil/gcutil.py--service_version="v1beta15"
      --project="accumulo-project"ssh --zone="europe-west1-b" "hadoop-datanode"

  • Follow the guidelines, to set up Hadoop (slave).

ZooKeeper

Create a single node for ZooKeeper.

  • Create the ZooKeeper node:
    • For Linux:

      /usr/local/gcutil/gcutil addinstance hadoop-zookeeper--machine_type
      =n1-standard-1 --image=centos-6-v20130731--zone=europe-west1-b

    • For Windows (Cygwin):

      python /usr/local/gcutil/gcutil.py addinstancehadoop-zookeeper
      --machine_type=n1-standard-1--image=centos-6-v20130731
      --zone=europe-west1-b

  • Connect to the newly created ZooKeeper node:
    • For Linux:

      /usr/local/gcutil/gcutil --service_version="v1beta15"--project
      ="accumulo-project" ssh --zone="europe-west1-b""hadoop-zookeeper"

    • For Windows:

      python /usr/local/gcutil/gcutil.py--service_version="v1beta15" --project
      ="accumulo-project"ssh --zone="europe-west1-b" "hadoop-zookeeper"

  • Follow the guidelines, to set up ZooKeeper.

Accumulo

Create a single node for Accumulo both master and tablet server.

  • Create the Accumulo node:
    • For Linux:

      /usr/local/gcutil/gcutil addinstance accumulo-node1--machine_type
      =n1-standard-1 --image=centos-6-v20130731--zone=europe-west1-b

    • For Windows (Cygwin)

      python /usr/local/gcutil/gcutil.py addinstance accumulo-node1
      --machine_type=n1-standard-1 --image=centos-6-v20130731
      --zone=europe-west1-b

  • Connect to the newly created Accumulo node:
    • For Linux:

      /usr/local/gcutil/gcutil --service_version="v1beta15"--project
      ="accumulo-project" ssh --zone="europe-west1-b""accumulo-node1"

    • For Windows:

      python /usr/local/gcutil/gcutil --service_version="v1beta15"--project=
      "accumulo-project" ssh --zone="europe-west1-b""accumulo-node1"

  • Follow the guidelines, to set up Accumulo.

 

After the setup, you should see the following page in the Google Cloud Console, under All Instances:

Deleting the cluster

After we're done using the cluster, there is no reason to keep it around. By deleting the cluster, we are going to stop all the machines, remove all the data on scratch disks, and finally remove all the machines from the project. Scratch disk space is space tied to the life of an instance. That means when terminated, all scratch disk data is lost. In real scenarios store all data on persistent disks.

While deleting an instance, you will be asked if you want to delete the instance, and if you want to delete the persistent boot disk, answer both of those questions with a Y:

  • Accumulo:
    • For Linux:

      /usr/local/gcutil/gcutil deleteinstance "accumulo-node1"
      --zone=europe-west1-b

    • For Windows Cygwin:

      python /usr/local/gcutil/gcutil.py deleteinstance "accumulo-node1"
      --zone=europe-west1-b

  • ZooKeeper:
    • For Linux:

      /usr/local/gcutil/gcutil deleteinstance "hadoop-zookeeper"
      --zone=europe-west1-b

    • For Windows Cygwin:

      python /usr/local/gcutil/gcutil.py deleteinstance "hadoop-zookeeper"
      --zone=europe-west1-b

  • Hadoop DataNode:
    • For Linux:

      /usr/local/gcutil/gcutil deleteinstance "hadoop-datanode"--zone
      =europe-west1-b

    • For Windows Cygwin:

      python /usr/local/gcutil/gcutil.py deleteinstance "hadoop-datanode"
      --zone=europe-west1-b

  • Hadoop NameNode:
    • For Linux:

      /usr/local/gcutil/gcutil deleteinstance "hadoop-namenode"--zone
      =europe-west1-b

    • For Windows Cygwin:

      python /usr/local/gcutil/gcutil deleteinstance "hadoop-namenode"
      --zone=europe-west1-b

Apache Accumulo for Developers Build and integrate Accumulo clusters with various cloud platforms with this book and ebook
Published: October 2013
eBook Price: $20.99
Book Price: $34.99
See more
Select your format and quantity:

Rackspace

Rackspace has great support for Accumulo, Hadoop, and ZooKeeper. For Hadoop and ZooKeeper, there is a set of libraries called Apache Whirr that provides the ability to communicate with a large number of clouds by using low-level API libraries. This is exactly the same as for Amazon EC2. This section will focus on the difference between Amazon EC2 and Rackspace Cloud Services. Follow the Amazon EC2 steps with some minor changes as described in the Configuration section.

Configuration

Edit the /usr/local/whirr/const/credentials file and change these lines:

PROVIDER=clusterservers-us IDENTITY=<your login from rackspace> CREDENTIAL= <You API key>

For the /usr/local/whirr/conf/cluster.properties file, you need to change whirr.provider and provide ZooKeeper node, Hadoop NameNode, and Hadoop DataNode:

whirr.cluster-name=demo-cluster whirr.instance-templates=1 zookeeper,1 hadoop-jobtracker+hadoop-namenode,
1 hadoop-datanode+hadoop-tasktracker whirr.provider=cloudservers-us whirr.private-key-file=${sys:user.home}/.ssh/id_rsa_whirr whirr.public-key-file=${sys:user.home}/.ssh/id_rsa_whirr.pub

Setting up Accumulo on Rackspace cluster requires manual steps.

Log in to the Rackspace cloud console and create a new Linux machine using CentOS image, connect to it, and set up Accumulo manually.

Network

A Rackspace cluster created with Whirr doesn't run behind a firewall. A firewall can be created manually by creating a new network, which is highly recommended to protect the cluster. More information on the topic of isolating a cloud network can be found at http://www.rackspace.com/knowledge_center/article/create-an-isolated-cloud-network.

Windows Azure

On February 1, 2010, Microsoft announced general availability of the Windows Azure cloud platform and infrastructure. Windows Azure supports both Microsoft Windows and Linux server operating systems. Windows Azure is a platform-closed source, but client SDK is open source.

In the previous demonstrations, scripts have been used to create the cluster. But it can be as easy to use the interface provided by Windows Azure to get the same result if needed to create a small cluster.

Prerequisites

The prerequisites for Windows Azure are:

There are command-line tools for both Linux and Windows. Install both Windows Azure PowerShell and Cross-platform command-line interface. For Linux, you only need a command-line interface.

Creating the cluster

Windows Azure supports Windows Server 2012, OpenSUSE, SUSE Linux Enterprise Server, Ubuntu Server, and CentOS. In our example, we are going to use Ubuntu Server 12.04 LTS, but this changes on a regular basis, and you need to see what images are available before starting.

In this section, we are going to focus on the user interface of creating a cluster in Windows Azure. Using the command-line interface that is available for both Linux and Windows, it is very easy to accomplish the same task. Windows Azure PowerShell cmdlets need to be configured before use by following the guidelines in the article Get Started with Window Azure Cmdlets at http://msdn.microsoft.com/en-us/library/windowsazure/jj554332.aspx.

Creating a cluster involves four steps:

  • Create and set up Hadoop NameNode. A new instance is created, and then Hadoop is set up and started as a master.
  • Create and set up Hadoop DataNode. A new instance is created, and then Hadoop is set up and started as a slave.
  • Create and set up a ZooKeeper node. A new instance is created, and then ZooKeeper is set up and started.
  • Create and set up an Accumulo node. A new instance is created, and then Accumulo is set up and started.

Hadoop

For Hadoop we are going to create two nodes: NameNode and DataNode.

  • Create the Hadoop NameNode by using the Windows Azure Management console. Create a new Linux virtual machine and give a meaningful name to the Hadoop NameNode. Nodes are available online (no firewall), so you need to pick a unique name for the Hadoop NameNode.

  • Connect to the newly created Hadoop NameNode:
    • For Linux, use SSH
    • For Windows, use PuTTY to connect to the newly created computer
  • Follow the guidelines, to set up Hadoop (master).
  • Create the Hadoop DataNode:
    • By using the Window Azure Management Console, create a new Linux virtual machine and give a meaningful name to the Hadoop DataNode. Because nodes are available online (no firewall), you need to pick a unique name for the Hadoop DataNode.
  • Connect to the newly created Hadoop DataNode:
    • For Linux, use SSH
    • For Windows, use PuTTY to connect to the newly created computer
  • Follow the guidelines, to set up Hadoop (slave).

ZooKeeper

For ZooKeeper, we are going to create a single node.

  • Create the ZooKeeper node:
    • By using the Window Azure Management console, create a new Linux virtual machine and give a meaningful name to the ZooKeeper node. Because nodes are available online (no firewall), you need to pick a unique name for the ZooKeeper node.
  • Connect to the newly created ZooKeeper node:
    • For Linux, use SSH
    • For Windows, use PuTTY to connect to the newly created computer
  • Follow the guidelines, to set up ZooKeeper.

Accumulo

For Accumulo, we are going to create a single node.

  • Create the Accumulo node:
    • By using the Window Azure Management console, create a new Linux virtual machine and give a meaningful name to the Accumulo node. Because nodes are available online (no firewall), you need to pick a unique name for the Accumulo node.
  • Connect to the newly created Accumulo node:
    • For Linux, use SSH
    • For Windows, use PuTTY to connect to the newly created computer
  • Follow the guidelines, to set up Accumulo.

After the setup, you should see the following page inside the Windows Azure console under virtual machines:

Deleting the cluster

After we have used our cluster, there is no reason to keep it around. By deleting the cluster, we are going to stop all machines, remove all scratch disk data, and finally remove all machines from the project.

While deleting an instance, you will be asked if you want to delete the instance, and if you want to delete the persistent boot disk. Answer both the questions with Y.

  • Accumulo: In the Windows Azure Management console, select the Accumulo VM and then delete the VM
  • ZooKeeper: In the Windows Azure Management console, select the ZooKeeper VM and then delete the VM
  • Hadoop DataNode: In the Windows Azure Management console, select the Hadoop DataNode VM and then delete the VM
  • Hadoop NameNode: In the Windows Azure Management console, select the Hadoop DataNode VM and then delete the VM

    For more information about Ganglia, Graylog2 server, and Nagios, visit the following websites:

Summary

This article was about setting up Hadoop, ZooKeeper, and Accumulo on four different cloud platforms. There is a difference between those cloud platforms, but the steps required to set up an Accumulo cluster are very similar, meaning the only difference is the scripts used to automate the process. Even for Windows Azure, writing scripts for Windows or Linux is an easy task and is well documented.

Resources for Article:


Further resources on this subject:


About the Author :


Guðmundur Jón Halldórsson

Guðmundur Jón Halldórsson is a Software Engineer who enjoys the challenges of complex problems and pays close attention to detail. He is an annual speaker at the Icelandic Computer Society (SKY, http://www.utmessan.is/).

Guðmundur is a Software Engineer with extensive experience and management skills, and works for Five Degrees (www.fivedegrees.nl), a banking software company. The company develops and sells high-quality banking software. As a Senior Software Engineer, he is responsible for the development of a backend banking system produced by the company. Guðmundur has a B.Sc. in Computer Sciences from the Reykjavik University.

Guðmundur has a long period of work experience as a Software Engineer since 1996. He has worked for a large bank in Iceland, an insurance company, and a large gaming company where he was in the core EVE Online team.

Guðmundur is passionate about whatever he does. He loves to play online chess and Sudoku. And when he has time, he likes to read science fiction and history books.

He maintains a Facebook page to network with his friends and readers, and blogs about the wonders of programming and cloud computing at http://www.gudmundurjon.net/.

Books From Packt


Hadoop Real-World Solutions Cookbook
Hadoop Real-World Solutions Cookbook

Hadoop Beginner's Guide
Hadoop Beginner's Guide

Apache Solr 3.1 Cookbook
Apache Solr 3.1 Cookbook

Apache Solr 3 Enterprise Search Server
Apache Solr 3 Enterprise Search Server

Hadoop MapReduce Cookbook
Hadoop MapReduce Cookbook

Apache CloudStack Cloud Computing
Apache CloudStack Cloud Computing

Apache Flume: Distributed Log Collection for Hadoop
Apache Flume: Distributed Log Collection for Hadoop

 Learning Apache Karaf
Learning Apache Karaf


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software