Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Big Data Analytics with Hadoop 3
Big Data Analytics with Hadoop 3

Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data

By Sridhar Alla
$43.99
Book May 2018 482 pages 1st Edition
eBook
$35.99 $24.99
Print
$43.99
Subscription
$15.99 Monthly
eBook
$35.99 $24.99
Print
$43.99
Subscription
$15.99 Monthly

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Black & white paperback book shipped to your address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : May 31, 2018
Length 482 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781788628846
Vendor :
Apache
Category :
Concepts :
Table of content icon View table of contents Preview book icon Preview Book

Big Data Analytics with Hadoop 3

Chapter 1. Introduction to Hadoop

This chapter introduces the reader to the world of Hadoop and the core components of Hadoop, namely the Hadoop Distributed File System (HDFS) and MapReduce. We will start by introducing the changes and new features in the Hadoop 3 release. Particularly, we will talk about the new features of HDFS and Yet Another Resource Negotiator (YARN), and changes to client applications. Furthermore, we will also install a Hadoop cluster locally and demonstrate the new features such as erasure coding (EC) and the timeline service. As as quick note, Chapter 10Visualizing Big Data shows you how to create a Hadoop cluster in AWS.

In a nutshell, the following topics will be covered throughout this chapter:

  • HDFS
    • High availability
    • Intra-DataNode balancer
    • EC
    • Port mapping
  • MapReduce
    • Task-level optimization
  • YARN
    • Opportunistic containers
    • Timeline service v.2
    • Docker containerization
  • Other changes
  • Installation of Hadoop 3.1
    • HDFS
    • YARN
    • EC
    • Timeline service v.2

Hadoop Distributed File System


HDFS is a software-based filesystem implemented in Java and it sits on top of the native filesystem. The main concept behind HDFS is that it divides a file into blocks (typically 128 MB) instead of dealing with a file as a whole. This allows many features such as distribution, replication, failure recovery, and more importantly distributed processing of the blocks using multiple machines. Block sizes can be 64 MB, 128 MB, 256 MB, or 512 MB, whatever suits the purpose. For a 1 GB file with 128 MB blocks, there will be 1024 MB/128 MB equal to eight blocks. If you consider a replication factor of three, this makes it 24 blocks. HDFS provides a distributed storage system with fault tolerance and failure recovery. HDFS has two main components: the NameNode and the DataNode. The NameNode contains all the metadata of all content of the filesystem: filenames, file permissions, and the location of each block of each file, and hence it is the most important machine in HDFS. DataNodes connect to the NameNode and store the blocks within HDFS. They rely on the NameNode for all metadata information regarding the content in the filesystem. If the NameNode does not have any information, the DataNode will not be able to serve information to any client who wants to read/write to the HDFS.

It is possible for NameNode and DataNode processes to be run on a single machine; however, generally HDFS clusters are made up of a dedicated server running the NameNode process and thousands of machines running the DataNode process. In order to be able to access the content information stored in the NameNode, it stores the entire metadata structure in memory. It ensures that there is no data loss as a result of machine failures by keeping a track of the replication factor of blocks. Since it is a single point of failure, to reduce the risk of data loss on account of the failure of a NameNode, a secondary NameNode can be used to generate snapshots of the primary NameNode's memory structures.

DataNodes have large storage capacities and, unlike the NameNode, HDFS will continue to operate normally if a DataNode fails. When a DataNode fails, the NameNode automatically takes care of the now diminished replication of all the data blocks in the failed DataNode and makes sure the replication is built back up. Since the NameNode knows all locations of the replicated blocks, any clients connected to the cluster are able to proceed with little to no hiccups.

Note

In order to make sure that each block meets the minimum required replication factor, the NameNode replicates the lost blocks.

The following diagram depicts the mapping of files to blocks in the NameNode, and the storage of blocks and their replicas within the DataNodes:

The NameNode, as shown in the preceding diagram, has been the single point of failure since the beginning of Hadoop.

High availability

The loss of NameNodes can crash the cluster in both Hadoop 1.x as well as Hadoop 2.x. In Hadoop 1.x, there was no easy way to recover, whereas Hadoop 2.x introduced high availability (active-passive setup) to help recover from NameNode failures.

The following diagram shows how high availability works:

In Hadoop 3.x you can have two passive NameNodes along with the active node, as well as five JournalNodesto assist with recovery from catastrophic failures:

  • NameNode machines: The machines on which you run the active and standby NameNodes. They should have equivalent hardware to each other and to what would be used in a non-HA cluster.

  • JournalNode machines: The machines on which you run the JournalNodes. The JournalNode daemon is relatively lightweight, so these daemons may reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager

Intra-DataNode balancer

HDFS has a way to balance the data blocks across the data nodes, but there is no such balancing inside the same data node with multiple hard disks. Hence, a 12-spindle DataNode can have out of balance physical disks. But why does this matter to performance? Well, by having out of balance disks, the blocks at DataNode level might be the same as other DataNodes but the reads/writes will be skewed because of imbalanced disks. Hence, Hadoop 3.x introduces the intra-node balancer to balance the physical disks inside each data node to reduce the skew of the data. 

This increases the reads and writes performed by any process running on the cluster, such as a mapper or reducer.

Erasure coding

HDFS has been the fundamental component since the inception of Hadoop. In Hadoop 1.x as well as Hadoop 2.x, a typical HDFS installation uses a replication factor of three.

Compared to the default replication factor of three, EC is probably the biggest change in HDFS in years and fundamentally doubles the capacity for many datasets by bringing down the replication factor from 3 to about 1.4. Let's now understand what EC is all about. 

EC is a method of data protection in which data is broken into fragments, expanded, encoded with redundant data pieces, and stored across a set of different locations or storage. If at some point during this process data is lost due to corruption, then it can be reconstructed using the information stored elsewhere. Although EC is more CPU intensive, this greatly reduces the storage needed for the reliable storing of large amounts of data (HDFS). HDFS uses replication to provide reliable storage and this is expensive, typically requiring three copies of data to be stored, thus causing a 200% overhead in storage space.

Port numbers

In Hadoop 3.x, many of the ports for various services have been changed.

Previously, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768–61000). This indicated that at startup, services would sometimes fail to bind to the port with another application due to a conflict.

These conflicting ports have been moved out of the ephemeral range, affecting the NameNode, Secondary NameNode, DataNode, and KMS. 

The changes are listed as follows:

  • NameNode ports: 50470 → 9871, 50070 → 9870, and 8020 → 9820
  • Secondary NameNode ports: 50091 → 9869 and 50090 → 9868
  • DataNode ports: 50020 → 9867, 50010 → 9866, 50475 → 9865, and 50075 → 9864

MapReduce framework


An easy way to understand this concept is to imagine that you and your friends want to sort out piles of fruit into boxes. For that, you want to assign each person the task of going through one raw basket of fruit (all mixed up) and separating out the fruit into various boxes. Each person then does the same task of separating the fruit into the various types with this basket of fruit. In the end, you end up with a lot of boxes of fruit from all your friends. Then, you can assign a group to put the same kind of fruit together in a box, weigh the box, and seal the box for shipping. A classic example of showing the MapReduce framework at work is the word count example. The following are the various stages of processing the input data, first splitting the input across multiple worker nodes and then finally generating the output, the word counts:

The MapReduce framework consists of a single ResourceManager and multiple NodeManagers (usually, NodeManagers coexist with the DataNodes of HDFS). 

Task-level native optimization

MapReduce has added support for a native implementation of the map output collector. This new support can result in a performance improvement of about 30% or more, particularly for shuffle-intensive jobs.

The native library will build automatically with Pnative. Users may choose the new collector on a job-by-job basis by setting mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator in their job configuration. 

The basic idea is to be able to add a NativeMapOutputCollector in order to handle key/value pairs emitted by mapper. As a result of this sort, spill, and IFile serialization can all be done in native code. A preliminary test (on Xeon E5410, jdk6u24) showed promising results as follows:

  • sort is about 3-10 times faster than Java (only binary string compare is supported)
  • IFile serialization speed is about three times faster than Java: about 500 MB per second. If CRC32C hardware is used, things can get much faster in the range of 1 GB or higher per second
  • Merge code is not completed yet, so the test uses enough io.sort.mb to prevent mid-spill

YARN


When an application wants to run, the client launches the ApplicationMaster, which then negotiates with the ResourceManager to get resources in the cluster in the form of containers. A container represents CPUs (cores) and memory allocated on a single node to be used to run tasks and processes. Containers are supervised by the NodeManager and scheduled by the ResourceManager.

Examples of containers:

  • One core and 4 GB RAM
  • Two cores and 6 GB RAM
  • Four cores and 20 GB RAM

Some containers are assigned to be mappers and others to be reducers; all this is coordinated by the ApplicationMaster in conjunction with the ResourceManager. This framework is called YARN:

Using YARN, several different applications can request for and execute tasks on containers, sharing the cluster resources pretty well. However, as the size of the clusters grows and the variety of applications and requirements change, the efficiency of the resource utilization is not as good over time.

Opportunistic containers

Opportunistic containers can be transmitted to a NodeManager even if their execution at that particular time cannot begin immediately, unlike YARN containers, which are scheduled in a node if and only if there are unallocated resources.

In these types of scenarios, opportunistic containers will be queued at the NodeManager till the required resources are available for use. The ultimate goal of these containers is to enhance the cluster resource utilization and in turn improve task throughput.

Types of container execution 

There are two types of container, as follows:

  • Guaranteed containers: These containers correspond to the existing YARN containers. They are assigned by the capacity scheduler. They are transmitted to a node if and only if there are resources available to begin their execution immediately. 
  • Opportunistic containers: Unlike guaranteed containers, in this case we cannot guarantee that there will be resources available to begin their execution once they are dispatched to a node. On the contrary, they will be queued at the NodeManager itself until resources become available.

YARN timeline service v.2

The YARN timeline service v.2 addresses the following two major challenges:

  • Enhancing the scalability and reliability of the timeline service
  • Improving usability by introducing flows and aggregation

Enhancing scalability and reliability

Version 2 adopts a more scalable distributed writer architecture and backend storage, as opposed to v.1 which does not scale well beyond small clusters as it used a single instance of writer/reader architecture and backend storage.

Since Apache HBase scales well even to larger clusters and continues to maintain a good read and write response time, v.2 prefers to select it as the primary backend storage.

Usability improvements

Many a time, users are more interested in the information obtained at the level of flows or in logical groups of YARN applications. For this reason, it is more convenient to launch a series of YARN applications to complete a logical workflow.

In order to achieve this, v.2 supports the notion of flows and aggregates metrics at the flow level.

Architecture

YARN Timeline Service v.2 uses a set of collectors (writers) to write data to the back-end storage. The collectors are distributed and co-located with the application masters to which they are dedicated. All data that belong to that application are sent to the application level timeline collectors with the exception of the resource manager timeline collector.

For a given application, the application master can write data for the application to the co-located timeline collectors (which is an NM auxiliary service in this release). In addition, node managers of other nodes that are running the containers for the application also write data to the timeline collector on the node that is running the application master. 

The resource manager also maintains its own timeline collector. It emits only YARN-generic life-cycle events to keep its volume of writes reasonable.

The timeline readers are separate daemons separate from the timeline collectors, and they are dedicated to serving queries via REST API:

The following diagram illustrates the design at a high level:

Other changes


There are other changes coming up in Hadoop 3, which are mainly to make it easier to maintain and operate. Particularly, the command-line tools have been revamped to better suit the needs of operational teams.

Minimum required Java version 

All Hadoop JARs are now compiled to target a runtime version of Java 8. Hence, users that are still using Java 7 or lower must upgrade to Java 8.

Shell script rewrite

The Hadoop shell scripts have been rewritten to fix many long-standing bugs and include some new features. 

Incompatible changes are documented in the release notes. You can find them at https://issues.apache.org/jira/browse/HADOOP-9902.

There are more details available in the documentation at https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-common/UnixShellGuide.html. The documentation present at https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-common/UnixShellAPI.html will appeal to power users, as it describes most of the new functionalities, particularly those related to extensibility.

Shaded-client JARs

The new hadoop-client-api and hadoop-client-runtime artifacts have been added, as referred to by https://issues.apache.org/jira/browse/HADOOP-11804. These artifacts shade Hadoop's dependencies into a single JAR. As a result, it avoids leaking Hadoop's dependencies onto the application's classpath.

Hadoop now also supports integration with Microsoft Azure Data Lake and Aliyun Object Storage System as an alternative for Hadoop-compatible filesystems.

Installing Hadoop 3 


In this section, we shall see how to install a single-node Hadoop 3 cluster on your local machine. In order to do this, we will be following the documentation given at https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html.

This document gives us a detailed description of how to install and configure a single-node Hadoop setup in order to carry out simple operations using Hadoop MapReduce and the HDFS quickly.

Prerequisites

Java 8 must be installed for Hadoop to be run. If Java 8 does not exist on your machine, then you can download and install Java 8: https://www.java.com/en/download/.

The following will appear on your screen when you open the download link in the browser:

Downloading

Download the Hadoop 3.1 version using the following link: http://apache.spinellicreations.com/hadoop/common/hadoop-3.1.0/.

The following screenshot is the page shown when the download link is opened in the browser:

When you get this page in your browser, simply download the hadoop-3.1.0.tar.gz file to your local machine.

Installation

Perform the following steps to install a single-node Hadoop cluster on your machine:

  1. Extract the downloaded file using the following command:
tar -xvzf hadoop-3.1.0.tar.gz
  1. Once you have extracted the Hadoop binaries, just run the following commands to test the Hadoop binaries and make sure the binaries works on our local machine:
cd hadoop-3.1.0

mkdir input

cp etc/hadoop/*.xml input

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.0.jar grep input output 'dfs[a-z.]+'

cat output/*

If everything runs as expected, you will see an output directory showing some output, which shows that the sample command worked.

Note

A typical error at this point will be missing Java. You might want to check and see if you have Java installed on your machine and the JAVA_HOME environment variable set correctly.

Setup password-less ssh

Now check if you can ssh to the localhost without a passphrase by running a simple command, shown as follows:

$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Setting up the NameNode

Make the following changes to the configuration file etc/hadoop/core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

Make the following changes to the configuration file etc/hadoop/hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
        <name>dfs.name.dir</name>
        <value><YOURDIRECTORY>/hadoop-3.1.0/dfs/name</value>
    </property>
</configuration>

Starting HDFS

Follow these steps as shown to start HDFS (NameNode and DataNode):

  1. Format the filesystem:
$ ./bin/hdfs namenode -format
  1. Start the NameNode daemon and the DataNode daemon:
$ ./sbin/start-dfs.sh

The Hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).

  1. Browse the web interface for the NameNode; by default it is available at http://localhost:9870/.
  2. Make the HDFS directories required to execute MapReduce jobs:
$ ./bin/hdfs dfs -mkdir /user 
$ ./bin/hdfs dfs -mkdir /user/<username>
  1. When you're done, stop the daemons with the following:
$ ./sbin/stop-dfs.sh
  1. Open a browser to check your local Hadoop, which can be launched in the browser as http://localhost:9870/. The following is what the HDFS installation looks like:
  1. Clicking on the Datanodes tab shows the nodes as shown in the following screenshot:

Figure: Screenshot showing the nodes in the Datanodes tab

  1. Clicking on the logs will show the various logs in your cluster, as shown in the following screenshot:
  1. As shown in the following screenshot, you can also look at the various JVM metrics of your cluster components:
  1. As shown in the following screenshot, you can also check the configuration. This is a good place to look at the entire configuration and all the default settings:
  1. You can also browse the filesystem of your newly installed cluster, as shown in the following screenshot:

Figure: Screenshot showing the Browse Directory and how you can browse the filesystem in you newly installed cluster

At this point, we should all be able to see and use a basic HDFS cluster. But this is just a HDFS filesystem with some directories and files. We also need a job/task scheduling service to actually use the cluster for computational needs rather than just storage.

Setting up the YARN service

In this section, we will set up a YARN service and start the components needed to run and operate a YARN cluster:

  1. Start the ResourceManager daemon and the NodeManager daemon:

$ sbin/start-yarn.sh
  1. Browse the web interface for the ResourceManager; by default it is available at: http://localhost:8088/

  2. Run a MapReduce job

  3. When you're done, stop the daemons with the following:

$ sbin/stop-yarn.sh

The following is the YARN ResourceManager, which you can view by putting the URL http://localhost:8088/ into the browser:

Figure: Screenshot of YARN ResouceManager

The following is a view showing the queues of resources in the cluster, along with any applications running. This is also the place where you can see and monitor the running jobs:

Figure: Screenshot of queues of resources in the cluster

At this time, we should be able to see the running YARN service in our local cluster running Hadoop 3.1.0. Next, we will look at some new features in Hadoop 3.x.

Erasure Coding

EC is a key change in Hadoop 3.x promising a significant improvement in HDFS utilization efficiencies as compared to earlier versions where replication factor of 3 for instance caused immense wastage of precious cluster file system for all kinds of data no matter what the relative importance was to the tasks at hand. 

EC can be setup using policies and assigning the policies to directories in HDFS. For this, HDFS provides an ec subcommand to perform administrative commands related to EC:

hdfs ec [generic options]
    [-setPolicy -path <path> [-policy <policyName>] [-replicate]]
    [-getPolicy -path <path>]
    [-unsetPolicy -path <path>]
    [-listPolicies]
    [-addPolicies -policyFile <file>]
    [-listCodecs]
    [-enablePolicy -policy <policyName>]
    [-disablePolicy -policy <policyName>]
    [-help [cmd ...]]

The following are the details of each command:

  • [-setPolicy -path <path> [-policy <policyName>] [-replicate]]: Sets an EC policy on a directory at the specified path.
    • path: An directory in HDFS. This is a mandatory parameter. Setting a policy only affects newly created files, and does not affect existing files.
    • policyName: The EC policy to be used for files under this directory. This parameter can be omitted if a  dfs.namenode.ec.system.default.policy configuration is set. The EC policy of the path will be set with the default value in configuration.
    • -replicate: Apply the special REPLICATION policy on the directory, force the directory to adopt 3x replication scheme.
    • -replicate and -policy <policyName>: These are optional arguments. They cannot be specified at the same time.
  • [-getPolicy -path <path>]: Get details of the EC policy of a file or directory at the specified path.
  • [-unsetPolicy -path <path>]: Unset an EC policy set by a previous call to setPolicy on a directory. If the directory inherits the EC policy from an ancestor directory, unsetPolicy is a no-op. Unsetting the policy on a directory which doesn't have an explicit policy set will not return an error.
  • [-listPolicies]: Lists all (enabled, disabled and removed) EC policies registered in HDFS. Only the enabled policies are suitable for use with the setPolicy command.
  • [-addPolicies -policyFile <file>]: Add a list of EC policies. Please referetc/hadoop/user_ec_policies.xml.template for the example policy file. The maximum cell size is defined in propertydfs.namenode.ec.policies.max.cellsize with the default value 4 MB. Currently HDFS allows the user to add 64 policies in total, and the added policy ID is in range of 64 to 127. Adding policy will fail if there are already 64 policies added.
  • [-listCodecs]: Get the list of supported EC codecs and coders in system. A coder is an implementation of a codec. A codec can have different implementations, thus different coders. The coders for a codec are listed in a fall back order.
  • [-removePolicy -policy <policyName>]: It removes an EC policy
  • [-enablePolicy -policy <policyName>]: It enables an EC policy
  • [-disablePolicy -policy <policyName>]: It disables an EC policy

By using -listPolicies, you can list all the EC policies currently setup in your cluster along with the state of such policies whether they are ENABLED or DISABLED:

Lets test out EC in our cluster. First we will create directories in the HDFS shown as follows:

./bin/hdfs dfs -mkdir /user/normal
./bin/hdfs dfs -mkdir /user/ec

Once the two directories are created then you can set the policy on any path:

./bin/hdfs ec -setPolicy -path /user/ec -policy RS-6-3-1024k
Set RS-6-3-1024k erasure coding policy on /user/ec

Now copying any content into the /user/ec folder falls into the newly set policy.

Type the command shown as follows to test this:

./bin/hdfs dfs -copyFromLocal ~/Documents/OnlineRetail.csv /user/ec

The following screenshot shows the result of the copying, as expected the system complains as we don't really have a cluster on our local system enough to implement EC. But this should give us an idea of what is needed and how it would look:

Intra-DataNode balancer

While HDFS always had a great feature of balancing the data between the data nodes in the cluster, often this resulted in skewed disks within data nodes. For instance, if you have four disks, two disks might take the bulk of the data and the other two might be under-utilized. Given that physical disks (say 7,200 or 10,000 rpm) are slow to read/write, this kind of skewing of data results in poor performance. Using an intra-node balancer, we can rebalance the data amongst the disks.

Run the command shown in the following example to invoke disk balancing on a DataNode:

./bin/hdfs diskbalancer -plan 10.0.0.103

The following is the output of the disk balancer command:

Installing YARN timeline service v.2

As stated in the YARN timeline service v.2 section, v.2 always selects Apache HBase as the primary backing storage, since Apache HBase scales well even to larger clusters and continues to maintain a good read and write response time.

There are a few steps that need to be performed to prepare the storage for timeline service v.2:

  1. Set up the HBase cluster
  2. Enable the co-processor
  3. Create the schema for timeline service v.2

Each step is explained in more detail in the following sections.

Setting up the HBase cluster

The first step involves picking an Apache HBase cluster to use as the storage cluster. The version of Apache HBase that is supported with the timeline service v.2 is 1.2.6. The 1.0.x versions no longer work with timeline service v.2. Later versions of HBase have not been tested yet with the timeline service.

Simple deployment for HBase

If you are intent on a simple deploy profile for the Apache HBase cluster where the data loading is light but the data needs to persist across node comings and goings, you could consider the Standalone HBase over HDFS deploy mode. 

http://mirror.cogentco.com/pub/apache/hbase/1.2.6/

The following screenshot is the download link to HBase 1.2.6:

Download hbase-1.2.6-bin.tar.gz to your local machine. Then extract the HBase binaries:

tar -xvzf hbase-1.2.6-bin.tar.gz

The following is the content of the extracted HBase:

This is a useful variation on the standalone HBase setup and has all HBase daemons running inside one JVM but rather than persisting to the local filesystem, it persists to an HDFS instance. Writing to HDFS where data is replicated ensures that data is persisted across node comings and goings. To configure this standalone variant, edit your hbasesite.xml setting the hbase.rootdir to point at a directory in your HDFS instance but then set hbase.cluster.distributed to false

The following is the hbase-site.xml with the hdfs port 9000 for the local cluster we have installed mentioned as a property. If you leave this out there wont be a HBase cluster installed.

<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://localhost:9000/hbase</value>
    </property>
    <property>
        <name>hbase.cluster.distributed</name>
        <value>false</value>
    </property>
</configuration>

Next step is to start HBase. We will do this by using start-hbase.sh script:

./bin/start-hbase.sh

The following screenshot shows the HBase cluster we just installed:

The following screenshot shows are more attributes of the HBase cluster setup showing versions, of various components:

Figure: Screenshot of attributes of the HBase cluster setup and the versions of different components

Once you have an Apache HBase cluster ready to use, perform the steps in the following  section.

Enabling the co-processor

In this version, the co-processor is loaded dynamically.

Copy the timeline service .jar to HDFS from where HBase can load it. It is needed for the flowrun table creation in the schema creator. The default HDFS location is /hbase/coprocessor.

For example:

hadoop fs -mkdir /hbase/coprocessor hadoop fs -put hadoop-yarn-server-timelineservice-hbase-3.0.0-alpha1-SNAPSHOT.jar /hbase/coprocessor/hadoop-yarn-server-timelineservice.jar

To place the JAR at a different location on HDFS, there also exists a YARN configuration setting called yarn.timeline-service.hbase.coprocessor.jar.hdfs.location, shown as follows:

<property>
  <name>yarn.timeline-service.hbase.coprocessor.jar.hdfs.location</name>
  <value>/custom/hdfs/path/jarName</value>
</property>

Create the timeline service schema using the schema creator tool. For this to happen, we also need to make sure the JARs are all found correctly:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/Users/sridharalla/hbase-1.2.6/lib/:/Users/sridharalla/hadoop-3.1.0/share/hadoop/yarn/timelineservice/

Once we have the classpath corrected, we can create the HBase schema/tables using a simple command, shown as follows:

./bin/hadoop org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -create -skipExistingTable

The following is the HBase schema created as a result of the preceding command:

Enabling timeline service v.2

The following are the basic configurations to start timeline service v.2:

<property>
  <name>yarn.timeline-service.version</name>
  <value>2.0f</value>
</property>

<property>
  <name>yarn.timeline-service.enabled</name>
  <value>true</value>
</property>

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle,timeline_collector</value>
</property>

<property>
  <name>yarn.nodemanager.aux-services.timeline_collector.class</name>
  <value>org.apache.hadoop.yarn.server.timelineservice.collector.PerNodeTimelineCollectorsAuxService</value>
</property>

<property>
  <description> This setting indicates if the yarn system metrics is published by RM and NM by on the timeline service. </description>
  <name>yarn.system-metrics-publisher.enabled</name>
  <value>true</value>
</property>

<property>
  <description>This setting is to indicate if the yarn container events are published by RM to the timeline service or not. This configuration is for ATS V2. </description>
  <name>yarn.rm.system-metrics-publisher.emit-container-events</name>
  <value>true</value>
</property>

Also, add the hbase-site.xml configuration file to the client Hadoop cluster configuration so that it can write data to the Apache HBase cluster you are using, or set yarn.timeline-service.hbase.configuration.file to the file URL pointing to hbase-site.xml for the same purpose of writing the data to HBase, for example:

<property>
  <description>This is an Optional URL to an hbase-site.xml configuration file. It is to be used to connect to the timeline-service hbase cluster. If it is empty or not specified, the HBase configuration will be loaded from the classpath. Else, they will override those from the ones present on the classpath. </description>
  <name>yarn.timeline-service.hbase.configuration.file</name>
  <value>file:/etc/hbase/hbase-site.xml</value>
</property>
Running timeline service v.2

Restart the ResourceManager as well as the node managers to pick up the new configuration. The collectors start within the resource manager and the node managers in an embedded manner.

The timeline service reader is a separate YARN daemon, and it can be started using the following syntax:

$ yarn-daemon.sh start timelinereader
Enabling MapReduce to write to timeline service v.2

To write MapReduce framework data to timeline service v.2, enable the following configuration in mapred-site.xml:

<property>
  <name>mapreduce.job.emit-timeline-data</name>
  <value>true</value>
</property>

The timeline service is still evolving so you should try it out only to test out the features and not in production, and wait for the more widely adopted version, which should be available sometime soon.

Summary


In this chapter, we have discussed the new features in Hadoop 3.x and how it improves the reliability and performance of Hadoop 2.x. We also walked through the installation of a standalone Hadoop cluster on the local machine.

In the next chapter, we will take a peek into the world of big data analytics.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Learn Hadoop 3 to build effective big data analytics solutions on-premise and on cloud
  • Integrate Hadoop with other big data tools such as R, Python, Apache Spark, and Apache Flink
  • Exploit big data using Hadoop 3 with real-world examples

Description

Apache Hadoop is the most popular platform for big data processing, and can be combined with a host of other big data tools to build powerful analytics solutions. Big Data Analytics with Hadoop 3 shows you how to do just that, by providing insights into the software as well as its benefits with the help of practical examples. Once you have taken a tour of Hadoop 3’s latest features, you will get an overview of HDFS, MapReduce, and YARN, and how they enable faster, more efficient big data processing. You will then move on to learning how to integrate Hadoop with the open source tools, such as Python and R, to analyze and visualize data and perform statistical computing on big data. As you get acquainted with all this, you will explore how to use Hadoop 3 with Apache Spark and Apache Flink for real-time data analytics and stream processing. In addition to this, you will understand how to use Hadoop to build analytics solutions on the cloud and an end-to-end pipeline to perform big data analysis using practical use cases. By the end of this book, you will be well-versed with the analytical capabilities of the Hadoop ecosystem. You will be able to build powerful solutions to perform big data analytics and get insight effortlessly.

What you will learn

Explore the new features of Hadoop 3 along with HDFS, YARN, and MapReduce Get well-versed with the analytical capabilities of Hadoop ecosystem using practical examples Integrate Hadoop with R and Python for more efficient big data processing Learn to use Hadoop with Apache Spark and Apache Flink for real-time data analytics Set up a Hadoop cluster on AWS cloud Perform big data analytics on AWS using Elastic Map Reduce

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Black & white paperback book shipped to your address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : May 31, 2018
Length 482 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781788628846
Vendor :
Apache
Category :
Concepts :

Table of Contents

18 Chapters
Title Page Chevron down icon Chevron up icon
Copyright and Credits Chevron down icon Chevron up icon
Packt Upsell Chevron down icon Chevron up icon
Contributors Chevron down icon Chevron up icon
Preface Chevron down icon Chevron up icon
Introduction to Hadoop Chevron down icon Chevron up icon
Overview of Big Data Analytics Chevron down icon Chevron up icon
Big Data Processing with MapReduce Chevron down icon Chevron up icon
Scientific Computing and Big Data Analysis with Python and Hadoop Chevron down icon Chevron up icon
Statistical Big Data Computing with R and Hadoop Chevron down icon Chevron up icon
Batch Analytics with Apache Spark Chevron down icon Chevron up icon
Real-Time Analytics with Apache Spark Chevron down icon Chevron up icon
Batch Analytics with Apache Flink Chevron down icon Chevron up icon
Stream Processing with Apache Flink Chevron down icon Chevron up icon
Visualizing Big Data Chevron down icon Chevron up icon
Introduction to Cloud Computing Chevron down icon Chevron up icon
Using Amazon Web Services Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela