Mastering Hadoop 3

Chapter 1. Journey to Hadoop 3

Hadoop has come a long way since its inception. Powered by a community of open source enthusiasts, it has seen three major version releases. The version 1 release saw the light of day six years after the first release of Hadoop. With this release, the Hadoop platform had full capabilities that can run MapReduce-distributed computing on Hadoop Distributed File System (HDFS) distributed storage. It had some of the most major performance improvements ever done, along with full support for security. This release also enjoyed a lot of improvements with respect to HBASE.

The version 2 release made significant leaps compared to version 1 of Hadoop. It introduced YARN, a sophisticated general-purpose resource manager and job scheduling component. HDFS high availability, HDFS federations, and HDFS snapshots were some other prominent features introduced in version 2 releases.

The latest major release of Hadoop is version 3. This version has seen some significant features such as HDFS erasure encoding, a new YARN Timeline service (with new architecture), YARN opportunistic containers and distributed scheduling, support for three name nodes, and intra-data-node load balancers. Apart from major feature additions, version 3 has performance improvements and bug fixes. As this book is about mastering Hadoop 3, we'll mostly talk about this version.

In this chapter, we will take a look at Hadoop's history and how the Hadoop evolution timeline looks. We will look at the features of Hadoop 3 and get a logical view of the Hadoop ecosystem along with different Hadoop distributions.

In particular, we will cover the following topics:

Hadoop origins
Hadoop Timelines
Hadoop logical view
Moving towards Hadoop 3
Hadoop distributions

Hadoop origins and Timelines

Hadoop is changing the way people think about data. We need to know what led to the origin of this magical innovation. Who developed Hadoop and why? What problems existed before Hadoop? How has it solved these problems? What challenges were encountered during development? How has Hadoop transformed from version 1 to version 3? Let's walk through the origins of Hadoop and its journey to version 3.

Origins

In 1997, Doug Cutting, a co-founder of Hadoop, started working on project Lucene, which is a full-text search library. It was completely written in Java and is a full-text search engine. It analyzes text and builds an index on it. An index is just a mapping of text to locations, so it quickly gives all locations matching particular search patterns. After a few years, Doug made the Lucene project open source; it got a tremendous response from the community and it later became the Apache foundation project.

Once Doug realized that he had enough people who can look into Lucene, he started focusing on indexing web pages. Mike Cafarella joined him for this project to develop a product that can index web pages, and they named this project Apache Nutch. Apache Nutch was also known to be a subproject of Apache Lucene, as Nutch uses the Lucene library to index the content of web pages. Fortunately, with hard work, they made good progress and deployed Nutch on a single machine that was able to index around 100 pages per second.

Scalability is something that people often don't consider while developing initial versions of applications. This was also true of Doug and Mike and the number of web pages that could be indexed was limited to 100 million. In order to index more pages, they increased the number of machines. However, increasing nodes resulted in operational problems because they did not have any underlying cluster manager to perform operational tasks. They wanted to focus more on optimizing and developing robust Nutch applications without worrying about scalability issues.

Doug and Mike wanted a system that had the following features:

Fault tolerant: The system should be able to handle any failure of the machines automatically, in an isolated manner. This means the failure of one machine should not affect the entire application.
Load balancing: If one machine fails, then its work should be distributed automatically to the working machines in a fair manner.
Data loss: They also wanted to make sure that, once data is written to disk, it should never be lost even if one or two machines fail.

They started working on developing a system that can fulfill the aforementioned requirements and spent a few months doing so. However, at the same time, Google published its Google File System. When they read about it, they found it had solutions to similar problems they were trying to solve. They decided to make an implementation based on this research paper and started the development of Nutch Distributed File System (NDFS), which they completed in 2004.

With the help of the Google File System, they solved the scalability and fault tolerance problem that we discussed previously. They used the concept of blocks and replication to do so. Blocks are created by splitting each file into 64 MB chunks (the size is configurable) and replicating each block three times by default so that, if a machine holding one block fails, then data can be served from another machine. The implementation helped them solve all the operational problems they were trying to solve for Apache Nutch. The next section explains the origin of MapReduce.

MapReduce origin

Doug and Mike started working on an algorithm that can process data stored on NDFS. They wanted a system whose performance can be doubled by just doubling the number of machines running the program. At the same time, Google published MapReduce: Simplified Data Processing on Large Clusters (https://research.google.com/archive/mapreduce.html).

The core idea behind the MapReduce model was to provide parallelism, fault tolerance, and data locality features. Data locality means a program is executed where data is stored instead of bringing the data to the program. MapReduce was integrated into Nutch in 2005. In 2006, Doug created a new incubating project that consisted of HDFS (Hadoop Distributed File System), named after NDFS, MapReduce, and Hadoop Common.

At that time, Yahoo! was struggling with its backend search performance. Engineers at Yahoo! already knew the benefits of Google File System and MapReduce implemented at Google. Yahoo! decided to adopt the capability of Hadoop and they employed Doug to help their engineering team to do so. In 2007, a few more companies who started contributing to Hadoop and Yahoo! reported that they were running 1,000 node Hadoop clusters at the same time.

NameNodes and DataNodes have a specific role in managing overall clusters. NameNodes are responsible for maintaining metadata information. MapReduce engines have a job tracker and task tracker whose scalability is limited to 40,000 nodes because the overall work of scheduling and tracking is handled by only the job tracker. YARN was introduced in Hadoop version 2 to overcome scalability issues and resource management jobs. It gave Hadoop a new lease of life and Hadoop became a more robust, faster, and more scalable system.

Timelines

We will talk about MapReduce and HDFS in detail later. Let's go through the evolution of Hadoop, which looks as follows:

Year	Event
2003	Research paper for Google File System released
2004	Research paper for MapReduce released
2006	JIRA, mailing list, and other documents created for Hadoop Hadoop Nutch created Hadoop created by moving out NDFS and MapReduce from Nutch Doug Cutting names the project Hadoop, which was the name of his son's yellow elephant toy Release of Hadoop 0.1.0 1.8 TB of data sorts on 188 nodes, which took 47.9 hours Three hundred machines deployed at Yahoo! for the Hadoop cluster Cluster size at Yahoo! increases to 600
2007	Two clusters of 1,000 machines run by Yahoo! Hadoop released with HBase Apache pig created at Yahoo!
2008	JIRA for YARN opened Twenty companies listed on the Powered by Hadoop page Web index at Yahoo! moved to Hadoop 10,000-core Hadoop cluster used to generate Yahoo!'s production search index First ever Hadoop summit World record created for the fastest sorting (one terabyte of data in 209 seconds) by using 910 node Hadoop clusters Hadoop bags record for terabyte sort Hadoop bags record for terabyte sort benchmark The Yahoo! cluster now has 10 TB loaded to Hadoop every day Cloudera founded as a Hadoop distribution company Google claims to sort 1 terabyte in 68 seconds with its MapReduce implementation
2009	24,000 machines with seven clusters run at Yahoo! Hadoop records sorting petabyte storage Yahoo! claims to sort 1 terabyte in 62 seconds Second Hadoop summit Hadoop core renamed Hadoop Common MapR distribution founded HDFS becomes a separate subproject MapReduce becomes a separate subproject
2010	Authentication using Kerberos added to Hadoop Stable version of Apache HBase Yahoo runs 4,000 nodes to process 70 PB Facebook runs 2,300 clusters to process 40 petabytes Apache Hive released Apache Pig released
2011	Apache Zookeeper released Contribution of 200,000 lines of code from Facebook, LinkedIn, eBay, and IBM Media Guardian Innovation Awards Top Prize for Hadoop Hortonworks started by Rob Beardon and Eric Badleschieler, who were core members of Hadoop at Yahoo 42,000 nodes of Hadoop cluster at Yahoo, processing petabytes data
2012	Hadoop community moves to integrate YARN Hadoop Summit at San Jose Hadoop version 1.0 released
2013	Yahoo! deploys YARN in production Hadoop Version 2.2 released
2014	Apache Spark becomes an Apache top-level project Hadoop Version 2.6 released
2015	Hadoop 2.7 released
2017	Hadoop 2.8 released

Hadoop 3 has introduced a few more important changes, which we will discuss in upcoming sections in this chapter.

Overview of Hadoop 3 and its features

The first alpha release of Hadoop version 3.0.0 was on 30 August 2016. It was called version 3.0.0-alpha1. This was the first alpha release in a series of planned alphas and betas that ultimately led to 3.0.0 GA. The intention behind this alpha release was to quickly gather and act on feedback from downstream users.

With any such releases, there are some key drivers that lead to its birth. These key drivers create benefits that will ultimately help in the better functioning of Hadoop-augmented enterprise applications. Before we discuss the features of Hadoop 3, you should understand these driving factors. Some driving factors behind the release of Hadoop 3 are as follows:

A lot of bug fixes and performance improvements: Hadoop has a growing open source community of developers regularly adding major/minor changes or improvements to the Hadoop trunk repository. These changes were growing day by day and they couldn't be accommodated in minor version releases of 2.x. They had to be accommodated with a major version release. Hence, it was decided to release the majority of these changes committed to the trunk repository with Hadoop 3.
Overhead due to data replication factor: As you may be aware, HDFS has a default replication factor of 3. This helps make things more fault-tolerant with better data locality and better load balancing of jobs among DataNodes. However, it comes with an overhead cost of around 200%. For non-frequently accessed datasets that have low I/O activities, these replicated blocks are never accessed in the course of normal operations. On the other hand, they consume the same number of resources as other main resources. To mitigate this overhead with non-frequently accessed data, Hadoop 3 introduced a major feature, called erasure coding. This stores data durably while saving space significantly.
Improving existing YARN Timeline services: YARN Timeline service version 1 has limitations that impact reliability, performance, and scalability. For example, it uses local-disk-based LevelDB storage that cannot scale to a high number of requests. Moreover, the Timeline server is a single point of failure. To mitigate such drawbacks, YARN Timeline server has been re-architected with the Hadoop 3 release.
Optimizing map output collector: It is a well-known fact that native code (written correctly) is faster to execute. In lieu of that, some optimization is done in Hadoop 3 that will speed up mapper tasks by approximately two to three times. The native implementation of map output collector has been added, which will be used in the Java-based MapReduce framework using the Java Native Interface (JNI). This is particularly useful for shuffle-intensive operations.
The need for a higher availability factor of NameNode: Hadoop is a fault-tolerant platform with support for handling multiple data node failures. In the case of NameNodes versions, prior to Hadoop version 3 support two NameNodes, Active and Standby. While it is a highly available solution, in the case of the failure of an active (or standby) NameNode, it will go back to a non-HA mode. This is not very accommodative of a high number of failures. In Hadoop 3, support for more than one standby NameNode has been introduced.
Dependency on Linux ephemeral port range: Linux ephemeral ports are short-lived ports created by the OS (operating system) when a process requests any available port. The OS assigns the port number from a predefined range. It then releases the port after the related connection terminates. With version 2 and earlier, many Hadoop services' default ports were in the Linux ephemeral port range. This means starting these services sometimes failed to bind to the port due to conflicts with other processes. In Hadoop 3, these default ports are moved out of the ephemeral port range.
Disk-level data skew: There are multiple disks (or drives) managed by DataNodes. Sometimes, adding or replacing disks leads to significant data skew within a DataNode. To rebalance data among disks within a DataNode, Hadoop 3 has introduced a CLI utility called hdfsdiskbalancer.

Well! Hopefully, by now, you have a clear understanding of why certain features were introduced in Hadoop 3 and what kinds of benefits are derived from them. Throughout this book, we will look into these features in detail. However, our intent in this section was to ensure that you get a high-level overview of the major features introduced in Hadoop 3 and why they were introduced. In the next section, we will look into Hadoop Logical view.

Hadoop logical view

The Hadoop Logical view can be divided into multiple sections. These sections can be viewed as a logical sequence, with steps starting from Ingress/Egress and ending at Data Storage Medium.

The following diagram shows the Hadoop platform logical view:

We will touch upon these sections as shown in the preceding diagram one by one, to understand them. However, when designing any Hadoop application, you should think in terms of those sections and make technological choices according to the use case problems you are trying to solve. Without wasting time, let's look at these sections one by one:

Ingress/egress/processing: Any interaction with the Hadoop platform should be viewed in terms of the following:
- Ingesting (ingress) data
- Reading (Egress) data
- Processing already ingested data

These actions can be automated via the use of tools or automated code. This can be achieved by user actions, by either uploading data to Hadoop or downloading data from Hadoop. Sometimes, users trigger actions that may result in Ingress/egress or the processing of data.

Data integration components: For ingress/egress or data processing in Hadoop, you need data integration components. These components are tools, software, or custom code that help integrate the underlying Hadoop data with user views or actions. If we talk about the user perspective alone, then these components give end users a unified view of data in Hadoop across different distributed Hadoop folders, in different files and data formats. These components provide end users and applications with an entry point for using or manipulating Hadoop data using different data access interfaces and data processing engines. We will exlpore the definition of data access interfaces and processing engines in the next section. In a nutshell, tools such as Hue and software (libraries) such as Sqoop, Java Hadoop Clients, and Hive Beeline Clients are some examples of data integration components.
Data access interfaces: Data access interfaces allow you to access underlying Hadoop data using different languages such as SQL, NoSQL, or APIs such as Rest and JAVA APIs, or using different data formats such as search data formats and streams. Sometimes, the interface that you use to access data from Hadoop is tightly coupled with underlying data processing engines. For example, if you're using SPARK SQL then it is bound to use the SPARK processing engine. Something similar is true in the case of the SEARCH interface, which is bound to use search engines such as SOLR or elastic search.
Data Processing Engines: Hadoop as a platform provides different processing engines to manipulate underlying data. These processing engines have different mechanisms to use system resources and have completely different SLA guarantees. For example, the MapReduce processing engine is more disk I/O-bound (keeping RAM memory usage under control) and it is suitable for batch-oriented data processing. Similarly, SPARK in a memory processing engine is less disk I/O-bound and more dependent on RAM memory. It is more suitable for stream or micro-batch processing. You should choose processing engines for your application based on the type of data sources you are dealing with along with SLAs you need to satisfy.
Resource management frameworks: Resource management frameworks expose abstract APIs to interact with underlying resource managers for task and job scheduling in Hadoop. These frameworks ensure there is a set of steps to follow for submitting jobs in Hadoop using designated resource managers such as YARN or MESOS. These frameworks help establish optimal performance by utilizing underlying resources systematically. Examples of such frameworks are Tez or Slider. Sometimes, data processing engines use these frameworks to interact with underlying resource managers or they have their own set of custom libraries to do so.
Task and resource management: Task and resource managment has one primary goal: sharing a large cluster of machines across different, simultaneously running applications in a cluster. There are two major resource managers in Hadoop: YARN and MESOS. Both are built with the same goal, but they use different scheduling or resource allocation mechanisms for jobs in Hadoop. For example, YARN is a Unix process while MESOS is Linux-container-based.
Data input/output: The data input/output layer is primarily responsible for different file formats, compression techniques, and data serialization for Hadoop storage.
Data Storage Medium: HDFS is the primary data storage medium used in Hadoop. It is a Java-based, high-performant distributed filesystem that is based on the underlying UNIX File System. In the next section, we will study Hadoop distributions along with their benefits.

Hadoop distributions

Hadoop is an open-source project under the Apache Software Foundation, and most components in the Hadoop ecosystem are also open-sourced. Many companies have taken important components and bundled them together to form a complete distribution package that is easier to use and manage. A Hadoop distribution offers the following benefits:

Installation: The distribution package provides an easy way to install any component or rpm-like package on clusters. It provides an easy interface too.
Packaging: It comes with multiple open-source tools that are well configured to work together. Assume that you want to install and configure each component separately on a multi-node cluster and then test whether it's working properly or not. What if we forget some testing scenarios and the cluster behaves unexpectedly? The Hadoop distribution assures us that we won't face such problems and also provides upgrades or installations of new components by using their package library.
Maintenance: The maintenance of a cluster and its components is also a very challenging task, but it is made very simple in of all these distribution packages. They provide us with a nice GUI interface to monitor the health and status of a component. We can also change the configuration to tune or maintain a component to perform well.
Support: Most distributions come with 24/7 support. That means that, if you are stuck with any cluster-or distribution-related issue, you don't need to worry much about finding resources to solve the problem. Hadoop Distribution comes with a support package that assures you of technical support and help as and when needed.

On-premise distribution

There are many distributions available in the market; we will look at the most widely used distributions:

Cloudera: Cloudera is an open source Hadoop distribution that was founded in 2008, just when Hadoop started gaining popularity. Cloudera is the oldest distribution available. People at Cloudera are committed to contributing to the open source community and they have contributed to the building of Hive, Impala, Hadoop, Pig, and other popular open-source projects. Cloudera comes with good tools packaged together to provide a good Hadoop experience. They also provide a nice GUI interface to manage and monitor clusters, known as Cloudera manager.
Hortonworks: Hortonworks was founded in 2011 and it comes with the Hortonworks Data Platform (HDP), which is an open-source Hadoop distribution. Hortonworks Distribution is widely used in organizations and it provides an Apache Ambari GUI-based interface to manage and monitor clusters. Hortonworks contributes to many open-source projects such as Apache tez, Hadoop, YARN, and Hive. Hortonworks has recently launched a Hortonworks Data Flow (HDF) platform for the purpose of data ingestion and storage. Hortonworks distribution also focuses on the security aspect of Hadoop and has integrated Ranger, Kerberos, and SSL-like security with the HDP and HDF platforms.
MapR: MapR was founded in 2009 and it has its own filesystem called MapR-FS, which is quite similar to HDFS but with some new features built by MapR. It boasts higher performance; it also consists of a few nice sets of tools to manage and administer a cluster, and it does not suffer from a single point of failure. It offers some useful features, such as mirroring and snapshots.

Cloud distributions

Cloud services offer cost-effective solutions in terms of infrastructure setup, monitoring, and maintenance. A large number of organizations do prefer moving their Hadoop infrastructure to the cloud. There are a few popular distributions available for the cloud:

Amazon's Elastic MapReduce: Before moving to Hadoop, Amazon had already acquired a large space on the cloud in their infrastructure setup. Amazon provides Elastic MapReduce and many other Hadoop ecosystem tools in their distribution. They have the s3 File System, which is another alternative to HDFS. They offer a cost-effective setup for Hadoop on cloud and it is currently the most actively used cloud on Hadoop distributions.
Microsoft Azure: Microsoft offers HDInsight as a Hadoop distribution. It also offers a cost-effective solution for Hadoop infrastructure setup, monitoring and managing cluster resources. Azure claims to provide a fully cloud-based cluster with 99.9% Service Level Agreements (SLA).

Other big companies have also started providing Hadoop on cloud such as Google Cloud Platform, IBM BigInsight, and Cloudera Cloud. You may choose any distribution based on the feasibility and stability of Hadoop tools and components. Most companies offer a free trial for 1 year with lots of free credits for organizational use.