Reader small image

You're reading from  Mastering Hadoop 3

Product typeBook
Published inFeb 2019
Reading LevelExpert
PublisherPackt
ISBN-139781788620444
Edition1st Edition
Languages
Tools
Right arrow
Authors (2):
Chanchal Singh
Chanchal Singh
author image
Chanchal Singh

Chanchal Singh has over half decades experience in Product Development and Architect Design. He has been working very closely with leadership team of various companies including directors ,CTO's and Founding members to define technical road-map for company.He is the Founder and Speaker at meetup group Big Data and AI Pune MeetupExperience Speaks. He is Co-Author of Book Building Data Streaming Application with Apache Kafka. He has a Bachelor's degree in Information Technology from the University of Mumbai and a Master's degree in Computer Application from Amity University. He was also part of the Entrepreneur Cell in IIT Mumbai. His Linkedin Profile can be found at with the username Chanchal Singh.
Read more about Chanchal Singh

Manish Kumar
Manish Kumar
author image
Manish Kumar

Manish Kumar works as Director of Technology and Architecture at VSquare. He has over 13 years' experience in providing technology solutions to complex business problems. He has worked extensively on web application development, IoT, big data, cloud technologies, and blockchain. Aside from this book, Manish has co-authored three books (Mastering Hadoop 3, Artificial Intelligence for Big Data, and Building Streaming Applications with Apache Kafka).
Read more about Manish Kumar

View More author details
Right arrow

Overview of Hadoop 3 and its features


The first alpha release of Hadoop version 3.0.0 was on 30 August 2016. It was called version 3.0.0-alpha1. This was the first alpha release in a series of planned alphas and betas that ultimately led to 3.0.0 GA. The intention behind this alpha release was to quickly gather and act on feedback from downstream users.

 

With any such releases, there are some key drivers that lead to its birth. These key drivers create benefits that will ultimately help in the better functioning of Hadoop-augmented enterprise applications. Before we discuss the features of Hadoop 3, you should understand these driving factors. Some driving factors behind the release of Hadoop 3 are as follows:

  • A lot of bug fixes and performance improvements: Hadoop has a growing open source community of developers regularly adding major/minor changes or improvements to the Hadoop trunk repository. These changes were growing day by day and they couldn't be accommodated in minor version releases of 2.x. They had to be accommodated with a major version release. Hence, it was decided to release the majority of these changes committed to the trunk repository with Hadoop 3.
  • Overhead due to data replication factor: As you may be aware, HDFS has a default replication factor of 3. This helps make things more fault-tolerant with better data locality and better load balancing of jobs among DataNodes. However, it comes with an overhead cost of around 200%. For non-frequently accessed datasets that have low I/O activities, these replicated blocks are never accessed in the course of normal operations. On the other hand, they consume the same number of resources as other main resources. To mitigate this overhead with non-frequently accessed data, Hadoop 3 introduced a major feature, called erasure coding. This stores data durably while saving space significantly.
  • Improving existing YARN Timeline services: YARN Timeline service version 1 has limitations that impact reliability, performance, and scalability. For example, it uses local-disk-based LevelDB storage that cannot scale to a high number of requests. Moreover, the Timeline server is a single point of failure. To mitigate such drawbacks, YARN Timeline server has been re-architected with the Hadoop 3 release.
  • Optimizing map output collector: It is a well-known fact that native code (written correctly) is faster to execute. In lieu of that, some optimization is done in Hadoop 3 that will speed up mapper tasks by approximately two to three times. The native implementation of map output collector has been added, which will be used in the Java-based MapReduce framework using the Java Native Interface (JNI). This is particularly useful for shuffle-intensive operations.
  • The need for a higher availability factor of NameNode: Hadoop is a fault-tolerant platform with support for handling multiple data node failures. In the case of NameNodes versions, prior to Hadoop version 3 support two NameNodes, Active and Standby. While it is a highly available solution, in the case of the failure of an active (or standby) NameNode, it will go back to a non-HA mode. This is not very accommodative of a high number of failures. In Hadoop 3, support for more than one standby NameNode has been introduced.
  • Dependency on Linux ephemeral port range: Linux ephemeral ports are short-lived ports created by the OS (operating system) when a process requests any available port. The OS assigns the port number from a predefined range. It then releases the port after the related connection terminates. With version 2 and earlier, many Hadoop services' default ports were in the Linux ephemeral port range. This means starting these services sometimes failed to bind to the port due to conflicts with other processes. In Hadoop 3, these default ports are moved out of the ephemeral port range.
  • Disk-level data skew: There are multiple disks (or drives) managed by DataNodes. Sometimes, adding or replacing disks leads to significant data skew within a DataNode. To rebalance data among disks within a DataNode, Hadoop 3 has introduced a CLI utility called hdfsdiskbalancer.

Well! Hopefully, by now, you have a clear understanding of why certain features were introduced in Hadoop 3 and what kinds of benefits are derived from them. Throughout this book, we will look into these features in detail. However, our intent in this section was to ensure that you get a high-level overview of the major features introduced in Hadoop 3 and why they were introduced. In the next section, we will look into Hadoop Logical view.

Previous PageNext Page
You have been reading a chapter from
Mastering Hadoop 3
Published in: Feb 2019Publisher: PacktISBN-13: 9781788620444
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Chanchal Singh

Chanchal Singh has over half decades experience in Product Development and Architect Design. He has been working very closely with leadership team of various companies including directors ,CTO's and Founding members to define technical road-map for company.He is the Founder and Speaker at meetup group Big Data and AI Pune MeetupExperience Speaks. He is Co-Author of Book Building Data Streaming Application with Apache Kafka. He has a Bachelor's degree in Information Technology from the University of Mumbai and a Master's degree in Computer Application from Amity University. He was also part of the Entrepreneur Cell in IIT Mumbai. His Linkedin Profile can be found at with the username Chanchal Singh.
Read more about Chanchal Singh

author image
Manish Kumar

Manish Kumar works as Director of Technology and Architecture at VSquare. He has over 13 years' experience in providing technology solutions to complex business problems. He has worked extensively on web application development, IoT, big data, cloud technologies, and blockchain. Aside from this book, Manish has co-authored three books (Mastering Hadoop 3, Artificial Intelligence for Big Data, and Building Streaming Applications with Apache Kafka).
Read more about Manish Kumar