Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Apache Hadoop 3 Quick Start Guide

You're reading from  Apache Hadoop 3 Quick Start Guide

Product type Book
Published in Oct 2018
Publisher Packt
ISBN-13 9781788999830
Pages 220 pages
Edition 1st Edition
Languages
Author (1):
Hrishikesh Vijay Karambelkar Hrishikesh Vijay Karambelkar
Profile icon Hrishikesh Vijay Karambelkar

Table of Contents (10) Chapters

Preface Hadoop 3.0 - Background and Introduction Planning and Setting Up Hadoop Clusters Deep Dive into the Hadoop Distributed File System Developing MapReduce Applications Building Rich YARN Applications Monitoring and Administration of a Hadoop Cluster Demystifying Hadoop Ecosystem Components Advanced Topics in Apache Hadoop Other Books You May Enjoy

Monitoring and Administration of a Hadoop Cluster

Previously, we have seen YARN and gained a deeper understanding of its capabilities. This chapter is focused on introducing you to the process-oriented approach to managing, monitoring, and optimizing your Hadoop cluster. We have already covered part of administration, when we set up a single node, a pseudo-distributed node, and a fully fledged distributed Hadoop cluster. We covered sizing the cluster, which is needed as part of the planning activity. We have also gone through some developer and system CLIs in the respective chapters on HDFS, MapReduce, and YARN. Hadoop administration is a vast topic; you will find lot of books dedicated to this activity in the market. I will be touching on key points of monitoring, managing, and optimizing your cluster.

We will cover the following topics:

  • Roles and responsibilities of Hadoop...

Roles and responsibilities of Hadoop administrators

Hadoop administration is highly technical work, where professionals need to have deeper understanding of the concepts of Hadoop, how it functions, and how it can be managed. The challenges faced by Hadoop administrators differ from other similar roles such as database or network administrators. For example, if you are a DBA, you typically get proactive alerts from the underlying database system when you run into tablespace threshold alerts when the disk space is not available for allocation, and you need to act on it, or else the operations will fail. In the case of Hadoop, the appropriate action is to move the job to another node in case it fails on one node due to sizing.

The following are the different responsibilities of a Hadoop administrator:

  • Installation and upgrades of clusters
  • Backup and disaster recovery
  • Application...

Planning your distributed cluster

In this section, we will cover the planning of your distributed cluster. We have already studied the sizing of clusters and estimation and data load aspects of clusters. When you explore different hardware alternatives, it is found that rack servers are the most suitable option available. Although Hadoop claims to support commodity hardware, the nodes still require server-class machines, and you should not consider setting up desktop-lass machines. However, unlike high-end databases, Hadoop does not require high-end server configuration; it can easily work on Intel-based processors, along with standard hard drives. This is where you save the cost.

Reliability is a major aspect to consider while working with any production system. Disk drives use Mean Time Between Failure (MTBF). It varies based on disk type. Hadoop is designed to work with hardware...

Resource management in Hadoop

As a Hadoop administrator, one important activity that you need to do is to ensure that all of the resources are used in the most optimal manner inside the cluster. When I refer to a resource, I mean the CPU time, the memory allocated to jobs, the network bandwidth utilization, and storage space consumed. Administrators can achieve that by balancing workloads on the jobs that are running in the cluster environment. When a cluster is set up, it may run different types of jobs, requiring different levels of time- and complexity-based SLAs. Fortunately, Apache Hadoop provides a built-in scheduler for scheduling jobs to allow administrators to prioritize different jobs as per the SLAs defined. So, overall resources can be managed by resource scheduling. All schedulers used in Hadoop use job queues to line up the jobs for prioritization. Among all, the...

High availability of Hadoop

We have seen the architecture of Apache Hadoop in a Chapter 1, Hadoop 3.0 - Background and Introduction. In this section, we will go through the High Availability (HA) feature of Apache Hadoop, given the fact that HDFS supports high availability through its replication factor. However, in earlier Apache Hadoop 1.X, NameNode was the single point of failure due to it being a central gateway for accessing data blocks. Similarly, Resource Manager is responsible for managing resources for MapReduce and YARN applications. We will study both of these points with respect to high availability.

High availability for NameNode

We have understood the challenges faced with Hadoop 1.x, so now let's understand...

Securing Hadoop clusters

Since Apache Hadoop works with lots of information, it brings in the important aspect of data governance and security of information. Usually, the cluster is not visible directly and is used primarily for computation and historical data storage, hence the urge for security implementation is relatively less than with applications that are running over the web, which demand the highest level of security requirements to be addressed. However, should there be any need, Hadoop deployments can be extremely secure. The security in hadoop works in the following key areas:

  • Data at Rest: How data stored can be encrypted so that no one can read it
  • Data in Motion: How the data transferred over the wire can be encrypted
  • Secured System access/APIs
  • Data Confidentiality: to control data access across different users

The good part is, Apache Hadoop ecosystem components...

Performing routine tasks

As a Hadoop administrator, you must work on your routine activities. Let's go through some of the most common routine tasks that you would perform with Hadoop administration.

Working with safe mode

When any client performs a write operation on HDFS, the changes get recorded in the edit log. This edit log is flushed at the end of write operations and the information is synced across nodes. Once this operation is complete, the system returns a success flag to the client. This ensures consistency of data and cleaner operation execution. Similarly, name node maintains a fsimage file, which is a data structure that name node uses to keep track of what goes where. This is a checkpoint copy which is...

Summary

In this chapter, we have gone through different activities performed by Hadoop administrators for monitoring and optimizing the Hadoop cluster. We looked at the roles and responsibilities of an administrator, followed by cluster planning. We did a deep dive into key management aspects of the hadoop cluster, such as resource management through job scheduling with algorithms such as Fair Scheduler and Capacity Scheduler. We also looked at ensuring high availability and security for the Apache hadoop cluster. This was followed by the day-to-day activities of Hadoop administrators, covering adding new nodes, archiving, hadoop Metric, and so on.

In the next chapter, we will look at Hadoop ecosystem components, which help the business develop big data applications rapidly.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Apache Hadoop 3 Quick Start Guide
Published in: Oct 2018 Publisher: Packt ISBN-13: 9781788999830
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}