Packt+ | Advance your knowledge in tech

You're reading from Hadoop 2.x Administration Cookbook

Product typeBook

Published inMay 2017

PublisherPackt

ISBN-139781787126732

Edition1st Edition

Tools

Hadoop

Concepts

System Administration

Author (1)

Aman Singh

Chapter 5. Schedulers

In this chapter, we will cover the following recipes:

Configuring users and groups
Fair Scheduler configuration
Fair Scheduler pools
Configuring job queues
Job queue ACLs
Configuring Capacity Scheduler
Queuing mappings in Capacity Scheduler
YARN and Mapred commands
YARN label-based scheduling
YARN SLS

Introduction

In this chapter, we will configure YARN schedulers and job queues so that multiple users can use the cluster at the same time and make a legitimate use of the resources provided to them. There are two approaches: either setup a separate cluster for different business units or share the clusters.

The first approach is fine if there are a few clusters, but managing a large number of clusters is challenging. A better approach is to build multitenancy clusters, which can support different users with varied use cases.

Note

If users are finding it difficult to write scripts or configurations, all these are available at my GitHub at https://github.com/netxillon/hadoop.

Configuring users and groups

In our previous recipes, we installed or configured Hadoop clusters as user hadoop. But, in production, it is good to run jobs as different users and also the Hadoop daemons can be separated to run with different user IDs, so as to have better control and security.

The security aspects will be covered in the security chapter, but it is important to understand the user segregation and grouping users per project or business units.

In this recipe, we will see how to create users and groups for job submission. These recipes do not talk about the HDFS user permissions or file ACLs, but only about the permission to submit jobs and what percentage of cluster capacity each user or department can use within an organization.

Getting ready

Before tackling the recipes in this chapter, make sure you have gone through the previous recipes or have at least gone through the steps to install the Hadoop cluster. In addition to this, the user must know the basics of Linux User Management...

Fair Scheduler configuration

Getting ready

To go through the recipe in this section, we need Hadoop Cluster setup and running. By default, Apache Hadoop 1.x distribution uses FIFO scheduler and Hadoop 2.x uses Capacity Scheduler. In a cluster with multiple jobs, it is not good to use FIFO scheduler, as it will starve the jobs for resources and only the very first job in the queue is executed; all other jobs have to wait.

To address the preceding issue, there are two commonly used Schedulers: Fair Scheduler, and Capacity Scheduler, to allocate the cluster resources in a fair manner. In this recipe, we will see how to configure Fair Scheduler. Simply put, Fair Scheduler shares resources fairly among running jobs based on queues and weights assigned.

How to do it...

Connect to the master1.cyrus.com master node in the cluster and switch as user hadoop.

Edit the yarn-site.xml as follows:

<property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache...

Fair Scheduler pools

In this recipe, we look at configuring Fair Scheduler with pools instead of queues. This is for backwards compatibility. In Hadoop 1.X, Fair Scheduler was addressed with pools and it means the same as queues.

It is recommended to use queues, as this is quite standard across the board. But, for the sake of the readers, it is good to cover the concepts of pools.

Getting ready

To go through the recipe, complete the previous recipe and just modify the fair-scheduler.xml file to reflect pools.

How to do it...

Connect to the master1.cyrus.com master node in the cluster and switch as user hadoop.
Edit the allocation file fair-scheduler.xml, as shown in the following screenshot:
Copy the fair-scheduler.xml file to all the nodes in the cluster and restart the YARN daemons.
Check the ResourceManager page to confirm whether the pools are visible or not, as shown in the following screenshot:
Submit a sample job such as wordcount as user hadoop and see the ResourceManager page, as shown in...

Configuring job queues

In this recipe, we will configure the job queue and allow users to submit jobs to the queues. In production, there might be many departments such as marketing, sales, and finance, sharing a cluster of resources and it is important to have the correct shares proportional to business and funding.

In the previous recipe, although the queues were setup, they were still not used. Queues were dynamically created for jobs submitted by specifying a queue. If no queue is specified, jobs are submitted to a queue by the name of the user who submitted the job. We will explore these a bit more in this recipe.

Getting ready

To complete the recipe, the user must have a running cluster with HDFS and YARN configured and must have completed the previous two recipes.

How to do it...

Connect to the master1.cyrus.com master node in the cluster and switch as user hadoop.
Edit the fair-scheduler.xml allocation file as shown next. Note that there is no user specified for any queue (within the ...

Job queue ACLs

In YARN, whenever a job is submitted, it is submitted either to a specified queue or default queue. This behavior has been explored in the last recipe. In this recipe, we will configure ACLs to block users from submitting jobs to other queues.

Getting ready

In order to get started, you will need a running cluster with HDFS and YARN set up properly and an understanding of the last recipe.

Note

This feature is not yet production ready and is scheduled to be a standard feature in Hadoop 2.9.0, but users can still play with it and test it. Users will not see much improvement for small jobs with very few jars or common code.

How to do it...

Connect to the master1.cyrus.com master node and switch as user hadoop.
Execute the command as shown next in the picture to see the queue ACLs. By default, users can see that d1 has administrative and submit rights to all the queues:
Edit the fair-scheduler.xml allocation file as shown next. Note the users specified for the specific queues (within the...

Configuring Capacity Scheduler

Capacity Scheduler is mainly designed for multitenancy, where multiple organizations collectively fund the cluster based on the computing needs. There is an added benefit that an organization can access any excess capacity not being used by others. This provides elasticity for the organizations in a cost-effective manner.

Getting ready

For this recipe, you will again need a running cluster with YARN and HDFS configured in the cluster. Readers are recommended to read the previous recipes in this chapter to understand this recipe better.

In Hadoop 2.x, the default scheduler is Capacity Scheduler and it is enabled by default, unless modified explicitly as seen in the previous recipes where we have configured Fair Scheduler.

How to do it...

Connect to the master1.cyrus.com master node and switch as user hadoop.

Modify the yarn-site.xml file by changing the following parameter:

<property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value...

Queuing mappings in Capacity Scheduler

In this recipe, we will be configuring users who can submit jobs to the queue and can also set rule for various job submissions.

Let's look at another use case where, if user hadoop submits a job, it should go to the prod queue and if any other users submits a job, it must go to dev queue. How can we set up something like this?

Getting ready

Make sure that the user has a running cluster with HDFS and YARN configured. It's best to have gone through at least the previous recipe.

How to do it...

Connect to the master1.cyrus.com Namenode and switch as user hadoop.

Edit the capacity-scheduler.xml allocation file as shown next:

<property>
    <name>yarn.scheduler.capacity.queue-mappings</name>
    <value>u:d1:dev,g:group1:default,u:hadoop:prod</value>
</property>

Make the preceding changes and copy the file across all nodes and restart the YARN daemons.
Whenever the d1 user submits a job, it should go to the dev queue and for user...

YARN and Mapred commands

In this recipe, we will cover a few of the important commands which can help with the administration of YARN.

Until now, after making any change to the configuration, we have restarted the YARN daemons. But, this is not always required if the parameters already exist in the configuration files. Updating the queue list and capacity can be done dynamically at run time. In this recipe, we will cover a few of the easy ways to make changes.

Getting ready

For this recipe, you will again need a running cluster with at least HDFS and YARN configured and have the queue setup as discussed in this chapter. It is recommended that users go through all the previous recipes in this chapter before following this particular recipe.

How to do it...

Connect to the master1.cyrus.com master node and switch as user hadoop.
The first thing is to list the queues configured using the following command:
```
$ mapred queue -list
```
The preceding command will list queues, along with their status as shown...

YARN label-based scheduling

In this recipe, we will configure YARN label-based scheduling. In a cluster, there can be a mixture of nodes with different configurations, some with more memory and CPU compared to other nodes in the cluster.

If we want to control which set of nodes a job executes, we need to assign labels to the nodes. A typical case could be that you want to run a Spark streaming job and want that to execute on nodes with high memory. For such a situation, we will configure the queue and assign a set of nodes for that, so that if a job is submitted to that queue, it executes on the nodes which have higher configuration in terms of memory and cores.

Getting ready

Make sure that the user has a running cluster with at least two Datanodes and YARN working perfectly. Users are expected to have a basic knowledge about queues in Hadoop, for which they can refer to the previous few recipes in this chapter.

How to do it...

Connect to the master1.cyrus.com master node and switch as user hadoop...

YARN SLS

In this recipe, we will take a look at YARN simulator, which is useful to test and determine the load of YARN under various test conditions.

The YARN Scheduler Load Simulator (SLS) is such a tool, which can simulate large-scale YARN clusters and application loads in a single machine within a single JVM.

Getting ready

For this recipe, you will need a single machine with Hadoop installed. For this, readers can refer to the first chapter, where we have covered a single node cluster setup.

How to do it...

Connect to the master1.cyrus.com single node and switch as user hadoop.
The SLS is located at $HADOOP_HOME/share/hadoop/tools/sls/.
The SLS runs the simulator using the sls-runner.xml configuration file under $HADOOP_HOME/etc/hadoop.
A sample file is located at $HADOOP_HOME/share/hadoop/tools/sls/sample-conf/sls-runner.xml.
The sls-runner.xml file contains parameters to tune a number of threads, container memory, NM threads, and many other parameters or the choice of scheduler you want.
The...

The rest of the chapter is locked

You have been reading a chapter from

Hadoop 2.x Administration Cookbook

Published in: May 2017Publisher: PacktISBN-13: 9781787126732

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Aman Singh

Gurmukh Singh is a seasoned technology professional with 14+ years of industry experience in infrastructure design, distributed systems, performance optimization, and networks. He has worked in big data domain for the last 5 years and provides consultancy and training on various technologies. He has worked with companies such as HP, JP Morgan, and Yahoo. He has authored Monitoring Hadoop by Packt Publishing
Read more about Aman Singh

Other recommended products

Related to this chapter

Apache Hadoop 3 Quick Start Guide

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics such as MapReduce, YARN and HDFS.

BookOct 2018220 pages

HBase High Performance Cookbook

BookJan 2017350 pages

Mastering Hadoop 3

This is a comprehensive guide to understand advanced concepts of Hadoop ecosystem. You will learn how Hadoop works internally, and build solutions to some of real world use cases. Finally, you will have a solid understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable Big Data pipeline

BookFeb 2019544 pages

Apache Hive Essentials

Apache Hive helps you deal with data summarization, queries, and analysis for huge amounts of data. This book will give you a background in big data, and familiarize you with your Hive working environment. Next you will cover advanced topics like performance and security in Hive and how to work efficiently to find solutions to big data problems.

BookJun 2018210 pages

Modern Big Data Processing with Hadoop

This book presents unique techniques to conquer different Big Data processing and analytics challenges using Hadoop. Practical examples are provided to boost your understanding of Big Data concepts and their implementation. By the end of the book, you will have all the knowledge and skills you need to become a true Big Data expert.

BookMar 2018394 pages

Mastering Apache Storm

With real-world examples and clear explanations, this book will ensure you will have a thorough mastery Apache Storm.You’ll get an understanding of deploying Storm on clusters. Introduce yourself to topics such as trident topology, monitoring, Storm Parallelism, scheduler and log processing. Learn how to integrate Storm with other well-known Big Data technologies such as HBase, Redis, Kafka, and Hadoop to realize the full potential of Storm.You will be able to use the knowledge to develop efficient, distributed real-time applications to cater to your business needs.

BookAug 2017284 pages

Data Lake for Enterprises

The term 'Data Lake' has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights which can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it helps to derive useful information from not only the historical data but also correlates real-time data to enable business for taking critical decisions. This book tries to bring these two important aspects into one, namely data lake and lambda architecture.

BookMay 2017596 pages

Personalised recommendations for you

Based on your interests and search pattern

Designing and Implementing Microsoft Azure Networking Solutions

Designing and Implementing Microsoft Azure Networking Solutions Exam Ref AZ-700 is an all-encompassing guide to the AZ-700 exam and contains all the information you need to succeed in the world of virtual networking with Azure. With this book, you will be fully prepared for the exam and the world of cloud networking.

BookAug 2023524 pages

Microsoft 365 Security, Compliance, and Identity Administration

The Microsoft 365 Security, Compliance, and Identity Administration is a comprehensive guide that helps you employ Microsoft 365's robust suite of features and empowers you to optimize your administrative tasks.

BookAug 2023630 pages

Zero Trust Overview and Playbook Introduction

Get started on Zero Trust with this step-by-step playbook and learn everything you need to know for a successful Zero Trust journey with tailored guidance for every role, covering strategy, operations, architecture, implementation, and measuring success. This book will become an indispensable reference for everyone in your organization.

BookOct 2023240 pages

The Self-Taught Cloud Computing Engineer

This self-study book helps you master multiple clouds, including AWS, Azure, and GCP, and serves as a roadmap to becoming a certified cloud computing expert. The book will guide you to develop a professional cloud career by helping you build a broad cloud knowledge base, developing hands-on cloud computing skills, and getting cloud certified.

BookSep 2023472 pages

Technology Operating Models for Cloud and Edge

This book will help you build and create ownership of a technology operating model, as well as connect your leadership with engineering and operations, keeping your internal and external customers in mind. It provides practical tips on why, where, and how to make the cloud and edge platform paradigm sing for you, your team, and your organization.

BookAug 2023228 pages

Azure Architecture Explained

Azure is the preferred platform to build mission-critical and secure apps. This book provides comprehensive coverage of essential Azure products, services, and solutions vital for every solution architect's success. Elevate your knowledge and master the critical components of Azure to excel in your role with Azure Architecture Explained.

BookSep 2023446 pages

Pentesting Active Directory and Windows-based Infrastructure

This practical guide helps you explore the pentesting of Microsoft infrastructure in detail, and enhances your offensive skillset by showing you the different ways to perform security assessment. This book will help blue teamers and IT engineers get up to speed with possible security issues they may encounter in their Windows environments.

BookNov 2023360 pages

Practical Ansible

In Practical Ansible, you'll work with the latest release of Ansible and learn to solve complex issues quickly with the help of task-oriented scenarios. You'll start by installing and configuring Ansible to automate monotonous and repetitive IT tasks and get to grips with concepts such as playbooks, inventories, plugins, collections, and network modules.

BookSep 2023420 pages

Windows 11 for Enterprise Administrators

Microsoft’s launch of Windows 11 is a step toward satisfying the enterprise administrator’s needs for better management and enhanced user experience customization. This book provides the enterprise administrator with the knowledge needed to fully utilize the advanced feature set of Windows 11 Enterprise.

BookOct 2023286 pages

The Linux DevOps Handbook

This book is for software and IT professionals seeking knowledge on Linux systems and DevOps practices. This book will provide you with guidance and tools to learn and gain proficiency in managing Linux-based infrastructures and knowledge of DevOps.

BookNov 2023428 pages2