Packt+ | Advance your knowledge in tech

You're reading from Hadoop 2.x Administration Cookbook

Product typeBook

Published inMay 2017

PublisherPackt

ISBN-139781787126732

Edition1st Edition

Tools

Hadoop

Concepts

System Administration

Author (1)

Aman Singh

Chapter 10. Cluster Planning

In this chapter, we will cover the following recipes:

Disk space calculations
Nodes needed in the cluster
Memory requirements
Sizing the cluster as per SLA
Network design
Estimating the cost of the Hadoop cluster
Hardware and software options

Introduction

In this chapter, we will look at cluster planning and some of the important aspects of cluster utilization.

Although this is a recipe book, it is good to have an understanding on the Hadoop cluster layout, network components, operating system, disk arrangements, and memory. We will try to cover some of the fundamental concepts on cluster planning and a few formulas to estimate the cluster size.

Let's say we are ready with our big data initiative and want to take the plunge into the Hadoop world. The first few of the primary concerns is what size cluster do we need? How many nodes and what configurations? What will be the roadmap in terms of the software/application stack, what will be the initial investment? What hardware to choose, whether to go with the vanilla Hadoop distribution or to go with vendor-specific Hadoop distributions.

There are no straightforward answers to these or any magic formulas. Many times, these decisions are influenced by market statistics, or by an organizational...

Disk space calculations

In this recipe, we will calculate the disk storage needed for the Hadoop cluster. Once we know what our storage requirement is, we can plan the number of nodes in the cluster and narrow down on the hardware options we have.

The intent of this recipe is not to tune performance, but to plan for capacity. Users are encouraged to read Chapter 9, HBase Administration on optimizing the Hadoop cluster.

Getting ready

To step through the recipe in this section, we need a Hadoop cluster set up and running. We need at least the HDFS configured correctly. It is recommended to complete the first two chapters before starting with this recipe.

How to do it...

Connect to the master1.cyrus.com master node in the cluster and switch to the user hadoop.
On the master node, execute the following command:
```
$ hdfs dfsadmin -report
```
This command will give you an understanding about how the storage in the cluster is represented. The total cluster storage is a summation of storages from each of the...

Nodes needed in the cluster

In this recipe, we will look at the number of nodes needed in the cluster based upon the storage requirements.

From the initial Disk space calculations recipe, we estimated that we need about 2 PB of storage for our cluster. In this recipe, we will estimate the number of nodes required for running a stable Hadoop cluster.

Getting ready

To step through the recipe, the user needs to have understood the Hadoop cluster daemons and their roles. It is recommended to have a cluster running with healthy HDFS and at least two Datanodes.

How to do it...

Connect to the master1.cyrus.com master node in the cluster and switch to the user hadoop.
Execute the command as shown here to see the Datanodes available and the disk space on each node:
```
$ hdfs dfsadmin -report
```
From the preceding command, we can tell the storage available per node, but we cannot tell the number of disks that make up that storage. Refer to the following screenshot for details:
Login to a Datanode dn6.cyrus.com and...

Memory requirements

In this recipe, we will look at the memory requirements per node in the cluster, especially looking at the memory on Datanodes as a factor of storage.

Despite having large clusters with many Datanodes, it is of not much use if the nodes do not have sufficient memory to serve the requests. A Namenode stores the entire metadata in the memory and also has to take care of the block reports sent by the Datanodes in the cluster. The larger the cluster, the larger will be the block reports, and the more resources a Namenode will require.

The intent of this recipe is not to tune it for memory, but to give an estimate on the memory required per node.

Getting ready

To complete this recipe, the user must have completed the Disk space calculations recipe and the Nodes needed in the cluster recipe. For better understanding, the user must have a running cluster with HDFS and YARN configured and must have played around with Chapter 1, Hadoop Architecture and Deployment and Chapter 2, Maintaining...

Sizing the cluster as per SLA

In this recipe, we will look at how service-level agreements can impact our decision to size the clusters. In an organization, there will be multitenant clusters, which are funded differently by business units and ask for a guarantee for their share.

A good thing about YARN is that multiple users can run different jobs such as MapReduce, Hive, Pig, HBase, Spark, and so on. While YARN guarantees what it needs to start a job, it does not control how the job will finish. Users can still step on each other and cause an impact on SLAs.

Getting ready

For this recipe, the users must have completed the Memory requirements and Nodes needed in the cluster recipes. It is good to have a running cluster with HDFS and YARN to run quick commands for reference. It is also good to understand the scheduler recipes covered in Chapter 5, Schedulers.

How to do it...

Connect to the master1.cyrus.com master node and switch to the user hadoop.
Run a teragen and terasort on the cluster using...

Network design

In this recipe, we will be looking at the network design for the Hadoop cluster and what things to consider for planning a Hadoop cluster.

Getting ready

Make sure that the user has a running cluster with HDFS and YARN and has at least two nodes in the cluster.

How to do it...

Connect to the master1.cyrus.com Namenode and switch to the user hadoop.
Execute the commands as follows to check for the link speed and other network option modes:
```
$ ethtool eth0
$ iftop
$ netstat -s
```
Always have a separate network for Hadoop traffic by using VLANs.
Ensure the DNS resolution works for both forward and reverse lookup.
Run a caching-only DNS within the Hadoop network, which caches records for faster resolution.
Consider NIC teaming or binding for better performance.
Use dedicated core switches and rack top switches.
Consider having static IPs per node in the cluster.
Disable IPv6 for all nodes and just use IPv4.
Increasing the size of the cluster will mean more connections and more data across nodes...

Estimating the cost of the Hadoop cluster

In this recipe, we will estimate the costing for the Hadoop cluster and see what factors to a take into account. The exact figures can vary according to the hardware and software choices.

The Hadoop cluster is a combination of servers, network components, power consumption, man hours to maintain it, software license costs, and cooling costs.

How to do it...

In this recipe, there is nothing to execute or do by logging into the cluster, but it is more of an estimation which will be governed by the following mentioned factors.

Each server in the Hadoop cluster will at least fall into three categories: Master nodes, Datanodes, and Edge nodes.

Costing master nodes: Intensive on memory and CPU, but need less of disk space.

Two OS disks in Raid 1 configuration
Two disks for logs and two disks for Namenode metadata
At least two network cards bounded together with minimum 1 Gbps
RAM 128 GB, higher if the HBase master is co-located on a Namenode
CPU cores per master...

Hardware and software options

In this recipe, we will discuss the hardware and software option to take account of while considering the Hadoop cluster.

There are many vendors for hardware and software and the options can be overwhelming. But some important things which must be taken into account are as follows:

Run benchmark tests on different hardware examples from HP, IBM, or Dell and compare them for the throughput per unit cost.
What is the roadmap for the hardware you choose? How long will the vendor support it?
Every year, the new hardware will be a better value for compute per unit. What will be the buyback strategy for the old hardware? Will the vendor take back the old hardware and give the new hardware at discounted rates?
Does the hardware have tightly coupled components, which could be difficult to replace in isolation?
What software options does the user have in terms of vendors? Should we go for HDP, Cloudera, or Mapr distribution or use the Apache Hadoop distribution.
Total cost...

The rest of the chapter is locked

You have been reading a chapter from

Hadoop 2.x Administration Cookbook

Published in: May 2017Publisher: PacktISBN-13: 9781787126732

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Aman Singh

Gurmukh Singh is a seasoned technology professional with 14+ years of industry experience in infrastructure design, distributed systems, performance optimization, and networks. He has worked in big data domain for the last 5 years and provides consultancy and training on various technologies. He has worked with companies such as HP, JP Morgan, and Yahoo. He has authored Monitoring Hadoop by Packt Publishing
Read more about Aman Singh

Other recommended products

Related to this chapter

Apache Hadoop 3 Quick Start Guide

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics such as MapReduce, YARN and HDFS.

BookOct 2018220 pages

HBase High Performance Cookbook

BookJan 2017350 pages

Mastering Hadoop 3

This is a comprehensive guide to understand advanced concepts of Hadoop ecosystem. You will learn how Hadoop works internally, and build solutions to some of real world use cases. Finally, you will have a solid understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable Big Data pipeline

BookFeb 2019544 pages

Apache Hive Essentials

Apache Hive helps you deal with data summarization, queries, and analysis for huge amounts of data. This book will give you a background in big data, and familiarize you with your Hive working environment. Next you will cover advanced topics like performance and security in Hive and how to work efficiently to find solutions to big data problems.

BookJun 2018210 pages

Modern Big Data Processing with Hadoop

This book presents unique techniques to conquer different Big Data processing and analytics challenges using Hadoop. Practical examples are provided to boost your understanding of Big Data concepts and their implementation. By the end of the book, you will have all the knowledge and skills you need to become a true Big Data expert.

BookMar 2018394 pages

Mastering Apache Storm

With real-world examples and clear explanations, this book will ensure you will have a thorough mastery Apache Storm.You’ll get an understanding of deploying Storm on clusters. Introduce yourself to topics such as trident topology, monitoring, Storm Parallelism, scheduler and log processing. Learn how to integrate Storm with other well-known Big Data technologies such as HBase, Redis, Kafka, and Hadoop to realize the full potential of Storm.You will be able to use the knowledge to develop efficient, distributed real-time applications to cater to your business needs.

BookAug 2017284 pages

Data Lake for Enterprises

The term 'Data Lake' has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights which can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it helps to derive useful information from not only the historical data but also correlates real-time data to enable business for taking critical decisions. This book tries to bring these two important aspects into one, namely data lake and lambda architecture.

BookMay 2017596 pages

Personalised recommendations for you

Based on your interests and search pattern

Designing and Implementing Microsoft Azure Networking Solutions

Designing and Implementing Microsoft Azure Networking Solutions Exam Ref AZ-700 is an all-encompassing guide to the AZ-700 exam and contains all the information you need to succeed in the world of virtual networking with Azure. With this book, you will be fully prepared for the exam and the world of cloud networking.

BookAug 2023524 pages

Microsoft 365 Security, Compliance, and Identity Administration

The Microsoft 365 Security, Compliance, and Identity Administration is a comprehensive guide that helps you employ Microsoft 365's robust suite of features and empowers you to optimize your administrative tasks.

BookAug 2023630 pages

Zero Trust Overview and Playbook Introduction

Get started on Zero Trust with this step-by-step playbook and learn everything you need to know for a successful Zero Trust journey with tailored guidance for every role, covering strategy, operations, architecture, implementation, and measuring success. This book will become an indispensable reference for everyone in your organization.

BookOct 2023240 pages

The Self-Taught Cloud Computing Engineer

This self-study book helps you master multiple clouds, including AWS, Azure, and GCP, and serves as a roadmap to becoming a certified cloud computing expert. The book will guide you to develop a professional cloud career by helping you build a broad cloud knowledge base, developing hands-on cloud computing skills, and getting cloud certified.

BookSep 2023472 pages

Technology Operating Models for Cloud and Edge

This book will help you build and create ownership of a technology operating model, as well as connect your leadership with engineering and operations, keeping your internal and external customers in mind. It provides practical tips on why, where, and how to make the cloud and edge platform paradigm sing for you, your team, and your organization.

BookAug 2023228 pages

Azure Architecture Explained

Azure is the preferred platform to build mission-critical and secure apps. This book provides comprehensive coverage of essential Azure products, services, and solutions vital for every solution architect's success. Elevate your knowledge and master the critical components of Azure to excel in your role with Azure Architecture Explained.

BookSep 2023446 pages

Pentesting Active Directory and Windows-based Infrastructure

This practical guide helps you explore the pentesting of Microsoft infrastructure in detail, and enhances your offensive skillset by showing you the different ways to perform security assessment. This book will help blue teamers and IT engineers get up to speed with possible security issues they may encounter in their Windows environments.

BookNov 2023360 pages

Practical Ansible

In Practical Ansible, you'll work with the latest release of Ansible and learn to solve complex issues quickly with the help of task-oriented scenarios. You'll start by installing and configuring Ansible to automate monotonous and repetitive IT tasks and get to grips with concepts such as playbooks, inventories, plugins, collections, and network modules.

BookSep 2023420 pages

Windows 11 for Enterprise Administrators

Microsoft’s launch of Windows 11 is a step toward satisfying the enterprise administrator’s needs for better management and enhanced user experience customization. This book provides the enterprise administrator with the knowledge needed to fully utilize the advanced feature set of Windows 11 Enterprise.

BookOct 2023286 pages

The Linux DevOps Handbook

This book is for software and IT professionals seeking knowledge on Linux systems and DevOps practices. This book will provide you with guidance and tools to learn and gain proficiency in managing Linux-based infrastructures and knowledge of DevOps.

BookNov 2023428 pages2