Hadoop Operations and Cluster Management Cookbook

Over 60 recipes showing you how to design, configure, manage, monitor, and tune a Hadoop cluster

Hadoop Operations and Cluster Management Cookbook

Cookbook
Shumin Guo

Over 60 recipes showing you how to design, configure, manage, monitor, and tune a Hadoop cluster
$29.99
$49.99
RRP $29.99
RRP $49.99
eBook
Print + eBook
$12.99 p/month

Get Access

Get Unlimited Access to every Packt eBook and Video course

Enjoy full and instant access to over 3000 books and videos – you’ll find everything you need to stay ahead of the curve and make sure you can always get the job done.

Code Files
+ Collection
Free Sample

Book Details

ISBN 139781782165163
Paperback368 pages

About This Book

  • Hands-on recipes to configure a Hadoop cluster from bare metal hardware nodes
  • Practical and in depth explanation of cluster management commands
  • Easy-to-understand recipes for securing and monitoring a Hadoop cluster, and design considerations
  • Recipes showing you how to tune the performance of a Hadoop cluster
  • Learn how to build a Hadoop cluster in the cloud

Who This Book Is For

If you are a Hadoop cluster system administrator with Unix/Linux system management experience and you are looking to get a good grounding in how to set up and manage a Hadoop cluster, then this book is for you. It’s assumed that you will have some experience in Unix/Linux command line already, as well as being familiar with network communication basics.

Table of Contents

Chapter 1: Big Data and Hadoop
Introduction
Defining a Big Data problem
Building a Hadoop-based Big Data platform
Choosing from Hadoop alternatives
Chapter 2: Preparing for Hadoop Installation
Introduction
Choosing hardware for cluster nodes
Designing the cluster network
Configuring the cluster administrator machine
Creating the kickstart file and boot media
Installing the Linux operating system
Installing Java and other tools
Configuring SSH
Chapter 3: Configuring a Hadoop Cluster
Introduction
Choosing a Hadoop version
Configuring Hadoop in pseudo-distributed mode
Configuring Hadoop in fully-distributed mode
Validating Hadoop installation
Configuring ZooKeeper
Installing HBase
Installing Hive
Installing Pig
Installing Mahout
Chapter 4: Managing a Hadoop Cluster
Introduction
Managing the HDFS cluster
Configuring SecondaryNameNode
Managing the MapReduce cluster
Managing TaskTracker
Decommissioning DataNode
Replacing a slave node
Managing MapReduce jobs
Checking job history from the web UI
Importing data to HDFS
Manipulating files on HDFS
Configuring the HDFS quota
Configuring CapacityScheduler
Configuring Fair Scheduler
Configuring Hadoop daemon logging
Configuring Hadoop audit logging
Upgrading Hadoop
Chapter 5: Hardening a Hadoop Cluster
Introduction
Configuring service-level authentication
Configuring job authorization with ACL
Securing a Hadoop cluster with Kerberos
Configuring web UI authentication
Recovering from NameNode failure
Configuring NameNode high availability
Configuring HDFS federation
Chapter 6: Monitoring a Hadoop Cluster
Introduction
Monitoring a Hadoop cluster with JMX
Monitoring a Hadoop cluster with Ganglia
Monitoring a Hadoop cluster with Nagios
Monitoring a Hadoop cluster with Ambari
Monitoring a Hadoop cluster with Chukwa
Chapter 7: Tuning a Hadoop Cluster for Best Performance
Introduction
Benchmarking and profiling a Hadoop cluster
Analyzing job history with Rumen
Benchmarking a Hadoop cluster with GridMix
Using Hadoop Vaidya to identify performance problems
Balancing data blocks for a Hadoop cluster
Choosing a proper block size
Using compression for input and output
Configuring speculative execution
Setting proper number of map and reduce slots for the TaskTracker
Tuning the JobTracker configuration
Tuning the TaskTracker configuration
Tuning shuffle, merge, and sort parameters
Configuring memory for a Hadoop cluster
Setting proper number of parallel copies
Tuning JVM parameters
Configuring JVM Reuse
Configuring the reducer initialization time
Chapter 8: Building a Hadoop Cluster with Amazon EC2 and S3
Introduction
Registering with Amazon Web Services (AWS)
Managing AWS security credentials
Preparing a local machine for EC2 connection
Creating an Amazon Machine Image (AMI)
Using S3 to host data
Configuring a Hadoop cluster with the new AMI

What You Will Learn

  • Defining your big data problem
  • Designing and configuring a pseudo-distributed Hadoop cluster
  • Configuring a fully distributed Hadoop cluster and tuning your Hadoop cluster for better performance
  • Managing the DFS and MapReduce cluster
  • Configuring Hadoop logging, auditing, and job scheduling
  • Hardening the Hadoop cluster with security and access control methods
  • Monitoring a Hadoop cluster with tools such as Chukwa, Ganglia, Nagio, and Ambari
  • Setting up a Hadoop cluster on the Amazon cloud

In Detail

We are facing an avalanche of data. The unstructured data we gather can contain many insights that could hold the key to business success or failure. Harnessing the ability to analyze and process this data with Hadoop is one of the most highly sought after skills in today's job market. Hadoop, by combining the computing and storage powers of a large number of commodity machines, solves this problem in an elegant way!

Hadoop Operations and Cluster Management Cookbook is a practical and hands-on guide for designing and managing a Hadoop cluster. It will help you understand how Hadoop works and guide you through cluster management tasks.

This book explains real-world, big data problems and the features of Hadoop that enables it to handle such problems. It breaks down the mystery of a Hadoop cluster and will guide you through a number of clear, practical recipes that will help you to manage a Hadoop cluster.

We will start by installing and configuring a Hadoop cluster, while explaining hardware selection and networking considerations. We will also cover the topic of securing a Hadoop cluster with Kerberos, configuring cluster high availability and monitoring a cluster. And if you want to know how to build a Hadoop cluster on the Amazon EC2 cloud, then this is a book for you.

Authors

Table of Contents

Chapter 1: Big Data and Hadoop
Introduction
Defining a Big Data problem
Building a Hadoop-based Big Data platform
Choosing from Hadoop alternatives
Chapter 2: Preparing for Hadoop Installation
Introduction
Choosing hardware for cluster nodes
Designing the cluster network
Configuring the cluster administrator machine
Creating the kickstart file and boot media
Installing the Linux operating system
Installing Java and other tools
Configuring SSH
Chapter 3: Configuring a Hadoop Cluster
Introduction
Choosing a Hadoop version
Configuring Hadoop in pseudo-distributed mode
Configuring Hadoop in fully-distributed mode
Validating Hadoop installation
Configuring ZooKeeper
Installing HBase
Installing Hive
Installing Pig
Installing Mahout
Chapter 4: Managing a Hadoop Cluster
Introduction
Managing the HDFS cluster
Configuring SecondaryNameNode
Managing the MapReduce cluster
Managing TaskTracker
Decommissioning DataNode
Replacing a slave node
Managing MapReduce jobs
Checking job history from the web UI
Importing data to HDFS
Manipulating files on HDFS
Configuring the HDFS quota
Configuring CapacityScheduler
Configuring Fair Scheduler
Configuring Hadoop daemon logging
Configuring Hadoop audit logging
Upgrading Hadoop
Chapter 5: Hardening a Hadoop Cluster
Introduction
Configuring service-level authentication
Configuring job authorization with ACL
Securing a Hadoop cluster with Kerberos
Configuring web UI authentication
Recovering from NameNode failure
Configuring NameNode high availability
Configuring HDFS federation
Chapter 6: Monitoring a Hadoop Cluster
Introduction
Monitoring a Hadoop cluster with JMX
Monitoring a Hadoop cluster with Ganglia
Monitoring a Hadoop cluster with Nagios
Monitoring a Hadoop cluster with Ambari
Monitoring a Hadoop cluster with Chukwa
Chapter 7: Tuning a Hadoop Cluster for Best Performance
Introduction
Benchmarking and profiling a Hadoop cluster
Analyzing job history with Rumen
Benchmarking a Hadoop cluster with GridMix
Using Hadoop Vaidya to identify performance problems
Balancing data blocks for a Hadoop cluster
Choosing a proper block size
Using compression for input and output
Configuring speculative execution
Setting proper number of map and reduce slots for the TaskTracker
Tuning the JobTracker configuration
Tuning the TaskTracker configuration
Tuning shuffle, merge, and sort parameters
Configuring memory for a Hadoop cluster
Setting proper number of parallel copies
Tuning JVM parameters
Configuring JVM Reuse
Configuring the reducer initialization time
Chapter 8: Building a Hadoop Cluster with Amazon EC2 and S3
Introduction
Registering with Amazon Web Services (AWS)
Managing AWS security credentials
Preparing a local machine for EC2 connection
Creating an Amazon Machine Image (AMI)
Using S3 to host data
Configuring a Hadoop cluster with the new AMI

Book Details

ISBN 139781782165163
Paperback368 pages
Read More