Home

Hadoop Operations and Cluster Management Cookbook

By Shumin Guo

Book

Subscription

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

Subscription

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

We are facing an avalanche of data. The unstructured data we gather can contain many insights that could hold the key to business success or failure. Harnessing the ability to analyze and process this data with Hadoop is one of the most highly sought after skills in today's job market. Hadoop, by combining the computing and storage powers of a large number of commodity machines, solves this problem in an elegant way!

Hadoop Operations and Cluster Management Cookbook is a practical and hands-on guide for designing and managing a Hadoop cluster. It will help you understand how Hadoop works and guide you through cluster management tasks.

This book explains real-world, big data problems and the features of Hadoop that enables it to handle such problems. It breaks down the mystery of a Hadoop cluster and will guide you through a number of clear, practical recipes that will help you to manage a Hadoop cluster.

We will start by installing and configuring a Hadoop cluster, while explaining hardware selection and networking considerations. We will also cover the topic of securing a Hadoop cluster with Kerberos, configuring cluster high availability and monitoring a cluster. And if you want to know how to build a Hadoop cluster on the Amazon EC2 cloud, then this is a book for you.

Publication date:: July 2013
Publisher: Packt
Pages: 368
ISBN: 9781782165163

Chapter 1. Big Data and Hadoop

In this chapter, we will cover:

Defining a Big Data problem
Building a Hadoop-based Big Data platform
Choosing from Hadoop alternatives

Introduction

Today, many organizations are facing the Big Data problem. Managing and processing Big Data can incur a lot of challenges for traditional data processing platforms such as relational database systems. Hadoop was designed to be a distributed and scalable system for dealing with Big Data problems.

The design, implementation, and deployment of a Big Data platform require a clear definition of the Big Data problem by system architects and administrators. A Hadoop-based Big Data platform uses Hadoop as the data storage and processing engine. It deals with the problem by transforming the Big Data input into the expected output. On one hand, the Big Data problem determines how the Big Data platform should be designed, for example, which modules or subsystems should be integrated into the platform and so on. On the other hand, the architectural design of the platform can determine complexity and efficiency of the platform.

Different Big Data problems have different properties. A Hadoop-based Big Data platform is capable of dealing with most of the Big Data problems, but might not be good fit for others. Because of these and many other reasons, we need to choose from Hadoop alternatives.

Defining a Big Data problem

Generally, the definition of Big Data is data in large sizes that go beyond the ability of commonly used software tools to collect, manage, and process within a tolerable elapsed time. More formally, the definition of Big Data should go beyond the size of the data to include other properties. In this recipe, we will outline the properties that define Big Data in a formal way.

Getting ready

Ideally, data has the following three important properties: volume, velocity, and variety. In this book, we treat the value property of Big Data as the fourth important property. And, the value property also explains the reason why the Big Data problem exists.

How to do it…

Defining a Big Data problem involves the following steps:

Estimate the volume of data. The volume should not only include the current data volume, for example in gigabytes or terabytes, but also should include the expected volume in the future.
There are two types of data in the real world: static and nonstatic data. The volume of static data, for example national census data and human genomic data, will not change over time. While for nonstatic data, such as streaming log data and social network streaming data, the volume increases over time.
Estimate the velocity of data. The velocity estimate should include how much data can be generated within a certain amount of time, for example during a day. For static data, the velocity is zero.
The velocity property of Big Data defines the speed that data can be generated. This property will not only affect the volume of data, but also determines how fast a data processing system should handle the data.
Identify the data variety. In other words, the data variety means the different sources of data, such as web click data, social network data, data in relational databases, and so on.
Variety means that data differs syntactically or semantically. The difference requires specifically designed modules for each data variety to be integrated into the Big Data platform. For example, a web crawler is needed for getting data from the Web, and a data translation module is needed to transfer data from relational databases to a nonrelational Big Data platform.
Define the expected value of data.
The value property of Big Data defines what we can potentially derive from and how we can use Big Data. For example, frequent item sets can be mined from online click-through data for better marketing and more efficient deployment of advertisements.

How it works…

A Big Data platform can be described with the IPO (http://en.wikipedia.org/wiki/IPO_Model) model, which includes three components: input, process, and output. For a Big Data problem, the volume, velocity, and variety properties together define the input of the system, and the value property defines the output.

Building a Hadoop-based Big Data platform

Hadoop was first developed as a Big Data processing system in 2006 at Yahoo! The idea is based on Google's MapReduce, which was first published by Google based on their proprietary MapReduce implementation. In the past few years, Hadoop has become a widely used platform and runtime environment for the deployment of Big Data applications. In this recipe, we will outline steps to build a Hadoop-based Big Data platform.

Getting ready

Hadoop was designed to be parallel and resilient. It redefines the way that data is managed and processed by leveraging the power of computing resources composed of commodity hardware. And it can automatically recover from failures.

How to do it…

Use the following steps to build a Hadoop-based Big Data platform:

Design, implement, and deploy data collection or aggregation subsystems. The subsystems should transfer data from different data sources to Hadoop-compatible data storage systems such as HDFS and HBase.
The subsystems need to be designed based on the input properties of a Big Data problem, including volume, velocity, and variety.
Design, implement, and deploy Hadoop Big Data processing platform. The platform should consume the Big Data located on HDFS or HBase and produce the expected and valuable output.
Design, implement, and deploy result delivery subsystems. The delivery subsystems should transform the analytical results from a Hadoop-compatible format to a proper format for end users. For example, we can design web applications to visualize the analytical results using charts, graphs, or other types of dynamic web applications.

How it works…

The architecture of a Hadoop-based Big Data system can be described with the following chart:

Although Hadoop borrows its idea from Google's MapReduce, it is more than MapReduce. A typical Hadoop-based Big Data platform includes the Hadoop Distributed File System (HDFS), the parallel computing framework (MapReduce), common utilities, a column-oriented data storage table (HBase), high-level data management systems (Pig and Hive), a Big Data analytics library (Mahout), a distributed coordination system (ZooKeeper), a workflow management module (Oozie), data transfer modules such as Sqoop, data aggregation modules such as Flume, and data serialization modules such as Avro.

HDFS is the default filesystem of Hadoop. It was designed as a distributed filesystem that provides high-throughput access to application data. Data on HDFS is stored as data blocks. The data blocks are replicated on several computing nodes and their checksums are computed. In case of a checksum error or system failure, erroneous or lost data blocks can be recovered from backup blocks located on other nodes.

MapReduce provides a programming model that transforms complex computations into computations over a set of key-value pairs. It coordinates the processing of tasks on a cluster of nodes by scheduling jobs, monitoring activity, and re-executing failed tasks.

In a typical MapReduce job, multiple map tasks on slave nodes are executed in parallel, generating results buffered on local machines. Once some or all of the map tasks have finished, the shuffle process begins, which aggregates the map task outputs by sorting and combining key-value pairs based on keys. Then, the shuffled data partitions are copied to reducer machine(s), most commonly, over the network. Then, reduce tasks will run on the shuffled data and generate final (or intermediate, if multiple consecutive MapReduce jobs are pipelined) results. When a job finishes, final results will reside in multiple files, depending on the number of reducers used in the job. The anatomy of the job flow can be described in the following chart:

There's more...

HDFS has two types of nodes, NameNode and DataNode. A NameNode keeps track of the filesystem metadata such as the locations of data blocks. For efficiency reasons, the metadata is kept in the main memory of a master machine. A DataNode holds physical data blocks and communicates with clients for data reading and writing. In addition, it periodically reports a list of its hosting blocks to the NameNode in the cluster for verification and validation purposes.

The MapReduce framework has two types of nodes, master node and slave node. JobTracker is the daemon on a master node, and TaskTracker is the daemon on a slave node. The master node is the manager node of MapReduce jobs. It splits a job into smaller tasks, which will be assigned by the JobTracker to TaskTrackers on slave nodes to run. When a slave node receives a task, its TaskTracker will fork a Java process to run the task. Meanwhile, the TaskTracker is also responsible for tracking and reporting the progress of individual tasks.

Hadoop common

Hadoop common is a collection of components and interfaces for the foundation of Hadoop-based Big Data platforms. It provides the following components:

Distributed filesystem and I/O operation interfaces
General parallel computation interfaces
Logging
Security management

Apache HBase

Apache HBase is an open source, distributed, versioned, and column-oriented data store. It was built on top of Hadoop and HDFS. HBase supports random, real-time access to Big Data. It can scale to host very large tables, containing billions of rows and millions of columns. More documentation about HBase can be obtained from http://hbase.apache.org.

Apache Mahout

Apache Mahout is an open source scalable machine learning library based on Hadoop. It has a very active community and is still under development. Currently, the library supports four use cases: recommendation mining, clustering, classification, and frequent item set mining. More documentation of Mahout can be obtained from http://mahout.apache.org.

Apache Pig

Apache Pig is a high-level system for expressing Big Data analysis programs. It supports Big Data by compiling the Pig statements into a sequence of MapReduce jobs. Pig uses Pig Latin as the programming language, which is extensible and easy to use. More documentation about Pig can be found from http://pig.apache.org.

Apache Hive

Apache Hive is a high-level system for the management and analysis of Big Data stored in Hadoop-based systems. It uses a SQL-like language called HiveQL. Similar to Apache Pig, the Hive runtime engine translates HiveQL statements into a sequence of MapReduce jobs for execution. More information about Hive can be obtained from http://hive.apache.org.

Apache ZooKeeper

Apache ZooKeeper is a centralized coordination service for large scale distributed systems. It maintains the configuration and naming information and provides distributed synchronization and group services for applications in distributed systems. More documentation about ZooKeeper can be obtained from http://zookeeper.apache.org.

Apache Oozie

Apache Oozie is a scalable workflow management and coordination service for Hadoop jobs. It is data aware and coordinates jobs based on their dependencies. In addition, Oozie has been integrated with Hadoop and can support all types of Hadoop jobs. More information about Oozie can be obtained from http://oozie.apache.org.

Apache Sqoop

Apache Sqoop is a tool for moving data between Apache Hadoop and structured data stores such as relational databases. It provides command-line suites to transfer data from relational database to HDFS and vice versa. More information about Apache Sqoop can be found at http://sqoop.apache.org.

Apache Flume

Apache Flume is a tool for collecting log data in distributed systems. It has a flexible yet robust and fault tolerant architecture that streams data from log servers to Hadoop. More information can be obtained from http://flume.apache.org.

Apache Avro

Apache Avro is a fast, feature rich data serialization system for Hadoop. The serialized data is coupled with the data schema, which facilitates its processing with different programming languages. More information about Apache Avro can be found at http://avro.apache.org.

Choosing from Hadoop alternatives

Although Hadoop has been very successful for most of the Big Data problems, it is not an optimal choice in many situations. In this recipe, we will introduce a few Hadoop alternatives.

Getting ready

Hadoop has the following drawbacks as a Big Data platform:

As an open source software, Hadoop is difficult to configure and manage, mainly due to the instability of the software and lack of properly maintained documentation and technical support
Hadoop is not an optimal choice for real-time, responsive Big Data applications
Hadoop is not a good fit for large graph datasets

Because of the preceding drawbacks as well as other reasons, such as special data processing requirements, we need to make an alternative choice.

Tip

Hadoop is not a good choice for data that is not categorized as Big Data; for example, data that has the following properties: small datasets and datasets with processing that requires transaction and synchronization.

How to do it…

We can choose Hadoop alternatives using the following guidelines:

Choose Enterprise Hadoop if there is no qualified Hadoop administrator and there is sufficient budget for deploying a Big Data platform.
Choose Spark or Storm if an application requires real-time data processing.
Choose GraphLab if an application requires handling of large graph datasets.

How it works…

Enterprise Hadoop refers to Hadoop distributions by some Hadoop-oriented companies. Compared with the community Hadoop releases, Enterprise Hadoop distributions are enterprise ready, easy to configure, and sometimes new features are added. In addition, the training and support services provided by these companies make it much easier for organizations to adopt the Hadoop Big Data platform. Famous Hadoop-oriented companies include: Cloudera, Horntonworks, MapR, Hadapt, and so on.

Cloudera is one of the most famous companies that delivers Enterprise Hadoop Big Data solutions. It provides Hadoop consulting, training, and certification services. It is also one of the biggest contributors of the Hadoop codebase. Their Big Data solution uses Cloudera Desktop as the cluster management interface. You can learn more from www.cloudera.com.
Hortonworks and MapR both provide featured Hadoop distributions and Hadoop-based Big Data solutions. You can get more details from www.hortonworks.com and www.mapr.com.
Hadapt differentiates itself from the other Hadoop-oriented companies by the goal of integrating structured, semi-structured, and unstructured data into a uniform data operation platform. Hadapt unifies SQL and Hadoop and makes it easy to handle different varieties of data. You can learn more at http://hadapt.com/.
Spark is a real-time in-memory Big Data processing platform. It can be up to 40 times faster than Hadoop. So it is ideal for iterative and responsive Big Data applications. Besides, Spark can be integrated with Hadoop, and the Hadoop-compatible storage APIs enable it to access any Hadoop-supported systems. More information about Spark can be learned from http://spark-project.org/.
Storm is another famous real-time Big Data processing platform. It is developed and open sourced by Twitter. For more information, please check http://storm-project.net/.
GraphLab is an open source distributed system developed at Carnegie Mellon University. It was targeted for handling sparse iterative graph algorithms. For more information, please visit: http://graphlab.org/.
Tip
The MapReduce framework parallels computation by splitting data into a number of distributed nodes. Some large natural graph data, such as social network data, has the problem of being hard to partition and thus, hard to split for Hadoop parallel processing. The performance can be severely panelized if Hadoop is used.
Other Hadoop-like implementations include Phoenix (http://mapreduce.stanford.edu/), which is a shared memory implementation of the MapReduce data processing framework, and Haloop (http://code.google.com/p/haloop/), which is a modified version of Hadoop for iterative data processing.
Tip
Phoenix and Haloop do not have an active community and they are not recommended for production deployment.

There's more...

As the Big Data problem floods the whole world, many systems have been designed to deal with the problem. Two such famous systems that do not follow the MapReduce route are Message Passing Interface (MPI) and High Performance Cluster Computing (HPCC).

MPI

MPI is a library specification for message passing. Different from Hadoop, MPI was designed for high performance on both massively parallel machines and on workstation clusters. In addition, MPI lacks fault tolerance and performance will be bounded when data becomes large. More documentation about MPI can be found at http://www.mpi-forum.org/.

HPCC

HPCC is an open source Big Data platform developed by HPCC systems, which was acquired by LexisNexis Risk Solutions. It achieves high performance by clustering commodity hardware. The system includes configurations for both parallel batch processing and high performance online query applications using indexed data files. The HPCC platform contains two cluster processing subsystems: Data Refinery subsystem and Data Delivery subsystem. The Data Refinery subsystem is responsible for the general processing of massive raw data, and the Data Delivery subsystem is responsible for the delivery of clean data for online queries and analytics. More information about HPCC can be found at http://hpccsystems.com/.

About the Author

Shumin Guo

Shumin Guo is a PhD student of Computer Science at Wright State University in Dayton, OH. His research fields include Cloud Computing and Social Computing. He is enthusiastic about open source technologies and has been working as a System Administrator, Programmer, and Researcher at State Street Corp. and LexisNexis.
Browse publications by this author

Hadoop Operations and Cluster Management Cookbook

Chapter 1. Big Data and Hadoop

Introduction

Defining a Big Data problem

Getting ready

How to do it…

How it works…

See also

Building a Hadoop-based Big Data platform

Getting ready

How to do it…

How it works…

There's more...

Hadoop common

Apache HBase

Apache Mahout

Apache Pig

Apache Hive

Apache ZooKeeper

Apache Oozie

Apache Sqoop

Apache Flume

Apache Avro

Choosing from Hadoop alternatives

Getting ready

Tip

How to do it…

How it works…

Tip

Tip

There's more...

MPI

HPCC