Free Sample
+ Collection

Mastering Hadoop

Sandeep Karanth

Go beyond the basics and master the next generation of Hadoop data processing platforms
RRP $29.99
RRP $49.99
Print + eBook

Want this title & more?

$12.99 p/month

Subscribe to PacktLib

Enjoy full and instant access to over 2000 books and videos – you’ll find everything you need to stay ahead of the curve and make sure you can always get the job done.

Book Details

ISBN 139781783983643
Paperback374 pages

About This Book

  • Learn how to optimize Hadoop MapReduce, Pig and Hive
  • Dive into YARN and learn how it can integrate Storm with Hadoop
  • Understand how Hadoop can be deployed on the cloud and gain insights into analytics with Hadoop

Who This Book Is For

Do you want to broaden your Hadoop skill set and take your knowledge to the next level? Do you wish to enhance your knowledge of Hadoop to solve challenging data processing problems? Are your Hadoop jobs, Pig scripts, or Hive queries not working as fast as you intend? Are you looking to understand the benefits of upgrading Hadoop? If the answer is yes to any of these, this book is for you. It assumes novice-level familiarity with Hadoop.

Table of Contents

Chapter 1: Hadoop 2.X
The inception of Hadoop
The evolution of Hadoop
Hadoop 2.X
Hadoop distributions
Chapter 2: Advanced MapReduce
MapReduce input
The RecordReader class
Hadoop's "small files" problem
Filtering inputs
The Map task
The Reduce task
MapReduce output
MapReduce job counters
Handling data joins
Chapter 3: Advanced Pig
Pig versus SQL
Different modes of execution
Complex data types in Pig
Compiling Pig scripts
Development and debugging aids
The advanced Pig operators
User-defined functions
Pig performance optimizations
Best practices
Chapter 4: Advanced Hive
The Hive architecture
Data types
File formats
The data model
Hive query optimizers
Advanced DML
Chapter 5: Serialization and Hadoop I/O
Data serialization in Hadoop
Avro serialization
File formats
Chapter 6: YARN – Bringing Other Paradigms to Hadoop
The YARN architecture
Developing YARN applications
Monitoring YARN
Job scheduling in YARN
YARN commands
Chapter 7: Storm on YARN – Low Latency Processing in Hadoop
Batch processing versus streaming
Apache Storm
Storm on YARN
Chapter 8: Hadoop on the Cloud
Cloud computing characteristics
Hadoop on the cloud
Amazon Elastic MapReduce (EMR)
Chapter 9: HDFS Replacements
HDFS – advantages and drawbacks
Amazon AWS S3
Implementing a filesystem in Hadoop
Implementing an S3 native filesystem in Hadoop
Chapter 10: HDFS Federation
Limitations of the older HDFS architecture
Architecture of HDFS Federation
HDFS high availability
HDFS block placement
Chapter 11: Hadoop Security
The security pillars
Authentication in Hadoop
Authorization in Hadoop
Data confidentiality in Hadoop
Audit logging in Hadoop
Chapter 12: Analytics Using Hadoop
Data analytics workflow
Machine learning
Apache Mahout
Document analysis using Hadoop and Mahout

What You Will Learn

  • Understand the changes involved in the process in the move from Hadoop 1.0 to Hadoop 2.0
  • Customize and optimize MapReduce jobs in Hadoop 2.0
  • Explore Hadoop I/O and different data formats
  • Dive into YARN and Storm and use YARN to integrate Storm with Hadoop
  • Deploy Hadoop on Amazon Elastic MapReduce
  • Discover HDFS replacements and learn about HDFS Federation
  • Get to grips with Hadoop's main security aspects
  • Utilize Mahout and RHadoop for Hadoop analytics

In Detail

Hadoop is synonymous with Big Data processing. Its simple programming model, "code once and deploy at any scale" paradigm, and an ever-growing ecosystem makes Hadoop an all-encompassing platform for programmers with different levels of expertise.

This book explores the industry guidelines to optimize MapReduce jobs and higher-level abstractions such as Pig and Hive in Hadoop 2.0. Then, it dives deep into Hadoop 2.0 specific features such as YARN and HDFS Federation.

This book is a step-by-step guide that focuses on advanced Hadoop concepts and aims to take your Hadoop knowledge and skill set to the next level. The data processing flow dictates the order of the concepts in each chapter, and each chapter is illustrated with code fragments or schematic diagrams.


Read More