Learning Hadoop 2

Design and implement data processing, lifecycle management, and analytic workflows with the cutting-edge toolbox of Hadoop 2
Preview in Mapt

Learning Hadoop 2

Garry Turkington, Gabriele Modena

1 customer reviews
Design and implement data processing, lifecycle management, and analytic workflows with the cutting-edge toolbox of Hadoop 2
Mapt Subscription
FREE
$29.99/m after trial
eBook
$21.00
RRP $29.99
Save 29%
Print + eBook
$49.99
RRP $49.99
What do I get with a Mapt Pro subscription?
  • Unlimited access to all Packt’s 5,000+ eBooks and Videos
  • Early Access content, Progress Tracking, and Assessments
  • 1 Free eBook or Video to download and keep every month after trial
What do I get with an eBook?
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with Print & eBook?
  • Get a paperback copy of the book delivered to you
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with a Video?
  • Download this Video course in MP4 format
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
$0.00
$21.00
$49.99
$29.99p/m after trial
RRP $29.99
RRP $49.99
Subscription
eBook
Print + eBook
Start 30 Day Trial

Frequently bought together


Learning Hadoop 2 Book Cover
Learning Hadoop 2
$ 29.99
$ 21.00
Learning Hadoop 2 [Video] Book Cover
Learning Hadoop 2 [Video]
$ 74.99
$ 63.75
Buy 2 for $35.00
Save $69.98
Add to Cart
Subscribe and access every Packt eBook & Video.
 
  • 5,000+ eBooks & Videos
  • 50+ New titles a month
  • 1 Free eBook/Video to keep every month
Start Free Trial
 

Book Details

ISBN 139781783285518
Paperback382 pages

Book Description

This book introduces you to the world of building data-processing applications with the wide variety of tools supported by Hadoop 2. Starting with the core components of the framework—HDFS and YARN—this book will guide you through how to build applications using a variety of approaches.

You will learn how YARN completely changes the relationship between MapReduce and Hadoop and allows the latter to support more varied processing approaches and a broader array of applications. These include real-time processing with Apache Samza and iterative computation with Apache Spark. Next up, we discuss Apache Pig and the dataflow data model it provides. You will discover how to use Pig to analyze a Twitter dataset.

With this book, you will be able to make your life easier by using tools such as Apache Hive, Apache Oozie, Hadoop Streaming, Apache Crunch, and Kite SDK. The last part of this book discusses the likely future direction of major Hadoop components and how to get involved with the Hadoop community.

Table of Contents

Chapter 1: Introduction
A note on versioning
The background of Hadoop
Components of Hadoop
Hadoop 2 – what's the big deal?
Distributions of Apache Hadoop
A dual approach
AWS – infrastructure on demand from Amazon
Getting started
Running the examples
Data processing with Hadoop
Summary
Chapter 2: Storage
The inner workings of HDFS
Command-line access to the HDFS filesystem
Protecting the filesystem metadata
Apache ZooKeeper – a different type of filesystem
Automatic NameNode failover
HDFS snapshots
Hadoop filesystems
Managing and serializing data
Storing data
Summary
Chapter 3: Processing – MapReduce and Beyond
MapReduce
Java API to MapReduce
Writing MapReduce programs
Walking through a run of a MapReduce job
YARN
YARN in the real world – Computation beyond MapReduce
Summary
Chapter 4: Real-time Computation with Samza
Stream processing with Samza
Summary
Chapter 5: Iterative Computation with Spark
Apache Spark
The Spark ecosystem
Processing data with Apache Spark
Comparing Samza and Spark Streaming
Summary
Chapter 6: Data Analysis with Apache Pig
An overview of Pig
Getting started
Running Pig
Fundamentals of Apache Pig
Programming Pig
Extending Pig (UDFs)
Analyzing the Twitter stream
Summary
Chapter 7: Hadoop and SQL
Why SQL on Hadoop
Prerequisites
Hive architecture
Hive and Amazon Web Services
Extending HiveQL
Programmatic interfaces
Stinger initiative
Impala
Summary
Chapter 8: Data Lifecycle Management
What data lifecycle management is
Building a tweet analysis capability
Challenges of external data
Collecting additional data
Pulling it all together
Summary
Chapter 9: Making Development Easier
Choosing a framework
Hadoop streaming
Kite Data
Apache Crunch
Summary
Chapter 10: Running a Hadoop Cluster
I'm a developer – I don't care about operations!
Cloudera Manager
Ambari – the open source alternative
Operations in the Hadoop 2 world
Sharing resources
Building a physical cluster
Building a cluster on EMR
Cluster tuning
Security
Monitoring
Troubleshooting
Summary
Chapter 11: Where to Go Next
Alternative distributions
Other computational frameworks
Other interesting projects
Other programming abstractions
AWS resources
Sources of information
Summary

What You Will Learn

  • Write distributed applications using the MapReduce framework
  • Go beyond MapReduce and process data in real time with Samza and iteratively with Spark
  • Familiarize yourself with data mining approaches that work with very large datasets
  • Prototype applications on a VM and deploy them to a local cluster or to a cloud infrastructure (Amazon Web Services)
  • Conduct batch and real time data analysis using SQL-like tools
  • Build data processing flows using Apache Pig and see how it enables the easy incorporation of custom functionality
  • Define and orchestrate complex workflows and pipelines with Apache Oozie
  • Manage your data lifecycle and changes over time

Authors

Table of Contents

Chapter 1: Introduction
A note on versioning
The background of Hadoop
Components of Hadoop
Hadoop 2 – what's the big deal?
Distributions of Apache Hadoop
A dual approach
AWS – infrastructure on demand from Amazon
Getting started
Running the examples
Data processing with Hadoop
Summary
Chapter 2: Storage
The inner workings of HDFS
Command-line access to the HDFS filesystem
Protecting the filesystem metadata
Apache ZooKeeper – a different type of filesystem
Automatic NameNode failover
HDFS snapshots
Hadoop filesystems
Managing and serializing data
Storing data
Summary
Chapter 3: Processing – MapReduce and Beyond
MapReduce
Java API to MapReduce
Writing MapReduce programs
Walking through a run of a MapReduce job
YARN
YARN in the real world – Computation beyond MapReduce
Summary
Chapter 4: Real-time Computation with Samza
Stream processing with Samza
Summary
Chapter 5: Iterative Computation with Spark
Apache Spark
The Spark ecosystem
Processing data with Apache Spark
Comparing Samza and Spark Streaming
Summary
Chapter 6: Data Analysis with Apache Pig
An overview of Pig
Getting started
Running Pig
Fundamentals of Apache Pig
Programming Pig
Extending Pig (UDFs)
Analyzing the Twitter stream
Summary
Chapter 7: Hadoop and SQL
Why SQL on Hadoop
Prerequisites
Hive architecture
Hive and Amazon Web Services
Extending HiveQL
Programmatic interfaces
Stinger initiative
Impala
Summary
Chapter 8: Data Lifecycle Management
What data lifecycle management is
Building a tweet analysis capability
Challenges of external data
Collecting additional data
Pulling it all together
Summary
Chapter 9: Making Development Easier
Choosing a framework
Hadoop streaming
Kite Data
Apache Crunch
Summary
Chapter 10: Running a Hadoop Cluster
I'm a developer – I don't care about operations!
Cloudera Manager
Ambari – the open source alternative
Operations in the Hadoop 2 world
Sharing resources
Building a physical cluster
Building a cluster on EMR
Cluster tuning
Security
Monitoring
Troubleshooting
Summary
Chapter 11: Where to Go Next
Alternative distributions
Other computational frameworks
Other interesting projects
Other programming abstractions
AWS resources
Sources of information
Summary

Book Details

ISBN 139781783285518
Paperback382 pages
Read More
From 1 reviews

Read More Reviews

Recommended for You

Practical Data Science Cookbook Book Cover
Practical Data Science Cookbook
$ 29.99
$ 21.00
Machine Learning with Spark Book Cover
Machine Learning with Spark
$ 29.99
$ 3.00
Big Data Analytics with R and Hadoop Book Cover
Big Data Analytics with R and Hadoop
$ 29.99
$ 21.00
Practical Data Analysis Book Cover
Practical Data Analysis
$ 29.99
$ 21.00
Python Machine Learning Book Cover
Python Machine Learning
$ 35.99
$ 25.20
Machine Learning with R Book Cover
Machine Learning with R
$ 32.99
$ 23.10