Free Sample
+ Collection

Hadoop Beginner's Guide

Beginner's Guide
Garry Turkington

Get your mountain of data under control with Hadoop. This guide requires no prior knowledge of the software or cloud services – just a willingness to learn the basics from this practical step-by-step tutorial.
$29.99
$49.99
RRP $29.99
RRP $49.99
eBook
Print + eBook

Want this title & more?

$16.99 p/month

Subscribe to PacktLib

Enjoy full and instant access to over 2000 books and videos – you’ll find everything you need to stay ahead of the curve and make sure you can always get the job done.

Book Details

ISBN 139781849517300
Paperback398 pages

About This Book

  • Learn tools and techniques that let you approach big data with relish and not fear
  • Shows how to build a complete infrastructure to handle your needs as your data grows
  • Hands-on examples in each chapter give the big picture while also giving direct experience

Who This Book Is For

This book assumes no existing experience with Hadoop or cloud services. It assumes you have familiarity with a programming language such as Java or Ruby but gives you the needed background on the other topics.

Table of Contents

Chapter 1: What It's All About
Big data processing
Cloud computing with Amazon Web Services
Summary
Chapter 2: Getting Hadoop Up and Running
Hadoop on a local Ubuntu host
Time for action – checking the prerequisites
Time for action – downloading Hadoop
Time for action – setting up SSH
Time for action – using Hadoop to calculate Pi
Time for action – configuring the pseudo-distributed mode
Time for action – changing the base HDFS directory
Time for action – formatting the NameNode
Time for action – starting Hadoop
Time for action – using HDFS
Time for action – WordCount, the Hello World of MapReduce
Using Elastic MapReduce
Time for action – WordCount on EMR using the management console
Comparison of local versus EMR Hadoop
Summary
Chapter 3: Understanding MapReduce
Key/value pairs
The Hadoop Java API for MapReduce
Writing MapReduce programs
Time for action – setting up the classpath
Time for action – implementing WordCount
Time for action – building a JAR file
Time for action – running WordCount on a local Hadoop cluster
Time for action – running WordCount on EMR
Time for action – WordCount the easy way
Walking through a run of WordCount
Time for action – WordCount with a combiner
Time for action – fixing WordCount to work with a combiner
Hadoop-specific data types
Time for action – using the Writable wrapper classes
Input/output
Summary
Chapter 4: Developing MapReduce Programs
Using languages other than Java with Hadoop
Time for action – implementing WordCount using Streaming
Analyzing a large dataset
Time for action – summarizing the UFO data
Time for action – summarizing the shape data
Time for action – correlating of sighting duration to UFO shape
Time for action – performing the shape/time analysis from the command line
Time for action – using ChainMapper for field validation/analysis
Time for action – using the Distributed Cache to improve location output
Counters, status, and other output
Time for action – creating counters, task states, and writing log output
Summary
Chapter 5: Advanced MapReduce Techniques
Simple, advanced, and in-between
Joins
Time for action – reduce-side join using MultipleInputs
Graph algorithms
Time for action – representing the graph
Time for action – creating the source code
Time for action – the first run
Time for action – the second run
Time for action – the third run
Time for action – the fourth and last run
Using language-independent data structures
Time for action – getting and installing Avro
Time for action – defining the schema
Time for action – creating the source Avro data with Ruby
Time for action – consuming the Avro data with Java
Time for action – generating shape summaries in MapReduce
Time for action – examining the output data with Ruby
Time for action – examining the output data with Java
Summary
Chapter 6: When Things Break
Failure
Time for action – killing a DataNode process
Time for action – the replication factor in action
Time for action – intentionally causing missing blocks
Time for action – killing a TaskTracker process
Time for action – killing the JobTracker
Time for action – killing the NameNode process
Time for action – causing task failure
Time for action – handling dirty data by using skip mode
Summary
Chapter 7: Keeping Things Running
A note on EMR
Hadoop configuration properties
Time for action – browsing default properties
Setting up a cluster
Time for action – examining the default rack configuration
Time for action – adding a rack awareness script
Cluster access control
Time for action – demonstrating the default security
Managing the NameNode
Time for action – adding an additional fsimage location
Time for action – swapping to a new NameNode host
Managing HDFS
MapReduce management
Time for action – changing job priorities and killing a job
Scaling
Summary
Chapter 8: A Relational View on Data with Hive
Overview of Hive
Setting up Hive
Time for action – installing Hive
Using Hive
Time for action – creating a table for the UFO data
Time for action – inserting the UFO data
Time for action – validating the table
Time for action – redefining the table with the correct column separator
Time for action – creating a table from an existing file
Time for action – performing a join
Time for action – using views
Time for action – exporting query output
Time for action – making a partitioned UFO sighting table
Time for action – adding a new User Defined Function (UDF)
Hive on Amazon Web Services
Time for action – running UFO analysis on EMR
Summary
Chapter 9: Working with Relational Databases
Common data paths
Setting up MySQL
Time for action – installing and setting up MySQL
Time for action – configuring MySQL to allow remote connections
Time for action – setting up the employee database
Getting data into Hadoop
Time for action – downloading and configuring Sqoop
Time for action – exporting data from MySQL to HDFS
Time for action – exporting data from MySQL into Hive
Time for action – a more selective import
Time for action – using a type mapping
Time for action – importing data from a raw query
Getting data out of Hadoop
Time for action – importing data from Hadoop into MySQL
Time for action – importing Hive data into MySQL
Time for action – fixing the mapping and re-running the export
AWS considerations
Summary
Chapter 10: Data Collection with Flume
A note about AWS
Data data everywhere...
Time for action – getting web server data into Hadoop
Introducing Apache Flume
Time for action – installing and configuring Flume
Time for action – capturing network traffic in a log file
Time for action – logging to the console
Time for action – capturing the output of a command to a flat file
Time for action – capturing a remote file in a local flat file
Time for action – writing network traffic onto HDFS
Time for action – adding timestamps
Time for action – multi level Flume networks
Time for action – writing to multiple sinks
The bigger picture
Summary
Chapter 11: Where to Go Next
What we did and didn't cover in this book
Upcoming Hadoop changes
Alternative distributions
Other Apache projects
Other programming abstractions
AWS resources
Sources of information
Summary

What You Will Learn

  • The trends that led to Hadoop and cloud services, giving the background to know when to use the technology
  • Best practices for setup and configuration of Hadoop clusters, tailoring the system to the problem at hand
  • Developing applications to run on Hadoop with examples in Java and Ruby
  • How Amazon Web Services can be used to deliver a hosted Hadoop solution and how this differs from directly-managed environments
  • Integration with relational databases, using Hive for SQL queries and Sqoop for data transfer
  • How Flume can collect data from multiple sources and deliver it to Hadoop for processing
  • What other projects and tools make up the broader Hadoop ecosystem and where to go next

In Detail

Data is arriving faster than you can process it and the overall volumes keep growing at a rate that keeps you awake at night. Hadoop can help you tame the data beast. Effective use of Hadoop however requires a mixture of programming, design, and system administration skills.

"Hadoop Beginner's Guide" removes the mystery from Hadoop, presenting Hadoop and related technologies with a focus on building working systems and getting the job done, using cloud services to do so when it makes sense. From basic concepts and initial setup through developing applications and keeping the system running as the data grows, the book gives the understanding needed to effectively use Hadoop to solve real world problems.

Starting with the basics of installing and configuring Hadoop, the book explains how to develop applications, maintain the system, and how to use additional products to integrate with other systems.

While learning different ways to develop applications to run on Hadoop the book also covers tools such as Hive, Sqoop, and Flume that show how Hadoop can be integrated with relational databases and log collection.

In addition to examples on Hadoop clusters on Ubuntu uses of cloud services such as Amazon, EC2 and Elastic MapReduce are covered.

Authors

Read More