Free Sample
+ Collection

Hadoop MapReduce Cookbook

Srinath Perera, Thilina Gunarathne

Learn how to use Hadoop MapReduce to analyze large and complex datasets with this comprehensive cookbook. Over fifty recipes with step-by-step instructions quickly take your Hadoop skills to the next level.
RRP $29.99
RRP $49.99
Print + eBook

Want this title & more?

$12.99 p/month

Subscribe to PacktLib

Enjoy full and instant access to over 2000 books and videos – you’ll find everything you need to stay ahead of the curve and make sure you can always get the job done.

Book Details

ISBN 139781849517287
Paperback300 pages

About This Book

  • Learn to process large and complex data sets, starting simply, then diving in deep
  • Solve complex big data problems such as classifications, finding relationships, online marketing and recommendations
  • More than 50 Hadoop MapReduce recipes, presented in a simple and straightforward manner, with step-by-step instructions and real world examples

Who This Book Is For

If you are a big data enthusiast and striving to use Hadoop to solve your problems, this book is for you. Aimed at Java programmers with some knowledge of Hadoop MapReduce, this is also a comprehensive reference for developers and system admins who want to get up to speed using Hadoop.

Table of Contents

Chapter 1: Getting Hadoop Up and Running in a Cluster
Setting up Hadoop on your machine
Writing a WordCount MapReduce sample, bundling it, and running it using standalone Hadoop
Adding the combiner step to the WordCount MapReduce program
Setting up HDFS
Using HDFS monitoring UI
HDFS basic command-line file operations
Setting Hadoop in a distributed cluster environment
Running the WordCount program in a distributed cluster environment
Using MapReduce monitoring UI
Chapter 2: Advanced HDFS
Benchmarking HDFS
Adding a new DataNode
Decommissioning DataNodes
Using multiple disks/volumes and limiting HDFS disk usage
Setting HDFS block size
Setting the file replication factor
Using HDFS Java API
Using HDFS C API (libhdfs)
Mounting HDFS (Fuse-DFS)
Merging files in HDFS
Chapter 3: Advanced Hadoop MapReduce Administration
Tuning Hadoop configurations for cluster deployments
Running benchmarks to verify the Hadoop installation
Reusing Java VMs to improve the performance
Fault tolerance and speculative execution
Debug scripts – analyzing task failures
Setting failure percentages and skipping bad records
Shared-user Hadoop clusters – using fair and other schedulers
Hadoop security – integrating with Kerberos
Using the Hadoop Tool interface
Chapter 4: Developing Complex Hadoop MapReduce Applications
Choosing appropriate Hadoop data types
Implementing a custom Hadoop Writable data type
Implementing a custom Hadoop key type
Emitting data of different value types from a mapper
Choosing a suitable Hadoop InputFormat for your input data format
Adding support for new input data formats – implementing a custom InputFormat
Formatting the results of MapReduce computations – using Hadoop OutputFormats
Hadoop intermediate (map to reduce) data partitioning
Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache
Using Hadoop with legacy applications – Hadoop Streaming
Adding dependencies between MapReduce jobs
Hadoop counters for reporting custom metrics
Chapter 5: Hadoop Ecosystem
Installing HBase
Data random access using Java client APIs
Running MapReduce jobs on HBase (table input/output)
Installing Pig
Running your first Pig command
Set operations (join, union) and sorting with Pig
Installing Hive
Running a SQL-style query with Hive
Performing a join with Hive
Installing Mahout
Running K-means with Mahout
Visualizing K-means results
Chapter 6: Analytics
Simple analytics using MapReduce
Performing Group-By using MapReduce
Calculating frequency distributions and sorting using MapReduce
Plotting the Hadoop results using GNU Plot
Calculating histograms using MapReduce
Calculating scatter plots using MapReduce
Parsing a complex dataset with Hadoop
Joining two datasets using MapReduce
Chapter 7: Searching and Indexing
Generating an inverted index using Hadoop MapReduce
Intra-domain web crawling using Apache Nutch
Indexing and searching web documents using Apache Solr
Configuring Apache HBase as the backend data store for Apache Nutch
Deploying Apache HBase on a Hadoop cluster
Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
ElasticSearch for indexing and searching
Generating the in-links graph for crawled web pages
Chapter 8: Classifications, Recommendations, and Finding Relationships
Content-based recommendations
Hierarchical clustering
Clustering an Amazon sales dataset
Collaborative filtering-based recommendations
Classification using Naive Bayes Classifier
Assigning advertisements to keywords using the Adwords balance algorithm
Chapter 9: Mass Text Data Processing
Data preprocessing (extract, clean, and format conversion) using Hadoop Streaming and Python
Data de-duplication using Hadoop Streaming
Loading large datasets to an Apache HBase data store using importtsv and bulkload tools
Creating TF and TF-IDF vectors for the text data
Clustering the text data
Topic discovery using Latent Dirichlet Allocation (LDA)
Document classification using Mahout Naive Bayes classifier
Chapter 10: Cloud Deployments: Using Hadoop on Clouds
Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR)
Saving money by using Amazon EC2 Spot Instances to execute EMR job flows
Executing a Pig script using EMR
Executing a Hive script using EMR
Creating an Amazon EMR job flow using the Command Line Interface
Deploying an Apache HBase Cluster on Amazon EC2 cloud using EMR
Using EMR Bootstrap actions to configure VMs for the Amazon EMR jobs
Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment
Using Apache Whirr to deploy an Apache HBase cluster in a cloud environment

What You Will Learn

  • How to install Hadoop MapReduce and HDFS to begin running examples
  • How to configure and administer Hadoop and HDFS securely
  • Understanding the internals of Hadoop and how Hadoop can be extended to suit your needs
  • How to use HBase, Hive, Pig, Mahout, and Nutch to get things done easily and efficiently
  • How to use MapReduce to solve many types of analytics problems
  • Solve complex problems such as classifications, finding relationships, online marketing, and recommendations
  • Using MapReduce for massive text data processing
  • How to use cloud environments to perform Hadoop computations

In Detail

We are facing an avalanche of data. The unstructured data we gather can contain many insights that might hold the key to business success or failure. Harnessing the ability to analyze and process this data with Hadoop MapReduce is one of the most highly sought after skills in today's job market.

"Hadoop MapReduce Cookbook" is a one-stop guide to processing large and complex data sets using the Hadoop ecosystem. The book introduces you to simple examples and then dives deep to solve in-depth big data use cases.

"Hadoop MapReduce Cookbook" presents more than 50 ready-to-use Hadoop MapReduce recipes in a simple and straightforward manner, with step-by-step instructions and real world examples.

Start with how to install, then configure, extend, and administer Hadoop. Then write simple examples, learn MapReduce patterns, harness the Hadoop landscape, and finally jump to the cloud.

The book deals with many exciting topics such as setting up Hadoop security, using MapReduce to solve analytics, classifications, on-line marketing, recommendations, and searching use cases. You will learn how to harness components from the Hadoop ecosystem including HBase, Hadoop, Pig, and Mahout, then learn how to set up cloud environments to perform Hadoop MapReduce computations.

"Hadoop MapReduce Cookbook" teaches you how process large and complex data sets using real examples providing a comprehensive guide to get things done using Hadoop MapReduce.


Read More