Hadoop MapReduce v2 Cookbook - Second Edition

Explore the Hadoop MapReduce v2 ecosystem to gain insights from very large datasets

Hadoop MapReduce v2 Cookbook - Second Edition

Cookbook
Thilina Gunarathne

Explore the Hadoop MapReduce v2 ecosystem to gain insights from very large datasets
$29.99
$49.99
RRP $29.99
RRP $49.99
eBook
Print + eBook
$12.99 p/month

Want this title & more? Subscribe to PacktLib

Enjoy full and instant access to over 2000 books and videos – you’ll find everything you need to stay ahead of the curve and make sure you can always get the job done.
+ Collection
Free Sample

Book Details

ISBN 139781783285471
Paperback322 pages

About This Book

  • Process large and complex datasets using next generation Hadoop
  • Install, configure, and administer MapReduce programs and learn what’s new in MapReduce v2
  • More than 90 Hadoop MapReduce recipes presented in a simple and straightforward manner, with step-by-step instructions and real-world examples

Who This Book Is For

If you are a Big Data enthusiast and wish to use Hadoop v2 to solve your problems, then this book is for you. This book is for Java programmers with little to moderate knowledge of Hadoop MapReduce. This is also a one-stop reference for developers and system admins who want to quickly get up to speed with using Hadoop v2. It would be helpful to have a basic knowledge of software development using Java and a basic working knowledge of Linux.

Table of Contents

Chapter 1: Getting Started with Hadoop v2
Introduction
Setting up Hadoop v2 on your local machine
Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode
Adding a combiner step to the WordCount MapReduce program
Setting up HDFS
Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2
Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution
HDFS command-line file operations
Running the WordCount program in a distributed cluster environment
Benchmarking HDFS using DFSIO
Benchmarking Hadoop MapReduce using TeraSort
Chapter 2: Cloud Deployments – Using Hadoop YARN on Cloud Environments
Introduction
Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce
Saving money using Amazon EC2 Spot Instances to execute EMR job flows
Executing a Pig script using EMR
Executing a Hive script using EMR
Creating an Amazon EMR job flow using the AWS Command Line Interface
Deploying an Apache HBase cluster on Amazon EC2 using EMR
Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs
Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment
Chapter 3: Hadoop Essentials – Configurations, Unit Tests, and Other APIs
Introduction
Optimizing Hadoop YARN and MapReduce configurations for cluster deployments
Shared user Hadoop clusters – using Fair and Capacity schedulers
Setting classpath precedence to user-provided JARs
Speculative execution of straggling tasks
Unit testing Hadoop MapReduce applications using MRUnit
Integration testing Hadoop MapReduce applications using MiniYarnCluster
Adding a new DataNode
Decommissioning DataNodes
Using multiple disks/volumes and limiting HDFS disk usage
Setting the HDFS block size
Setting the file replication factor
Using the HDFS Java API
Chapter 4: Developing Complex Hadoop MapReduce Applications
Introduction
Choosing appropriate Hadoop data types
Implementing a custom Hadoop Writable data type
Implementing a custom Hadoop key type
Emitting data of different value types from a Mapper
Choosing a suitable Hadoop InputFormat for your input data format
Adding support for new input data formats – implementing a custom InputFormat
Formatting the results of MapReduce computations – using Hadoop OutputFormats
Writing multiple outputs from a MapReduce computation
Hadoop intermediate data partitioning
Secondary sorting – sorting Reduce input values
Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache
Using Hadoop with legacy applications – Hadoop streaming
Adding dependencies between MapReduce jobs
Hadoop counters to report custom metrics
Chapter 5: Analytics
Introduction
Simple analytics using MapReduce
Performing GROUP BY using MapReduce
Calculating frequency distributions and sorting using MapReduce
Plotting the Hadoop MapReduce results using gnuplot
Calculating histograms using MapReduce
Calculating Scatter plots using MapReduce
Parsing a complex dataset with Hadoop
Joining two datasets using MapReduce
Chapter 6: Hadoop Ecosystem – Apache Hive
Introduction
Getting started with Apache Hive
Creating databases and tables using Hive CLI
Simple SQL-style data querying using Apache Hive
Creating and populating Hive tables and views using Hive query results
Utilizing different storage formats in Hive - storing table data using ORC files
Using Hive built-in functions
Hive batch mode - using a query file
Performing a join with Hive
Creating partitioned Hive tables
Writing Hive User-defined Functions (UDF)
HCatalog – performing Java MapReduce computations on data mapped to Hive tables
HCatalog – writing data to Hive tables from Java MapReduce computations
Chapter 7: Hadoop Ecosystem II – Pig, HBase, Mahout, and Sqoop
Introduction
Getting started with Apache Pig
Joining two datasets using Pig
Accessing a Hive table data in Pig using HCatalog
Getting started with Apache HBase
Data random access using Java client APIs
Running MapReduce jobs on HBase
Using Hive to insert data into HBase tables
Getting started with Apache Mahout
Running K-means with Mahout
Importing data to HDFS from a relational database using Apache Sqoop
Exporting data from HDFS to a relational database using Apache Sqoop
Chapter 8: Searching and Indexing
Introduction
Generating an inverted index using Hadoop MapReduce
Intradomain web crawling using Apache Nutch
Indexing and searching web documents using Apache Solr
Configuring Apache HBase as the backend data store for Apache Nutch
Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
Elasticsearch for indexing and searching
Generating the in-links graph for crawled web pages
Chapter 9: Classifications, Recommendations, and Finding Relationships
Introduction
Performing content-based recommendations
Classification using the naïve Bayes classifier
Assigning advertisements to keywords using the Adwords balance algorithm
Chapter 10: Mass Text Data Processing
Introduction
Data preprocessing using Hadoop streaming and Python
De-duplicating data using Hadoop streaming
Loading large datasets to an Apache HBase data store – importtsv and bulkload
Creating TF and TF-IDF vectors for the text data
Clustering text data using Apache Mahout
Topic discovery using Latent Dirichlet Allocation (LDA)
Document classification using Mahout Naive Bayes Classifier

What You Will Learn

  • Configure and administer Hadoop YARN, MapReduce v2, and HDFS clusters
  • Use Hive, HBase, Pig, Mahout, and Nutch with Hadoop v2 to solve your big data problems easily and effectively
  • Solve large-scale analytics problems using MapReduce-based applications
  • Tackle complex problems such as classifications, finding relationships, online marketing, recommendations, and searching using Hadoop MapReduce and other related projects
  • Perform massive text data processing using Hadoop MapReduce and other related projects
  • Deploy your clusters to cloud environments

In Detail

Starting with installing Hadoop YARN, MapReduce, HDFS, and other Hadoop ecosystem components, with this book, you will soon learn about many exciting topics such as MapReduce patterns, using Hadoop to solve analytics, classifications, online marketing, recommendations, and data indexing and searching. You will learn how to take advantage of Hadoop ecosystem projects including Hive, HBase, Pig, Mahout, Nutch, and Giraph and be introduced to deploying in cloud environments.

Finally, you will be able to apply the knowledge you have gained to your own real-world scenarios to achieve the best-possible results.

Authors

Table of Contents

Chapter 1: Getting Started with Hadoop v2
Introduction
Setting up Hadoop v2 on your local machine
Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode
Adding a combiner step to the WordCount MapReduce program
Setting up HDFS
Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2
Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution
HDFS command-line file operations
Running the WordCount program in a distributed cluster environment
Benchmarking HDFS using DFSIO
Benchmarking Hadoop MapReduce using TeraSort
Chapter 2: Cloud Deployments – Using Hadoop YARN on Cloud Environments
Introduction
Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce
Saving money using Amazon EC2 Spot Instances to execute EMR job flows
Executing a Pig script using EMR
Executing a Hive script using EMR
Creating an Amazon EMR job flow using the AWS Command Line Interface
Deploying an Apache HBase cluster on Amazon EC2 using EMR
Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs
Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment
Chapter 3: Hadoop Essentials – Configurations, Unit Tests, and Other APIs
Introduction
Optimizing Hadoop YARN and MapReduce configurations for cluster deployments
Shared user Hadoop clusters – using Fair and Capacity schedulers
Setting classpath precedence to user-provided JARs
Speculative execution of straggling tasks
Unit testing Hadoop MapReduce applications using MRUnit
Integration testing Hadoop MapReduce applications using MiniYarnCluster
Adding a new DataNode
Decommissioning DataNodes
Using multiple disks/volumes and limiting HDFS disk usage
Setting the HDFS block size
Setting the file replication factor
Using the HDFS Java API
Chapter 4: Developing Complex Hadoop MapReduce Applications
Introduction
Choosing appropriate Hadoop data types
Implementing a custom Hadoop Writable data type
Implementing a custom Hadoop key type
Emitting data of different value types from a Mapper
Choosing a suitable Hadoop InputFormat for your input data format
Adding support for new input data formats – implementing a custom InputFormat
Formatting the results of MapReduce computations – using Hadoop OutputFormats
Writing multiple outputs from a MapReduce computation
Hadoop intermediate data partitioning
Secondary sorting – sorting Reduce input values
Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache
Using Hadoop with legacy applications – Hadoop streaming
Adding dependencies between MapReduce jobs
Hadoop counters to report custom metrics
Chapter 5: Analytics
Introduction
Simple analytics using MapReduce
Performing GROUP BY using MapReduce
Calculating frequency distributions and sorting using MapReduce
Plotting the Hadoop MapReduce results using gnuplot
Calculating histograms using MapReduce
Calculating Scatter plots using MapReduce
Parsing a complex dataset with Hadoop
Joining two datasets using MapReduce
Chapter 6: Hadoop Ecosystem – Apache Hive
Introduction
Getting started with Apache Hive
Creating databases and tables using Hive CLI
Simple SQL-style data querying using Apache Hive
Creating and populating Hive tables and views using Hive query results
Utilizing different storage formats in Hive - storing table data using ORC files
Using Hive built-in functions
Hive batch mode - using a query file
Performing a join with Hive
Creating partitioned Hive tables
Writing Hive User-defined Functions (UDF)
HCatalog – performing Java MapReduce computations on data mapped to Hive tables
HCatalog – writing data to Hive tables from Java MapReduce computations
Chapter 7: Hadoop Ecosystem II – Pig, HBase, Mahout, and Sqoop
Introduction
Getting started with Apache Pig
Joining two datasets using Pig
Accessing a Hive table data in Pig using HCatalog
Getting started with Apache HBase
Data random access using Java client APIs
Running MapReduce jobs on HBase
Using Hive to insert data into HBase tables
Getting started with Apache Mahout
Running K-means with Mahout
Importing data to HDFS from a relational database using Apache Sqoop
Exporting data from HDFS to a relational database using Apache Sqoop
Chapter 8: Searching and Indexing
Introduction
Generating an inverted index using Hadoop MapReduce
Intradomain web crawling using Apache Nutch
Indexing and searching web documents using Apache Solr
Configuring Apache HBase as the backend data store for Apache Nutch
Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
Elasticsearch for indexing and searching
Generating the in-links graph for crawled web pages
Chapter 9: Classifications, Recommendations, and Finding Relationships
Introduction
Performing content-based recommendations
Classification using the naïve Bayes classifier
Assigning advertisements to keywords using the Adwords balance algorithm
Chapter 10: Mass Text Data Processing
Introduction
Data preprocessing using Hadoop streaming and Python
De-duplicating data using Hadoop streaming
Loading large datasets to an Apache HBase data store – importtsv and bulkload
Creating TF and TF-IDF vectors for the text data
Clustering text data using Apache Mahout
Topic discovery using Latent Dirichlet Allocation (LDA)
Document classification using Mahout Naive Bayes Classifier

Book Details

ISBN 139781783285471
Paperback322 pages
Read More