Hadoop Real-World Solutions Cookbook

Ever felt you could use some no-nonsense, practical help when developing applications with Hadoop? Well, you’ve just found it. This real-world solutions cookbook is packed with handy recipes you can apply to your own everyday issues.

Hadoop Real-World Solutions Cookbook

Cookbook
Jonathan R. Owens, Jon Lentz, Brian Femiano

Ever felt you could use some no-nonsense, practical help when developing applications with Hadoop? Well, you’ve just found it. This real-world solutions cookbook is packed with handy recipes you can apply to your own everyday issues.
$10.00
$49.99
RRP $29.99
RRP $49.99
eBook
Print + eBook
$12.99 p/month

Get Access

Get Unlimited Access to every Packt eBook and Video course

Enjoy full and instant access to over 3000 books and videos – you’ll find everything you need to stay ahead of the curve and make sure you can always get the job done.

+ Collection
Free Sample

Book Details

ISBN 139781849519120
Paperback316 pages

About This Book

  • Solutions to common problems when working in the Hadoop environment
  • Recipes for (un)loading data, analytics, and troubleshooting
  • In depth code examples demonstrating various analytic models, analytic solutions, and common best practices

Who This Book Is For

This book is ideal for developers who wish to have a better understanding of Hadoop application development and associated tools, and developers who understand Hadoop conceptually but want practical examples of real world applications.

Table of Contents

Chapter 1: Hadoop Distributed File System – Importing and Exporting Data
Introduction
Importing and exporting data into HDFS using Hadoop shell commands
Moving data efficiently between clusters using Distributed Copy
Importing data from MySQL into HDFS using Sqoop
Exporting data from HDFS into MySQL using Sqoop
Configuring Sqoop for Microsoft SQL Server
Exporting data from HDFS into MongoDB
Importing data from MongoDB into HDFS
Exporting data from HDFS into MongoDB using Pig
Using HDFS in a Greenplum external table
Using Flume to load data into HDFS
Chapter 2: HDFS
Introduction
Reading and writing data to HDFS
Compressing data using LZO
Reading and writing data to SequenceFiles
Using Apache Avro to serialize data
Using Apache Thrift to serialize data
Using Protocol Buffers to serialize data
Setting the replication factor for HDFS
Setting the block size for HDFS
Chapter 3: Extracting and Transforming Data
Introduction
Transforming Apache logs into TSV format using MapReduce
Using Apache Pig to filter bot traffic from web server logs
Using Apache Pig to sort web server log data by timestamp
Using Apache Pig to sessionize web server log data
Using Python to extend Apache Pig functionality
Using MapReduce and secondary sort to calculate page views
Using Hive and Python to clean and transform geographical event data
Using Python and Hadoop Streaming to perform a time series analytic
Using MultipleOutputs in MapReduce to name output files
Creating custom Hadoop Writable and InputFormat to read geographical event data
Chapter 4: Performing Common Tasks Using Hive, Pig, and MapReduce
Introduction
Using Hive to map an external table over weblog data in HDFS
Using Hive to dynamically create tables from the results of a weblog query
Using the Hive string UDFs to concatenate fields in weblog data
Using Hive to intersect weblog IPs and determine the country
Generating -grams over news archives using MapReduce
Using the distributed cache in MapReduce to find lines that contain matching keywords over news archives
Using Pig to load a table and perform a SELECT operation with GROUP BY
Chapter 5: Advanced Joins
Introduction
Joining data in the Mapper using MapReduce
Joining data using Apache Pig replicated join
Joining sorted data using Apache Pig merge join
Joining skewed data using Apache Pig skewed join
Using a map-side join in Apache Hive to analyze geographical events
Using optimized full outer joins in Apache Hive to analyze geographical events
Joining data using an external key-value store (Redis)
Chapter 6: Big Data Analysis
Introduction
Counting distinct IPs in weblog data using MapReduce and Combiners
Using Hive date UDFs to transform and sort event dates from geographic event data
Using Hive to build a per-month report of fatalities over geographic event data
Implementing a custom UDF in Hive to help validate source reliability over geographic event data
Marking the longest period of non-violence using Hive MAP/REDUCE operators and Python
Calculating the cosine similarity of artists in the Audioscrobbler dataset using Pig
Trim Outliers from the Audioscrobbler dataset using Pig and datafu
Chapter 7: Advanced Big Data Analysis
Introduction
PageRank with Apache Giraph
Single-source shortest-path with Apache Giraph
Using Apache Giraph to perform a distributed breadth-first search
Collaborative filtering with Apache Mahout
Clustering with Apache Mahout
Sentiment classification with Apache Mahout
Chapter 8: Debugging
Introduction
Using Counters in a MapReduce job to track bad records
Developing and testing MapReduce jobs with MRUnit
Developing and testing MapReduce jobs running in local mode
Enabling MapReduce jobs to skip bad records
Using Counters in a streaming job
Updating task status messages to display debugging information
Using illustrate to debug Pig jobs
Chapter 9: System Administration
Introduction
Starting Hadoop in pseudo-distributed mode
Starting Hadoop in distributed mode
Adding new nodes to an existing cluster
Safely decommissioning nodes
Recovering from a NameNode failure
Monitoring cluster health using Ganglia
Tuning MapReduce job parameters
Chapter 10: Persistence Using Apache Accumulo
Introduction
Designing a row key to store geographic events in Accumulo
Using MapReduce to bulk import geographic event data into Accumulo
Setting a custom field constraint forinputting geographic event data in Accumulo
Limiting query results using the regex filtering iterator
Counting fatalities for different versions of the same key using SumCombiner
Enforcing cell-level security on scans using Accumulo
Aggregating sources in Accumulo using MapReduce

What You Will Learn

  • Data ETL, compression, serialization, and import/export
  • Simple and advanced aggregate analysis
  • Graph analysis
  • Machine learning
  • Troubleshooting and debugging
  • Scalable persistence
  • Cluster administration and configuration

In Detail

Helping developers become more comfortable and proficient with solving problems in the Hadoop space. People will become more familiar with a wide variety of Hadoop related tools and best practices for implementation.

Hadoop Real-World Solutions Cookbook will teach readers how to build solutions using tools such as Apache Hive, Pig, MapReduce, Mahout, Giraph, HDFS, Accumulo, Redis, and Ganglia.

Hadoop Real-World Solutions Cookbook provides in depth explanations and code examples. Each chapter contains a set of recipes that pose, then solve, technical challenges, and can be completed in any order. A recipe breaks a single problem down into discrete steps that are easy to follow. The book covers (un)loading to and from HDFS, graph analytics with Giraph, batch data analysis using Hive, Pig, and MapReduce, machine learning approaches with Mahout, debugging and troubleshooting MapReduce, and columnar storage and retrieval of structured data using Apache Accumulo.

Hadoop Real-World Solutions Cookbook will give readers the examples they need to apply Hadoop technology to their own problems.

Authors

Table of Contents

Chapter 1: Hadoop Distributed File System – Importing and Exporting Data
Introduction
Importing and exporting data into HDFS using Hadoop shell commands
Moving data efficiently between clusters using Distributed Copy
Importing data from MySQL into HDFS using Sqoop
Exporting data from HDFS into MySQL using Sqoop
Configuring Sqoop for Microsoft SQL Server
Exporting data from HDFS into MongoDB
Importing data from MongoDB into HDFS
Exporting data from HDFS into MongoDB using Pig
Using HDFS in a Greenplum external table
Using Flume to load data into HDFS
Chapter 2: HDFS
Introduction
Reading and writing data to HDFS
Compressing data using LZO
Reading and writing data to SequenceFiles
Using Apache Avro to serialize data
Using Apache Thrift to serialize data
Using Protocol Buffers to serialize data
Setting the replication factor for HDFS
Setting the block size for HDFS
Chapter 3: Extracting and Transforming Data
Introduction
Transforming Apache logs into TSV format using MapReduce
Using Apache Pig to filter bot traffic from web server logs
Using Apache Pig to sort web server log data by timestamp
Using Apache Pig to sessionize web server log data
Using Python to extend Apache Pig functionality
Using MapReduce and secondary sort to calculate page views
Using Hive and Python to clean and transform geographical event data
Using Python and Hadoop Streaming to perform a time series analytic
Using MultipleOutputs in MapReduce to name output files
Creating custom Hadoop Writable and InputFormat to read geographical event data
Chapter 4: Performing Common Tasks Using Hive, Pig, and MapReduce
Introduction
Using Hive to map an external table over weblog data in HDFS
Using Hive to dynamically create tables from the results of a weblog query
Using the Hive string UDFs to concatenate fields in weblog data
Using Hive to intersect weblog IPs and determine the country
Generating -grams over news archives using MapReduce
Using the distributed cache in MapReduce to find lines that contain matching keywords over news archives
Using Pig to load a table and perform a SELECT operation with GROUP BY
Chapter 5: Advanced Joins
Introduction
Joining data in the Mapper using MapReduce
Joining data using Apache Pig replicated join
Joining sorted data using Apache Pig merge join
Joining skewed data using Apache Pig skewed join
Using a map-side join in Apache Hive to analyze geographical events
Using optimized full outer joins in Apache Hive to analyze geographical events
Joining data using an external key-value store (Redis)
Chapter 6: Big Data Analysis
Introduction
Counting distinct IPs in weblog data using MapReduce and Combiners
Using Hive date UDFs to transform and sort event dates from geographic event data
Using Hive to build a per-month report of fatalities over geographic event data
Implementing a custom UDF in Hive to help validate source reliability over geographic event data
Marking the longest period of non-violence using Hive MAP/REDUCE operators and Python
Calculating the cosine similarity of artists in the Audioscrobbler dataset using Pig
Trim Outliers from the Audioscrobbler dataset using Pig and datafu
Chapter 7: Advanced Big Data Analysis
Introduction
PageRank with Apache Giraph
Single-source shortest-path with Apache Giraph
Using Apache Giraph to perform a distributed breadth-first search
Collaborative filtering with Apache Mahout
Clustering with Apache Mahout
Sentiment classification with Apache Mahout
Chapter 8: Debugging
Introduction
Using Counters in a MapReduce job to track bad records
Developing and testing MapReduce jobs with MRUnit
Developing and testing MapReduce jobs running in local mode
Enabling MapReduce jobs to skip bad records
Using Counters in a streaming job
Updating task status messages to display debugging information
Using illustrate to debug Pig jobs
Chapter 9: System Administration
Introduction
Starting Hadoop in pseudo-distributed mode
Starting Hadoop in distributed mode
Adding new nodes to an existing cluster
Safely decommissioning nodes
Recovering from a NameNode failure
Monitoring cluster health using Ganglia
Tuning MapReduce job parameters
Chapter 10: Persistence Using Apache Accumulo
Introduction
Designing a row key to store geographic events in Accumulo
Using MapReduce to bulk import geographic event data into Accumulo
Setting a custom field constraint forinputting geographic event data in Accumulo
Limiting query results using the regex filtering iterator
Counting fatalities for different versions of the same key using SumCombiner
Enforcing cell-level security on scans using Accumulo
Aggregating sources in Accumulo using MapReduce

Book Details

ISBN 139781849519120
Paperback316 pages
Read More