Hadoop Real-World Solutions Cookbook


Hadoop Real-World Solutions Cookbook
eBook: $29.99
Formats: PDF, PacktLib, ePub and Mobi formats
$23.99
save 20%!
Print + free eBook + free PacktLib access to the book: $79.98    Print cover: $49.99
$49.99
save 37%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Overview
Table of Contents
Author
Support
Sample Chapters
  • Solutions to common problems when working in the Hadoop environment
  • Recipes for (un)loading data, analytics, and troubleshooting
  • In depth code examples demonstrating various analytic models, analytic solutions, and common best practices

Book Details

Language : English
Paperback : 316 pages [ 235mm x 191mm ]
Release Date : February 2013
ISBN : 1849519129
ISBN 13 : 9781849519120
Author(s) : Jonathan R. Owens, Brian Femiano, Jon Lentz
Topics and Technologies : All Books, Big Data and Business Intelligence, Data, Cloud, Cookbooks, Open Source

Table of Contents

Preface
Chapter 1: Hadoop Distributed File System – Importing and Exporting Data
Chapter 2: HDFS
Chapter 3: Extracting and Transforming Data
Chapter 4: Performing Common Tasks Using Hive, Pig, and MapReduce
Chapter 5: Advanced Joins
Chapter 6: Big Data Analysis
Chapter 7: Advanced Big Data Analysis
Chapter 8: Debugging
Chapter 9: System Administration
Chapter 10: Persistence Using Apache Accumulo
Index
  • Chapter 1: Hadoop Distributed File System – Importing and Exporting Data
    • Introduction
    • Importing and exporting data into HDFS using Hadoop shell commands
    • Moving data efficiently between clusters using Distributed Copy
    • Importing data from MySQL into HDFS using Sqoop
    • Exporting data from HDFS into MySQL using Sqoop
    • Configuring Sqoop for Microsoft SQL Server
    • Exporting data from HDFS into MongoDB
    • Importing data from MongoDB into HDFS
    • Exporting data from HDFS into MongoDB using Pig
    • Using HDFS in a Greenplum external table
    • Using Flume to load data into HDFS
    • Chapter 2: HDFS
      • Introduction
      • Reading and writing data to HDFS
      • Compressing data using LZO
      • Reading and writing data to SequenceFiles
      • Using Apache Avro to serialize data
      • Using Apache Thrift to serialize data
      • Using Protocol Buffers to serialize data
      • Setting the replication factor for HDFS
      • Setting the block size for HDFS
      • Chapter 3: Extracting and Transforming Data
        • Introduction
        • Transforming Apache logs into TSV format using MapReduce
        • Using Apache Pig to filter bot traffic from web server logs
        • Using Apache Pig to sort web server log data by timestamp
        • Using Apache Pig to sessionize web server log data
        • Using Python to extend Apache Pig functionality
        • Using MapReduce and secondary sort to calculate page views
        • Using Hive and Python to clean and transform geographical event data
        • Using Python and Hadoop Streaming to perform a time series analytic
        • Using MultipleOutputs in MapReduce to name output files
        • Creating custom Hadoop Writable and InputFormat to read geographical event data
        • Chapter 4: Performing Common Tasks Using Hive, Pig, and MapReduce
          • Introduction
          • Using Hive to map an external table over weblog data in HDFS
          • Using Hive to dynamically create tables from the results of a weblog query
          • Using the Hive string UDFs to concatenate fields in weblog data
          • Using Hive to intersect weblog IPs and determine the country
          • Generating n-grams over news archives using MapReduce
          • Using the distributed cache in MapReduce to find lines that contain matching keywords over news archives
          • Using Pig to load a table and perform a SELECT operation with GROUP BY
          • Chapter 5: Advanced Joins
            • Introduction
            • Joining data in the Mapper using MapReduce
            • Joining data using Apache Pig replicated join
            • Joining sorted data using Apache Pig merge join
            • Joining skewed data using Apache Pig skewed join
            • Using a map-side join in Apache Hive to analyze geographical events
            • Using optimized full outer joins in Apache Hive to analyze geographical events
            • Joining data using an external key-value store (Redis)
            • Chapter 6: Big Data Analysis
              • Introduction
              • Counting distinct IPs in weblog data using MapReduce and Combiners
              • Using Hive date UDFs to transform and sort event dates from geographic event data
              • Using Hive to build a per-month report of fatalities over geographic event data
              • Implementing a custom UDF in Hive to help validate source reliability over geographic event data
              • Marking the longest period of non-violence using Hive MAP/REDUCE operators and Python
              • Calculating the cosine similarity of artists in the Audioscrobbler dataset using Pig
              • Trim Outliers from the Audioscrobbler dataset using Pig and datafu
              • Chapter 7: Advanced Big Data Analysis
                • Introduction
                • PageRank with Apache Giraph
                • Single-source shortest-path with Apache Giraph
                • Using Apache Giraph to perform a distributed breadth-first search
                • Collaborative filtering with Apache Mahout
                • Clustering with Apache Mahout
                • Sentiment classification with Apache Mahout
                • Chapter 8: Debugging
                  • Introduction
                  • Using Counters in a MapReduce job to track bad records
                  • Developing and testing MapReduce jobs with MRUnit
                  • Developing and testing MapReduce jobs running in local mode
                  • Enabling MapReduce jobs to skip bad records
                  • Using Counters in a streaming job
                  • Updating task status messages to display debugging information
                  • Using illustrate to debug Pig jobs
                  • Chapter 9: System Administration
                    • Introduction
                    • Starting Hadoop in pseudo-distributed mode
                    • Starting Hadoop in distributed mode
                    • Adding new nodes to an existing cluster
                    • Safely decommissioning nodes
                    • Recovering from a NameNode failure
                    • Monitoring cluster health using Ganglia
                    • Tuning MapReduce job parameters
                    • Chapter 10: Persistence Using Apache Accumulo
                      • Introduction
                      • Designing a row key to store geographic events in Accumulo
                      • Using MapReduce to bulk import geographic event data into Accumulo
                      • Setting a custom field constraint for inputting geographic event data in Accumulo
                      • Limiting query results using the regex filtering iterator
                      • Counting fatalities for different versions of the same key using SumCombiner
                      • Enforcing cell-level security on scans using Accumulo
                      • Aggregating sources in Accumulo using MapReduce

                      Jonathan R. Owens

                      Jonathan R. Owens has a background in Java and C++, and has worked in both the private and public sectors as a software engineer. Most recently, he has been working with Hadoop, and related distributing processing technologies. Currently, Jonathan R. Owens works for comScore, Inc, a widely regarded digital measurement and analytics company. At comScore, he is a member of the core-processing team, which uses Hadoop and other custom distributed systems to aggregate, analyze, and manage over 40+ billion transactions per day.

                      Brian Femiano

                      Brian Femiano has a B.S in Computer Science and has been programming professionally for over 6 years, the last 2 of which have been spent building advanced analytics and big data capabilities using Apache Hadoop. He has worked for the commercial sector in the past, but the majority of his experience comes from the government contracting space. He currently works for Potomac Fusion in the DC/Virginia area, where they develop scalable algorithms to study and enhance some of the most advanced and complex datasets in the government space. Within Potomac Fusion, he has taught courses and training sessions to help teach Apache Hadoop and related cloud-scale technology.

                      Jon Lentz

                      Jon Lentz is a software engineer on the Core Processing team at comScore, an online audience measurement and analytics company. He prefers to do most of his coding in Pig. Before working at comScore he wrote software to optimize supply chains and to allocate fixed income securities.
                      Sorry, we don't have any reviews for this title yet.

                      Code Downloads

                      Download the code and support files for this book.


                      Submit Errata

                      Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.


                      Errata

                      - 2 submitted: last submission 18 Mar 2014

                      Errata type: Technical | Page number: 34

                      On page 34, in step number 6, the JAR file of PIG should also be copied along with the other components to $HADOOP_HOME/lib on each node

                      The weblog_entries.txt file is present in the folder named "code" in the code files of Chapter 4.

                      Sample chapters

                      You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

                      Frequently bought together

                      Hadoop Real-World Solutions Cookbook +    Clojure Data Analysis Cookbook =
                      50% Off
                      the second eBook
                      Price for both: $43.05

                      Buy both these recommended eBooks together and get 50% off the cheapest eBook.

                      What you will learn from this book

                      • Data ETL, compression, serialization, and import/export
                      • Simple and advanced aggregate analysis
                      • Graph analysis
                      • Machine learning
                      • Troubleshooting and debugging
                      • Scalable persistence
                      • Cluster administration and configuration

                      In Detail

                      Helping developers become more comfortable and proficient with solving problems in the Hadoop space. People will become more familiar with a wide variety of Hadoop related tools and best practices for implementation.

                      Hadoop Real-World Solutions Cookbook will teach readers how to build solutions using tools such as Apache Hive, Pig, MapReduce, Mahout, Giraph, HDFS, Accumulo, Redis, and Ganglia.

                      Hadoop Real-World Solutions Cookbook provides in depth explanations and code examples. Each chapter contains a set of recipes that pose, then solve, technical challenges, and can be completed in any order. A recipe breaks a single problem down into discrete steps that are easy to follow. The book covers (un)loading to and from HDFS, graph analytics with Giraph, batch data analysis using Hive, Pig, and MapReduce, machine learning approaches with Mahout, debugging and troubleshooting MapReduce, and columnar storage and retrieval of structured data using Apache Accumulo.

                      Hadoop Real-World Solutions Cookbook will give readers the examples they need to apply Hadoop technology to their own problems.

                      Approach

                      Cookbook recipes demonstrate Hadoop in action and then explain the concepts behind the code.

                      Who this book is for

                      This book is ideal for developers who wish to have a better understanding of Hadoop application development and associated tools, and developers who understand Hadoop conceptually but want practical examples of real world applications.

                      Code Download and Errata
                      Packt Anytime, Anywhere
                      Register Books
                      Print Upgrades
                      eBook Downloads
                      Video Support
                      Contact Us
                      Awards Voting Nominations Previous Winners
                      Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
                      Resources
                      Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software