Hadoop MapReduce Cookbook


Hadoop MapReduce Cookbook
eBook: $29.99
Formats: PDF, PacktLib, ePub and Mobi formats
$25.49
save 15%!
Print + free eBook + free PacktLib access to the book: $79.98    Print cover: $49.99
$49.99
save 37%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Overview
Table of Contents
Author
Support
Sample Chapters
  • Learn to process large and complex data sets, starting simply, then diving in deep
  • Solve complex big data problems such as classifications, finding relationships, online marketing and recommendations
  • More than 50 Hadoop MapReduce recipes, presented in a simple and straightforward manner, with step-by-step instructions and real world examples

Book Details

Language : English
Paperback : 300 pages [ 235mm x 191mm ]
Release Date : January 2013
ISBN : 1849517282
ISBN 13 : 9781849517287
Author(s) : Srinath Perera, Thilina Gunarathne
Topics and Technologies : All Books, Big Data and Business Intelligence, Data, Cloud, Cookbooks, Open Source

Table of Contents

Preface
Chapter 1: Getting Hadoop Up and Running in a Cluster
Chapter 2: Advanced HDFS
Chapter 3: Advanced Hadoop MapReduce Administration
Chapter 4: Developing Complex Hadoop MapReduce Applications
Chapter 5: Hadoop Ecosystem
Chapter 6: Analytics
Chapter 7: Searching and Indexing
Chapter 8: Classifications, Recommendations, and Finding Relationships
Chapter 9: Mass Text Data Processing
Chapter 10: Cloud Deployments: Using Hadoop on Clouds
Index
  • Chapter 1: Getting Hadoop Up and Running in a Cluster
    • Introduction
    • Setting up Hadoop on your machine
    • Writing a WordCount MapReduce sample, bundling it, and running it using standalone Hadoop
    • Adding the combiner step to the WordCount MapReduce program
    • Setting up HDFS
    • Using HDFS monitoring UI
    • HDFS basic command-line file operations
    • Setting Hadoop in a distributed cluster environment
    • Running the WordCount program in a distributed cluster environment
    • Using MapReduce monitoring UI
    • Chapter 2: Advanced HDFS
      • Introduction
      • Benchmarking HDFS
      • Adding a new DataNode
      • Decommissioning DataNodes
      • Using multiple disks/volumes and limiting HDFS disk usage
      • Setting HDFS block size
      • Setting the file replication factor
      • Using HDFS Java API
      • Using HDFS C API (libhdfs)
      • Mounting HDFS (Fuse-DFS)
      • Merging files in HDFS
      • Chapter 3: Advanced Hadoop MapReduce Administration
        • Introduction
        • Tuning Hadoop configurations for cluster deployments
        • Running benchmarks to verify the Hadoop installation
        • Reusing Java VMs to improve the performance
        • Fault tolerance and speculative execution
        • Debug scripts – analyzing task failures
        • Setting failure percentages and skipping bad records
        • Shared-user Hadoop clusters – using fair and other schedulers
        • Hadoop security – integrating with Kerberos
        • Using the Hadoop Tool interface
        • Chapter 4: Developing Complex Hadoop MapReduce Applications
          • Introduction
          • Choosing appropriate Hadoop data types
          • Implementing a custom Hadoop Writable data type
          • Implementing a custom Hadoop key type
          • Emitting data of different value types from a mapper
          • Choosing a suitable Hadoop InputFormat for your input data format
          • Adding support for new input data formats – implementing a custom InputFormat
          • Formatting the results of MapReduce computations – using Hadoop OutputFormats
          • Hadoop intermediate (map to reduce) data partitioning
          • Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache
          • Using Hadoop with legacy applications – Hadoop Streaming
          • Adding dependencies between MapReduce jobs
          • Hadoop counters for reporting custom metrics
          • Chapter 5: Hadoop Ecosystem
            • Introduction
            • Installing HBase
            • Data random access using Java client APIs
            • Running MapReduce jobs on HBase (table input/output)
            • Installing Pig
            • Running your first Pig command
            • Set operations (join, union) and sorting with Pig
            • Installing Hive
            • Running a SQL-style query with Hive
            • Performing a join with Hive
            • Installing Mahout
            • Running K-means with Mahout
            • Visualizing K-means results
            • Chapter 6: Analytics
              • Introduction
              • Simple analytics using MapReduce
              • Performing Group-By using MapReduce
              • Calculating frequency distributions and sorting using MapReduce
              • Plotting the Hadoop results using GNU Plot
              • Calculating histograms using MapReduce
              • Calculating scatter plots using MapReduce
              • Parsing a complex dataset with Hadoop
              • Joining two datasets using MapReduce
              • Chapter 7: Searching and Indexing
                • Introduction
                • Generating an inverted index using Hadoop MapReduce
                • Intra-domain web crawling using Apache Nutch
                • Indexing and searching web documents using Apache Solr
                • Configuring Apache HBase as the backend data store for Apache Nutch
                • Deploying Apache HBase on a Hadoop cluster
                • Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
                • ElasticSearch for indexing and searching
                • Generating the in-links graph for crawled web pages
                  • Chapter 9: Mass Text Data Processing
                    • Introduction
                    • Data preprocessing (extract, clean, and format conversion) using Hadoop Streaming and Python
                    • Data de-duplication using Hadoop Streaming
                    • Loading large datasets to an Apache HBase data store using importtsv and bulkload tools
                    • Creating TF and TF-IDF vectors for the text data
                    • Clustering the text data
                    • Topic discovery using Latent Dirichlet Allocation (LDA)
                    • Document classification using Mahout Naive Bayes classifier
                    • Chapter 10: Cloud Deployments: Using Hadoop on Clouds
                      • Introduction
                      • Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR)
                      • Saving money by using Amazon EC2 Spot Instances to execute EMR job flows
                      • Executing a Pig script using EMR
                      • Executing a Hive script using EMR
                      • Creating an Amazon EMR job flow using the Command Line Interface
                      • Deploying an Apache HBase Cluster on Amazon EC2 cloud using EMR
                      • Using EMR Bootstrap actions to configure VMs for the Amazon EMR jobs
                      • Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment
                      • Using Apache Whirr to deploy an Apache HBase cluster in a cloud environment

                      Srinath Perera

                      Srinath Perera is a senior software architect at WSO2 Inc., where he overlooks the overall WSO2 platform architecture with the CTO. He also serves as a research scientist at Lanka Software Foundation and teaches as a visiting faculty at Department of Computer Science and Engineering, University of Moratuwa. He is a co-founder of Apache Axis2 open source project, and he has been involved with the Apache Web Service project since 2002 and is a member of Apache Software foundation and Apache Web Service project PMC. He is also a committer of Apache open source projects Axis, Axis2, and Geronimo. He received his Ph.D. and M.Sc. in Computer Sciences from Indiana University, Bloomington, USA and received his Bachelor of Science in Computer Science and Engineering degree from the University of Moratuwa, Sri Lanka. He has authored many technical and peer reviewed research articles, and more details can be found on his website. He is also a frequent speaker at technical venues. He has worked with large-scale distributed systems for a long time. He closely works with Big Data technologies like Hadoop and Cassandra daily. He also teaches a parallel programming graduate class at University of Moratuwa, which is primarily based on Hadoop.

                      Thilina Gunarathne

                      Thilina Gunarathne is a Ph.D. candidate at the School of Informatics and Computing of Indiana University. He has extensive experience in using Apache Hadoop and related technologies for large-scale data intensive computations. His current work focuses on developing technologies to perform scalable and efficient large-scale data intensive computations on cloud environments. Thilina has published many articles and peer reviewed research papers in the areas of distributed and parallel computing, including several papers on extending MapReduce model to perform efficient data mining and data analytics computations on clouds. Thilina is a regular presenter in both academic as well as industry settings. Thilina has contributed to several open source projects at Apache Software Foundation as a committer and a PMC member since 2005. Before starting the graduate studies, Thilina worked as a Senior Software Engineer at WSO2 Inc., focusing on open source middleware development. Thilina received his B.Sc. in Computer Science and Engineering from University of Moratuwa, Sri Lanka, in 2006 and received his M.Sc. in Computer Science from Indiana University, Bloomington, in 2009. Thilina expects to receive his doctorate in the field of distributed and parallel computing in 2013.
                      Sorry, we don't have any reviews for this title yet.

                      Code Downloads

                      Download the code and support files for this book.


                      Submit Errata

                      Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.


                      Errata

                      - 1 submitted: last submission 03 Apr 2014

                      Errata Type: Code         Errata Page: 91

                      The code is:

                      value = new LogWritable(userIP, timestamp, request, status, bytes);

                      It should be:

                      value = new LogWritable(userIP, timestamp, request, bytes, status);

                      Note that the same has to be changed in the code file provided with this book (Hadoop MapReduce Cookbook\7287OS_Code\chapter4\src\demo\LogFileRecordReader.java)

                      Sample chapters

                      You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

                      Frequently bought together

                      Hadoop MapReduce Cookbook +    Alfresco 3 Records Management =
                      50% Off
                      the second eBook
                      Price for both: £30.75

                      Buy both these recommended eBooks together and get 50% off the cheapest eBook.

                      What you will learn from this book

                      • How to install Hadoop MapReduce and HDFS to begin running examples
                      • How to configure and administer Hadoop and HDFS securely
                      • Understanding the internals of Hadoop and how Hadoop can be extended to suit your needs
                      • How to use HBase, Hive, Pig, Mahout, and Nutch to get things done easily and efficiently
                      • How to use MapReduce to solve many types of analytics problems
                      • Solve complex problems such as classifications, finding relationships, online marketing, and recommendations
                      • Using MapReduce for massive text data processing
                      • How to use cloud environments to perform Hadoop computations

                      In Detail

                      We are facing an avalanche of data. The unstructured data we gather can contain many insights that might hold the key to business success or failure. Harnessing the ability to analyze and process this data with Hadoop MapReduce is one of the most highly sought after skills in today's job market.

                      "Hadoop MapReduce Cookbook" is a one-stop guide to processing large and complex data sets using the Hadoop ecosystem. The book introduces you to simple examples and then dives deep to solve in-depth big data use cases.

                      "Hadoop MapReduce Cookbook" presents more than 50 ready-to-use Hadoop MapReduce recipes in a simple and straightforward manner, with step-by-step instructions and real world examples.

                      Start with how to install, then configure, extend, and administer Hadoop. Then write simple examples, learn MapReduce patterns, harness the Hadoop landscape, and finally jump to the cloud.

                      The book deals with many exciting topics such as setting up Hadoop security, using MapReduce to solve analytics, classifications, on-line marketing, recommendations, and searching use cases. You will learn how to harness components from the Hadoop ecosystem including HBase, Hadoop, Pig, and Mahout, then learn how to set up cloud environments to perform Hadoop MapReduce computations.

                      "Hadoop MapReduce Cookbook" teaches you how process large and complex data sets using real examples providing a comprehensive guide to get things done using Hadoop MapReduce.

                      Approach

                      Individual self-contained code recipes. Solve specific problems using individual recipes, or work through the book to develop your capabilities.

                      Who this book is for

                      If you are a big data enthusiast and striving to use Hadoop to solve your problems, this book is for you. Aimed at Java programmers with some knowledge of Hadoop MapReduce, this is also a comprehensive reference for developers and system admins who want to get up to speed using Hadoop.

                      Code Download and Errata
                      Packt Anytime, Anywhere
                      Register Books
                      Print Upgrades
                      eBook Downloads
                      Video Support
                      Contact Us
                      Awards Voting Nominations Previous Winners
                      Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
                      Resources
                      Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software