Hadoop Beginner's Guide


Hadoop Beginner's Guide
eBook: $29.99
Formats: PDF, PacktLib, ePub and Mobi formats
$25.49
save 15%!
Print + free eBook + free PacktLib access to the book: $79.98    Print cover: $49.99
$49.99
save 37%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Overview
Table of Contents
Author
Reviews
Support
Sample Chapters
  • Learn tools and techniques that let you approach big data with relish and not fear
  • Shows how to build a complete infrastructure to handle your needs as your data grows
  • Hands-on examples in each chapter give the big picture while also giving direct experience

Book Details

Language : English
Paperback : 398 pages [ 235mm x 191mm ]
Release Date : February 2013
ISBN : 1849517304
ISBN 13 : 9781849517300
Author(s) : Garry Turkington
Topics and Technologies : All Books, Big Data and Business Intelligence, Data, Beginner's Guides, Cloud, Open Source

Table of Contents

Preface
Chapter 1: What It's All About
Chapter 2: Getting Hadoop Up and Running
Chapter 3: Understanding MapReduce
Chapter 4: Developing MapReduce Programs
Chapter 5: Advanced MapReduce Techniques
Chapter 6: When Things Break
Chapter 7: Keeping Things Running
Chapter 8: A Relational View on Data with Hive
Chapter 9: Working with Relational Databases
Chapter 10: Data Collection with Flume
Chapter 11: Where to Go Next
Appendix: Pop Quiz Answers
Index
  • Chapter 1: What It's All About
    • Big data processing
      • The value of data
      • Historically for the few and not the many
        • Classic data processing systems
        • Limiting factors
      • A different approach
        • All roads lead to scale-out
        • Share nothing
        • Expect failure
        • Smart software, dumb hardware
        • Move processing, not data
        • Build applications, not infrastructure
      • Hadoop
        • Thanks, Google
        • Thanks, Doug
        • Thanks, Yahoo
        • Parts of Hadoop
        • Common building blocks
        • HDFS
        • MapReduce
        • Better together
        • Common architecture
        • What it is and isn't good for
    • Cloud computing with Amazon Web Services
      • Too many clouds
      • A third way
      • Different types of costs
      • AWS – infrastructure on demand from Amazon
        • Elastic Compute Cloud (EC2)
        • Simple Storage Service (S3)
        • Elastic MapReduce (EMR)
      • What this book covers
        • A dual approach
    • Summary
    • Chapter 2: Getting Hadoop Up and Running
      • Hadoop on a local Ubuntu host
        • Other operating systems
      • Time for action – checking the prerequisites
        • Setting up Hadoop
          • A note on versions
      • Time for action – downloading Hadoop
      • Time for action – setting up SSH
        • Configuring and running Hadoop
      • Time for action – using Hadoop to calculate Pi
        • Three modes
      • Time for action – configuring the pseudo-distributed mode
        • Configuring the base directory and formatting the filesystem
      • Time for action – changing the base HDFS directory
      • Time for action – formatting the NameNode
        • Starting and using Hadoop
      • Time for action – starting Hadoop
      • Time for action – using HDFS
      • Time for action – WordCount, the Hello World of MapReduce
        • Monitoring Hadoop from the browser
          • The HDFS web UI
      • Using Elastic MapReduce
        • Setting up an account on Amazon Web Services
          • Creating an AWS account
          • Signing up for the necessary services
      • Time for action – WordCount in EMR using the management console
        • Other ways of using EMR
          • AWS credentials
          • The EMR command-line tools
        • The AWS ecosystem
      • Comparison of local versus EMR Hadoop
      • Summary
      • Chapter 3: Understanding MapReduce
        • Key/value pairs
          • What it mean
          • Why key/value data?
            • Some real-world examples
          • MapReduce as a series of key/value transformations
        • The Hadoop Java API for MapReduce
          • The 0.20 MapReduce Java API
            • The Mapper class
            • The Reducer class
            • The Driver class
        • Writing MapReduce programs
        • Time for action – setting up the classpath
        • Time for action – implementing WordCount
        • Time for action – building a JAR file
        • Time for action – running WordCount on a local Hadoop cluster
        • Time for action – running WordCount on EMR
          • The pre-0.20 Java MapReduce API
          • Hadoop-provided mapper and reducer implementations
        • Time for action – WordCount the easy way
        • Walking through a run of WordCount
          • Startup
          • Splitting the input
          • Task assignment
          • Task startup
          • Ongoing JobTracker monitoring
          • Mapper input
          • Mapper execution
          • Mapper output and reduce input
          • Partitioning
          • The optional partition function
          • Reducer input
          • Reducer execution
          • Reducer output
          • Shutdown
          • That's all there is to it!
          • Apart from the combiner…maybe
            • Why have a combiner?
        • Time for action – WordCount with a combiner
          • When you can use the reducer as the combiner
      • Time for action – fixing WordCount to work with a combiner
        • Reuse is your friend
      • Hadoop-specific data types
        • The Writable and WritableComparable interfaces
        • Introducing the wrapper classes
          • Primitive wrapper classes
          • Array wrapper classes
          • Map wrapper classes
      • Time for action – using the Writable wrapper classes
        • Other wrapper classes
        • Making your own
      • Input/output
        • Files, splits, and records
        • InputFormat and RecordReader
        • Hadoop-provided InputFormat
        • Hadoop-provided RecordReader
        • Output formats and RecordWriter
        • Hadoop-provided OutputFormat
        • Don't forget Sequence files
      • Summary
        • Chapter 4: Developing MapReduce Programs
          • Using languages other than Java with Hadoop
            • How Hadoop Streaming works
            • Why to use Hadoop Streaming
          • Time for action – WordCount using Streaming
            • Differences in jobs when using Streaming
          • Analyzing a large dataset
            • Getting the UFO sighting dataset
            • Getting a feel for the dataset
          • Time for action – summarizing the UFO data
            • Examining UFO shapes
        • Time for action – summarizing the shape data
        • Time for action – correlating sighting duration to UFO shape
          • Using Streaming scripts outside Hadoop
        • Time for action – performing the shape/time analysis from the command line
          • Java shape and location analysis
        • Time for action – using ChainMapper for field validation/analysis
          • Too many abbreviations
          • Using the Distributed Cache
        • Time for action – using the Distributed Cache to improve location output
        • Counters, status, and other output
        • Time for action – creating counters, task states, and writing log output
          • Too much information!
        • Summary
          • Chapter 5: Advanced MapReduce Techniques
            • Simple, advanced, and in-between
            • Joins
              • When this is a bad idea
              • Map-side versus reduce-side joins
              • Matching account and sales information
            • Time for action – reduce-side joins using MultipleInputs
              • DataJoinMapper and TaggedMapperOutput
            • Implementing map-side joins
              • Using the Distributed Cache
              • Pruning data to fit in the cache
              • Using a data representation instead of raw data
              • Using multiple mappers
            • To join or not to join...
          • Graph algorithms
            • Graph 101
            • Graphs and MapReduce – a match made somewhere
            • Representing a graph
          • Time for action – representing the graph
            • Overview of the algorithm
              • The mapper
              • The reducer
              • Iterative application
          • Time for action – creating the source code
          • Time for action – the first run
          • Time for action – the second run
          • Time for action – the third run
          • Time for action – the fourth and last run
            • Running multiple jobs
            • Final thoughts on graphs
          • Using language-independent data structures
            • Candidate technologies
            • Introducing Avro
          • Time for action – getting and installing Avro
            • Avro and schemas
          • Time for action – defining the schema
          • Time for action – creating the source Avro data with Ruby
          • Time for action – consuming the Avro data with Java
            • Using Avro within MapReduce
          • Time for action – generating shape summaries in MapReduce
          • Time for action – examining the output data with Ruby
          • Time for action – examining the output data with Java
            • Going further with Avro
          • Summary
            • Chapter 6: When Things Break
              • Failure
                • Embrace failure
                • Or at least don't fear it
                • Don't try this at home
                • Types of failure
                • Hadoop node failure
                  • The dfsadmin command
                  • Cluster setup, test files, and block sizes
                  • Fault tolerance and Elastic MapReduce
              • Time for action – killing a DataNode process
                • NameNode and DataNode communication
            • Time for action – the replication factor in action
            • Time for action – intentionally causing missing blocks
              • When data may be lost
              • Block corruption
            • Time for action – killing a TaskTracker process
              • Comparing the DataNode and TaskTracker failures
              • Permanent failure
            • Killing the cluster masters
            • Time for action – killing the JobTracker
              • Starting a replacement JobTracker
            • Time for action – killing the NameNode process
              • Starting a replacement NameNode
              • The role of the NameNode in more detail
              • File systems, files, blocks, and nodes
              • The single most important piece of data in the cluster – fsimage
              • DataNode startup
              • Safe mode
              • SecondaryNameNode
              • So what to do when the NameNode process has a critical failure?
              • BackupNode/CheckpointNode and NameNode HA
              • Hardware failure
              • Host failure
              • Host corruption
              • The risk of correlated failures
            • Task failure due to software
              • Failure of slow running tasks
            • Time for action – causing task failure
              • Hadoop's handling of slow-running tasks
              • Speculative execution
              • Hadoop's handling of failing tasks
            • Task failure due to data
              • Handling dirty data through code
              • Using Hadoop's skip mode
            • Time for action – handling dirty data by using skip mode
              • To skip or not to skip...
            • Summary
              • Chapter 7: Keeping Things Running
                • A note on EMR
                • Hadoop configuration properties
                  • Default values
                • Time for action – browsing default properties
                  • Additional property elements
                  • Default storage location
                  • Where to set properties
                • Setting up a cluster
                  • How many hosts?
                    • Calculating usable space on a node
                    • Location of the master nodes
                    • Sizing hardware
                    • Processor / memory / storage ratio
                    • EMR as a prototyping platform
                  • Special node requirements
                  • Storage types
                    • Commodity versus enterprise class storage
                    • Single disk versus RAID
                    • Finding the balance
                    • Network storage
                  • Hadoop networking configuration
                    • How blocks are placed
                    • Rack awareness
                • Time for action – examining the default rack configuration
                • Time for action – adding a rack awareness script
                  • What is commodity hardware anyway?
                • Cluster access control
                  • The Hadoop security model
                • Time for action – demonstrating the default security
                  • User identity
                  • More granular access control
                • Working around the security model via physical access control
              • Managing the NameNode
                • Configuring multiple locations for the fsimage class
              • Time for action – adding an additional fsimage location
                • Where to write the fsimage copies
              • Swapping to another NameNode host
                • Having things ready before disaster strikes
              • Time for action – swapping to a new NameNode host
                • Don't celebrate quite yet!
                • What about MapReduce?
              • Managing HDFS
                • Where to write data
                • Using balancer
                  • When to rebalance
              • MapReduce management
                • Command line job management
                • Job priorities and scheduling
              • Time for action – changing job priorities and killing a job
                • Alternative schedulers
                  • Capacity Scheduler
                  • Fair Scheduler
                  • Enabling alternative schedulers
                  • When to use alternative schedulers
              • Scaling
                • Adding capacity to a local Hadoop cluster
                • Adding capacity to an EMR job flow
                  • Expanding a running job flow
              • Summary
                • Chapter 8: A Relational View on Data with Hive
                  • Overview of Hive
                    • Why use Hive?
                    • Thanks, Facebook!
                  • Setting up Hive
                    • Prerequisites
                    • Getting Hive
                  • Time for action – installing Hive
                  • Using Hive
                  • Time for action – creating a table for the UFO data
                  • Time for action – inserting the UFO data
                    • Validating the data
                  • Time for action – validating the table
                  • Time for action – redefining the table with the correct column separator
                    • Hive tables – real or not?
                  • Time for action – creating a table from an existing file
                  • Time for action – performing a join
                    • Hive and SQL views
                  • Time for action – using views
                    • Handling dirty data in Hive
                  • Time for action – exporting query output
                    • Partitioning the table
                  • Time for action – making a partitioned UFO sighting table
                    • Bucketing, clustering, and sorting... oh my!
                    • User Defined Function
                  • Time for action – adding a new User Defined Function (UDF)
                    • To preprocess or not to preprocess...
                    • Hive versus Pig
                    • What we didn't cover
                  • Hive on Amazon Web Services
                  • Time for action – running UFO analysis on EMR
                    • Using interactive job flows for development
                    • Integration with other AWS products
                  • Summary
                  • Chapter 9: Working with Relational Databases
                    • Common data paths
                      • Hadoop as an archive store
                      • Hadoop as a preprocessing step
                      • Hadoop as a data input tool
                      • The serpent eats its own tail
                    • Setting up MySQL
                    • Time for action – installing and setting up MySQL
                      • Did it have to be so hard?
                    • Time for action – configuring MySQL to allow remote connections
                      • Don't do this in production!
                    • Time for action – setting up the employee database
                      • Be careful with data file access rights
                    • Getting data into Hadoop
                      • Using MySQL tools and manual import
                      • Accessing the database from the mapper
                      • A better way – introducing Sqoop
                    • Time for action – downloading and configuring Sqoop
                      • Sqoop and Hadoop versions
                      • Sqoop and HDFS
                  • Time for action – exporting data from MySQL to HDFS
                    • Sqoop's architecture
                  • Importing data into Hive using Sqoop
                  • Time for action – exporting data from MySQL into Hive
                  • Time for action – a more selective import
                    • Datatype issues
                  • Time for action – using a type mapping
                  • Time for action – importing data from a raw query
                    • Sqoop and Hive partitions
                    • Field and line terminators
                  • Getting data out of Hadoop
                    • Writing data from within the reducer
                    • Writing SQL import files from the reducer
                    • A better way – Sqoop again
                  • Time for action – importing data from Hadoop into MySQL
                    • Differences between Sqoop imports and exports
                    • Inserts versus updates
                    • Sqoop and Hive exports
                  • Time for action – importing Hive data into MySQL
                  • Time for action – fixing the mapping and re-running the export
                    • Other Sqoop features
                  • AWS considerations
                    • Considering RDS
                  • Summary
                    • Chapter 10: Data Collection with Flume
                      • A note about AWS
                      • Data data everywhere
                        • Types of data
                        • Getting network traffic into Hadoop
                      • Time for action – getting web server data into Hadoop
                        • Getting files into Hadoop
                        • Hidden issues
                          • Keeping network data on the network
                          • Hadoop dependencies
                          • Reliability
                          • Re-creating the wheel
                          • A common framework approach
                      • Introducing Apache Flume
                        • A note on versioning
                      • Time for action – installing and configuring Flume
                        • Using Flume to capture network data
                      • Time for action – capturing network traffic to a log file
                      • Time for action – logging to the console
                        • Writing network data to log files
                      • Time for action – capturing the output of a command in a flat file
                        • Logs versus files
                    • Time for action – capturing a remote file in a local flat file
                      • Sources, sinks, and channels
                        • Sources
                        • Sinks
                        • Channels
                        • Or roll your own
                      • Understanding the Flume configuration files
                      • It's all about events
                    • Time for action – writing network traffic onto HDFS
                    • Time for action – adding timestamps
                      • To Sqoop or to Flume...
                    • Time for action – multi level Flume networks
                    • Time for action – writing to multiple sinks
                      • Selectors replicating and multiplexing
                      • Handling sink failure
                      • Next, the world
                    • The bigger picture
                      • Data lifecycle
                      • Staging data
                      • Scheduling
                    • Summary
                      • Chapter 11: Where to Go Next
                        • What we did and didn't cover in this book
                        • Upcoming Hadoop changes
                        • Alternative distributions
                          • Why alternative distributions?
                            • Bundling
                            • Free and commercial extensions
                            • Choosing a distribution
                        • Other Apache projects
                          • HBase
                          • Oozie
                          • Whir
                          • Mahout
                          • MRUnit
                        • Other programming abstractions
                          • Pig
                          • Cascading
                        • AWS resources
                          • HBase on EMR
                          • SimpleDB
                          • DynamoDB
                        • Sources of information
                          • Source code
                          • Mailing lists and forums
                          • LinkedIn groups
                          • HUGs
                          • Conferences
                        • Summary

                          Garry Turkington

                          Garry Turkington has 14 years of industry experience, most of which has been focused on the design and implementation of large-scale distributed systems. In his current roles as VP Data Engineering at Improve Digital and the company’s lead architect he is primarily responsible for the realization of systems that store, process, and extract value from the company’s large data volumes. Before joining Improve Digital he spent time at Amazon UK where he led several software development teams, building systems that process the Amazon catalog data for every item worldwide. Prior to this he spent a decade in various government positions in both the UK and USA. He has BSc and PhD degrees in computer science from the Queens University of Belfast in Northern Ireland and a MEng in Systems Engineering from Stevens Institute of Technology in the USA.

                          Garry Turkington's well-written, well-structured Hadoop Beginner's Guide can start you moving down the lengthy path to becoming an expert user.
                          - Slashdot review by Si Dunn

                           

                          Code Downloads

                          Download the code and support files for this book.


                          Submit Errata

                          Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.


                          Errata

                          - 13 submitted: last submission 06 Dec 2013

                          Errata type: code | Page number: 38 | Errata date: 09 April 2013

                          This code snippet

                          $ hadoop -mkdir /user
                          $ hadoop -mkdir /user/hadoop
                          $ hadoop fs -ls /user

                          Should be

                          $ hadoop -dfs mkdir /user
                          $ hadoop -dfs mkdir /user/hadoop
                          $ hadoop dfs -ls /user

                          Errata type: others | Page number: 43 | Errata date: 09 April 2013

                          The port number on the screenshot should be 50030

                          Errata type: others | Page number: 44 | Errata date: 09 April 2013

                          The port number on the screenshot should be 50070

                          Errata type: code | Page number: 38 | Errata date: 13 May 2013

                          This line

                          drwxr-xr-x - hadoop supergroup 0 2012-10-26 23:09 /user/Hadoop

                          Should be

                          drwxr-xr-x - hadoop supergroup 0 2012-10-26 23:09 /user/hadoop

                          Errata type: technical | Page number: 38 | Errata date: 16 May 2013

                          This line "Don't worry if your output looks a little different." should be appended to the second paragraph of "what just happened?" section, as follows:

                          "After starting these components, we use the JDK's jps utility to see which Java processes are running,
                           and, as the output looks good, we then use Hadoop's dfs utility to list the root of the HDFS filesystem.
                           Don't worry if your output looks a little different."

                           

                          Errata type: code | Page number: 96 | Errata date: 13 June 2013

                          This command in step 5

                          $ hadoop jar hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar

                          Should be

                          $ hadoop jar hadoop/contrib/streaming/hadoop-streaming-1.0.4.jar

                          Errata type: code | Page number: 40 | Errata date: 13 June 2013

                          This command in step 2

                          $ Hadoop Hadoop/hadoop-examples-1.0.4.jar wordcount data out

                          Should be

                          $ hadoop jar Hadoop/hadoop-examples-1.0.4.jar wordcount data out

                          Errata type: code | Page number: 95 | Errata date: July 15, 2013

                          This line

                          #/bin/env ruby

                          should be

                          #!/usr/bin/env ruby

                          Errata type: code | Page number: 99 | Errata date: July 15, 2013

                          This line

                          #!/usr/bin/env ruby
                          while line = gets
                          puts "total\t1"
                          parts = line.split("\t")
                          puts "badline\t1" if parts.size != 6
                          puts "sighted\t1" if !parts[0].empty?
                          puts "recorded\t1" if !parts[1].empty?
                          puts "location\t1" if !parts[2].empty?
                          puts "shape\t1" if !parts[3].empty?
                          puts "duration\t1" if !parts[4].empty?
                          puts "description\t1" if !parts[5].empty?
                          end

                          Should be:

                          #!/usr/bin/env ruby
                          while line = gets
                              puts "total\t1"
                              parts = line.split("\t")
                              if parts.size != 6
                                  puts "badline\t1"
                              else
                                  puts "sighted\t1" if !parts[0].empty?
                                  puts "recorded\t1" if !parts[1].empty?
                                  puts "location\t1" if !parts[2].empty?
                                  puts "shape\t1" if !parts[3].empty?
                                  puts "duration\t1" if !parts[4].empty?
                                  puts "description\t1" if !parts[5].empty?
                              end
                          end

                          Errata type: code | Page number: 41 | Errata date: September 12, 2013

                          This line

                          $ hadoop fs -cat out/part-0-00000

                          should be

                          $ hadoop fs -cat out/part-r-00000

                           

                          Errata type: technical | Page number: 41 and 42 | Errata date: October 23, 2013

                          The dfs.default.name variable holds the location of the NameNode and is required by both HDFS and MapReduce components, which explains why it's in core-site.xmland not hdfs-site.xml.

                          and

                          The mapred.job.tracker variable holds the location of the JobTracker just like dfs.default.name holds the location of the NameNode. Because only MapReduce components need know this location, it is in mapred-site.xml.

                          should be

                          The fs.default.name variable holds the location of the NameNode and is required by both HDFS and MapReduce components, which explains why it's in core-site.xmland not hdfs-site.xml.

                          and

                          The mapred.job.tracker variable holds the location of the JobTracker just like fs.default.name holds the location of the NameNode. Because only MapReduce components need know this location, it is in mapred-site.xml.

                          One of our readers suggested the following change on Page 95 of this book:

                          The code contained in wcmapper.rb:
                          
                          while line = gets
                                  words = line.split("\t")
                                  words.each{ |word| puts word.strip+"\t1"}
                          end
                          
                          should be:
                          
                          while line = gets
                                  words = line.split
                                  words.each{ |word| puts word.strip+"\t1"}
                          end
                          
                          This causes line to be split on whitespace (the default for split) rather  
                          than a tab (\t).  Failing this the script will not work as intended  
                          (outputting the whole line rather than individual words)

                          This is what the author has to say for the say:

                          It really depends on the format of the input data, if the input data contains words separated by space,
                          then the reader is correct. On the other hand, if the words are separated by tab ('\t'), the original code
                          should work without problem. So if there is any description about the data in file "test.txt" in
                          line 1 of page 4, we should be able to have a sure answer to this question.



                          One of our readers suggested the following change on Page 103/104 of this book:

                          The code in shapetimemapper.rb:
                          
                          pattern = Regexp.new /\d* ?((min)|(sec))/
                          while line = gets
                            parts = line.split("\t")
                            if parts.size == 6
                              shape = parts[3].strip
                              duration = parts[4].strip.downcase
                              if !shape.empty? && !duration.empty?
                                match = pattern.match(duration)
                                time = /\d*/.match(match[0])[0]
                                unit = match[1]
                                time = Integer(time)
                                time = time * 60 if unit == "min"
                                puts shape+"\t"+time.to_s
                              end
                            end
                          end
                          
                          needs to be amended thus:
                          
                          pattern = Regexp.new /\d* ?((min)|(sec)|(hr)|(hour))/
                          while line = gets
                            parts = line.split("\t")
                            if parts.size == 6
                              shape = parts[3].strip
                              duration = parts[4].strip.downcase
                              if !shape.empty? && !duration.empty?
                                match = pattern.match(duration)
                                if !match.nil?
                                  time = /\d*/.match(match[0])[0]
                                  if !time.empty?
                                    unit = match[1]
                                    begin
                                      time = Integer(time)
                                    rescue
                                    else
                                      time = time * 60 if unit == "min"
                                      time = time * 60  * 60 if (unit == "hr" ||
                                          unit=="hour"
                                      puts shape+"\t"+time.to_s
                                    end
                                  end
                                end
                              end
                            end
                          end
                          
                          This deals with the various errors raised by inconsistent field contents in  the source data
                          and adds checks for durations expressed in hours.

                          This is what the author has to say for the say:

                          The reader is correct to some sense. As she/he considers the irregularities of the time description.
                          But her/his suggestions are not complete either. In order to get all the varieties
                          of the time descriptions, I suggest the reader to look at the ufo_awesome.tsv file.
                          For example, I use the following command to get all the varieties of the
                          duration description under my Linux machine:


                          awk 'BEGIN {FS="\t"} {print $5}' ufo_awesome.tsv | less

                          We can get the following representative duration descriptions:
                          10 seconds (?)
                          1-2 min.
                          0.5 sec.
                          3 nights
                          90 min.
                          10 sec.
                          3 hrs.
                          1-2 sec.
                          30 MIN.
                          a few seconds
                          60-90 seconds
                          Hours
                          3 - 4 minutes
                          several minutes
                          5 min 30 sec
                          10:00
                          about 5 min.
                          5 minutes or less


                          The original code can be thought of as enough for instructional purposes,
                          as it covers the majority of the data records. And I think simplicity is
                          important for readers to understand the code! On the other hand,
                          if the reader want to cover more data records,
                          she/he definitely needs more efforts to do data cleaning or regularization.


                          
                          

                          Sample chapters

                          You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

                          Frequently bought together

                          Hadoop Beginner's Guide +    Telerik WPF Controls Tutorial =
                          50% Off
                          the second eBook
                          Price for both: ₨415.80

                          Buy both these recommended eBooks together and get 50% off the cheapest eBook.

                          What you will learn from this book

                          • The trends that led to Hadoop and cloud services, giving the background to know when to use the technology
                          • Best practices for setup and configuration of Hadoop clusters, tailoring the system to the problem at hand
                          • Developing applications to run on Hadoop with examples in Java and Ruby
                          • How Amazon Web Services can be used to deliver a hosted Hadoop solution and how this differs from directly-managed environments
                          • Integration with relational databases, using Hive for SQL queries and Sqoop for data transfer
                          • How Flume can collect data from multiple sources and deliver it to Hadoop for processing
                          • What other projects and tools make up the broader Hadoop ecosystem and where to go next

                          In Detail

                          Data is arriving faster than you can process it and the overall volumes keep growing at a rate that keeps you awake at night. Hadoop can help you tame the data beast. Effective use of Hadoop however requires a mixture of programming, design, and system administration skills.

                          "Hadoop Beginner's Guide" removes the mystery from Hadoop, presenting Hadoop and related technologies with a focus on building working systems and getting the job done, using cloud services to do so when it makes sense. From basic concepts and initial setup through developing applications and keeping the system running as the data grows, the book gives the understanding needed to effectively use Hadoop to solve real world problems.

                          Starting with the basics of installing and configuring Hadoop, the book explains how to develop applications, maintain the system, and how to use additional products to integrate with other systems.

                          While learning different ways to develop applications to run on Hadoop the book also covers tools such as Hive, Sqoop, and Flume that show how Hadoop can be integrated with relational databases and log collection.

                          In addition to examples on Hadoop clusters on Ubuntu uses of cloud services such as Amazon, EC2 and Elastic MapReduce are covered.

                          Approach

                          As a Packt Beginner's Guide, the book is packed with clear step-by-step instructions for performing the most useful tasks, getting you up and running quickly, and learning by doing.

                          Who this book is for

                          This book assumes no existing experience with Hadoop or cloud services. It assumes you have familiarity with a programming language such as Java or Ruby but gives you the needed background on the other topics.

                          Code Download and Errata
                          Packt Anytime, Anywhere
                          Register Books
                          Print Upgrades
                          eBook Downloads
                          Video Support
                          Contact Us
                          Awards Voting Nominations Previous Winners
                          Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
                          Resources
                          Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software