Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Optimizing Hadoop for MapReduce

You're reading from  Optimizing Hadoop for MapReduce

Product type Book
Published in Feb 2014
Publisher
ISBN-13 9781783285655
Pages 120 pages
Edition 1st Edition
Languages
Author (1):
Khaled Tannir Khaled Tannir
Profile icon Khaled Tannir

Index

A

  • Ambari
    • about / Performance monitoring tools
  • Apache Ambari
    • URL / Using Apache Ambari to monitor Hadoop
    • using, for Hadoop monitoring / Using Apache Ambari to monitor Hadoop
  • Associative property
    • about / Using Combiners
    • URL / Using Combiners

B

  • best practices, Hadoop / Hadoop best practices and recommendations, Hadoop tuning recommendations, Using a MapReduce template class code
  • Bios tuning
    • checklists / The Bios tuning checklist
  • Blacklisted node / Checking the Hadoop cluster node's health

C

  • Cacti / Checking the Hadoop cluster node's health
  • Chukwa
    • about / Performance monitoring tools
    • using, for Hadoop monitoring / Using Chukwa to monitor Hadoop
  • collect() method / Using Combiners
  • Combiner
    • about / Using Combiners
    • using / Using Combiners
    • Commutative property / Using Combiners
    • Associative property / Using Combiners
    • implementing / Using Combiners
    • Hadoop counters, focusing on / Using Combiners
  • Combiner function / The MapReduce model
  • Commutative property
    • about / Using Combiners
    • URL / Using Combiners
  • Completely Fair Queuing (CFQ) / OS configuration recommendations
  • Compress Input phase / Using compression
  • compression
    • using / Using compression, Using appropriate Writable types
  • Compress Mapper output phase / Using compression
  • Compress Reducer output phase / Using compression
  • Context Switch / Checking for CPU contention
  • core-site.xml configuration file
    • about / The core-site.xml configuration file
    • fs.default.name variable / The core-site.xml configuration file
    • hadoop.tmp.dir variable / The core-site.xml configuration file
    • fs.checkpoint.dir variable / The core-site.xml configuration file
    • io.file.buffer.size variable / The core-site.xml configuration file
  • CPU-related parameters
    • mapred.tasktracker.map.tasks.maximum variable / The CPU-related parameters
    • mapred.tasktracker.reduce.tasks.maximum variable / The CPU-related parameters
  • CPU bottlenecks
    • identifying / Identifying CPU bottlenecks

D

  • data locality / Sizing your Hadoop cluster
  • DataNodes / An overview of Hadoop MapReduce
  • dfs.access.time.precision variable / The hdfs-site.xml configuration file
  • dfs.balance.bandwidthPerSec variable / The hdfs-site.xml configuration file
  • dfs.block.size variable / The hdfs-site.xml configuration file
  • dfs.data.dir (hdfs-site.xml) variable / The disk I/O related parameters
  • dfs.data.dir variable / The disk I/O related parameters
  • dfs.datanode.du.reserved variable / The hdfs-site.xml configuration file
  • dfs.datanode.handler.count variable / The hdfs-site.xml configuration file
  • dfs.name.dir variable / The hdfs-site.xml configuration file
  • dfs.name.edits.dir variable / The hdfs-site.xml configuration file
  • dfs.namenode.handler.count parameter / The hdfs-site.xml configuration file
  • dfs.replication.considerLoad variable / The hdfs-site.xml configuration file
  • dfs.replication variable / The hdfs-site.xml configuration file
  • disk I/O-related parameters
    • about / The disk I/O related parameters
    • mapred.compress.map.output variable / The disk I/O related parameters
    • mapred.output.compress parameter / The disk I/O related parameters
    • mapred.map.output.compression.codec parameter / The disk I/O related parameters
    • mapred.local.dir variable / The disk I/O related parameters
    • dfs.data.dir (hdfs-site.xml) variable / The disk I/O related parameters
    • mapred.child.java.opts variable / The memory-related parameters
    • Mapred.child.ulimit variable / The memory-related parameters
    • io.sort.mb variable / The memory-related parameters
    • io.sort.factor / The memory-related parameters
    • mapred.job.reduce.input.buffer.percent variable / The memory-related parameters

E

  • Excluded node / Checking the Hadoop cluster node's health

F

  • Fetch phase / Enhancing map tasks
  • FileInputFormat class / Using appropriate Writable types
  • FileOutputFormat class / Using a MapReduce template class code
  • fs.checkpoint.dir variable / The core-site.xml configuration file
  • fs.default.name variable / The core-site.xml configuration file

G

  • Ganglia
    • URL / Using Ganglia to monitor Hadoop
    • using, for Hadoop monitoring / Using Ganglia to monitor Hadoop
    / Checking the Hadoop cluster node's health
  • Ganglia Collector / Using Ganglia to monitor Hadoop
  • garbage collector (GC) / Hadoop tuning recommendations
  • GFS / An overview of Hadoop MapReduce
  • Graylisted node / Checking the Hadoop cluster node's health
  • Graylisted nodes / Checking the Hadoop cluster node's health
  • gzip codec / Using compression

H

  • Hadoop
    • JobTracker / Sizing your Hadoop cluster
    • TaskTracker / Sizing your Hadoop cluster
    • best practices / Hadoop best practices and recommendations
    • deploying / Deploying Hadoop
    • tuning recommendations, checklists / Hadoop tuning recommendations
    • I/O tuning recommendations / Hadoop tuning recommendations
    • minimal configuration checklist / Hadoop tuning recommendations
  • hadoop.tmp.dir variable / The core-site.xml configuration file
  • Hadoop cluster / An overview of Hadoop MapReduce
    • configuring / Creating a performance baseline
    • weakness, identifying / Identifying cluster weakness
    • node’s health, checking / Checking the Hadoop cluster node's health
    • input data size, checking / Checking the input data size
    • massive I/O, checking / Checking massive I/O and network traffic
    • network traffic, checking / Checking massive I/O and network traffic
    • insufficient concurrent tasks, checking / Checking for insufficient concurrent tasks
    • CPU contention, checking / Checking for CPU contention
    • sizing / Sizing your Hadoop cluster
    • configuring correctly / Configuring your cluster correctly
    • checklists / The Hadoop cluster checklist
  • Hadoop cluster weakness
    • scenarios / Checking the Hadoop cluster node's health, Checking the input data size, Checking massive I/O and network traffic, Checking for CPU contention
  • Hadoop Distributed File System (HDFS) / Enhancing map tasks
  • Hadoop job
    • components / Factors affecting the performance of MapReduce
  • Hadoop MapReduce
    • overview / An overview of Hadoop MapReduce
    • internals / Hadoop MapReduce internals
    • performance, affecting factors / Factors affecting the performance of MapReduce
    • metrics / Hadoop MapReduce metrics
  • Hadoop parameter
    • dfs.replication / Creating a performance baseline
    • dfs.block.size / Creating a performance baseline
    • dfs.namenode.handler.count / Creating a performance baseline
    • dfs.datanode.handler.count / Creating a performance baseline
    • io.sort.factor / Creating a performance baseline
    • io.sort.mb / Creating a performance baseline
    • mapred.tasktracker.map.tasks.maximum / Creating a performance baseline
    • mapred.map.tasks / Creating a performance baseline
    • mapred.reduce.tasks / Creating a performance baseline
    • mapred.tasktracker.reduce.tasks.maximum / Creating a performance baseline
    • mapred.reduce.parallel.copies / Creating a performance baseline
    • mapred.job.reduce.input.buffer.percent / Creating a performance baseline
    • mapred.child.java.opts / Creating a performance baseline
  • Hadoop parameters
    • investigating / Investigating the Hadoop parameters
    • mapred-site.xml configuration file / The mapred-site.xml configuration file
  • HaLoop / Optimizing mappers and reducers code
  • HDFS
    • about / An overview of Hadoop MapReduce
    • NameNode / An overview of Hadoop MapReduce
    • DataNodes / An overview of Hadoop MapReduce
  • hdfs-site.xml configuration file
    • about / The hdfs-site.xml configuration file
    • dfs.access.time.precision variable / The hdfs-site.xml configuration file
    • dfs.balance.bandwidthPerSec variable / The hdfs-site.xml configuration file
    • dfs.block.size variable / The hdfs-site.xml configuration file
    • dfs.data.dir variable / The hdfs-site.xml configuration file
    • dfs.datanode.du.reserved variable / The hdfs-site.xml configuration file
    • dfs.datanode.handler.count variable / The hdfs-site.xml configuration file
    • dfs.max.objects parameter / The hdfs-site.xml configuration file
    • dfs.name.dir variable / The hdfs-site.xml configuration file
    • dfs.name.edits.dir variable / The hdfs-site.xml configuration file
    • dfs.replication variable / The hdfs-site.xml configuration file
    • dfs.replication.considerLoad variable / The hdfs-site.xml configuration file
  • High Performance Computing (HPC) / Using Nagios to monitor Hadoop
  • Host Controller Interface (AHCI) option / The Bios tuning checklist

I

  • I/O mode
    • Direct I/O / Factors affecting the performance of MapReduce
    • Streaming I/O / Factors affecting the performance of MapReduce
  • immutable objects / Factors affecting the performance of MapReduce
  • InputFormat class / Using appropriate Writable types
  • io.compression.codec parameter / Using compression
  • io.file.buffer.size variable / The core-site.xml configuration file
  • io.sort.factor parameter / Reducing spilled records during the Map phase, Improving Reduce execution phase
  • io.sort.factor variable / The memory-related parameters
  • io.sort.mb parameter / Reducing spilled records during the Map phase
  • io.sort.mb variable / The memory-related parameters
  • io.sort.record.percent parameter / Reducing spilled records during the Map phase
  • io.sort.spill.percent parameter / Reducing spilled records during the Map phase

J

  • JobTracker / An overview of Hadoop MapReduce
  • JVM (Java Virtual Machine) / Investigating the Hadoop parameters

L

  • Limpel-Zif-Oberhumer (LZO) / Using compression
  • Logical Volume Management (LVM) / OS configuration recommendations

M

  • main() method / Using a MapReduce template class code
  • map() function / Using a MapReduce template class code
  • map-side bottlenecks
    • identifying / Enhancing map tasks
  • map function / The MapReduce model
  • mappers
    • optimizing / Optimizing mappers and reducers code
  • Map phase
    • profiling / Enhancing map tasks
  • mapred-site.xml configuration file
    • CPU-related parameters / The CPU-related parameters
    • CPU-related variable / The CPU-related parameters
    • disk I/O-related parameters / The disk I/O related parameters
    • memory-related parameters / The memory-related parameters
    • network-related parameters / The network-related parameters
  • mapred.child.java.opts parameter / Tuning map and reduce parameters, Reusing types smartly
  • Mapred.child.ulimit variable / The memory-related parameters
  • mapred.compress.map.output variable / The disk I/O related parameters
  • mapred.job.reduce.input.buffer.percent parameter / Improving Reduce execution phase
  • mapred.job.reduce.input.buffer.percent variable / The memory-related parameters
  • mapred.job.reuse.jvm.num.tasks parameter / Reusing types smartly
  • mapred.job.shuffle.input.buffer.percent parameter / Improving Reduce execution phase
  • mapred.local.dir variable / The disk I/O related parameters
  • mapred.map.output.compression.codec variable / The disk I/O related parameters
  • mapred.output.compress variable / The disk I/O related parameters
  • mapred.reduce.parallel.copies parameter / Improving Reduce execution phase
  • mapred.reduce.parallel.copies variable / The network-related parameters
  • mapred.tasktracker.map.tasks.maximum variable / The CPU-related parameters
  • mapred.tasktracker.reduce.tasks.maximum variable / The CPU-related parameters
  • MapReduce
    • job performance / Factors affecting the performance of MapReduce
  • mapreduce.map.output.compress.codec parameter / Using compression
  • mapreduce.map.output.compress parameter / Using compression
  • mapreduce.output.fileoutputformat.compress.codec parameter / Using compression
  • mapreduce.output.fileoutputformat.compress.type parameter / Using compression
  • mapreduce.output.fileoutputformat.compress parameter / Using compression
  • MapReduce job
    • launching / Hadoop MapReduce internals
    • optimizing / Optimizing mappers and reducers code
  • MapReduce job phase
    • Compress Input / Using compression
    • Compress Mapper output / Using compression
    • Compress Reducer output / Using compression
  • MapReduce model
    • about / The MapReduce model
    • map function / The MapReduce model
    • reduce function / The MapReduce model
    • using / The MapReduce model
    • phases / The MapReduce model
    • design / The MapReduce model
    • diagram / The MapReduce model
  • MapReduceTemplate class / Using a MapReduce template class code
  • MapReduce template class code
    • using / Using a MapReduce template class code
  • map tasks
    • execution sequence / Enhancing map tasks
  • map tasks(mappers) / Hadoop MapReduce internals
  • Map tasks performance
    • enhancing / Enhancing map tasks, Input data and block size impact, Dealing with small and unsplittable files, Reducing spilled records during the Map phase, Calculating map tasks' throughput
  • Map tasks performance, enhancing
    • input data / Input data and block size impact
    • block size / Input data and block size impact
    • small files, dealing with / Dealing with small and unsplittable files
    • small files, packing options / Dealing with small and unsplittable files
    • spilled records, reducing / Reducing spilled records during the Map phase
    • map tasks’throughput, calculating / Calculating map tasks' throughput
    • map tasks’throughput, calculating / Calculating map tasks' throughput
  • master nodes / Configuring your cluster correctly
  • merge-sort algorithm / Factors affecting the performance of MapReduce
  • Merge phase / Enhancing map tasks
  • Metadata size (MS) / Reducing spilled records during the Map phase
  • mutable objects / Factors affecting the performance of MapReduce

N

  • Nagios
    • about / Performance monitoring tools
    • URL / Using Nagios to monitor Hadoop
    • using, for Hadoop monitoring / Using Nagios to monitor Hadoop
    • using, for monitoring perspectives / Using Nagios to monitor Hadoop
    • benefits / Using Nagios to monitor Hadoop
    / Checking the Hadoop cluster node's health
  • NameNode / An overview of Hadoop MapReduce
  • Native Command Queuing mode (NCQ) / The Bios tuning checklist
  • network-related parameters
    • mapred.reduce.parallel.copies variable / The network-related parameters
    • topology.script.file.name (core-site.xml) variable / The network-related parameters
  • network bandwidth bottlenecks
    • identifying / Identifying network bandwidth bottlenecks
  • nodiratime / OS configuration recommendations

O

  • off-machine level / The core-site.xml configuration file
  • on-machine level / The core-site.xml configuration file
  • OS configuration
    • recommendations / OS configuration recommendations

P

  • performance
    • monitoring, tools / Performance monitoring tools
  • performance affecting factors, Hadoop MapReduce
    • I/O mode / Factors affecting the performance of MapReduce
    • input data parsing / Factors affecting the performance of MapReduce
  • performance baseline
    • creating / Creating a performance baseline
    • TeraGen modules / Creating a performance baseline
    • TeraSort modules / Creating a performance baseline
    • TeraValidate modules / Creating a performance baseline
  • performance monitoring, tools
    • Chukwa / Using Chukwa to monitor Hadoop
    • Ganglia / Using Ganglia to monitor Hadoop
    • Nagios / Using Nagios to monitor Hadoop
    • Apache Ambari / Using Apache Ambari to monitor Hadoop
  • performance tuning
    • goal / Performance tuning
    • categories / Performance tuning
    • of Hadoop MapReduce job / Performance tuning
    • steps / Performance tuning
    • diagram / Performance tuning
  • pseudo formula / Configuring your cluster correctly

R

  • rack awareness. concept / The network-related parameters
  • RAM bottlenecks
    • identifying / Identifying RAM bottlenecks
  • read-only default configuration
    • about / Investigating the Hadoop parameters
  • readInt() method / Using appropriate Writable types
  • Read phase / Enhancing map tasks
  • Record length (RL) / Reducing spilled records during the Map phase
  • records / Optimizing mappers and reducers code
  • reduce () function / Using a MapReduce template class code
  • reduce function / The MapReduce model
  • Reduce phase
    • enhancing, parameters / Improving Reduce execution phase
  • Reducer function / Using a MapReduce template class code
  • reducers code
    • optimizing / Optimizing mappers and reducers code
  • Reduce tasks
    • enhancing / Enhancing reduce tasks, Calculating reduce tasks' throughput, Improving Reduce execution phase
    • phases / Enhancing reduce tasks
    • Shuffle phase / Enhancing reduce tasks
    • Reduce phase / Enhancing reduce tasks
    • Write phase / Enhancing reduce tasks
  • reduce tasks(reducers) / Hadoop MapReduce internals
  • Reduce tasks, enhancing
    • Shuffle phase, profiling / Enhancing reduce tasks
    • Reduce phase / Enhancing reduce tasks
    • Write phase, profiling / Enhancing reduce tasks
    • reduce task throughput, calculating / Calculating reduce tasks' throughput
    • Reduce execution phase, improving / Improving Reduce execution phase
    • reduce parameters, tuning / Tuning map and reduce parameters
    • map parameters, tuning / Tuning map and reduce parameters
  • resource bottlenecks
    • identifying / Identifying resource bottlenecks
    • RAM bottlenecks, identifying / Identifying RAM bottlenecks
    • CPU bottlenecks, identifying / Identifying CPU bottlenecks
    • storage bottlenecks, identifying / Identifying storage bottlenecks
    • network bandwidth bottlenecks, identifying / Identifying network bandwidth bottlenecks
  • run() method / Using a MapReduce template class code

S

  • Secure Copy Protocol (SCP) / Deploying Hadoop
  • Secure Shell (SSH) / Deploying Hadoop
  • site-specific configuration
    • about / Investigating the Hadoop parameters
  • SkipBadRecords class / Optimizing mappers and reducers code
  • small files
    • packing, alternatives / Dealing with small and unsplittable files
  • Spilled Records size (RS) / Reducing spilled records during the Map phase
  • Spill phase / Enhancing map tasks
  • split file / Dealing with small and unsplittable files
  • storage bottlenecks
    • identifying / Identifying storage bottlenecks
  • system performance
    • analyzing / Identifying network bandwidth bottlenecks

T

  • TaskTracker / An overview of Hadoop MapReduce
  • tasktrakker.http.threads parameter / Reducing spilled records during the Map phase
  • TeraGen modules / Creating a performance baseline
  • TeraSort modules / Creating a performance baseline
  • TeraValidate modules / Creating a performance baseline
  • TestDFSIO benchmark tool
    • using / Identifying storage bottlenecks
    • output log / Identifying storage bottlenecks
  • topology.script.file.name (core-site.xml) variable / The network-related parameters
  • Twister / Optimizing mappers and reducers code
  • types
    • reusing / Reusing types smartly

V

  • vmstat tool / Checking for CPU contention

W

  • Writable class / Using appropriate Writable types
  • WritableComparable class / Using appropriate Writable types
  • WritableComparator class / Using appropriate Writable types
  • Writable type object
    • writing / Using appropriate Writable types
lock icon The rest of the chapter is locked
arrow left Previous Chapter
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}