Packt+ | Advance your knowledge in tech

You're reading from Optimizing Hadoop for MapReduce

Product type Book

Published in Feb 2014

Publisher

ISBN-13 9781783285655

Pages 120 pages

Edition 1st Edition

Languages

Concepts

Data Processing

Author (1):

Khaled Tannir

Table of Contents (15) Chapters

Optimizing Hadoop for MapReduce

Credits

About the Author

Acknowledgments

About the Reviewers

www.PacktPub.com

Preface

1. Understanding Hadoop MapReduce

2. An Overview of the Hadoop Parameters

3. Detecting System Bottlenecks

4. Identifying Resource Weaknesses

5. Enhancing Map and Reduce Tasks

6. Optimizing MapReduce Tasks

7. Best Practices and Recommendations

Index

A

Ambari
- about / Performance monitoring tools
Apache Ambari
- URL / Using Apache Ambari to monitor Hadoop
- using, for Hadoop monitoring / Using Apache Ambari to monitor Hadoop
Associative property
- about / Using Combiners
- URL / Using Combiners

B

best practices, Hadoop / Hadoop best practices and recommendations, Hadoop tuning recommendations, Using a MapReduce template class code
Bios tuning
- checklists / The Bios tuning checklist
Blacklisted node / Checking the Hadoop cluster node's health

C

Cacti / Checking the Hadoop cluster node's health
Chukwa
- about / Performance monitoring tools
- using, for Hadoop monitoring / Using Chukwa to monitor Hadoop
collect() method / Using Combiners
Combiner
- about / Using Combiners
- using / Using Combiners
- Commutative property / Using Combiners
- Associative property / Using Combiners
- implementing / Using Combiners
- Hadoop counters, focusing on / Using Combiners
Combiner function / The MapReduce model
Commutative property
- about / Using Combiners
- URL / Using Combiners
Completely Fair Queuing (CFQ) / OS configuration recommendations
Compress Input phase / Using compression
compression
- using / Using compression, Using appropriate Writable types
Compress Mapper output phase / Using compression
Compress Reducer output phase / Using compression
Context Switch / Checking for CPU contention
core-site.xml configuration file
- about / The core-site.xml configuration file
- fs.default.name variable / The core-site.xml configuration file
- hadoop.tmp.dir variable / The core-site.xml configuration file
- fs.checkpoint.dir variable / The core-site.xml configuration file
- io.file.buffer.size variable / The core-site.xml configuration file
CPU-related parameters
- mapred.tasktracker.map.tasks.maximum variable / The CPU-related parameters
- mapred.tasktracker.reduce.tasks.maximum variable / The CPU-related parameters
CPU bottlenecks
- identifying / Identifying CPU bottlenecks

D

data locality / Sizing your Hadoop cluster
DataNodes / An overview of Hadoop MapReduce
dfs.access.time.precision variable / The hdfs-site.xml configuration file
dfs.balance.bandwidthPerSec variable / The hdfs-site.xml configuration file
dfs.block.size variable / The hdfs-site.xml configuration file
dfs.data.dir (hdfs-site.xml) variable / The disk I/O related parameters
dfs.data.dir variable / The disk I/O related parameters
dfs.datanode.du.reserved variable / The hdfs-site.xml configuration file
dfs.datanode.handler.count variable / The hdfs-site.xml configuration file
dfs.name.dir variable / The hdfs-site.xml configuration file
dfs.name.edits.dir variable / The hdfs-site.xml configuration file
dfs.namenode.handler.count parameter / The hdfs-site.xml configuration file
dfs.replication.considerLoad variable / The hdfs-site.xml configuration file
dfs.replication variable / The hdfs-site.xml configuration file
disk I/O-related parameters
- about / The disk I/O related parameters
- mapred.compress.map.output variable / The disk I/O related parameters
- mapred.output.compress parameter / The disk I/O related parameters
- mapred.map.output.compression.codec parameter / The disk I/O related parameters
- mapred.local.dir variable / The disk I/O related parameters
- dfs.data.dir (hdfs-site.xml) variable / The disk I/O related parameters
- mapred.child.java.opts variable / The memory-related parameters
- Mapred.child.ulimit variable / The memory-related parameters
- io.sort.mb variable / The memory-related parameters
- io.sort.factor / The memory-related parameters
- mapred.job.reduce.input.buffer.percent variable / The memory-related parameters

E

Excluded node / Checking the Hadoop cluster node's health

F

Fetch phase / Enhancing map tasks
FileInputFormat class / Using appropriate Writable types
FileOutputFormat class / Using a MapReduce template class code
fs.checkpoint.dir variable / The core-site.xml configuration file
fs.default.name variable / The core-site.xml configuration file

G

Ganglia
- URL / Using Ganglia to monitor Hadoop
- using, for Hadoop monitoring / Using Ganglia to monitor Hadoop
/ Checking the Hadoop cluster node's health
Ganglia Collector / Using Ganglia to monitor Hadoop
garbage collector (GC) / Hadoop tuning recommendations
GFS / An overview of Hadoop MapReduce
Graylisted node / Checking the Hadoop cluster node's health
Graylisted nodes / Checking the Hadoop cluster node's health
gzip codec / Using compression

H

Hadoop
- JobTracker / Sizing your Hadoop cluster
- TaskTracker / Sizing your Hadoop cluster
- best practices / Hadoop best practices and recommendations
- deploying / Deploying Hadoop
- tuning recommendations, checklists / Hadoop tuning recommendations
- I/O tuning recommendations / Hadoop tuning recommendations
- minimal configuration checklist / Hadoop tuning recommendations
hadoop.tmp.dir variable / The core-site.xml configuration file
Hadoop cluster / An overview of Hadoop MapReduce
- configuring / Creating a performance baseline
- weakness, identifying / Identifying cluster weakness
- nodes health, checking / Checking the Hadoop cluster node's health
- input data size, checking / Checking the input data size
- massive I/O, checking / Checking massive I/O and network traffic
- network traffic, checking / Checking massive I/O and network traffic
- insufficient concurrent tasks, checking / Checking for insufficient concurrent tasks
- CPU contention, checking / Checking for CPU contention
- sizing / Sizing your Hadoop cluster
- configuring correctly / Configuring your cluster correctly
- checklists / The Hadoop cluster checklist
Hadoop cluster weakness
- scenarios / Checking the Hadoop cluster node's health, Checking the input data size, Checking massive I/O and network traffic, Checking for CPU contention
Hadoop Distributed File System (HDFS) / Enhancing map tasks
Hadoop job
- components / Factors affecting the performance of MapReduce
Hadoop MapReduce
- overview / An overview of Hadoop MapReduce
- internals / Hadoop MapReduce internals
- performance, affecting factors / Factors affecting the performance of MapReduce
- metrics / Hadoop MapReduce metrics
Hadoop parameter
- dfs.replication / Creating a performance baseline
- dfs.block.size / Creating a performance baseline
- dfs.namenode.handler.count / Creating a performance baseline
- dfs.datanode.handler.count / Creating a performance baseline
- io.sort.factor / Creating a performance baseline
- io.sort.mb / Creating a performance baseline
- mapred.tasktracker.map.tasks.maximum / Creating a performance baseline
- mapred.map.tasks / Creating a performance baseline
- mapred.reduce.tasks / Creating a performance baseline
- mapred.tasktracker.reduce.tasks.maximum / Creating a performance baseline
- mapred.reduce.parallel.copies / Creating a performance baseline
- mapred.job.reduce.input.buffer.percent / Creating a performance baseline
- mapred.child.java.opts / Creating a performance baseline
Hadoop parameters
- investigating / Investigating the Hadoop parameters
- mapred-site.xml configuration file / The mapred-site.xml configuration file
HaLoop / Optimizing mappers and reducers code
HDFS
- about / An overview of Hadoop MapReduce
- NameNode / An overview of Hadoop MapReduce
- DataNodes / An overview of Hadoop MapReduce
hdfs-site.xml configuration file
- about / The hdfs-site.xml configuration file
- dfs.access.time.precision variable / The hdfs-site.xml configuration file
- dfs.balance.bandwidthPerSec variable / The hdfs-site.xml configuration file
- dfs.block.size variable / The hdfs-site.xml configuration file
- dfs.data.dir variable / The hdfs-site.xml configuration file
- dfs.datanode.du.reserved variable / The hdfs-site.xml configuration file
- dfs.datanode.handler.count variable / The hdfs-site.xml configuration file
- dfs.max.objects parameter / The hdfs-site.xml configuration file
- dfs.name.dir variable / The hdfs-site.xml configuration file
- dfs.name.edits.dir variable / The hdfs-site.xml configuration file
- dfs.replication variable / The hdfs-site.xml configuration file
- dfs.replication.considerLoad variable / The hdfs-site.xml configuration file
High Performance Computing (HPC) / Using Nagios to monitor Hadoop
Host Controller Interface (AHCI) option / The Bios tuning checklist

I

I/O mode
- Direct I/O / Factors affecting the performance of MapReduce
- Streaming I/O / Factors affecting the performance of MapReduce
immutable objects / Factors affecting the performance of MapReduce
InputFormat class / Using appropriate Writable types
io.compression.codec parameter / Using compression
io.file.buffer.size variable / The core-site.xml configuration file
io.sort.factor parameter / Reducing spilled records during the Map phase, Improving Reduce execution phase
io.sort.factor variable / The memory-related parameters
io.sort.mb parameter / Reducing spilled records during the Map phase
io.sort.mb variable / The memory-related parameters
io.sort.record.percent parameter / Reducing spilled records during the Map phase
io.sort.spill.percent parameter / Reducing spilled records during the Map phase

J

JobTracker / An overview of Hadoop MapReduce
JVM (Java Virtual Machine) / Investigating the Hadoop parameters

L

Limpel-Zif-Oberhumer (LZO) / Using compression
Logical Volume Management (LVM) / OS configuration recommendations

M

main() method / Using a MapReduce template class code
map() function / Using a MapReduce template class code
map-side bottlenecks
- identifying / Enhancing map tasks
map function / The MapReduce model
mappers
- optimizing / Optimizing mappers and reducers code
Map phase
- profiling / Enhancing map tasks
mapred-site.xml configuration file
- CPU-related parameters / The CPU-related parameters
- CPU-related variable / The CPU-related parameters
- disk I/O-related parameters / The disk I/O related parameters
- memory-related parameters / The memory-related parameters
- network-related parameters / The network-related parameters
mapred.child.java.opts parameter / Tuning map and reduce parameters, Reusing types smartly
Mapred.child.ulimit variable / The memory-related parameters
mapred.compress.map.output variable / The disk I/O related parameters
mapred.job.reduce.input.buffer.percent parameter / Improving Reduce execution phase
mapred.job.reduce.input.buffer.percent variable / The memory-related parameters
mapred.job.reuse.jvm.num.tasks parameter / Reusing types smartly
mapred.job.shuffle.input.buffer.percent parameter / Improving Reduce execution phase
mapred.local.dir variable / The disk I/O related parameters
mapred.map.output.compression.codec variable / The disk I/O related parameters
mapred.output.compress variable / The disk I/O related parameters
mapred.reduce.parallel.copies parameter / Improving Reduce execution phase
mapred.reduce.parallel.copies variable / The network-related parameters
mapred.tasktracker.map.tasks.maximum variable / The CPU-related parameters
mapred.tasktracker.reduce.tasks.maximum variable / The CPU-related parameters
MapReduce
- job performance / Factors affecting the performance of MapReduce
mapreduce.map.output.compress.codec parameter / Using compression
mapreduce.map.output.compress parameter / Using compression
mapreduce.output.fileoutputformat.compress.codec parameter / Using compression
mapreduce.output.fileoutputformat.compress.type parameter / Using compression
mapreduce.output.fileoutputformat.compress parameter / Using compression
MapReduce job
- launching / Hadoop MapReduce internals
- optimizing / Optimizing mappers and reducers code
MapReduce job phase
- Compress Input / Using compression
- Compress Mapper output / Using compression
- Compress Reducer output / Using compression
MapReduce model
- about / The MapReduce model
- map function / The MapReduce model
- reduce function / The MapReduce model
- using / The MapReduce model
- phases / The MapReduce model
- design / The MapReduce model
- diagram / The MapReduce model
MapReduceTemplate class / Using a MapReduce template class code
MapReduce template class code
- using / Using a MapReduce template class code
map tasks
- execution sequence / Enhancing map tasks
map tasks(mappers) / Hadoop MapReduce internals
Map tasks performance
- enhancing / Enhancing map tasks, Input data and block size impact, Dealing with small and unsplittable files, Reducing spilled records during the Map phase, Calculating map tasks' throughput
Map tasks performance, enhancing
- input data / Input data and block size impact
- block size / Input data and block size impact
- small files, dealing with / Dealing with small and unsplittable files
- small files, packing options / Dealing with small and unsplittable files
- spilled records, reducing / Reducing spilled records during the Map phase
- map tasks’throughput, calculating / Calculating map tasks' throughput
- map tasksthroughput, calculating / Calculating map tasks' throughput
master nodes / Configuring your cluster correctly
merge-sort algorithm / Factors affecting the performance of MapReduce
Merge phase / Enhancing map tasks
Metadata size (MS) / Reducing spilled records during the Map phase
mutable objects / Factors affecting the performance of MapReduce

N

Nagios
- about / Performance monitoring tools
- URL / Using Nagios to monitor Hadoop
- using, for Hadoop monitoring / Using Nagios to monitor Hadoop
- using, for monitoring perspectives / Using Nagios to monitor Hadoop
- benefits / Using Nagios to monitor Hadoop
/ Checking the Hadoop cluster node's health
NameNode / An overview of Hadoop MapReduce
Native Command Queuing mode (NCQ) / The Bios tuning checklist
network-related parameters
- mapred.reduce.parallel.copies variable / The network-related parameters
- topology.script.file.name (core-site.xml) variable / The network-related parameters
network bandwidth bottlenecks
- identifying / Identifying network bandwidth bottlenecks
nodiratime / OS configuration recommendations

O

off-machine level / The core-site.xml configuration file
on-machine level / The core-site.xml configuration file
OS configuration
- recommendations / OS configuration recommendations

P

performance
- monitoring, tools / Performance monitoring tools
performance affecting factors, Hadoop MapReduce
- I/O mode / Factors affecting the performance of MapReduce
- input data parsing / Factors affecting the performance of MapReduce
performance baseline
- creating / Creating a performance baseline
- TeraGen modules / Creating a performance baseline
- TeraSort modules / Creating a performance baseline
- TeraValidate modules / Creating a performance baseline
performance monitoring, tools
- Chukwa / Using Chukwa to monitor Hadoop
- Ganglia / Using Ganglia to monitor Hadoop
- Nagios / Using Nagios to monitor Hadoop
- Apache Ambari / Using Apache Ambari to monitor Hadoop
performance tuning
- goal / Performance tuning
- categories / Performance tuning
- of Hadoop MapReduce job / Performance tuning
- steps / Performance tuning
- diagram / Performance tuning
pseudo formula / Configuring your cluster correctly

R

rack awareness. concept / The network-related parameters
RAM bottlenecks
- identifying / Identifying RAM bottlenecks
read-only default configuration
- about / Investigating the Hadoop parameters
readInt() method / Using appropriate Writable types
Read phase / Enhancing map tasks
Record length (RL) / Reducing spilled records during the Map phase
records / Optimizing mappers and reducers code
reduce () function / Using a MapReduce template class code
reduce function / The MapReduce model
Reduce phase
- enhancing, parameters / Improving Reduce execution phase
Reducer function / Using a MapReduce template class code
reducers code
- optimizing / Optimizing mappers and reducers code
Reduce tasks
- enhancing / Enhancing reduce tasks, Calculating reduce tasks' throughput, Improving Reduce execution phase
- phases / Enhancing reduce tasks
- Shuffle phase / Enhancing reduce tasks
- Reduce phase / Enhancing reduce tasks
- Write phase / Enhancing reduce tasks
reduce tasks(reducers) / Hadoop MapReduce internals
Reduce tasks, enhancing
- Shuffle phase, profiling / Enhancing reduce tasks
- Reduce phase / Enhancing reduce tasks
- Write phase, profiling / Enhancing reduce tasks
- reduce task throughput, calculating / Calculating reduce tasks' throughput
- Reduce execution phase, improving / Improving Reduce execution phase
- reduce parameters, tuning / Tuning map and reduce parameters
- map parameters, tuning / Tuning map and reduce parameters
resource bottlenecks
- identifying / Identifying resource bottlenecks
- RAM bottlenecks, identifying / Identifying RAM bottlenecks
- CPU bottlenecks, identifying / Identifying CPU bottlenecks
- storage bottlenecks, identifying / Identifying storage bottlenecks
- network bandwidth bottlenecks, identifying / Identifying network bandwidth bottlenecks
run() method / Using a MapReduce template class code

S

Secure Copy Protocol (SCP) / Deploying Hadoop
Secure Shell (SSH) / Deploying Hadoop
site-specific configuration
- about / Investigating the Hadoop parameters
SkipBadRecords class / Optimizing mappers and reducers code
small files
- packing, alternatives / Dealing with small and unsplittable files
Spilled Records size (RS) / Reducing spilled records during the Map phase
Spill phase / Enhancing map tasks
split file / Dealing with small and unsplittable files
storage bottlenecks
- identifying / Identifying storage bottlenecks
system performance
- analyzing / Identifying network bandwidth bottlenecks

T

TaskTracker / An overview of Hadoop MapReduce
tasktrakker.http.threads parameter / Reducing spilled records during the Map phase
TeraGen modules / Creating a performance baseline
TeraSort modules / Creating a performance baseline
TeraValidate modules / Creating a performance baseline
TestDFSIO benchmark tool
- using / Identifying storage bottlenecks
- output log / Identifying storage bottlenecks
topology.script.file.name (core-site.xml) variable / The network-related parameters
Twister / Optimizing mappers and reducers code
types
- reusing / Reusing types smartly

V

vmstat tool / Checking for CPU contention

W

Writable class / Using appropriate Writable types
WritableComparable class / Using appropriate Writable types
WritableComparator class / Using appropriate Writable types
Writable type object
- writing / Using appropriate Writable types

The rest of the chapter is locked

You're reading from Optimizing Hadoop for MapReduce

Table of Contents (15) Chapters

Index

A

B

C

D

E

F

G

H

I

J

L

M

N

O

P

R

S

T

V

W

Authors (1)

Personalised recommendations for you

You're reading from Optimizing Hadoop for MapReduce

Table of Contents (15) Chapters

Index

A

B

C

D

E

F

G

H

I

J

L

M

N

O

P

R

S

T

V

W

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you