Reader small image

You're reading from  Optimizing Hadoop for MapReduce

Product typeBook
Published inFeb 2014
Publisher
ISBN-139781783285655
Edition1st Edition
Tools
Right arrow
Author (1)
Khaled Tannir
Khaled Tannir
author image
Khaled Tannir

Khaled Tannir has been working with computers since 1980. He began programming with the legendary Sinclair Zx81 and later with Commodore home computer products (Vic 20, Commodore 64, Commodore 128D, and Amiga 500). He has a Bachelor's degree in Electronics, a Master's degree in System Information Architectures, in which he graduated with a professional thesis, and completed his education with a Master of Research degree. He is a Microsoft Certified Solution Developer (MCSD) and has more than 20 years of technical experience leading the development and implementation of software solutions and giving technical presentations. He now works as an independent IT consultant and has worked as an infrastructure engineer, senior developer, and enterprise/solution architect for many companies in France and Canada. With significant experience in Microsoft .Net, Microsoft Server Systems, and Oracle Java technologies, he has extensive skills in online/offline applications design, system conversions, and multilingual applications in both domains: Internet and Desktops. He is always researching new technologies, learning about them, and looking for new adventures in France, North America, and the Middle-east. He owns an IT and electronics laboratory with many servers, monitors, open electronic boards such as Arduino, Netduino, RaspBerry Pi, and .Net Gadgeteer, and some smartphone devices based on Windows Phone, Android, and iOS operating systems. In 2012, he contributed to the EGC 2012 (International Complex Data Mining forum at Bordeaux University, France) and presented, in a workshop session, his work on "how to optimize data distribution in a cloud computing environment". This work aims to define an approach to optimize the use of data mining algorithms such as k-means and Apriori in a cloud computing environment. He is the author of RavenDB 2.x Beginner's Guide, Packt Publishing. He aims to get a PhD in Cloud Computing and Big Data and wants to learn more and more about these technologies. He enjoys taking landscape and night time photos, travelling, playing video games, creating funny electronic gadgets with Arduino/.Net Gadgeteer, and of course, spending time with his wife and family. You can reach him at contact@khaledtannir.net.
Read more about Khaled Tannir

Right arrow

Creating a performance baseline


Let's begin by creating a performance baseline for our system. When creating a baseline, you should keep the Hadoop default configuration settings and use the TeraSort benchmark tool, which is a part of the example JAR files provided with the Hadoop distribution package. TeraSort is accepted as an industry standard benchmark to compare the performance of Hadoop. This benchmark tool tries to use the entire Hadoop cluster to sort 1 TB of data as quickly as possible and is divided into three main modules:

  • TeraGen: This module is used to generate a file of the desired size as an input that usually ranges between 500 GB up to 3 TB. Once the input data is generated by TeraGen, it can be used in all the runs with the same file size.

  • TeraSort: This module will sort the input file across the Hadoop cluster. TeraSort stresses the Hadoop cluster on both layers, MapReduce and HDFS, in terms of computation, network bandwidth, and I/O response. Once sorted, a single reduced...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Optimizing Hadoop for MapReduce
Published in: Feb 2014Publisher: ISBN-13: 9781783285655

Author (1)

author image
Khaled Tannir

Khaled Tannir has been working with computers since 1980. He began programming with the legendary Sinclair Zx81 and later with Commodore home computer products (Vic 20, Commodore 64, Commodore 128D, and Amiga 500). He has a Bachelor's degree in Electronics, a Master's degree in System Information Architectures, in which he graduated with a professional thesis, and completed his education with a Master of Research degree. He is a Microsoft Certified Solution Developer (MCSD) and has more than 20 years of technical experience leading the development and implementation of software solutions and giving technical presentations. He now works as an independent IT consultant and has worked as an infrastructure engineer, senior developer, and enterprise/solution architect for many companies in France and Canada. With significant experience in Microsoft .Net, Microsoft Server Systems, and Oracle Java technologies, he has extensive skills in online/offline applications design, system conversions, and multilingual applications in both domains: Internet and Desktops. He is always researching new technologies, learning about them, and looking for new adventures in France, North America, and the Middle-east. He owns an IT and electronics laboratory with many servers, monitors, open electronic boards such as Arduino, Netduino, RaspBerry Pi, and .Net Gadgeteer, and some smartphone devices based on Windows Phone, Android, and iOS operating systems. In 2012, he contributed to the EGC 2012 (International Complex Data Mining forum at Bordeaux University, France) and presented, in a workshop session, his work on "how to optimize data distribution in a cloud computing environment". This work aims to define an approach to optimize the use of data mining algorithms such as k-means and Apriori in a cloud computing environment. He is the author of RavenDB 2.x Beginner's Guide, Packt Publishing. He aims to get a PhD in Cloud Computing and Big Data and wants to learn more and more about these technologies. He enjoys taking landscape and night time photos, travelling, playing video games, creating funny electronic gadgets with Arduino/.Net Gadgeteer, and of course, spending time with his wife and family. You can reach him at contact@khaledtannir.net.
Read more about Khaled Tannir