Free Sample
+ Collection

Learning Big Data with Amazon Elastic MapReduce

Learning
Amarkant Singh, Vijay Rayapati

Easily learn, build, and execute real-world Big Data solutions using Hadoop and AWS EMR
$26.99
$44.99
RRP $26.99
RRP $44.99
eBook
Print + eBook

Want this title & more?

$16.99 p/month

Subscribe to PacktLib

Enjoy full and instant access to over 2000 books and videos – you’ll find everything you need to stay ahead of the curve and make sure you can always get the job done.

Book Details

ISBN 139781782173434
Paperback242 pages

About This Book

  • Learn how to solve big data problems using Apache Hadoop
  • Use Amazon Elastic MapReduce to create and maintain cluster infrastructure for big data analytics
  • A step-by-step guide exploring the vast set of services provided by Amazon on the cloud

Who This Book Is For

This book is aimed at developers and system administrators who want to learn about Big Data analysis using Amazon Elastic MapReduce. Basic Java programming knowledge is required. You should be comfortable with using command-line tools. Prior knowledge of AWS, API, and CLI tools is not assumed. Also, no exposure to Hadoop and MapReduce is expected.

Table of Contents

Chapter 1: Amazon Web Services
What is Amazon Web Services?
Structure and Design
Services provided by AWS
Creating an account on AWS
Launching the AWS management console
Getting started with Amazon EC2
Getting started with Amazon S3
Summary
Chapter 2: MapReduce
The map function
The reduce function
What is MapReduce?
Data life cycle in the MapReduce framework
Real-world examples and use cases of MapReduce
Software distributions built on the MapReduce framework
Summary
Chapter 3: Apache Hadoop
What is Apache Hadoop?
Hadoop modules
Hadoop Distributed File System
Apache Hadoop MapReduce
Apache Hadoop as a platform
Summary
Chapter 4: Amazon EMR – Hadoop on Amazon Web Services
What is AWS EMR?
The EMR architecture
EMR use cases
Summary
Chapter 5: Programming Hadoop on Amazon EMR
Hello World in Hadoop
Mapper implementation
Reducer implementation
Driver implementation
Summary
Chapter 6: Executing Hadoop Jobs on an Amazon EMR Cluster
Creating an EC2 key pair
Creating a S3 bucket for input data and JAR
How to launch an EMR cluster
Viewing results
Summary
Chapter 7: Amazon EMR – Cluster Management
EMR cluster management – different methods
EMR bootstrap actions
EMR cluster monitoring and troubleshooting
EMR best practices
Summary
Chapter 8: Amazon EMR – Command-line Interface Client
EMR – CLI client installation
Launching and monitoring an EMR cluster using CLI
Summary
Chapter 9: Hadoop Streaming and Advanced Hadoop Customizations
Hadoop streaming
Adding streaming Job Step on EMR
Advanced Hadoop customizations
Emitting results to multiple outputs
Summary
Chapter 10: Use Case – Analyzing CloudFront Logs Using Amazon EMR
Use case definition
The solution architecture
Creating the Hadoop Job Step
Output ingestion to a data store
Using a visualization tool – Tableau Desktop
Summary

What You Will Learn

  • Create and access your account on AWS and learn about its various services
  • Launch a machine on the cloud infrastructure of AWS, get login credentials, and communicate with that machine
  • Learn about the logical dataflow of MapReduce and how it uses distributed computing effectively
  • Understand the benefits of EMR over a local Hadoop cluster
  • Discover the best practices that should be kept in mind while planning and executing a cluster/job on EMR
  • Launch a cluster on Amazon EMR, submit the Hello World wordcount job for processing, and download and view the results
  • Execute jobs on EMR using the two primary methods provided by EMR

In Detail

Amazon Elastic MapReduce is a web service used to process and store vast amount of data, and it is one of the largest Hadoop operators in the world. With the increase in the amount of data generated and collected by many businesses and the arrival of cost-effective cloud-based solutions for distributed computing, the feasibility to crunch large amounts of data to get deep insights within a short span of time has increased greatly.

This book will get you started with AWS so that you can quickly create your own account and explore the services provided, many of which you might be delighted to use. This book covers the architectural details of the MapReduce framework, Apache Hadoop, various job models on EMR, how to manage clusters on EMR, and the command-line tools available with EMR. Each chapter builds on the knowledge of the previous one, leading to the final chapter where you will learn about solving a real-world use case using Apache Hadoop and EMR. This book will, therefore, get you up and running with major Big Data technologies quickly and efficiently.

Authors

Read More