Learning Hadoop 2

More Information
  • Write distributed applications using the MapReduce framework
  • Go beyond MapReduce and process data in real time with Samza and iteratively with Spark
  • Familiarize yourself with data mining approaches that work with very large datasets
  • Prototype applications on a VM and deploy them to a local cluster or to a cloud infrastructure (Amazon Web Services)
  • Conduct batch and real time data analysis using SQL-like tools
  • Build data processing flows using Apache Pig and see how it enables the easy incorporation of custom functionality
  • Define and orchestrate complex workflows and pipelines with Apache Oozie
  • Manage your data lifecycle and changes over time

This book introduces you to the world of building data-processing applications with the wide variety of tools supported by Hadoop 2. Starting with the core components of the framework—HDFS and YARN—this book will guide you through how to build applications using a variety of approaches.

You will learn how YARN completely changes the relationship between MapReduce and Hadoop and allows the latter to support more varied processing approaches and a broader array of applications. These include real-time processing with Apache Samza and iterative computation with Apache Spark. Next up, we discuss Apache Pig and the dataflow data model it provides. You will discover how to use Pig to analyze a Twitter dataset.

With this book, you will be able to make your life easier by using tools such as Apache Hive, Apache Oozie, Hadoop Streaming, Apache Crunch, and Kite SDK. The last part of this book discusses the likely future direction of major Hadoop components and how to get involved with the Hadoop community.

  • Construct state-of-the-art applications using higher-level interfaces and tools beyond the traditional MapReduce approach
  • Use the unique features of Hadoop 2 to model and analyze Twitter’s global stream of user generated data
  • Develop a prototype on a local cluster and deploy to the cloud (Amazon Web Services)
Page Count 382
Course Length 11 hours 27 minutes
ISBN 9781783285518
Date Of Publication 13 Feb 2015


Garry Turkington

Garry Turkington has over 15 years of industry experience, most of which has been focused on the design and implementation of large-scale distributed systems. In his current role as the CTO at Improve Digital, he is primarily responsible for the realization of systems that store, process, and extract value from the company's large data volumes. Before joining Improve Digital, he spent time at Amazon.co.uk, where he led several software development teams, building systems that process the Amazon catalog data for every item worldwide. Prior to this, he spent a decade in various government positions in both the UK and the USA.

Gabriele Modena

Gabriele Modena is a data scientist at Improve Digital. In his current position, he uses Hadoop to manage, process, and analyze behavioral and machine-generated data. Gabriele enjoys using statistical and computational methods to look for patterns in large amounts of data. Prior to his current job in ad tech he held a number of positions in Academia and Industry where he did research in machine learning and artificial intelligence.

He holds a BSc degree in Computer Science from the University of Trento, Italy and a Research MSc degree in Artificial Intelligence: Learning Systems, from the University of Amsterdam in the Netherlands.