Packt+ | Advance your knowledge in tech

You're reading from Practical Big Data Analytics

Product type Book

Published in Jan 2018

Publisher Packt

ISBN-13 9781783554393

Pages 412 pages

Edition 1st Edition

Languages

Java

Concepts

Big Data

Author (1):

Nataraj Dasgupta

Table of Contents (16) Chapters

Title Page

Packt Upsell

Contributors

Preface

Too Big or Not Too Big

Big Data Mining for the Masses

The Analytics Toolkit

Big Data With Hadoop

Big Data Mining with NoSQL

Spark for Big Data Analytics

An Introduction to Machine Learning Concepts

Machine Learning Deep Dive

Enterprise Data Science

Closing Thoughts on Big Data

External Data Science Resources

Other Books You May Enjoy

Leave a review - let other readers know what you think

Chapter 4. Big Data With Hadoop

Hadoop has become the de facto standard in the world of big data, especially over the past three to four years. Hadoop started as a subproject of Apache Nutch in 2006 and introduced two key features related to distributed filesystems and distributed computing, also known as MapReduce, that caught on very rapidly among the open source community. Today, there are thousands of new products that have been developed leveraging the core features of Hadoop, and it has evolved into a vast ecosystem consisting of more than 150 related major products. Arguably, Hadoop was one of the primary catalysts that started the big data and analytics industry.

In this chapter, we will discuss the background and core concepts of Hadoop, the components of the Hadoop platform, and delve deeper into the major products in the Hadoop ecosystem. We will learn about the core concepts of distributed filesystems and distributed processing and optimizations to improve the performance of Hadoop...

The fundamentals of Hadoop

In 2006, Doug Cutting, the creator of Hadoop, was working at Yahoo!. He was actively engaged in an open source project called Nutch that involved the development of a large-scale web crawler. A web crawler at a high level is essentially software that can browse and index web pages, generally in an automatic manner, on the internet. Intuitively, this involves efficient management and computation across large volumes of data. In late January of 2006, Doug formally announced the start of Hadoop. The first line of the request, still available on the internet at https://issues.apache.org/jira/browse/INFRA-700, was The Lucene PMC has voted to split part of Nutch into a new subproject named Hadoop. And thus, Hadoop was born.

At the onset, Hadoop had two core components : Hadoop Distributed File System (HDFS) and MapReduce. This was the first iteration of Hadoop, also now known as Hadoop 1. Later, in 2012, a third component was added known as YARN (Yet Another Resource...

The Hadoop ecosystem

This chapter should be titled as the Apache ecosystem. Hadoop, like all the other projects that will be discussed in this section, is an Apache project. Apache is used loosely as a short form for the open source projects that are supported by the Apache Software Foundation. It originally has its roots in the development of the Apache HTTP server in the early 90s, and today is a collaborative global initiative that comprises entirely of volunteers who participate in releasing open source software to the global technical community.

Hadoop started out as, and still is, one of the projects in the Apache ecosystem. Due to its popularity, many other projects that are also part of Apache have been linked directly or indirectly to Hadoop as they support key functionalities in the Hadoop environment. That said, it is important to bear in mind that these projects can in most cases exist as independent products that can function without a Hadoop environment. Whether it would provide...

Hands-on with CDH

In this section, we will utilize the CDH QuickStart VM to work through some of the topics that have been discussed in the current chapter. The exercises do not have to be necessarily performed in a chronological order and are not dependent upon the completion of any of the other exercises.

We will complete the following exercises in this section:

WordCount using Hadoop MapReduce
Working with the HDFS
Downloading and querying data with Apache Hive

WordCount using Hadoop MapReduce

In this exercise, we will be attempting to count the number of occurrences of each word in one of the longest novels ever written. For the exercise, we have selected the book Artamène ou le Grand Cyrus written by Georges and/or Madeleine de Scudéry between 1649-1653. The book is considered to be the second longest novel ever written, per the related list on Wikipedia (https://en.wikipedia.org/wiki/List_of_longest_novels). The novel consists of 13,905 pages across 10 volumes and has close to two million...

Summary

This chapter provided a technical overview of Hadoop. We discussed the core components and core concepts that are fundamental to Hadoop, such as MapReduce and HDFS. We also looked at the technical challenges and considerations of using Hadoop. While it may appear simple in concept, the inner workings and a formal administration of a Hadoop architecture can be fairly complex. In this chapter we highlighted a few of them.

We concluded with a hands-on exercise on Hadoop using the Cloudera Distribution. For this tutorial, we used the CDH Virtual Machine downloaded earlier from Cloudera's website.

In the next chapter, we will look at NoSQL, an alternative or a complementary solution to Hadoop depending upon your individual and/or organization al needs. While Hadoop offers a far richer set of capabilities, if your intended use case(s) can be done with simply NoSQL solutions, the latter may be an easier choice in terms of the effort required.