Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Practical Big Data Analytics

You're reading from  Practical Big Data Analytics

Product type Book
Published in Jan 2018
Publisher Packt
ISBN-13 9781783554393
Pages 412 pages
Edition 1st Edition
Languages
Concepts
Author (1):
Nataraj Dasgupta Nataraj Dasgupta
Profile icon Nataraj Dasgupta

Table of Contents (16) Chapters

Title Page
Packt Upsell
Contributors
Preface
Too Big or Not Too Big Big Data Mining for the Masses The Analytics Toolkit Big Data With Hadoop Big Data Mining with NoSQL Spark for Big Data Analytics An Introduction to Machine Learning Concepts Machine Learning Deep Dive Enterprise Data Science Closing Thoughts on Big Data External Data Science Resources Other Books You May Enjoy

Chapter 4. Big Data With Hadoop

Hadoop has become the de facto standard in the world of big data, especially over the past three to four years. Hadoop started as a subproject of Apache Nutch in 2006 and introduced two key features related to distributed filesystems and distributed computing, also known as MapReduce, that caught on very rapidly among the open source community. Today, there are thousands of new products that have been developed leveraging the core features of Hadoop, and it has evolved into a vast ecosystem consisting of more than 150 related major products. Arguably, Hadoop was one of the primary catalysts that started the big data and analytics industry.

In this chapter, we will discuss the background and core concepts of Hadoop, the components of the Hadoop platform, and delve deeper into the major products in the Hadoop ecosystem. We will learn about the core concepts of distributed filesystems and distributed processing and optimizations to improve the performance of Hadoop...

The fundamentals of Hadoop


In 2006, Doug Cutting, the creator of Hadoop, was working at Yahoo!. He was actively engaged in an open source project called Nutch that involved the development of a large-scale web crawler. A web crawler at a high level is essentially software that can browse and index web pages, generally in an automatic manner, on the internet. Intuitively, this involves efficient management and computation across large volumes of data. In late January of 2006, Doug formally announced the start of Hadoop. The first line of the request, still available on the internet at https://issues.apache.org/jira/browse/INFRA-700, was The Lucene PMC has voted to split part of Nutch into a new subproject named Hadoop. And thus, Hadoop was born.

At the onset, Hadoop had two core components : Hadoop Distributed File System (HDFS) and MapReduce. This was the first iteration of Hadoop, also now known as Hadoop 1. Later, in 2012, a third component was added known as YARN (Yet Another Resource...

The Hadoop ecosystem


This chapter should be titled as the Apache ecosystem. Hadoop, like all the other projects that will be discussed in this section, is an Apache project. Apache is used loosely as a short form for the open source projects that are supported by the Apache Software Foundation. It originally has its roots in the development of the Apache HTTP server in the early 90s, and today is a collaborative global initiative that comprises entirely of volunteers who participate in releasing open source software to the global technical community.

Hadoop started out as, and still is, one of the projects in the Apache ecosystem. Due to its popularity, many other projects that are also part of Apache have been linked directly or indirectly to Hadoop as they support key functionalities in the Hadoop environment. That said, it is important to bear in mind that these projects can in most cases exist as independent products that can function without a Hadoop environment. Whether it would provide...

Hands-on with CDH


In this section, we will utilize the CDH QuickStart VM to work through some of the topics that have been discussed in the current chapter. The exercises do not have to be necessarily performed in a chronological order and are not dependent upon the completion of any of the other exercises.

We will complete the following exercises in this section:

  • WordCount using Hadoop MapReduce
  • Working with the HDFS
  • Downloading and querying data with Apache Hive

WordCount using Hadoop MapReduce

In this exercise, we will be attempting to count the number of occurrences of each word in one of the longest novels ever written. For the exercise, we have selected the book Artamène ou le Grand Cyrus written by Georges and/or Madeleine de Scudéry between 1649-1653. The book is considered to be the second longest novel ever written, per the related list on Wikipedia (https://en.wikipedia.org/wiki/List_of_longest_novels). The novel consists of 13,905 pages across 10 volumes and has close to two million...

Summary


This chapter provided a technical overview of Hadoop. We discussed the core components and core concepts that are fundamental to Hadoop, such as MapReduce and HDFS. We also looked at the technical challenges and considerations of using Hadoop. While it may appear simple in concept, the inner workings and a formal administration of a Hadoop architecture can be fairly complex. In this chapter we highlighted a few of them.

We concluded with a hands-on exercise on Hadoop using the Cloudera Distribution. For this tutorial, we used the CDH Virtual Machine downloaded earlier from Cloudera's website.

In the next chapter, we will look at NoSQL, an alternative or a complementary solution to Hadoop depending upon your individual and/or organization al needs. While Hadoop offers a far richer set of capabilities, if your intended use case(s) can be done with simply NoSQL solutions, the latter may be an easier choice in terms of the effort required.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Practical Big Data Analytics
Published in: Jan 2018 Publisher: Packt ISBN-13: 9781783554393
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}