Reader small image

You're reading from  Microsoft SQL Server 2012 with Hadoop

Product typeBook
Published inAug 2013
Reading LevelIntermediate
PublisherPackt
ISBN-139781782177982
Edition1st Edition
Languages
Right arrow
Author (1)
Debarchan Sarkar
Debarchan Sarkar
author image
Debarchan Sarkar

Debarchan Sarkar is a Microsoft Data Platform engineer. He specializes in the Microsoft SQL Server Business Intelligence stack. Debarchan is a subject matter expert in SQL Server Integration Services and delves deep into the open source world, specifically the Apache Hadoop framework. He is currently working on a technology called HDInsight, which is Microsoft's distribution of Hadoop on Windows. He has authored various books on SQL Server and Big Data, including Microsoft SQL Server 2012 with Hadoop, Packt Publishing, and Pro Microsoft HDInsight: Hadoop on Windows, Apress. His Twitter handle is @debarchans.
Read more about Debarchan Sarkar

Right arrow

The Apache Hadoop framework


Hadoop is an open source software framework that supports data-intensive distributed applications available through the Apache Open Source community. It consists of a distributed file system HDFS, the Hadoop Distributed File System and an approach to distributed processing of analysis called MapReduce. It is written in Java and based on the Linux/Unix platform.

It's used (extensively now) in the processing of streams of data that go well beyond even the largest enterprise datasets in size. Whether it's sensor, clickstream, social media, location-based, or other data that is generated and collected in large gobs, Hadoop is often on the scene in the service of processing and analyzing it. The real magic of Hadoop is its ability to move the processing or computing logic to the data where it resides as opposed to traditional systems, which focus on a scaled-up single server, move the data to that central processing unit and process the data there. This model does not work on the volume, velocity, and variety of data that present day industry is looking to mine for business intelligence. Hence, Hadoop with its powerful fault tolerant and reliable file system and highly optimized distributed computing model, is one of the leaders in the Big Data world.

The core of Hadoop is its storage system and its distributed computing model:

HDFS

Hadoop Distributed File System is a program level abstraction on top of the host OS file system. It is responsible for storing data on the cluster. Data is split into blocks and distributed across multiple nodes in the cluster.

MapReduce

MapReduce is a programming model for processing large datasets using distributed computing on clusters of computers. MapReduce consists of two phases: dividing the data across a large number of separate processing units (called Map), and then combining the results produced by these individual processes into a unified result set (called Reduce). Between Map and Reduce, shuffle and sort occur. Hadoop cluster, once successfully configured on a system, has the following basic components:

NameNode

This is also called the Head Node/Master Node of the cluster. Primarily, it holds the metadata for HDFS during processing of data which is distributed across the nodes; it keeps track of each HDFS data block in the nodes.

Note

The NameNode is the single point of failure in a Hadoop cluster.

Secondary NameNode

This is an optional node that you can have in your cluster to back up the NameNode if it goes down. If a secondary NameNode is configured, it keeps a periodic snapshot of the NameNode configuration to serve as a backup when needed. However, there is no automated way for failing over to the secondary NameNode; if the primary NameNode goes down, a manual intervention is needed. This essentially means that there would be an obvious down time in your cluster in case the NameNode goes down.

DataNode

These are the systems across the cluster which store the actual HDFS data blocks. The data blocks are replicated on multiple nodes to provide fault tolerant and high availability solutions.

JobTracker

This is a service running on the NameNode, which manages MapReduce jobs and distributes individual tasks.

TaskTracker

This is a service running on the DataNodes, which instantiates and monitors individual Map and Reduce tasks that are submitted.

The following figure shows you the core components of the Apache Hadoop framework:

Additionally, there are a number of supporting projects for Hadoop, each having their unique purpose for example, to feed input data to Hadoop system, a data warehousing system for ad hoc queries on top of Hadoop, and many more. The following are a few worth mentioning:

Hive

Hive is a supporting project for the main Apache Hadoop project and is an abstraction on top of MapReduce, which allows users to query the data without developing MapReduce applications. It provides the user with a SQL-like query language called Hive Query Language (HQL) to fetch data from Hive store. This makes it easier for people with SQL skills to adapt to Hadoop environment quickly.

Pig

Pig is an alternative abstraction on MapReduce, which uses dataflow scripting language called PigLatin. This is favored by programmers who already have scripting skills. You can run PigLatin statements interactively in a command line Pig shell named Grunt. You can also combine a sequence of PigLatin statements in a script, which can then be executed as a unit. These PigLatin statements are used to generate MapReduce jobs by the Pig interpreter and are executed on the HDFS data.

Flume

Flume is another open source implementation on top of Hadoop, which provides a data-ingestion mechanism for data into HDFS as data is generated.

Sqoop

Sqoop provides a way to import and export data to and from relational database tables (for example, SQL Server) and HDFS.

Oozie

Oozie allows creation of workflow of MapReduce jobs. This is familiar with developers who have worked on Workflow and communication foundation based solutions.

HBase

HBase is Hadoop database, a NoSQL database. It is another abstraction on top of Hadoop, which provides a near real-time query mechanisms to HDFS data.

Mahout

Mahout is a machine-learning library that contains algorithms for clustering and classification. One major focus of machine-learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data.

The following figure gives you a 1000 feet view of the Apache Hadoop and the various supporting projects that form this amazing ecosystem:

We will be exploring some of these components in the subsequent chapters of this book, but for a complete reference, please visit the Apache website http://hadoop.apache.org/.

Setting up this ecosystem along with the required supporting projects could be really non-trivial. In fact the only drawback this implementation has, is the effort needed to set up and administer a Hadoop cluster. This is basically the reason that many vendors are coming up with their own distribution of Hadoop bundled and distributed as a data processing platform. Using these distributions, enterprises would be able to set up Hadoop clusters in minutes through simplified and user-friendly cluster deployment wizards and also use the various dashboards for monitoring and instrumentation purposes. Some of the present day distributions are CH4 from Cloudera, Hortonworks Data Platform, and Microsoft HDInsight, which are quickly gaining popularity. These distributions are outside the scope of this book and won't be covered; please visit the respective websites for detailed information about these distributions.

Previous PageNext Page
You have been reading a chapter from
Microsoft SQL Server 2012 with Hadoop
Published in: Aug 2013Publisher: PacktISBN-13: 9781782177982
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Debarchan Sarkar

Debarchan Sarkar is a Microsoft Data Platform engineer. He specializes in the Microsoft SQL Server Business Intelligence stack. Debarchan is a subject matter expert in SQL Server Integration Services and delves deep into the open source world, specifically the Apache Hadoop framework. He is currently working on a technology called HDInsight, which is Microsoft's distribution of Hadoop on Windows. He has authored various books on SQL Server and Big Data, including Microsoft SQL Server 2012 with Hadoop, Packt Publishing, and Pro Microsoft HDInsight: Hadoop on Windows, Apress. His Twitter handle is @debarchans.
Read more about Debarchan Sarkar