Reader small image

You're reading from  Modern Big Data Processing with Hadoop

Product typeBook
Published inMar 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781787122765
Edition1st Edition
Languages
Concepts
Right arrow
Authors (3):
V Naresh Kumar
V Naresh Kumar
author image
V Naresh Kumar

Naresh has more than a decade of professional experience in designing, implementing and running very large scale Internet Applications in Fortune Top 500 Companies. He is a Full Stack Architect with hands-on experience in domains like E-commerce, Web-hosting, Healthcare, Bigdata & Analytics, Data Streaming, Advertising and Databases. He believes in Opensource and contributes to them actively. He keeps himself up-to-date with emerging technologies starting from Linux Systems Internals to Frontend technologies. He studied in BITS-Pilani, Rajasthan with Dual Degree in Computer Science & Economics.
Read more about V Naresh Kumar

Manoj R Patil
Manoj R Patil
author image
Manoj R Patil

Manoj R Patil is the Chief Architect in Big Data at Compassites Software Solutions Pvt. Ltd. where he overlooks the overall platform architecture related to Big Data solutions, and he also has a hands-on contribution to some assignments. He has been working in the IT industry for the last 15 years. He started as a programmer and, on the way, acquired skills in architecting and designing solutions, managing projects keeping each stakeholder's interest in mind, and deploying and maintaining the solution on a cloud infrastructure. He has been working on the Pentaho-related stack for the last 5 years, providing solutions while working with employers and as a freelancer as well. Manoj has extensive experience in JavaEE, MySQL, various frameworks, and Business Intelligence, and is keen to pursue his interest in predictive analysis. He was also associated with TalentBeat, Inc. and Persistent Systems, and implemented interesting solutions in logistics, data masking, and data-intensive life sciences.
Read more about Manoj R Patil

Prashant Shindgikar
Prashant Shindgikar
author image
Prashant Shindgikar

Prashant Shindgikar is an accomplished big data Architect with over 20 years of experience in data analytics. He specializes in data innovation and resolving data challenges for major retail brands. He is a hands-on architect having an innovative approach to solving data problems. He provides thought leadership and pursues strategies for engagements with the senior executives on innovation in data processing and analytics. He presently works for a large USA-based retail company.
Read more about Prashant Shindgikar

View More author details
Right arrow

Large-Scale Data Processing Frameworks

As the volume and complexity of data sources are increasing, deriving value out of data is also becoming increasingly difficult. Ever since Hadoop was made, it has built a massively scalable filesystem, HDFS. It has adopted the MapReduce concepts from functional programming to approach the large-scale data processing challenges. As technology is constantly evolving to overcome the challenges posed by data mining, enterprises are also finding ways to embrace these changes to stay ahead.

In this chapter, we will focus on these data processing solutions:

  • MapReduce
  • Apache Spark
  • Spark SQL
  • Spark Streaming

MapReduce

MapReduce is a concept that is borrowed from functional programming. The data processing is broken down into a map phase, where data preparation occurs, and a reduce phase, where the actual results are computed. The reason MapReduce has played an important role is the massive parallelism we can achieve as the data is sharded into multiple distributed servers. Without this advantage, MapReduce cannot really perform well.

Let's take up a simple example to understand how MapReduce works in functional programming:

  • The input data is processed using a mapper function of our choice
  • The output from the mapper function should be in a state that is consumable by the reduce function
  • The output from the mapper function is fed to the reduce function to generate the necessary results

Let's understand these steps using a simple program. This program uses the following text...

Hadoop MapReduce

Apache MapReduce is a framework that makes it easier for us to run MapReduce operations on very large, distributed datasets. One of the advantages of Hadoop is a distributed file system that is rack-aware and scalable. The Hadoop job scheduler is intelligent enough to make sure that the computation happens on the nodes where the data is located. This is also a very important aspect as it reduces the amount of network IO.

Let's see how the framework makes it easier to run massively parallel computations with the help of this diagram:

This diagram looks a bit more complicated than the previous diagram, but most of the things are done by the Hadoop MapReduce framework itself for us. We still write the code for mapping and reducing our input data.

Let's see in detail what happens when we process our data with the Hadoop MapReduce framework from the preceding...

Apache Spark 2

Apache Spark is a general-purpose cluster computing system. It's very well suited for large-scale data processing. It performs 100 times better than Hadoop when run completely in-memory and 10 times better when run entirely from disk. It has a sophisticated directed acyclic graph execution engine that supports an acyclic data flow model.

Apache Spark has first-class support for writing programs in Java, Scala, Python, and R programming languages to cater to a wider audience. It offers more than 80 different operators to build parallel applications without worrying about the underlying infrastructure.

Apache Spark has libraries catering to Structured Query Language, known as Spark SQL; this supports writing queries in programs using ANSI SQL. It also has support for computing streaming data, which is very much needed in today's real-time data processing...

Summary

In this chapter, you looked at the basic concepts of large-scale data processing frameworks and also learned that one of the powerful features of spark is building applications that process real-time streaming data and produce real-time results.

In the next few chapters, we will discuss how to build real-time data search pipelines with Elasticsearch stack.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Modern Big Data Processing with Hadoop
Published in: Mar 2018Publisher: PacktISBN-13: 9781787122765
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
V Naresh Kumar

Naresh has more than a decade of professional experience in designing, implementing and running very large scale Internet Applications in Fortune Top 500 Companies. He is a Full Stack Architect with hands-on experience in domains like E-commerce, Web-hosting, Healthcare, Bigdata & Analytics, Data Streaming, Advertising and Databases. He believes in Opensource and contributes to them actively. He keeps himself up-to-date with emerging technologies starting from Linux Systems Internals to Frontend technologies. He studied in BITS-Pilani, Rajasthan with Dual Degree in Computer Science & Economics.
Read more about V Naresh Kumar

author image
Manoj R Patil

Manoj R Patil is the Chief Architect in Big Data at Compassites Software Solutions Pvt. Ltd. where he overlooks the overall platform architecture related to Big Data solutions, and he also has a hands-on contribution to some assignments. He has been working in the IT industry for the last 15 years. He started as a programmer and, on the way, acquired skills in architecting and designing solutions, managing projects keeping each stakeholder's interest in mind, and deploying and maintaining the solution on a cloud infrastructure. He has been working on the Pentaho-related stack for the last 5 years, providing solutions while working with employers and as a freelancer as well. Manoj has extensive experience in JavaEE, MySQL, various frameworks, and Business Intelligence, and is keen to pursue his interest in predictive analysis. He was also associated with TalentBeat, Inc. and Persistent Systems, and implemented interesting solutions in logistics, data masking, and data-intensive life sciences.
Read more about Manoj R Patil

author image
Prashant Shindgikar

Prashant Shindgikar is an accomplished big data Architect with over 20 years of experience in data analytics. He specializes in data innovation and resolving data challenges for major retail brands. He is a hands-on architect having an innovative approach to solving data problems. He provides thought leadership and pursues strategies for engagements with the senior executives on innovation in data processing and analytics. He presently works for a large USA-based retail company.
Read more about Prashant Shindgikar