Packt+ | Advance your knowledge in tech

You're reading from Apache Spark 2.x for Java Developers

Product type Book

Published in Jul 2017

Publisher Packt

ISBN-13 9781787126497

Pages 350 pages

Edition 1st Edition

Languages

Java

Concepts

Data Streaming

Authors (2):

Sourav Gulati

Sumit Kumar

View More author details

Table of Contents (19) Chapters

Title Page

Credits

Foreword

About the Authors

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

1. Introduction to Spark

2. Revisiting Java

3. Let Us Spark

4. Understanding the Spark Programming Model

5. Working with Data and Storage

6. Spark on Cluster

7. Spark Programming Model - Advanced

8. Working with Spark SQL

9. Near Real-Time Processing with Spark Streaming

10. Machine Learning Analytics with Spark MLlib

11. Learning Spark GraphX

Chapter 1. Introduction to Spark

"We call this the problem of big data."

Arguably, the first time big data was being talked about in a context we know now was in July, 1997. MichaelCox and DavidEllsworth, scientists/researchers from NASA, described the problem they faced when processing humongous amounts of data with the traditional computers of that time. In the early 2000s, Lexis Nexis designed a proprietary system, which later went on to become the High-PerformanceComputingCluster (HPCC), to address the growing need of processing data on a cluster. It was later open sourced in 2011.

It was an era of dot coms and Google was challenging the limits of the internet by crawling and indexing the entire internet. With the rate at which the internet was expanding, Google knew it would be difficult if not impossible to scale vertically to process data of that size. Distributed computing, though still in its infancy, caught Google's attention. They not only developed a distributed fault tolerant filesystem, Google File System (GFS), but also a distributed processing engine/system called MapReduce. It was then in 2003-2004 that Google released the white paper titled The Google File System by SanjayGhemawat, HowardGobioff, and Shun-TakLeung, and shortly thereafter they released another white paper titled MapReduce: Simplified Data Processing on Large Clusters by JeffreyDean and SanjayGhemawat.

Doug Cutting, an open source contributor, around the same time was looking for ways to make an open source search engine and like Google was failing to process the data at the internet scale. By 1999, Doug Cutting had developed Lucene, a Java library with the capability of text/web searching among other things. Nutch, an open source web crawler and data indexer built by Doug Cutting along with Mike Cafarella, was not scaling well. As luck would have it, Google's white paper caught Doug Cutting's attention. He began working on similar concepts calling them Nutch Distributed File System (NDFS) and Nutch MapReduce. By 2005, he was able to scale Nutch, which could index from 100 million pages to multi-billion pages using the distributed platform.

However, it wasn't just Doug Cutting but Yahoo! too who became interested in the development of the MapReduce computing framework to serve its processing capabilities. It is here that Doug Cutting refactored the distributed computing framework of Nutch and named it after his kid's elephant toy, Hadoop. By 2008, Yahoo! was using Hadoop in its production cluster to build its search index and metadata called web map. Despite being a direct competitor to Google, one distinct strategic difference that Yahoo! took while co-developing Hadoop was the nature in which the project was to be developed: they open sourced it. And the rest, as we know is history!