Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Apache Spark 2.x for Java Developers

You're reading from  Apache Spark 2.x for Java Developers

Product type Book
Published in Jul 2017
Publisher Packt
ISBN-13 9781787126497
Pages 350 pages
Edition 1st Edition
Languages
Authors (2):
Sourav Gulati Sourav Gulati
Profile icon Sourav Gulati
Sumit Kumar Sumit Kumar
Profile icon Sumit Kumar
View More author details

Table of Contents (19) Chapters

Title Page
Credits
Foreword
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
1. Introduction to Spark 2. Revisiting Java 3. Let Us Spark 4. Understanding the Spark Programming Model 5. Working with Data and Storage 6. Spark on Cluster 7. Spark Programming Model - Advanced 8. Working with Spark SQL 9. Near Real-Time Processing with Spark Streaming 10. Machine Learning Analytics with Spark MLlib 11. Learning Spark GraphX

Chapter 1. Introduction to Spark

"We call this the problem of big data."

Arguably, the first time big data was being talked about in a context we know now was in July, 1997. MichaelCox and DavidEllsworth, scientists/researchers from NASA, described the problem they faced when processing humongous amounts of data with the traditional computers of that time. In the early 2000s, Lexis Nexis designed a proprietary system, which later went on to become the High-PerformanceComputingCluster (HPCC), to address the growing need of processing data on a cluster. It was later open sourced in 2011.

It was an era of dot coms and Google was challenging the limits of the internet by crawling and indexing the entire internet. With the rate at which the internet was expanding, Google knew it would be difficult if not impossible to scale vertically to process data of that size. Distributed computing, though still in its infancy, caught Google's attention. They not only developed a distributed fault tolerant filesystem, Google File System (GFS), but also a distributed processing engine/system called MapReduce. It was then in 2003-2004 that Google released the white paper titled The Google File System by SanjayGhemawat, HowardGobioff, and Shun-TakLeung, and shortly thereafter they released another white paper titled MapReduce: Simplified Data Processing on Large Clusters by JeffreyDean and SanjayGhemawat.

Doug Cutting, an open source contributor, around the same time was looking for ways to make an open source search engine and like Google was failing to process the data at the internet scale. By 1999, Doug Cutting had developed Lucene, a Java library with the capability of text/web searching among other things. Nutch, an open source web crawler and data indexer built by Doug Cutting along with Mike Cafarella, was not scaling well. As luck would have it, Google's white paper caught Doug Cutting's attention. He began working on similar concepts calling them Nutch Distributed File System (NDFS) and Nutch MapReduce. By 2005, he was able to scale Nutch, which could index from 100 million pages to multi-billion pages using the distributed platform.

However, it wasn't just Doug Cutting but Yahoo! too who became interested in the development of the MapReduce computing framework to serve its processing capabilities. It is here that Doug Cutting refactored the distributed computing framework of Nutch and named it after his kid's elephant toy, Hadoop. By 2008, Yahoo! was using Hadoop in its production cluster to build its search index and metadata called web map. Despite being a direct competitor to Google, one distinct strategic difference that Yahoo! took while co-developing Hadoop was the nature in which the project was to be developed: they open sourced it. And the rest, as we know is history!

In this chapter, we will cover the following topics:

  • What is big data?
  • Why Apache Spark?
  • RDD the first citizen of Spark
  • Spark ecosystem -- Spark SQL, Spark Streaming, Milb, Graphx
  • What's new in Spark 2.X?
You have been reading a chapter from
Apache Spark 2.x for Java Developers
Published in: Jul 2017 Publisher: Packt ISBN-13: 9781787126497
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}