Reader small image

You're reading from  Fast Data Processing with Spark 2 - Third Edition

Product typeBook
Published inOct 2016
Reading LevelBeginner
PublisherPackt
ISBN-139781785889271
Edition3rd Edition
Languages
Right arrow
Author (1)
Holden Karau
Holden Karau
author image
Holden Karau

Holden Karau is a software development engineer and is active in the open source. She has worked on a variety of search, classification, and distributed systems problems at IBM, Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a bachelor's of mathematics degree in computer science. Other than software, she enjoys playing with fire and hula hoops, and welding.
Read more about Holden Karau

Right arrow

Chapter 7. Spark 2.0 Concepts

Now that you have seen the fundamental underpinnings of Spark, let's take a broader look at the architecture, context, and ecosystem in which Spark operates. This is a catch-all chapter that captures a divergent set of essential topics that will help you get a broader understanding of Spark as a whole. Once you go through this, you will understand who is using Spark and how and where it is being used. This chapter will cover the following topics:

  • The Datasets accompanying this book and the IDEs for data wrangling

  • A quick description of a data scientist's expectation from Spark

  • The Data Lake architecture and the position of Spark

  • The evolution and progression of Spark Architecture to 2.0

  • The Parquet data storage mechanism

So with good fundamental knowledge of the Spark framework, let's start focusing on these three topics: data scientist DevOps, data wrangling, and of course, the mechanisms in Apache Spark, including DataFrames, machine learning, and working with big...

Code and Datasets for the rest of the book


The first order of business is to look at the code and Datasets that we will be using for the rest of the chapters.

Code

It is time for you to experiment with Spark APIs and wrangle with data. We have been using the Scala and Python shell in this book and you can continue to do so. You should also explore using an iPython notebook, which is an excellent way for data engineers and data scientists to experiment with data. The iPython notebooks and its Datasets are available at https://github.com/xsankar/fdps-v3. You'll have to download some of the data yourselves due to the restrictions in distributing them. We have provided the appropriate URL as and when the need to download data arises.

IDE

For this book, we will use scala-shell and pyspark. The Zeppelin IDE is another fine choice. Python is a better language for data scientists and has a tradition of strong scientific libraries. For those of you who prefer Scala, it is not that hard to map Python...

The data scientist and Spark features


One of the interesting questions relevant to this book is, "What do data scientists want?" It is a question that is being discussed and debated in many blogs. A short answer is as follows:

  • The ability to explore, model, and reason data at scale-because many of their algorithms get asymptotically better with data, and so, a small Dataset sample is not enough for exploring different algorithms

  • The ability to deploy without a lot of impedance

  • The facility to evolve models once they are in production and the real world is using them

In short, all we ask for is the shortest path from the lab to the factory, enabling a data scientist DevOps person! The following screenshot (combining talks from Josh Willis and Ian Buss), which displays The Sense & Sensibility of a Data Scientist DevOps, succinctly shows the value of Apache Spark to a data scientist by addressing three points:

Who is this data scientist DevOps person?

Of course, we really do not want to start...

Spark v2.0 and beyond


Spark v2.0 and beyond has been the catalyst for a renaissance in data science! Datasets, DataFrames, ML pipelines, and new and improved algorithms in MLlib have paved the way for data wrangling at scale. I think Version 2.0 marks the spot where Spark turned into a mature framework. It could handle huge workloads in terms of the number of machines as well as the volume of data. The community update at the Spark Summit 2015 in San Francisco included a slide that showed the power of Spark:

  • The largest cluster-8,000 nodes (Tencent)

  • The largest single job-1 petabyte and more (Alibaba and Tencent)

  • The longest running job-1 petabyte and more for a week (Alibaba)

  • The top streaming intake-1 terabyte/hour (Janelia farm)

  • The largest shuffle-1 petabyte during sort benchmark (databricks)

  • Netflix uses Spark for ad-hoc query and experimentation; they have 1,500 and more Spark nodes with 100 terabyte memory, chugging through 15 petabyte and more of S3 data and 7 petabyte of Parquet

  • Tencent...

Apache Spark - evolution


It is interesting to trace the evolution of Apache Spark from an abstract perspective. Spark started out as a fast engine for big data processing-fast to run the code and write code as well. The original value proposition for Spark was that it offered faster in-memory computation graphs with compatibility with the Hadoop ecosystem, plus interesting and very usable APIs in Scala, Java, and Python. RDDs ruled the world. The focus was on iterative and interactive apps that operated on data multiple times, which was not a good use case for Hadoop.

The evolution didn't stop there. As Matei pointed out in his talk at MIT, users wanted more, and the Spark programming model evolved to include the following functionalities:

  • More complex, multi-pass analytics (for example, ML pipelines and graph)

  • More interactive ad-hoc queries

  • More real-time stream processing

  • More parallel machine learning algorithms beyond the basic RDDs

  • More types of data sources as input and output

  • More integration...

Apache Spark - the full stack


With all of this background information behind us, let's take a quick look at the full Spark stack (shown in the following diagram), which used to be a lot simpler, showing how the Spark ecosystem is continually evolving:

The Spark stack currently includes the following features:

  • It provides the Spark SQL feature. This feature uses SQL for data manipulation while maintaining the underlying Spark computations. It also provides the vital interface via exposing the Datasets to external systems through JDBC/ODBC, arguably the best value of Spark SQL.

  • Advanced analytics, which is still evolving; look out for features such as parameter server and neural networks in the later versions of Spark.

  • It provides the Dataset/DataFrame API, of course. It is one of parts we are focusing on in this book and we will see more of it in the following chapters.

  • The catalyst optimizer is an interesting beast. It is the proverbial software layer that separates a declarative API/interface...

The art of a big data store - Parquet


For an efficient and performant computing stack, we also need an equally optimal storage mechanism. Parquet fits the bill and can be considered as a best practice. The pattern uses the HDFS file system, curates the Datasets, and stores them in the Parquet format.

Parquet is a very efficient columnar format for data storage, initially developed by contributors from Twitter, Cloudera, Criteo, Berkely AMP Lab, LinkedIn, and Stripe. The Google Dremel paper (Dremel, 2010) inspired the basic algorithms and design of Parquet. It is now a top-level Apache project, parquet-apache, and is the default format for reading and writing operations in Spark DataFrames. Almost all the big data products, from MPP databases to query engines to visualization tools, interface natively with Parquet. Let's take a quick look at the science of data storage in the big data domain, what we need from a store, and the capabilities of Parquet.

Column projection and data partition

Column...

Summary


This is an interesting chapter where we discussed the broader and wider picture of where Spark fits in the big data and analytics ecosystem. First, we looked at the Datasets that accompany this book as well as some interesting IDEs. We then discussed the role of data scientists and what they expect from a Spark stack, which led to our discussion to the Spark-based Data Lake architecture and then the Spark stack. We also looked at Parquest as an efficient storage format.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Fast Data Processing with Spark 2 - Third Edition
Published in: Oct 2016Publisher: PacktISBN-13: 9781785889271
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Holden Karau

Holden Karau is a software development engineer and is active in the open source. She has worked on a variety of search, classification, and distributed systems problems at IBM, Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a bachelor's of mathematics degree in computer science. Other than software, she enjoys playing with fire and hula hoops, and welding.
Read more about Holden Karau