Reader small image

You're reading from  Learning PySpark

Product typeBook
Published inFeb 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781786463708
Edition1st Edition
Languages
Right arrow
Authors (2):
Tomasz Drabas
Tomasz Drabas
author image
Tomasz Drabas

Tomasz Drabas is a Data Scientist working for Microsoft and currently residing in the Seattle area. He has over 12 years' international experience in data analytics and data science in numerous fields: advanced technology, airlines, telecommunications, finance, and consulting. Tomasz started his career in 2003 with LOT Polish Airlines in Warsaw, Poland while finishing his Master's degree in strategy management. In 2007, he moved to Sydney to pursue a doctoral degree in operations research at the University of New South Wales, School of Aviation; his research crossed boundaries between discrete choice modeling and airline operations research. During his time in Sydney, he worked as a Data Analyst for Beyond Analysis Australia and as a Senior Data Analyst/Data Scientist for Vodafone Hutchison Australia among others. He has also published scientific papers, attended international conferences, and served as a reviewer for scientific journals. In 2015 he relocated to Seattle to begin his work for Microsoft. While there, he has worked on numerous projects involving solving problems in high-dimensional feature space.
Read more about Tomasz Drabas

Denny Lee
Denny Lee
author image
Denny Lee

Denny Lee is a Principal Program Manager at Microsoft for the Azure DocumentDB teamMicrosoft's blazing fast, planet-scale managed document store service. He is a hands-on distributed systems and data science engineer with more than 18 years of experience developing Internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He has extensive experience of building greenfield teams as well as turnaround/ change catalyst. Prior to joining the Azure DocumentDB team, Denny worked as a Technology Evangelist at Databricks; he has been working with Apache Spark since 0.5. He was also the Senior Director of Data Sciences Engineering at Concur, and was on the incubation team that built Microsoft's Hadoop on Windows and Azure service (currently known as HDInsight). Denny also has a Masters in Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise healthcare customers for the last 15 years.
Read more about Denny Lee

View More author details
Right arrow

Spark 2.0 architecture


The introduction of Apache Spark 2.0 is the recent major release of the Apache Spark project based on the key learnings from the last two years of development of the platform:

Source: Apache Spark 2.0: Faster, Easier, and Smarter http://bit.ly/2ap7qd5

The three overriding themes of the Apache Spark 2.0 release surround performance enhancements (via Tungsten Phase 2), the introduction of structured streaming, and unifying Datasets and DataFrames. We will describe the Datasets as they are part of Spark 2.0 even though they are currently only available in Scala and Java.

Note

Refer to the following presentations by key Spark committers for more information about Apache Spark 2.0:

Reynold Xin's Apache Spark 2.0: Faster, Easier, and Smarter webinar http://bit.ly/2ap7qd5

Michael Armbrust's Structuring Spark: DataFrames, Datasets, and Streaming http://bit.ly/2ap7qd5

Tathagata Das' A Deep Dive into Spark Streaming http://bit.ly/2aHt1w0

Joseph Bradley's Apache Spark MLlib 2.0 Preview: Data Science and Production http://bit.ly/2aHrOVN

Unifying Datasets and DataFrames

In the previous section, we stated out that Datasets (at the time of writing this book) are only available in Scala or Java. However, we are providing the following context to better understand the direction of Spark 2.0.

Datasets were introduced in 2015 as part of the Apache Spark 1.6 release. The goal for datasets was to provide a type-safe, programming interface. This allowed developers to work with semi-structured data (like JSON or key-value pairs) with compile time type safety (that is, production applications can be checked for errors before they run). Part of the reason why Python does not implement a Dataset API is because Python is not a type-safe language.

Just as important, the Datasets API contain high-level domain specific language operations such as sum(), avg(), join(), and group(). This latter trait means that you have the flexibility of traditional Spark RDDs but the code is also easier to express, read, and write. Similar to DataFrames, Datasets can take advantage of Spark's catalyst optimizer by exposing expressions and data fields to a query planner and making use of Tungsten's fast in-memory encoding.

The history of the Spark APIs is denoted in the following diagram noting the progression from RDD to DataFrame to Dataset:

Source: From Webinar Apache Spark 1.5: What is the difference between a DataFrame and a RDD? http://bit.ly/29JPJSA

The unification of the DataFrame and Dataset APIs has the potential of creating breaking changes to backwards compatibility. This was one of the main reasons Apache Spark 2.0 was a major release (as opposed to a 1.x minor release which would have minimized any breaking changes). As you can see from the following diagram, DataFrame and Dataset both belong to the new Dataset API introduced as part of Apache Spark 2.0:

Source: A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets http://bit.ly/2accSNA

As noted previously, the Dataset API provides a type-safe, object-oriented programming interface. Datasets can take advantage of the Catalyst optimizer by exposing expressions and data fields to the query planner and Project Tungsten's Fast In-memory encoding. But with DataFrame and Dataset now unified as part of Apache Spark 2.0, DataFrame is now an alias for the Dataset Untyped API. More specifically:

DataFrame = Dataset[Row]

Introducing SparkSession

In the past, you would potentially work with SparkConf, SparkContext, SQLContext, and HiveContext to execute your various Spark queries for configuration, Spark context, SQL context, and Hive context respectively. The SparkSession is essentially the combination of these contexts including StreamingContext.

For example, instead of writing:

df = sqlContext.read \
    .format('json').load('py/test/sql/people.json')

now you can write:

df = spark.read.format('json').load('py/test/sql/people.json')

or:

df = spark.read.json('py/test/sql/people.json')

The SparkSession is now the entry point for reading data, working with metadata, configuring the session, and managing the cluster resources.

Tungsten phase 2

The fundamental observation of the computer hardware landscape when the project started was that, while there were improvements in price per performance in RAM memory, disk, and (to an extent) network interfaces, the price per performance advancements for CPUs were not the same. Though hardware manufacturers could put more cores in each socket (i.e. improve performance through parallelization), there were no significant improvements in the actual core speed.

Project Tungsten was introduced in 2015 to make significant changes to the Spark engine with the focus on improving performance. The first phase of these improvements focused on the following facets:

  • Memory Management and Binary Processing: Leveraging application semantics to manage memory explicitly and eliminate the overhead of the JVM object model and garbage collection

  • Cache-aware computation: Algorithms and data structures to exploit memory hierarchy

  • Code generation: Using code generation to exploit modern compilers and CPUs

The following diagram is the updated Catalyst engine to denote the inclusion of Datasets. As you see at the right of the diagram (right of the Cost Model), Code Generation is used against the selected physical plans to generate the underlying RDDs:

Source: Structuring Spark: DataFrames, Datasets, and Streaming http://bit.ly/2cJ508x

As part of Tungsten Phase 2, there is the push into whole-stage code generation. That is, the Spark engine will now generate the byte code at compile time for the entire Spark stage instead of just for specific jobs or tasks. The primary facets surrounding these improvements include:

  • No virtual function dispatches: This reduces multiple CPU calls that can have a profound impact on performance when dispatching billions of times

  • Intermediate data in memory vs CPU registers: Tungsten Phase 2 places intermediate data into CPU registers. This is an order of magnitude reduction in the number of cycles to obtain data from the CPU registers instead of from memory

  • Loop unrolling and SIMD: Optimize Apache Spark's execution engine to take advantage of modern compilers and CPUs' ability to efficiently compile and execute simple for loops (as opposed to complex function call graphs)

For a more in-depth review of Project Tungsten, please refer to:

Structured Streaming

As quoted by Reynold Xin during Spark Summit East 2016:

"The simplest way to perform streaming analytics is not having to reason about streaming."

This is the underlying foundation for building Structured Streaming. While streaming is powerful, one of the key issues is that streaming can be difficult to build and maintain. While companies such as Uber, Netflix, and Pinterest have Spark Streaming applications running in production, they also have dedicated teams to ensure the systems are highly available.

Note

For a high-level overview of Spark Streaming, please review Spark Streaming: What Is It and Who's Using It? http://bit.ly/1Qb10f6

As implied previously, there are many things that can go wrong when operating Spark Streaming (and any streaming system for that matter) including (but not limited to) late events, partial outputs to the final data source, state recovery on failure, and/or distributed reads/writes:

Source: A Deep Dive into Structured Streaming http://bit.ly/2aHt1w0

Therefore, to simplify Spark Streaming, there is now a single API that addresses both batch and streaming within the Apache Spark 2.0 release. More succinctly, the high-level streaming API is now built on top of the Apache Spark SQL Engine. It runs the same queries as you would with Datasets/DataFrames providing you with all the performance and optimization benefits as well as benefits such as event time, windowing, sessions, sources, and sinks.

Continuous applications

Altogether, Apache Spark 2.0 not only unified DataFrames and Datasets but also unified streaming, interactive, and batch queries. This opens a whole new set of use cases including the ability to aggregate data into a stream and then serving it using traditional JDBC/ODBC, to change queries at run time, and/or to build and apply ML models in for many scenario in a variety of latency use cases:

Together, you can now build end-to-end continuous applications, in which you can issue the same queries to batch processing as to real-time data, perform ETL, generate reports, update or track specific data in the stream.

Note

For more information on continuous applications, please refer to Matei Zaharia's blog post Continuous Applications: Evolving Streaming in Apache Spark 2.0 - A foundation for end-to-end real-time applications http://bit.ly/2aJaSOr.

Previous PageNext Page
You have been reading a chapter from
Learning PySpark
Published in: Feb 2017Publisher: PacktISBN-13: 9781786463708
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Tomasz Drabas

Tomasz Drabas is a Data Scientist working for Microsoft and currently residing in the Seattle area. He has over 12 years' international experience in data analytics and data science in numerous fields: advanced technology, airlines, telecommunications, finance, and consulting. Tomasz started his career in 2003 with LOT Polish Airlines in Warsaw, Poland while finishing his Master's degree in strategy management. In 2007, he moved to Sydney to pursue a doctoral degree in operations research at the University of New South Wales, School of Aviation; his research crossed boundaries between discrete choice modeling and airline operations research. During his time in Sydney, he worked as a Data Analyst for Beyond Analysis Australia and as a Senior Data Analyst/Data Scientist for Vodafone Hutchison Australia among others. He has also published scientific papers, attended international conferences, and served as a reviewer for scientific journals. In 2015 he relocated to Seattle to begin his work for Microsoft. While there, he has worked on numerous projects involving solving problems in high-dimensional feature space.
Read more about Tomasz Drabas

author image
Denny Lee

Denny Lee is a Principal Program Manager at Microsoft for the Azure DocumentDB teamMicrosoft's blazing fast, planet-scale managed document store service. He is a hands-on distributed systems and data science engineer with more than 18 years of experience developing Internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He has extensive experience of building greenfield teams as well as turnaround/ change catalyst. Prior to joining the Azure DocumentDB team, Denny worked as a Technology Evangelist at Databricks; he has been working with Apache Spark since 0.5. He was also the Senior Director of Data Sciences Engineering at Concur, and was on the incubation team that built Microsoft's Hadoop on Windows and Azure service (currently known as HDInsight). Denny also has a Masters in Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise healthcare customers for the last 15 years.
Read more about Denny Lee