Interestingly, as I was writing this chapter, Michael Armbrust from Databricks wrote a blog about the data sources API and presented an architecture diagram; this is what inspired me to create the following diagram:
The bottom layer is a flexible data access layer (and store) that works via multiple formats, usually a distributed filesystem such as HDFS. The computation layer is the place where we leverage the distributed-at-scale processing of the Spark engine, including the streaming data. The computation layer usually acts on RDDs. The Dataset/DataFrame layer provides the API layer. The Spark SQL then overlays the Dataset/DataFrame layer and provides data access for applications, dashboards, BI tools, and so forth. There is a huge amount of SQL knowledge among various people, with roles ranging from data analysts and programmers to data engineers, who have developed interesting SQL queries over their data. Spark needs to leverage this knowledge of SQL queries...