Datasets - a quick introduction
A Spark Dataset is a group of specified heterogeneous columns, akin to a spreadsheet or a relational database table. RDDs have always been the basic building blocks of Spark and they still are. But RDDs deal with objects; we might know what the objects are but the framework doesn't. So things such as type checking and semantic queries are not possible with RDDs. Then came DataFrames, which added schemas; we can associate schemas with an RDD. DataFrames also added SQL and SQL-like capabilities.
Spark 2.0.0 added Datasets, which have all the original DataFrame APIs as well as compile-time type checking, thus making our interfaces richer and more robust. So now we have three mechanisms:
Our preferred mechanism is the semantic-rich Datasets
Our second option is the use of DataFrames as untyped views in a Dataset
For low-level operations, we'll use RDDs as the underlying basic distributed objects
In short, we should always use the Dataset APIs and abstractions. RDDs...