Curating data in stages for analytics
The raw data has to be wrangled and transformed to be consumable and ready for analytics. Each data persona may look at different data aspects or features, and there is no reason for all of them to run repeatable cleansing functions because if they did, they could all have multiple copies of data and unnecessary processing cycles, which is both time-consuming and expensive. This is where a good data catalog and design blueprints help to maintain discipline, offer data discovery opportunities for reusable components, and prevent redundant work. We have already looked at the medallion architecture, and the bronze, silver, and gold zones are where data is forged and made usable.
RDD, DataFrames, and datasets
This is a good time to refresh concepts around RDD, DataFrames, and datasets. RDD stands for Resilient Distributed Data and is the original low-level construct of Spark. RDDs have to be optimized at each stage and cannot infer schema. DataFrames...