The end-to-end pipeline
Before we write any code, we need to consider the following:
- Data is loaded daily; our process will run after this happens… Do we schedule or trigger our processing?
- We will need a way to specify the date we’ll be processing
- We need to consider the possibility of having to reprocess a date if there is an issue with the original dataset
- Do we need to write a mechanism to process more than one day?
- Do we need to write a mechanism to reprocess the whole dataset?
- What data quality rules should be put in place?
- How and where are we going to transform our data?
Here’s a high-level overview of how we can structure our Spark/Scala application to meet these requirements: