Understanding data pipelines
A data pipeline is a mechanism to handle three types of operation over data: extraction, transformation, and load. Extraction is the process of obtaining raw data from a source, i.e. data that hasn’t been processed previously. Imagine extracting images from a camera, recordings from sensors in the wild, or text from comments on a website.
The transformation process takes data points and applies functions to clean (filter), enrich, validate, change or project the original raw data in order to make it processable in business applications. The transformations can be handled sequentially or in parallel, and might involve long and complex computational operations or small incremental steps. Most importantly, transformations must handle edge cases and exceptions in a consistent manner, because raw data is usually messy.
Load operations, the third type of operation common in data pipelines, involve a series of steps to connect and store data to...