Scaling data pipelines in cloud environments
The amount of data that can be handled using data pipelines implemented as a single Python script like that presented in the previous sections is usually restricted by the amount of memory (RAM) available. Also, this schema relies completely on a single computer to execute the entire pipeline.
This type of solution puts the emphasis on simplicity – a single python script with the whole data pipeline executed on a single computer – and there may be circumstances where that is to be preferred to an architecture in which execution is distributed.
In many business scenarios where you have big-data kinds of data source, or where there are massive concurrent data streams, you might want to scale even further, and using an architecture featuring a data pipelines orchestrator could be a viable alternative. Several options exist in the market for frameworks and control systems to define, implement, deploy, and observe pipelines...