Data Pipelines with Luigi

Until now, we have been writing code as separate notebooks and scripts. In the previous chapter, we learned how to group those scripts into a package so that it can be distributed and tested properly. In many cases, however, we need to execute certain tasks on a strict schedule. Often, it is needed to process certain data—pull off analytics, collect information from external sources, or re-train an ML model. All of this is prone to errors: tasks may depend on other tasks, and some tasks shouldn't run before others. It is important that tasks should be easy to orchestrate, monitor, and re-run for ease of use.

In this chapter, we will learn to build and orchestrate our own data pipelines. Building good pipelines is an important skill that can save tons of time and stress for anyone who masters it.

In particular, we will cover the following topics...

Technical requirements

For this chapter, we'll use a package called luigi. Last few tasks will require two more packages—boto3 and sqlalchemy. We will also use the wikiwwii package we built in Chapter 15, Packaging and Testing with Poetry and PyTest. You can build it yourself by following the chapter or install it by running this:

pip install git+https://github.com/Casyfill/wikiwwii.git

All of the code is in the repository, in the Chapter16 folder (https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications).

Introducing the ETL pipeline

Data pipelines are important and ubiquitous. Even organizations with a small online presence run their own jobs: thousands of research facilities, meteorological centers, observatories, hospitals, military bases, and banks all run their internal data processing.

Another name for the data pipelines is ETL, which stands for Extract, Transform, and Load—three conceptual pieces of each pipeline. At first glance, the task may sound trivial. Most of our notebooks are, in a way, ETL jobs—we load some data, work with it, and then store it somewhere. However, building and maintaining a good pipeline requires a thorough and consistent approach. Processes should be reliable, easy to re-run, and reusable. Particular tasks shouldn't run more than once or if their dependencies are not satisfied (say, other tasks haven't finished yet).

It...

Building our first task in Luigi

Luckily, luigi allows us to start small. We'll start by building a task that pulls all of the links on the battles, using the code from our wikiwwii package. First, we will import all we need in a separate file, luigi_fronts.py:

# luigi_fronts.py
from pathlib import Path
import json

import luigi
from wikiwwii.collect.battles import collect_fronts
URL = 'https://en.wikipedia.org/wiki/List_of_World_War_II_battles'
folder = Path(__file__).parents[1] / 'data'

Here, we declared a link for the battles, imported our collect_fronts function, and specified a relative folder to store the data in. Now, let's write the task itself. In the following, we'll create a task class, define the URL as a luigi parameter with a default value (more on that later), and add (or, rather, override) two methods—output, which returns a...

Understanding time-based tasks

Pipelines are especially useful to schedule data collection, for example, downloading new data every night.

Say we want to collect new data on 311 calls in NYC for the previous day, every morning. First, let's write the pulling function itself. The code is fairly trivial. You can take a look at the Socrata (the data-sharing platform New York uses) API documentation via this link, https://dev.socrata.com/consumers/getting-started.html. The only tricky part is that the dataset can be large—but Socrata won't give us more than 50,000 rows at once. Hence, if the length of the input is equal to 50,000, most likely, the data was capped, and we'll need to make another pull with the offset, over and over until the number of rows is smaller. resource in the arguments represents a unique ID of the dataset—you can obtain it from...

Exploring the different output formats

In the code of the Scheduling with cron section, we used local targets, writing to the filesystem of our computer. In a real-world scenario, that will rarely suffice—you'll be probably writing either to a database or file stored in the cloud. In fact, we highly encourage you to write tasks to the cloud (for example, S3 buckets) from the get-go, if there is no reason not to. Luigi supports FTP, S3, Azure Blobs, Google Cloud, Spark, MongoDB, SQL databases, and many more. The only question is to create those resources and set up credentials to access them. The best part for many of them is that the interface is very similar, so it is easy to swap targets for existing tasks, by changing only a few lines of code.

Writing to an S3 bucket...

Expanding Luigi with custom template classes

In the previous section, we used the CopyToTable class as the template instead of luigi.Task. In fact, this is a good pattern to use! If there is any custom configuration or code you can use from one task to another, feel free to create a custom task class of your own. For example, in our practice, we use a custom S3Task class, similar to the one that follows:

from luigi.contrib.s3 import S3Client, S3Target
import pandas as pd
from io import StringIO, BytesIO

class S3Task(luigi.Task):
    client = S3Client()

    def _upload_csv(df, path):
        content = df.to_csv(float_format="%.3f", index=None)
        self.client.put_string(
            content=content, destination_s3_path=path, 
            ContentType="text/csv"
        )


    def _upload_binary(self, df):        
        format_ = path.split(".")[-1]
       ...

Summary

In this chapter, we learned how to form our code into production-level data pipelines that can be scheduled and re-run on demand. Building good pipelines is an important skill, as it enables you to have the data up to date and work on your business logic (for example, parsing the information), rather than running and re-running pipeline scripts or building your own bicycle solution. This reliable and robust solution is a good way to deploy and schedule your code as a deliverable. In the later part of this chapter, we learned about the different output formats and custom templates in luigi.

In the next chapter, we'll build on top of the pipeline we set up. We will use the data we collected to build a couple of interactive dashboards, allowing us to monitor the process and analyze ongoing trends in the data.