Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Learn Python by Building Data Science Applications

You're reading from  Learn Python by Building Data Science Applications

Product type Book
Published in Aug 2019
Publisher Packt
ISBN-13 9781789535365
Pages 482 pages
Edition 1st Edition
Languages
Authors (2):
Philipp Kats Philipp Kats
Profile icon Philipp Kats
David Katz David Katz
Profile icon David Katz
View More author details

Table of Contents (26) Chapters

Preface Section 1: Getting Started with Python
Preparing the Workspace First Steps in Coding - Variables and Data Types Functions Data Structures Loops and Other Compound Statements First Script – Geocoding with Web APIs Scraping Data from the Web with Beautiful Soup 4 Simulation with Classes and Inheritance Shell, Git, Conda, and More – at Your Command Section 2: Hands-On with Data
Python for Data Applications Data Cleaning and Manipulation Data Exploration and Visualization Training a Machine Learning Model Improving Your Model – Pipelines and Experiments Section 3: Moving to Production
Packaging and Testing with Poetry and PyTest Data Pipelines with Luigi Let's Build a Dashboard Serving Models with a RESTful API Serverless API Using Chalice Best Practices and Python Performance Assessments Other Books You May Enjoy

Data Pipelines with Luigi

Until now, we have been writing code as separate notebooks and scripts. In the previous chapter, we learned how to group those scripts into a package so that it can be distributed and tested properly. In many cases, however, we need to execute certain tasks on a strict schedule. Often, it is needed to process certain data—pull off analytics, collect information from external sources, or re-train an ML model. All of this is prone to errors: tasks may depend on other tasks, and some tasks shouldn't run before others. It is important that tasks should be easy to orchestrate, monitor, and re-run for ease of use.

In this chapter, we will learn to build and orchestrate our own data pipelines. Building good pipelines is an important skill that can save tons of time and stress for anyone who masters it.

In particular, we will cover the following topics...

Technical requirements

For this chapter, we'll use a package called luigi. Last few tasks will require two more packages—boto3 and sqlalchemy. We will also use the wikiwwii package we built in Chapter 15, Packaging and Testing with Poetry and PyTest. You can build it yourself by following the chapter or install it by running this:

pip install git+https://github.com/Casyfill/wikiwwii.git

All of the code is in the repository, in the Chapter16 folder (https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications).

Introducing the ETL pipeline

Data pipelines are important and ubiquitous. Even organizations with a small online presence run their own jobs: thousands of research facilities, meteorological centers, observatories, hospitals, military bases, and banks all run their internal data processing.

Another name for the data pipelines is ETL, which stands for Extract, Transform, and Load—three conceptual pieces of each pipeline. At first glance, the task may sound trivial. Most of our notebooks are, in a way, ETL jobswe load some data, work with it, and then store it somewhere. However, building and maintaining a good pipeline requires a thorough and consistent approach. Processes should be reliable, easy to re-run, and reusable. Particular tasks shouldn't run more than once or if their dependencies are not satisfied (say, other tasks haven't finished yet).

It...

Building our first task in Luigi

Luckily, luigi allows us to start small. We'll start by building a task that pulls all of the links on the battles, using the code from our wikiwwii package. First, we will import all we need in a separate file, luigi_fronts.py:

# luigi_fronts.py
from pathlib import Path
import json

import luigi
from wikiwwii.collect.battles import collect_fronts
URL = 'https://en.wikipedia.org/wiki/List_of_World_War_II_battles'
folder = Path(__file__).parents[1] / 'data'

Here, we declared a link for the battles, imported our collect_fronts function, and specified a relative folder to store the data in. Now, let's write the task itself. In the following, we'll create a task class, define the URL as a luigi parameter with a default value (more on that later), and add (or, rather, override) two methods—output, which returns a...

Understanding time-based tasks

Pipelines are especially useful to schedule data collection, for example, downloading new data every night.

Say we want to collect new data on 311 calls in NYC for the previous day, every morning. First, let's write the pulling function itself. The code is fairly trivial. You can take a look at the Socrata (the data-sharing platform New York uses) API documentation via this link, https://dev.socrata.com/consumers/getting-started.html. The only tricky part is that the dataset can be large—but Socrata won't give us more than 50,000 rows at once. Hence, if the length of the input is equal to 50,000, most likely, the data was capped, and we'll need to make another pull with the offset, over and over until the number of rows is smaller. resource in the arguments represents a unique ID of the dataset—you can obtain it from...

Exploring the different output formats

In the code of the Scheduling with cron section, we used local targets, writing to the filesystem of our computer. In a real-world scenario, that will rarely suffice—you'll be probably writing either to a database or file stored in the cloud. In fact, we highly encourage you to write tasks to the cloud (for example, S3 buckets) from the get-go, if there is no reason not to. Luigi supports FTP, S3, Azure Blobs, Google Cloud, Spark, MongoDB, SQL databases, and many more. The only question is to create those resources and set up credentials to access them. The best part for many of them is that the interface is very similar, so it is easy to swap targets for existing tasks, by changing only a few lines of code.

Writing to an S3 bucket...

Expanding Luigi with custom template classes

In the previous section, we used the CopyToTable class as the template instead of luigi.Task. In fact, this is a good pattern to use! If there is any custom configuration or code you can use from one task to another, feel free to create a custom task class of your own. For example, in our practice, we use a custom S3Task class, similar to the one that follows:

from luigi.contrib.s3 import S3Client, S3Target
import pandas as pd
from io import StringIO, BytesIO

class S3Task(luigi.Task):
client = S3Client()

def _upload_csv(df, path):
content = df.to_csv(float_format="%.3f", index=None)
self.client.put_string(
content=content, destination_s3_path=path,
ContentType="text/csv"
)


def _upload_binary(self, df):
format_ = path.split(".")[-1]
...

Summary

In this chapter, we learned how to form our code into production-level data pipelines that can be scheduled and re-run on demand. Building good pipelines is an important skill, as it enables you to have the data up to date and work on your business logic (for example, parsing the information), rather than running and re-running pipeline scripts or building your own bicycle solution. This reliable and robust solution is a good way to deploy and schedule your code as a deliverable. In the later part of this chapter, we learned about the different output formats and custom templates in luigi.

In the next chapter, we'll build on top of the pipeline we set up. We will use the data we collected to build a couple of interactive dashboards, allowing us to monitor the process and analyze ongoing trends in the data.

Questions

  1. What are the benefits of writing tasks rather than using simple scripts?
  2. What is the base element of Luigi jobs?
  3. How are DAGs defined in Luigi? What are the benefits of that architecture?
  4. How can we parametrize a task?
  5. What is the best way to run time-based tasks in bulk?
  6. How can we schedule a job with Luigi?

Further reading

lock icon The rest of the chapter is locked
You have been reading a chapter from
Learn Python by Building Data Science Applications
Published in: Aug 2019 Publisher: Packt ISBN-13: 9781789535365
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}