Advanced Topics - Building a Data Processing Pipeline and Deploying It

This chapter will cover the strategy of building a data analysis pipeline and deploying it to run in production on future, incoming data. It will also cover persistent model storage, which is required to distribute for deployment. This chapter will then cover the specific consequences that Python's interpreted nature has on deployment.

The following topics will be covered in this chapter:

Pipelining your analysis
Storing a model for deployment
Loading a deployed model
Python-specific deployment concerns

Pipelining your analysis

A pipelined analysis is a series of steps stored as a single function or object. On top of providing a framework for your analysis, the most important reason for pipelining becomes apparent upon examining what is required to reproduce your workflow or apply it to new data. Now that you've seen a nice collection of various data mining methods, it's a good time to acknowledge some facts:

Most analysis workflows have multiple steps (cleaning, scaling, transforming, clustering, and so on)
In order to reproduce the workflow, all of the steps must be done in the exact right order
Failure to reproduce the steps exactly can result in bad information, often failing silently
Humans make mistakes, so we need to guard against those mistakes

The perfect tool for guarding against mistakes is to build a pipeline, test it locally, and deploy the entire pipeline...

Deploying the model

Often in a production environment, deployment is the step where you release your model into the wild and let it run on unforeseen data. However, data mining also produces many local analysis workflows; that don't necessarily need to deploy but do need to be stored and re-loaded later in order to reproduce the analysis. Both of these use cases require what is called model persistence. The term persistence means the model needs to be stored and loaded for later use. Python is an object-oriented language and appropriately scikit-learn uses objects for most of its analysis routines. Storing an object is not as simple as storing a basic text file full of strings. It instead requires a process called serialization to store in a reliable and error-free manner. One of the most popular serialization packages is a Python core library, pickle. It's what we will...

Python-specific deployment concerns

Python is not a compiled language. It is interpreted at the time of execution. It is important to remember that, when you follow the steps in this chapter, you are not pickling an executable program. You are simply pickling an object. At load time, the environment must be compatible with the contents of the object. Often, that means matching versions, as libraries change over time. Also, the default serialization protocol for pickle is not compatible with Python 2, so you will have to change the protocol if switching Python versions.

Lastly, the pickled object is similar to a ZIP file in that anyone can bundle up anything inside it and you will not know it until you unpickle/unzip it. Security should always be a concern with any file types that are not transparent.

You should read the main pickle doc page for descriptions of compatibility...

Summary

This chapter covered a strategy for pipelining and deploying using built-in Scikit-learn methods. It also introduced the pickle module for model persistence and storage, as well as Python-specific concerns at deployment time. I encourage you to return to the code from Chapter 2, Basic Terminology and Our End-to-End Example, and build the entire end-to-end example data mining workflow as a Scikit-learn pipeline.

There's no substitute for practice, so grab some freely available data sets and solve as many real-world problems as you can find. Try your hand at a few analytics competitions and share your code with a friend for review and discussion. Identify the concepts that are toughest for you, and then hunt down explanations from other instructors or authors to get a different viewpoint on the topic. Don't let yourself off the hook until you fully understand the...