Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Python Data Mining Quick Start Guide

You're reading from  Python Data Mining Quick Start Guide

Product type Book
Published in Apr 2019
Publisher Packt
ISBN-13 9781789800265
Pages 188 pages
Edition 1st Edition
Languages
Concepts
Author (1):
Nathan Greeneltch Nathan Greeneltch
Profile icon Nathan Greeneltch

Table of Contents (9) Chapters

Preface Data Mining and Getting Started with Python Tools Basic Terminology and Our End-to-End Example Collecting, Exploring, and Visualizing Data Cleaning and Readying Data for Analysis Grouping and Clustering Data Prediction with Regression and Classification Advanced Topics - Building a Data Processing Pipeline and Deploying It Other Books You May Enjoy

Advanced Topics - Building a Data Processing Pipeline and Deploying It

This chapter will cover the strategy of building a data analysis pipeline and deploying it to run in production on future, incoming data. It will also cover persistent model storage, which is required to distribute for deployment. This chapter will then cover the specific consequences that Python's interpreted nature has on deployment.

The following topics will be covered in this chapter:

  • Pipelining your analysis
  • Storing a model for deployment
  • Loading a deployed model
  • Python-specific deployment concerns

Pipelining your analysis

A pipelined analysis is a series of steps stored as a single function or object. On top of providing a framework for your analysis, the most important reason for pipelining becomes apparent upon examining what is required to reproduce your workflow or apply it to new data. Now that you've seen a nice collection of various data mining methods, it's a good time to acknowledge some facts:

  • Most analysis workflows have multiple steps (cleaning, scaling, transforming, clustering, and so on)
  • In order to reproduce the workflow, all of the steps must be done in the exact right order
  • Failure to reproduce the steps exactly can result in bad information, often failing silently
  • Humans make mistakes, so we need to guard against those mistakes

The perfect tool for guarding against mistakes is to build a pipeline, test it locally, and deploy the entire pipeline...

Deploying the model

Often in a production environment, deployment is the step where you release your model into the wild and let it run on unforeseen data. However, data mining also produces many local analysis workflows; that don't necessarily need to deploy but do need to be stored and re-loaded later in order to reproduce the analysis. Both of these use cases require what is called model persistence. The term persistence means the model needs to be stored and loaded for later use. Python is an object-oriented language and appropriately scikit-learn uses objects for most of its analysis routines. Storing an object is not as simple as storing a basic text file full of strings. It instead requires a process called serialization to store in a reliable and error-free manner. One of the most popular serialization packages is a Python core library, pickle. It's what we will...

Python-specific deployment concerns

Python is not a compiled language. It is interpreted at the time of execution. It is important to remember that, when you follow the steps in this chapter, you are not pickling an executable program. You are simply pickling an object. At load time, the environment must be compatible with the contents of the object. Often, that means matching versions, as libraries change over time. Also, the default serialization protocol for pickle is not compatible with Python 2, so you will have to change the protocol if switching Python versions.

Lastly, the pickled object is similar to a ZIP file in that anyone can bundle up anything inside it and you will not know it until you unpickle/unzip it. Security should always be a concern with any file types that are not transparent.

You should read the main pickle doc page for descriptions of compatibility...

Summary

This chapter covered a strategy for pipelining and deploying using built-in Scikit-learn methods. It also introduced the pickle module for model persistence and storage, as well as Python-specific concerns at deployment time. I encourage you to return to the code from Chapter 2, Basic Terminology and Our End-to-End Example, and build the entire end-to-end example data mining workflow as a Scikit-learn pipeline.

There's no substitute for practice, so grab some freely available data sets and solve as many real-world problems as you can find. Try your hand at a few analytics competitions and share your code with a friend for review and discussion. Identify the concepts that are toughest for you, and then hunt down explanations from other instructors or authors to get a different viewpoint on the topic. Don't let yourself off the hook until you fully understand the...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Python Data Mining Quick Start Guide
Published in: Apr 2019 Publisher: Packt ISBN-13: 9781789800265
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}