Reader small image

You're reading from  Learning Predictive Analytics with Python

Product typeBook
Published inFeb 2016
Reading LevelIntermediate
Publisher
ISBN-139781783983261
Edition1st Edition
Languages
Right arrow
Authors (2):
Ashish Kumar
Ashish Kumar
author image
Ashish Kumar

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about Ashish Kumar

View More author details
Right arrow

Chapter 9. Best Practices for Predictive Modelling

As we have seen in all the chapters on the modelling techniques, a predictive model is nothing but a set of mathematical equations derived using a few lines of codes. In essence, this code together with a slide-deck highlighting the high-level results from the model constitute a project. However, the user of our solution is more interested in finding a solution for the problem he is facing in the business context. It is the responsibility of the analyst or the data scientist to offer the solution in a way that is user-friendly and maximizes output or insights.

There are some general guidelines that can be followed for the optimum results in a predictive modelling project. As predictive modelling comprises a mix of computer science techniques, algorithms, statistics, and business context capabilities, the best practices in the predictive modelling are a total of the best practices in the aforementioned individual fields.

In this chapter, we...

Best practices for coding


When one uses Python for predictive modelling, one needs to write small snippets of code. To ensure that one gets the maximum out of their code snippets and that the work is reproducible, one should be aware of and aspire to follow the best practices in coding. Some of the best practices for coding are as follows.

Commenting the codes

There is a tradeoff between the elegance and understandability of a code snippet. As a code snippet becomes more elegant, its understandability by a new user (other than the author of the snippet) decreases. Some of the users are interested only in the end results, but most of the users like to understand what is going on behind the hood and want to have a good understanding of the code.

For the code snippet to be understandable by a new person or the user of the code, it is a common practice to comment on the important lines, if not all the lines, and write the headings for the major chunks of the code. Some of the properties of a comment...

Best practices for data handling


Data cleaning and manipulation constitutes the framework of any analytics project. To ensure that this important step is executed efficiently, the following best practices should be executed:

  • After importing the dataset, one should ensure that the dataset (all the variables and rows) has been read correctly. This means reading all the variables in their correct or required format. Sometimes, due to some limitation on the data or the IDE side, some variables are read wrongly and they need to be formatted to the correct format.

  • For example, if a variable reports some numerical ID (let's say 10-digits long), many a times it would be read and displayed in a scientific notation. However, this would be wrong as it is an ID and shouldn't be displayed in a scientific notation. Sometimes, a variable containing long strings are truncated. These issues should be taken care of before performing any operation on the data.

  • After every data manipulation step such as transposing...

Best practices for algorithms


The choice of which algorithm to deploy to answer a business question depends on a variety of parameters, and there is no one good answer. The choice of algorithm generally depends on the nature of the predictor and output variables; also, the overarching nature of the business problem at hand—whether it is a numerical prediction, classification, or an aggregation problem. Based on these preliminary criteria, one can shortlist a few existing methods to apply on the dataset.

Each method will have its own pros and cons, and the final decision should be taken keeping in mind the business context. The decision for the best-suited algorithm is usually taken based on the following two requirements:

  • Sometimes, the user of the result is interested only in the accuracy of the results. In such cases, the choice of the algorithm is done based on the accuracy of the algorithms. All the qualifying models are run and the one with the maximum accuracy is finalized.

  • At other times...

Best practices for statistics


Statistics are an integral part of any predictive modelling assignment. Statistics are important because they help us gauge the efficiency of a model. Each predictive model generates a set of statistics, which suggests how good the model is and how the model can be fine-tuned to perform better. The following is a summary of the most widely reported statistics and their desired values for the predictive models described in this book:

Algorithms

Statistics/Parameter

The desired value of statistics

Linear regression

R2, p-values, F-statistic, and Adj. R2

High Adj. R2, low F-statistic, and low p-value

Logistic regression

Sensitivity, specificity, Area Under the Curve (AUC), and KS statistic

High AUC (proximity to 1)

Clustering

Intra-cluster distance and silhouette coefficient

High intra-cluster distance and high silhouette coefficient (proximity to 1)

Decision trees (classification)

AUC and KS statistics

High AUC (proximity to 1)

While reporting...

Best practices for business contexts


This is the meatiest part of the report created for a predictive modeling project. Some users of the report will navigate directly to this section as they are primarily interested in the overall effect of the project. Thus, it is imperative to mention the highlights and most important findings of the project in this section. This is different from reporting the statistics, which is in a way the raw output of the predictive model. In this section, we will focus on the following:

  • Findings and insights of the analyses

  • Major problems identified

  • Major results from the model

  • The accuracy or efficiency of the model

  • Action steps for the user to solve the business problem, and so on

If it is a customer segmentation problem, mention the names and characteristics of the segments identified along with the statistical summary for each segment. Recommend a plan to maximize sales and revenue (or whatever the business objective might be) for each of the segments.

If it is a...

Summary


What are the do's and don'ts of a predictive modelling project? This chapter dealt with these pressing questions and listed a number of best practices to make a predictive modelling project successful. Following are the important points:

  • Codes should be well-commented, modular, version-controlled, generalized, and not have hard-coded values.

  • Data should be observed carefully after every import and manipulation in order to check for any errors that might creep in while performing these operations.

  • The choice of the algorithm is guided by the nature of the predictor and outcome variable. The ultimate selection of the algorithm depends upon whether the user prioritizes accuracy or the understandability of the algorithm.

  • While reporting the results of a predictive model, the most optimum value of the important statistics and their relevance should be clearly stated.

  • Main business questions should be clearly answered. Major finding should be reported clearly. Some actionable recommendations...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning Predictive Analytics with Python
Published in: Feb 2016Publisher: ISBN-13: 9781783983261
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Ashish Kumar

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about Ashish Kumar