Packt+ | Advance your knowledge in tech

You're reading from Practical Predictive Analytics

Product typeBook

Published inJun 2017

Reading LevelIntermediate

PublisherPackt

ISBN-139781785886188

Edition1st Edition

Languages

Tools

Splunk

Concepts

Predictive Analytics

Author (1)

Ralph Winters

Chapter 2. The Modeling Process

Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.

-George Edward Pelham Box

Today, we are at a juncture in which many different types of skill sets are needed to participate in predictive analytics projects. Once, this was the pure domain of statisticians, programmers, and business analysts. Now, the roles have expanded to include visualization experts, data storage experts, and other types of specialists. Yet, so many are unfamiliar with an understanding of how predictive analytics projects can be structured. This lack of structure can be inhibited by several factors. Often there is a lack of understanding of the critical parts of a business problem, and a model is developed much too early. Alternatively, a formal methodology may be put off to the future, in favor of a quick solution.

In this chapter, we will start by discussing the advantages of using structured analytics methodologies. Methodologies...

Advantages of a structured approach

Analytic projects have many components. That is where a structured methodology can help. Many benefits can be gained if there is a structure which is placed upon discovery and analysis, rather than only on pure model building. The discovery and insight gained will certainly be utilized past the original intent of the problem.

We assume that the quick-thinking "hare brain" will beat out the slower Intuition of the "tortoise mind." However, now research in cognitive science is changing this understanding of the human mind. It suggests that patience and confusion--rather than rigor and certainty--are the essential precursors of wisdom.

-Guy Claxton

Ways in which structured methodologies can help

Here are several points to bear in mind concerning the advantages of structured methodologies:

Data is coming at us fast and furious. We need to keep track of the many data sources, evaluate which ones are the best ones to use at any given time and continually monitor...

Analytic process methodologies

There are several analytic process methodologies which are currently practiced; however, I will be discussing only two longstanding methodologies that have been in existence for a while, CRISP-DM and SEMMA, which can help you organize your journey from problem definition to insight.

CRISP-DM and SEMMA

Cross-Industry Standard process for Data Mining (CRISP-DM) and Sample, Explore, Modify, Model, and Assess (SEMMA) are two standard data mining methodologies that have been utilized for many years and describe a general methodology for implementing analytical projects. There is a good deal of overlap between the methodologies, even though the names for each step are different. All of the listed steps are important to the success of a predictive analytics project. However, it is not necessary that these steps be followed exactly in order. The concepts outlined are more or less an outline of best practices. It helps to be aware of the importance of each of these steps...

An analytics methodology outline specific steps

This section will look at each of the analytics methodology steps individually. I will use CRISP-DM as the template, because it covers model deployment, and we have already mentioned the benefits of sampling (which is the first step in SEMMA).

Step 1 business understanding

Many predictive modelers assume that the actual modeling phase is where the most insightful model development takes place. However, much of the groundwork and insight can be discovered early on, and a good understanding of business objectives can avoid pitfalls later on.

Communicating business goals the feedback loop

I must admit, business people and technical people can be better at communicating with each other. How business goals are communicated can run the gamut. It can be anything from a business partner stating, "Tell me how sales need to be increased" or "Tell me something I don't know."

So, it really starts with understanding what the specific business objectives are...

Step 2 data understanding

Once an objective is established and data sources have been identified, you can begin looking at the data in order to understand how each data element behaves individually, as well as how it interacts in combination with other variables. But even before you start looking at the values of variables, it is important to understand the different types of data levels of measurement and the kind of analyses you can perform with them.

Levels of measurement

Levels of measurement is a classification system for classifying data into 4 different categories which is discussed as follows (ratio, ordinal, interval, and nominal). It is an important aspect of the project or studies metadata.

Levels of measurement is important in the world of predictive analytics since the specific measurements will often dictate which algorithm or techniques can be applied. For example k-means clustering does work if you want to incorporate nominal data, and logistic regression can not use ratio data...

Step 3 data preparation

As was mentioned in Chapter 1, Getting Started with Predictive Analysis, one purpose of data preparation is preparing an input data modeling file, which can go directly into an algorithm. In theory, the input file will encompass all of the knowledge gained in steps 1 and 2. Ideally, this file will consist of a target variable, all meaningful predictor variables and other identification variables to aid in the modeling process, and any additional variables which would have been created based on the raw data sources. Data preparation, such as the previous steps outlined is an iterative process. Here are some typical steps you might follow when preparing the data:

Identifying the data sources: These are the critical data inputs that you will need to read in and manipulate. They can be sourced from various data formats such as CSV files, databases, or XML or JSON files. They can be in structured format or unstructured format.
Identify the expected input: Read in some test...

Step 4 modeling

In the modeling stage, you will pick an appropriate predictive modeling technique that fits your problem and apply it to your data. There are several factors which influence the selection of a model:

Who will use the model?
How will the model be used?
What are the assumptions of the model?
How much data do I have?
How many variables do I need to use?
What is the accuracy level needed by the model?
Am I willing to trade some accuracy for interpretability?

Particularly related to the last point is the concept of bias and variance.

Bias is related to the ability of a model to approximate the data. Low bias algorithms are able to fit the data with little error. While this may seem to an advantage all of the time, it can result in a complex model which is unstable, and difficult to explain. On the other hand, a high bias model is relatively simple to explain (like linear regression), but may sacrifice some accuracy for explanability, and stability. You will usually start by looking at...

Step 5 evaluation

Model evaluation deals with how accurate or useful the model you have just developed is or will be in the future. Model evaluation can take different forms. Some are more subjective and are domain oriented, such as placing it under the scrutiny of experts in your field, and some are more technically oriented. There are many metrics and procedures available to assess a model. At the basic level, you have many statistics (some of them with acronyms known as AIC, BIC, and AUC) which purport to convey the goodness of a model in a single metric. However, these metrics by themselves are unable to convey the purpose and application of a predictive model to a larger audience and often these metrics are in conflict. Some context is needed. Some would argue that one could also develop a perfectly good predictive model and then be unable to convey its purpose and application to a larger audience. In my opinion, that is a bad model, regardless of how well an evaluation metric fits...

Step 6 deployment

Deployment of a model is the process by which you put your models into a real-world production setting. This can depend on many factors, such as the environment in which it was developed, the algorithm that was chosen, assumptions concerning the data that was made when the model was developed, and of course, the level of the developer. Often a model is unable to scale up to the demands of a production environment and knowing your possible production environment in advance will dictate what problems or techniques are feasible.

Model scoring

Model scoring makes the model actionable. If you develop a model and you are unable to apply the results to new data, then you will be unable to do any prediction on an ongoing basis. New model scoring often involves outputing the development model outputs to a real-time scoring engine. That engine is often Java or C++. How that is performed varies vastly depending upon the modeling technique. Sometimes the scoring is performed separately...

References

You can refer to the following articles:

Determine the Root Cause: 5 Whys. Retrieved from https://www.isixsigma.com/tools-templates/cause-effect/determine-root-cause-5-whys/
Gettysburg Address, (1863, November 19). Retrieved from Abraham Lincoln Online: http://www.abrahamlincolnonline.org/lincoln/speeches/gettysburg.html
Stevens, S. (1946). On the Theory of Scales of Measurement. Science
Wikipedia. Bias-variance tradeoff. Retrieved from Wikipedia: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

Notes

Random Forests (tm) is a trademark of Leo Breiman and Adele Cutler and is licensed exclusively to Salford Systems for the commercial release of the software.

Summary

In this chapter, we learned about the various structured approaches to predictive analytics and how implementing an analytics project in a methodical way can enhance the success of an analytics project through collaboration and communication. We went through the various steps of the CRISP-DM methodology and demonstrated tools that you could use to help you progress along these steps.

We discussed the benefits of sampling and how it could speed up your project. SQL was demonstrated to illustrate basic charts and plots, so that you can begin to develop insight even before you create a first model. We showed that data simulation could also be used at the data understanding phase as a preliminary modeling tool to do "what ifing", even before actual company data is obtained.

We learned about the various types of data that you will encounter, and showed some examples of independent and dependent variables and the importance of doing preliminary 1-way and 2-way variable analysis as a precursor...

The rest of the chapter is locked

You have been reading a chapter from

Practical Predictive Analytics

Published in: Jun 2017Publisher: PacktISBN-13: 9781785886188

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Ralph Winters

Ralph Winters started his career as a database researcher for a music performing rights organization (he composed as well!), and then branched out into healthcare survey research, finally landing in the Analytics and Information technology world. He has provided his statistical and analytics expertise to many large fortune 500 companies in the financial, direct marketing, insurance, healthcare, and pharmaceutical industries. He has worked on many diverse types of predictive analytics projects involving customerretention, anti-money laundering, voice of the customer text mining analytics, and health care risk and customer choice models. He is currently data architect for a healthcare services company working in the data and advanced analytics group. He enjoys working collaboratively with a smart team of business analysts, technologists, actuaries as well as with other data scientists. Ralph considered himself a practical person. In addition to authoring Practical Predictive Analytics for Packt Publishing, he has also contributed two tutorials illustrating the use of predictive analytics in Medicine and Healthcare in Practical Predictive Analytics and Decisioning Systems for Medicine: Miner et al., Elsevier September, 2014, and also presented Practical Text Mining with SQL using Relational Databases, at the 2013 11th Annual Text and Social Analytics Summit in Cambridge, MA. Ralph resides in New Jersey with his loving wife Katherine, amazing daughters Claire and Anna, and his four-legged friends, Bubba and Phoebe, who can be unpredictable. Ralph's web site can be found at ralphwinters.com
Read more about Ralph Winters

Other recommended products

Related to this chapter

Big Data Analytics with Hadoop 3

Apache Hadoop is the most popular platform for big data processing to build powerful analytics solutions. This book shows you how to do just that, with the help of practical examples. You will be well-versed with the analytical capabilities of Hadoop ecosystem with Apache Spark and Apache Flink to perform big data analytics by the end of this book.

BookMay 2018482 pages

Hands-On Exploratory Data Analysis with R

Hands-On Exploratory Data Analysis with R puts the complete process of exploratory data analysis into a practical demonstration in one nutshell. You will understand the concepts of data analysis right from data ingestion, data cleaning, data manipulation to applying statistical techniques and visualizing hidden patterns.

BookMay 2019266 pages

Machine Learning with R Cookbook

The R language is a powerful open source functional programming language. At its core, R is a statistical language that provides impressive tools to analyze data and create high-level graphics. This book covers the basics of R by setting up a user-friendly programming environment and programming ETL in R. Data exploration examples are provided that demonstrate how powerful data visualisation and machine learning is in discovering hidden relationships. You will also explore air quality data, steps to fix the missing values and visualising the same. You will then dive into important machine learning topics, including data classification, regression, survival analysis, time series analysis, clustering association rule mining, and dimension reduction.This book will include the latest code and examples based on R 3.3 and above—updated for better computation, accuracy, and speed with R.

BookOct 2017572 pages

Hands-On Ensemble Learning with R

This book introduces you to the concept of ensemble learning and demonstrates how different machine learning algorithms can be combined to build efficient machine learning models. Use R to implement the popular trilogy of ensemble techniques, i.e. bagging, random forest and boosting, to build faster and more accurate machine learning models.

BookJul 2018376 pages

Practical Machine Learning with R

Practical Machine Learning with R gives you the complete knowledge to solve your business problems - starting by forming a good problem statement, selecting the most appropriate model to solve your problem, and then ensuring that you do not overtrain the model.

BookAug 2019416 pages

Associations and Correlations

Through this book, you’ll learn why most statistical techniques give incorrect results and what you can do to avoid the most common pitfalls. You’ll learn how to make sure you get the correct results the first time, every time.

BookJun 2019134 pages

R Data Analysis Projects

R offers a large variety of packages and libraries for fast and accurate data analysis and visualization. As a result, it is one of the most popularly used languages by data scientists and analysts, or anyone who wants to perform data analysis. In this book, we show you just how to do that - with the help of practical implementations of real-world use cases.

BookNov 2017366 pages

Regression Analysis with R

Regression analysis is a statistical process which enables prediction of relationships between variables. This book will give you a rundown explaining what regression analysis is, explaining you the process from scratch. Each chapter starts with explaining the theoretical concepts and once the reader gets comfortable with the theory, we move to the practical examples to support the understanding. By the end of this book you will know all the concepts and pain-points related to regression analysis, and you will be able to implement your learning in your projects.

BookJan 2018422 pages

SAS for Finance

SAS is the ground-breaking tool for advanced, predictive, and statistical analytics. Right from refining your data using power of SAS analytics, you will be able to exploit the capabilities of high-powered package to create accurate financial models. You can easily assess the pros and cons of models to suit unique business needs.

BookMay 2018306 pages

IBM SPSS Modeler Essentials

IBM SPSS Modeler allows quick, efficient predictive analytics and insight building from your data, and is a popularly used data mining tool. This book will guide you through the data mining process, and presents relevant statistical methods which are used to build predictive models and conduct other analytic tasks using IBM SPSS Modeler. From importing the data to finding hidden relationships within it, you will be able to build solid data mining solutions and then deploy them to production. The book also contains valuable information on evaluating and enhancing the performance of your data models.

BookDec 2017238 pages

Data Science with SQL Server Quick Start Guide

SQL Server started to fully support data science only with its last two editions. If you are a professional from both worlds, SQL Server and data science, and interested in using SQL Server and Machine Learning Services for their projects, then this is the ideal book for you.

BookAug 2018206 pages

Applied Supervised Learning with R

Applied Supervised Learning with R will make you a pro at identifying your business problem, selecting the best supervised machine learning algorithm to solve it, and fine-tuning your model to exactly deliver your needs without overfitting itself.

BookMay 2019502 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages