IBM SPSS Modeler Essentials

By Jesus Salcedo , Keith McCormick
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Introduction to Data Mining and Predictive Analytics

About this book

IBM SPSS Modeler allows users to quickly and efficiently use predictive analytics and gain insights from your data. With almost 25 years of history, Modeler is the most established and comprehensive Data Mining workbench available. Since it is popular in corporate settings, widely available in university settings, and highly compatible with all the latest technologies, it is the perfect way to start your Data Science and Machine Learning journey.

This book takes a detailed, step-by-step approach to introducing data mining using the de facto standard process, CRISP-DM, and Modeler’s easy to learn “visual programming” style. You will learn how to read data into Modeler, assess data quality, prepare your data for modeling, find interesting patterns and relationships within your data, and export your predictions. Using a single case study throughout, this intentionally short and focused book sticks to the essentials. The authors have drawn upon their decades of teaching thousands of new users, to choose those aspects of Modeler that you should learn first, so that you get off to a good start using proven best practices.

This book provides an overview of various popular data modeling techniques and presents a detailed case study of how to use CHAID, a decision tree model. Assessing a model’s performance is as important as building it; this book will also show you how to do that. Finally, you will see how you can score new data and export your predictions. By the end of this book, you will have a firm understanding of the basics of data mining and how to effectively use Modeler to build predictive models.

Publication date:
December 2017
Publisher
Packt
Pages
238
ISBN
9781788291118

 

Chapter 1. Introduction to Data Mining and Predictive Analytics

IBM SPSS Modeler is an interactive data mining workbench composed of multiple tools and technologies to support the entire data mining process. In this first chapter, readers will be introduced to the concepts of data mining, CRISP-DM, which is a recipe for doing data mining the right way, and a case study outlining the data mining process. The chapter topics are as follows:

  • Introduction to data mining
  • CRISP-DM overview
  • The data mining process (as a case study)
 

Introduction to data mining


In this chapter, we will place IBM SPSS Modeler and its use in a broader context. Modeler was developed as a tool to perform data mining. Although the phrase predictive analytics is more common now, when Modeler was first developed in the 1990s, this type of analytics was almost universally called data mining. The use of the phrase data mining has evolved a bit since then to emphasize the exploratory aspect, especially in the context of big data and sometimes with a particular emphasis on the mining of private data that has been collected. This will not be our use of the term. Data mining can be defined in the following way:

Data mining is the search of data, accumulated during the normal course of doing business, in order to find and confirm the existence of previously unknown relationships that can produce positive and verifiable outcomes through the deployment of predictive models when applied to new data.

Several points are worth emphasizing:

  • The data is not new
  • The data that can solve the problem was not collected solely to perform data mining
  • The data miner is not testing known relationships (neither hypotheses nor hunches) against the data
  • The patterns must be verifiable
  • The resulting models must be capable of something useful
  • The resulting models must actually work when deployed on new data

In the late 1990s, a process was developed called the Cross Industry Standard Process for Data Mining (CRISP-DM). We will be drawing heavily from that tradition in this chapter, and CRISP-DM can be a powerful way to organize your work in Modeler. It is because of our use of this process in organizing this book's material that prompts us to use the term data mining. It is worth noting that the team that first developed Modeler, originally called Clementine, and the team that wrote CRISP-DM have some members in common.

 

CRISP-DM overview


The CRISP-DM is considered to be the de facto standard for conducting a data mining project. Starting with the Business Understanding phase and ending with the Deployment phase, this six-phase process has a total of 24 tasks. It is important to not get by with just focusing on the highest level of the phases, since it is well worth the effort to familiarize yourself with all of the 24 tasks. The diagram shown next illustrates the six phases of the CRISP-DM process model and the following pages will discuss each of these phases:

Business Understanding

The Business Understanding phase is focused on good problem definition and ensuring that you are solving the business's problem. You must begin from a business perspective and business knowledge, and proceed by converting this knowledge into a data mining problem definition. You will not be performing the actual Business Understanding in Modeler, as such, but Modeler allows you to organize supporting material such as word documents and PowerPoint presentations as part of a Modeler project file. You don't need to organize this material in a project file, but you do need to remember to do a proper job at this phase. For more detailed information on each task within a phase, refer to the CRISP-DM document itself. It is free and readily available on the internet.

The four tasks in this phase are:

  • Determine business objectives
  • Assess situation
  • Determine data mining goals
  • Produce project plan

Data Understanding

Modeler has numerous resources for exploring your data in preparation for the other phases. We will demonstrate a number of these in Chapter 3, Importing Data into ModelerChapter 4, Data Quality and Exploration; and Chapter 8, Looking for Relationships Between Fields. The Data Understanding phase includes activities for getting familiar with the data as well as data collection and data quality. The four Data Understanding tasks are:

  • Collect initial data
  • Describe data
  • Explore data
  • Verify data quality

Data Preparation

The Data Preparation phase covers all activities to construct the final dataset (the data that will be fed into the modeling tool(s)) from the initial raw data. Data Preparation is often described as the most labor-intensive phase for the data analyst. It is terribly important that Data Preparation is done well, and a substantial amount of this book is dedicated to it. We cover cleaning, selecting, integrating, and constructing data, in Chapter 5Cleaning and Selecting Data; Chapter 6,Combining Data Files; and Chapter 7, Deriving New Fields, respectively. However, a book dedicated to the basics of data mining can really only start you on your journey when it comes to Data Preparation, since there are so many ways in which you can improve and prepare data. When you are ready for a more advanced treatment of this topic, there are two resources that will go into Data Preparation in much more depth, and both have extensive Modeler software examples: The IBM SPSS Modeler Cookbook (Packt Publishing) and Effective Data Preparation (Cambridge University Press).

The five Data Preparation tasks are:

  • Select data
  • Clean data
  • Construct data
  • Integrate data
  • Format data

Modeling

The Modeling phase is probably what you expect it to be—the phase where the modeling algorithms move to the forefront. In many ways, this is the easiest phase, as the algorithms do a lot of the work if you have done an excellent job on the prior phases and you've done a good job translating the business problem into a data mining problem. Despite the fact that the algorithms are doing the heavy lifting in this phase, it is generally considered the most intimidating; it is understandable why. There are an overwhelming number of algorithms to choose from. Even in a well-curated workbench such as Modeler, there are dozens of choices. Open source options such as R have hundreds of choices. While this book is not an algorithms guide, and even though it is impossible to offer a chapter on each algorithm, Chapter 9Introduction to Modeling Options in IBM SPSS Modeler should be very helpful in understanding, at a high level, what options are available in Modeler. Also, in Chapter 10, Decision Tree Models we go through a thorough demonstration of one modeling technique, decision trees, to orient you to modeling in Modeler.

The four tasks in this phase are:

  • Select modeling technique
  • Generate test design
  • Build model
  • Assess model

Evaluation

At this stage in the project you have built a model (or models) that appears to be of high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model—to be certain it properly achieves the business objectives.

Evaluation is frequently confused with model assessment—the last task of the Modeling phase. Assess model is all about the data analysis perspective and includes metrics such as model accuracy. The authors of CRISP-DM considered calling this phase business evaluation because it has to be conducted in the language of the business and using the metrics of the business as indicators of success. Given the nature of this book, and its emphasis on the point and click operation of Modeler, there will be virtually no opportunity to practice this phase, but in real world projects it is a critical phase.

The three tasks in this phase are:

  • Evaluate results
  • Review process
  • Determine next steps

Deployment

Creation of the model is generally not the end of the project. Depending on the requirements, the Deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process. Given the software focus of this book and the spirit of sticking to the basics, we will really only cover using models for the scoring of new data. Real world deployment is much more complex and a complex deployment can more than double the length of a project. Modeler's capabilities in this area go far beyond what we will be able to show in this book. The final chapter of this book, Chapter 11, Model Assessment and Scoring, briefly talks about some of these issues.

However, it is not unusual for the deployment team to be different than the modeling team, and the responsibility may fall to team members with more of an IT focus. The IBM software stack offers dedicated tools for complex deployment scenarios. IBM Collaboration and Deployment Services has such advanced features.

The four tasks in the Deployment phase are:

  • Plan deployment
  • Plan monitoring and maintenance
  • Produce final report
  • Review project

Learning more about CRISP-DM

Here are five great resources to learn more about CRISP-DM:

 

The data mining process (as a case study)


As Chapter 9Introduction to Modeling Options in IBM SPSS Modeler will illustrate, there are many different types of data mining projects. For example, you may wish to create customer segments based on products purchased or service usage, so that you can develop targeted advertising campaigns. Or you may want to determine where to better position products in your store, based on customer purchase patterns. Or you may want to predict which students will drop out of school, so that you can provide additional services before this happens.

In this book, we will be using a dataset where we are trying to predict which people have incomes above or below $50,000. We may be trying to do this because we know that people with incomes above $50,000 are much more likely to purchase our products, given that previous work found that income was the most important predictor regarding product purchase. The point is that regardless of the actual data that we are using, the principles that we will be showing apply to an infinite number of data mining problems; whether you are trying to determine which customers will purchase a product, or when you will need to replace an elevator, or how many hotels rooms will be booked on a given date, or what additional complications might occur during surgery, and so on.

As was mentioned previously, Modeler supports the entire data mining process. The figure shown next illustrates exactly how Modeler can be used to compartmentalize each aspect of the CRISP-DM process model:

In Chapter 2The Basics of Using IBM SPSS Modeler, you will become familiar with the Modeler graphic user interface. In this chapter, we will be using screenshots to illustrate how Modeler represents various data mining activities. Therefore the following figures in this chapter are just providing an overview of how different tasks will look within Modeler, so for the moment do not worry about how each image was created, since you will see exactly how to create each of these in later chapters.

First and foremost, every data mining project will need to begin with well-defined business objectives. This is crucial for determining what you are trying to accomplish or learn from a project, and how to translate this into data mining goals. Once this is done, you will need to assess the current business situation and develop a project plan that is reasonable given the data and time constraints.

Once business and data mining objectives are well defined, you will need to collect the appropriate data. Chapter 3, Importing Data into Modeler will focus on how to bring data into Modeler. Remember that data mining typically uses data that was collected during the normal course of doing business, therefore it is going to be crucial that the data you are using can really address the business and data mining goals:

Once you have data, it is very important to describe and assess its quality. Chapter 4Data Quality and Exploration will focus on how to assess data quality using the Data Audit node:

Once the Data Understanding phase has been completed, it is time to move on to the Data Preparation phase. The Data Preparation phase is by far the most time consuming and creative part of a data mining project. This is because, as was mentioned previously, we are using data that was collected during the normal course of doing business, therefore the data will not be clean, it will have errors, it will include information that is not relevant, it will have to be restructured into an appropriate format, and you will need to create many new variables that extract important information. Thus, due to the importance of this phase, we have devoted several chapters to addressing these issues. Chapter 5Cleaning and Selecting Data will focus on how to select the appropriate cases, by using the Select node, and how to clean data by using the Distinct and Reclassify nodes:

Chapter 6, Combining Data Files will continue to focus on the Data Preparation phase by using both the Append and Merge nodes to integrate various data files:

Finally, Chapter 7Deriving New Fields will focus on constructing additional fields by using the Derive node:

At this point we will be ready to begin exploring relationships within the data. In Chapter 8Looking for Relationships Between Fields we will use the Distribution, Matrix, Histogram, Means, Plot, and Statistics nodes to uncover and understand simple relationships between variables:

Once the Data Preparation phase has been completed, we will move on to the Modeling phase. Chapter 9Introduction to Modeling Options in IBM SPSS Modeler will introduce the various types of models available in Modeler and then provide an overview of the predictive models. It will also discuss how to select a modeling technique. Chapter 10Decision Tree Models will cover the theory behind decision tree models and focus specifically on how to build a CHAID model. We will also use a Partition node to generate a test design; this is extremely important because only through replication can we determine whether we have a verifiable pattern:

Chapter 11Model Assessment and Scoring is the final chapter in this book and it will provide readers with the opportunity to assess and compare models using the Analysis node. The Evaluation node will also be introduced as a way to evaluate model results:

Finally, we will spend some time discussing how to score new data and export those results to another application using the Flat File node:

 

Summary


In this chapter, you were introduced to the notion of data mining and the CRISP-DM process model. You were also provided with an overview of the data mining process, along with previews of what to expect in the upcoming chapters.

In the next chapter you will learn about the different components of the Modeler graphic user interface. You also learn how to build streams. Finally, you will be introduced to various help options.

About the Authors

  • Jesus Salcedo

    Jesus Salcedo has a PhD in psychometrics from Fordham University. He is an independent statistical consultant and has been using SPSS products for over 20 years. He is a former SPSS Curriculum Team Lead and Senior Education Specialist who has written numerous SPSS training courses and trained thousands of users.

    Browse publications by this author
  • Keith McCormick

    Keith McCormick is an independent data miner, trainer, conference speaker, and author. He has been using statistics software tools since the early 90s, and has been conducting training since 1997. He has been data mining and using IBM SPSS Modeler since its arrival in North America in the late 90s. He is also an expert in other packages, IBM's SPSS software suite, including IBM SPSS Statistics, AMOS, and Text Mining. He blogs and reviews related books as well.

    Browse publications by this author
IBM SPSS Modeler Essentials
Unlock this book and the full library for FREE
Start free trial