IBM SPSS Modeler is an interactive data mining workbench composed of multiple tools and technologies to support the entire data mining process. In this first chapter, readers will be introduced to the concepts of data mining, CRISP-DM, which is a recipe for doing data mining the right way, and a case study outlining the data mining process. The chapter topics are as follows:
- Introduction to data mining
- CRISP-DM overview
- The data mining process (as a case study)
In this chapter, we will place IBM SPSS Modeler and its use in a broader context. Modeler was developed as a tool to perform data mining. Although the phrase predictive analytics is more common now, when Modeler was first developed in the 1990s, this type of analytics was almost universally called data mining. The use of the phrase data mining has evolved a bit since then to emphasize the exploratory aspect, especially in the context of big data and sometimes with a particular emphasis on the mining of private data that has been collected. This will not be our use of the term. Data mining can be defined in the following way:
Data mining is the search of data, accumulated during the normal course of doing business, in order to find and confirm the existence of previously unknown relationships that can produce positive and verifiable outcomes through the deployment of predictive models when applied to new data.
Several points are worth emphasizing:
- The data is not new
- The data that can solve the problem was not collected solely to perform data mining
- The data miner is not testing known relationships (neither hypotheses nor hunches) against the data
- The patterns must be verifiable
- The resulting models must be capable of something useful
- The resulting models must actually work when deployed on new data
In the late 1990s, a process was developed called the Cross Industry Standard Process for Data Mining (CRISP-DM). We will be drawing heavily from that tradition in this chapter, and CRISP-DM can be a powerful way to organize your work in Modeler. It is because of our use of this process in organizing this book's material that prompts us to use the term data mining. It is worth noting that the team that first developed Modeler, originally called Clementine, and the team that wrote CRISP-DM have some members in common.
The CRISP-DM is considered to be the de facto standard for conducting a data mining project. Starting with the
Business Understanding phase and ending with the
Deployment phase, this six-phase process has a total of 24 tasks. It is important to not get by with just focusing on the highest level of the phases, since it is well worth the effort to familiarize yourself with all of the 24 tasks. The diagram shown next illustrates the six phases of the CRISP-DM process model and the following pages will discuss each of these phases:
Business Understanding phase is focused on good problem definition and ensuring that you are solving the business's problem. You must begin from a business perspective and business knowledge, and proceed by converting this knowledge into a data mining problem definition. You will not be performing the actual
Business Understanding in Modeler, as such, but Modeler allows you to organize supporting material such as word documents and PowerPoint presentations as part of a Modeler project file. You don't need to organize this material in a project file, but you do need to remember to do a proper job at this phase. For more detailed information on each task within a phase, refer to the CRISP-DM document itself. It is free and readily available on the internet.
The four tasks in this phase are:
- Determine business objectives
- Assess situation
- Determine data mining goals
- Produce project plan
Modeler has numerous resources for exploring your data in preparation for the other phases. We will demonstrate a number of these in Chapter 3, Importing Data into Modeler; Chapter 4, Data Quality and Exploration; and Chapter 8, Looking for Relationships Between Fields. The
Data Understanding phase includes activities for getting familiar with the data as well as data collection and data quality. The four
Data Understanding tasks are:
- Collect initial data
- Describe data
- Explore data
- Verify data quality
Data Preparation phase covers all activities to construct the final dataset (the data that will be fed into the modeling tool(s)) from the initial raw data.
Data Preparation is often described as the most labor-intensive phase for the data analyst. It is terribly important that
Data Preparation is done well, and a substantial amount of this book is dedicated to it. We cover cleaning, selecting, integrating, and constructing data, in Chapter 5, Cleaning and Selecting Data; Chapter 6,Combining Data Files; and Chapter 7, Deriving New Fields, respectively. However, a book dedicated to the basics of data mining can really only start you on your journey when it comes to
Data Preparation, since there are so many ways in which you can improve and prepare data. When you are ready for a more advanced treatment of this topic, there are two resources that will go into
Data Preparation in much more depth, and both have extensive Modeler software examples: The IBM SPSS Modeler Cookbook (Packt Publishing) and Effective Data Preparation (Cambridge University Press).
Data Preparation tasks are:
- Select data
- Clean data
- Construct data
- Integrate data
- Format data
Modeling phase is probably what you expect it to be—the phase where the modeling algorithms move to the forefront. In many ways, this is the easiest phase, as the algorithms do a lot of the work if you have done an excellent job on the prior phases and you've done a good job translating the business problem into a data mining problem. Despite the fact that the algorithms are doing the heavy lifting in this phase, it is generally considered the most intimidating; it is understandable why. There are an overwhelming number of algorithms to choose from. Even in a well-curated workbench such as Modeler, there are dozens of choices. Open source options such as R have hundreds of choices. While this book is not an algorithms guide, and even though it is impossible to offer a chapter on each algorithm, Chapter 9, Introduction to Modeling Options in IBM SPSS Modeler should be very helpful in understanding, at a high level, what options are available in Modeler. Also, in Chapter 10, Decision Tree Models we go through a thorough demonstration of one modeling technique, decision trees, to orient you to modeling in Modeler.
The four tasks in this phase are:
- Select modeling technique
- Generate test design
- Build model
- Assess model
At this stage in the project you have built a model (or models) that appears to be of high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model—to be certain it properly achieves the business objectives.
Evaluation is frequently confused with model assessment—the last task of the
Modeling phase. Assess model is all about the data analysis perspective and includes metrics such as model accuracy. The authors of CRISP-DM considered calling this phase business evaluation because it has to be conducted in the language of the business and using the metrics of the business as indicators of success. Given the nature of this book, and its emphasis on the point and click operation of Modeler, there will be virtually no opportunity to practice this phase, but in real world projects it is a critical phase.
The three tasks in this phase are:
- Evaluate results
- Review process
- Determine next steps
Creation of the model is generally not the end of the project. Depending on the requirements, the
Deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process. Given the software focus of this book and the spirit of sticking to the basics, we will really only cover using models for the scoring of new data. Real world deployment is much more complex and a complex deployment can more than double the length of a project. Modeler's capabilities in this area go far beyond what we will be able to show in this book. The final chapter of this book, Chapter 11, Model Assessment and Scoring, briefly talks about some of these issues.
However, it is not unusual for the deployment team to be different than the modeling team, and the responsibility may fall to team members with more of an IT focus. The IBM software stack offers dedicated tools for complex deployment scenarios. IBM Collaboration and Deployment Services has such advanced features.
The four tasks in the
Deployment phase are:
- Plan deployment
- Plan monitoring and maintenance
- Produce final report
- Review project
Here are five great resources to learn more about CRISP-DM:
- The CRISP-DM document itself, found in various forms: https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
- Tom Khabaza's Nine Laws of Data Mining: http://khabaza.codimension.net/index_files/9laws.htm
- The IBM SPSS Modeler Cookbook (Packt Publishing), in particular there is a more extensive introduction and an appendix with essays on the
- Data Mining For Dummies: There is no software instruction in this book, but it has an excellent chapter on CRISP-DM and also a chapter on the nine laws
- https://www.lynda.com/SPSS-tutorials/Essential-Elements-Predictive-Analytics-Data-Mining: This web-based course does not cover software tasks but discusses predictive analytics strategy, CRISP-DM, and the Nine Laws of Data Mining
As Chapter 9, Introduction to Modeling Options in IBM SPSS Modeler will illustrate, there are many different types of data mining projects. For example, you may wish to create customer segments based on products purchased or service usage, so that you can develop targeted advertising campaigns. Or you may want to determine where to better position products in your store, based on customer purchase patterns. Or you may want to predict which students will drop out of school, so that you can provide additional services before this happens.
In this book, we will be using a dataset where we are trying to predict which people have incomes above or below $50,000. We may be trying to do this because we know that people with incomes above $50,000 are much more likely to purchase our products, given that previous work found that income was the most important predictor regarding product purchase. The point is that regardless of the actual data that we are using, the principles that we will be showing apply to an infinite number of data mining problems; whether you are trying to determine which customers will purchase a product, or when you will need to replace an elevator, or how many hotels rooms will be booked on a given date, or what additional complications might occur during surgery, and so on.
As was mentioned previously, Modeler supports the entire data mining process. The figure shown next illustrates exactly how Modeler can be used to compartmentalize each aspect of the CRISP-DM process model:
In Chapter 2, The Basics of Using IBM SPSS Modeler, you will become familiar with the Modeler graphic user interface. In this chapter, we will be using screenshots to illustrate how Modeler represents various data mining activities. Therefore the following figures in this chapter are just providing an overview of how different tasks will look within Modeler, so for the moment do not worry about how each image was created, since you will see exactly how to create each of these in later chapters.
First and foremost, every data mining project will need to begin with well-defined business objectives. This is crucial for determining what you are trying to accomplish or learn from a project, and how to translate this into data mining goals. Once this is done, you will need to assess the current business situation and develop a project plan that is reasonable given the data and time constraints.
Once business and data mining objectives are well defined, you will need to collect the appropriate data. Chapter 3, Importing Data into Modeler will focus on how to bring data into Modeler. Remember that data mining typically uses data that was collected during the normal course of doing business, therefore it is going to be crucial that the data you are using can really address the business and data mining goals:
Once you have data, it is very important to describe and assess its quality. Chapter 4, Data Quality and Exploration will focus on how to assess data quality using the
Data Audit node:
Data Understanding phase has been completed, it is time to move on to the
Data Preparation phase. The
Data Preparation phase is by far the most time consuming and creative part of a data mining project. This is because, as was mentioned previously, we are using data that was collected during the normal course of doing business, therefore the data will not be clean, it will have errors, it will include information that is not relevant, it will have to be restructured into an appropriate format, and you will need to create many new variables that extract important information. Thus, due to the importance of this phase, we have devoted several chapters to addressing these issues. Chapter 5, Cleaning and Selecting Data will focus on how to select the appropriate cases, by using the
Select node, and how to clean data by using the
Chapter 6, Combining Data Files will continue to focus on the
Data Preparation phase by using both the
Merge nodes to integrate various data files:
Finally, Chapter 7, Deriving New Fields will focus on constructing additional fields by using the
At this point we will be ready to begin exploring relationships within the data. In Chapter 8, Looking for Relationships Between Fields we will use the
Statistics nodes to uncover and understand simple relationships between variables:
Data Preparation phase has been completed, we will move on to the
Modeling phase. Chapter 9, Introduction to Modeling Options in IBM SPSS Modeler will introduce the various types of models available in Modeler and then provide an overview of the predictive models. It will also discuss how to select a modeling technique. Chapter 10, Decision Tree Models will cover the theory behind decision tree models and focus specifically on how to build a CHAID model. We will also use a
Partition node to generate a test design; this is extremely important because only through replication can we determine whether we have a verifiable pattern:
Chapter 11, Model Assessment and Scoring is the final chapter in this book and it will provide readers with the opportunity to assess and compare models using the
Analysis node. The
Evaluation node will also be introduced as a way to evaluate model results:
Finally, we will spend some time discussing how to score new data and export those results to another application using the
Flat File node:
In this chapter, you were introduced to the notion of data mining and the CRISP-DM process model. You were also provided with an overview of the data mining process, along with previews of what to expect in the upcoming chapters.
In the next chapter you will learn about the different components of the Modeler graphic user interface. You also learn how to build streams. Finally, you will be introduced to various help options.