IBM SPSS Modeler – Pushing the Limits

Exclusive offer: get 50% off this eBook here
IBM SPSS Modeler Cookbook

IBM SPSS Modeler Cookbook — Save 50%

Over 60 practical recipes to achieve better results using the experts' methods for data mining

$38.99    $19.50
by Keith McCormick | October 2013 | Cookbooks Enterprise Articles

In this article, Keith McCormick co-author of IBM SPSS Modeler Cookbook expresses pride in the fact that as a group of authors (Dean Abbott, Meta S. Brown, Tom Khabaza, and Scott R. Mutchler) that they have often provided the unexpected, the innovative, and boundary testing aspects of using Modeler everyday. Even the reviewers played a critical role in this. Two of the reviewers, Terry Taerum and Jesus Salcedo, made improvements to recipes and supplemented the recipes in the final days of the review process. All of the reviewers played a role in making the collection more innovative. Colin Shearer kindly observed in his Foreword:

“The author of this book are among the very best of these exponents, gurus who, in their brilliant and imaginative use of the tool, have pushed back the boundaries of applied analytics. By reading this article, you are learning from practitioners who have helped define the state of the art”

(For more resources related to this topic, see here.)

Using the Feature Selection node creatively to remove or decapitate perfect predictors

In this recipe, we will identify perfect or near perfect predictors in order to insure that they do not contaminate our model. Perfect predictors earn their name by being correct 100 percent of the time, usually indicating circular logic and not a prediction of value. It is a common and serious problem.

When this occurs we have accidentally allowed information into the model that could not possibly be known at the time of the prediction. Everyone 30 days late on their mortgage receives a late letter, but receiving a late letter is not a good predictor of lateness because their lateness caused the letter, not the other way around.

The rather colorful term decapitate is borrowed from the data miner Dorian Pyle. It is a reference to the fact that perfect predictors will be found at the top of any list of key drivers ("caput" means head in Latin). Therefore, to decapitate is to remove the variable at the top. Their status at the top of the list will be capitalized upon in this recipe.

The following table shows the three time periods; the past, the present, and the future. It is important to remember that, when we are making predictions, we can use information from the past to predict the present or the future but we cannot use information from the future to predict the future. This seems obvious, but it is common to see analysts use information that was gathered after the date for which predictions are made. As an example, if a company sends out a notice after a customer has churned, you cannot say that the notice is predictive of churning.

 

Past

Now

Future

 

Contract Start

Expiration

Outcome

Renewal Date

Joe

January 1, 2010

January 1, 2012

Renewed

January 2, 2012

Ann

February 15, 2010

February 15, 2012

Out of Contract

Null

Bill

March 21, 2010

March 21, 2012

Churn

NA

Jack

April 5, 2010

April 5, 2012

Renewed

April 9, 2012

New Customer

24 Months Ago

Today

???

???

Getting ready

We will start with a blank stream, and will be using the cup98lrn reduced vars2.txt data set.

How to do it...

To identify perfect or near-perfect predictors in order to insure that they do not contaminate our model:

  1. Build a stream with a Source node, a Type node, and a Table then force instantiation by running the Table node.
  2. Force TARGET_B to be flag and make it the target.
  3. Add a Feature Selection Modeling node and run it.

  4. Edit the resulting generated model and examine the results. In particular, focus on the top of the list.

  5. Review what you know about the top variables, and check to see if any could be related to the target by definition or could possibly be based on information that actually postdates the information in the target.
  6. Add a CHAID Modeling node, set it to run in Interactive mode, and run it.
  7. Examine the first branch, looking for any child node that might be perfectly predicted; that is, look for child nodes whose members are all found in one category.

  8. Continue steps 6 and 7 for the first several variables.

  9. Variables that are problematic (steps 5 and/or 7) need to be set to None in the Type node.

How it works...

Which variables need decapitation? The problem is information that, although it was known at the time that you extracted it, was not known at the time of decision. In this case, the time of decision is the decision that the potential donor made to donate or not to donate. Was the amount, Target_D known before the decision was made to donate? Clearly not. No information that dates after the information in the target variable can ever be used in a predictive model.

This recipe is built of the following foundation—variables with this problem will float up to the top of the Feature Selection results.

They may not always be perfect predictors, but perfect predictors always must go. For example, you might find that, if a customer initially rejects or postpones a purchase, there should be a follow up sales call in 90 days. They are recorded as rejected offer in the campaign, and as a result most of them had a follow up call in 90 days after the campaign. Since a couple of the follow up calls might not have happened, it won't be a perfect predictor, but it still must go.

Note that variables such as RFA_2 and RFA_2A are both very recent information and highly predictive. Are they a problem? You can't be absolutely certain without knowing the data. Here the information recorded in these variables is calculated just prior to the campaign. If the calculation was made just after, they would have to go. The CHAID tree almost certainly would have shown evidence of perfect prediction in this case.

There's more...

Sometimes a model has to have a lot of lead time; predicting today's weather is a different challenge than next year's prediction in the farmer's almanac. When more lead time is desired you could consider dropping all of the _2 series variables. What would the advantage be? What if you were buying advertising space and there was a 45 day delay for the advertisement to appear? If the _2 variables occur between your advertising deadline and your campaign you might have to use information attained in the _3 campaign.

Next-Best-Offer for large datasets

Association models have been the basis for next-best-offer recommendation engines for a long time. Recommendation engines are widely used for presenting customers with cross-sell offers. For example, if a customer purchases a shirt, pants, and a belt; which shoes would he also likely buy? This type of analysis is often called market-basket analysis as we are trying to understand which items customers purchase in the same basket/transaction.

Recommendations must be very granular (for example, at the product level) to be usable at the check-out register, website, and so on. For example, knowing that female customers purchase a wallet 63.9 percent of the time when they buy a purse is not directly actionable. However, knowing that customers that purchase a specific purse (for example, SKU 25343) also purchase a specific wallet (for example, SKU 98343) 51.8 percent of the time, can be the basis for future recommendations.

Product level recommendations require the analysis of massive data sets (that is, millions of rows). Usually, this data is in the form of sales transactions where each line item (that is, row of data) represents a single product. The line items are tied together by a single transaction ID.

IBM SPSS Modeler association models support both tabular and transactional data. The tabular format requires each product to be represented as column. As most product level recommendations would contain thousands of products, this format is not practical. The transactional format uses the transactional data directly and requires only two inputs, the transaction ID and the product/item.

Getting ready

This example uses the file stransactions.sav and scoring.csv.

How to do it...

To recommend the next best offer for large datasets:

  1. Start with a new stream by navigating to File | New Stream.
  2. Go to File | Stream Properties from the IBM SPSS Modeler menu bar. On the Options tab change the Maximum members for nominal fields to 50000. Click on OK.
  3. Add a Statistics File source node to the upper left of the stream. Set the file field by navigating to transactions.sav. On the Types tab, change the Product_Code field to Nominal and click on the Read Values button. Click on OK.
  4. Add a CARMA Modeling node connected to the Statistics File source node in step 3. On the Fields tab, click on the Use custom settings and check the Use transactional format check box. Select Transaction_ID as the ID field and Product_Code as the Content field.

  5. On the Model tab of the CARMA Modeling node, change the Minimum rule support (%) to 0.0 and the Minimum rule confidence (%) to 5.0. Click on the Run button to build the model. Double-click the generated model to ensure that you have approximately 40,000 rules.

  6. Add a Var File source node to the middle left of the stream. Set the file field by navigating to scoring.csv. On the Types tab, click on the Read Values button. Click on the Preview button to preview the data. Click on OK to dismiss all dialogs.

  7. Add a Sort node connected to the Var File node in step 6. Choose Transaction_ID and Line_Number (with Ascending sort) by clicking the down arrow on the right of the dialog. Click on OK.
  8. Connect the Sort node in step 7 to the generated model (replacing the current link).
  9. Add an Aggregate node connected to the generated model.
  10. Add a Merge node connected to the generated model. Connect the Aggregate node in step 9 to the Merge node. On the Merge tab, choose Keys as the Merge Method, select Transaction_ID, and click on the right arrow. Click on OK.
  11. Add a Select node connected to the Merge node in step 10. Set the condition to Record_Count = Line_Number. Click on OK. At this point, the stream should look as follows:

  12. Add a Table node connected to the Select node in step 11. Right-click on the Table node and click on Run to see the next-best-offer for the input data.

How it works...

In steps 1-5, we set up the CARMA model to use the transactional data (without needing to restructure the data). CARMA was selected over A Priori for its improved performance and stability with large data sets. For recommendation engines, the settings for the Model tab are somewhat arbitrary and are driven by the practical limitations of the number of rules generated. Lowering the thresholds for confidence and rule support generates more rules. Having more rules can have a negative impact on scoring performance but will result in more (albeit weaker) recommendations.

Rule Support

How many transactions contain the entire rule (that is, both antecedents ("if" products) and consequents ("then" products))

Confidence

If a transaction contains all the antecedents ("if" products), what percentage of the time does it contain the consequents ("then" products)

In step 5, when we examine the model we see the generated Association Rules with the corresponding rules support and confidences.

In the remaining steps (7-12), we score a new transaction and generate 3 next-best-offers based on the model containing the Association Rules. Since the model was built with transactional data, the scoring data must also be transactional. This means that each row is scored using the current row and the prior rows with the same transaction ID. The only row we generally care about is the last row for each transaction where all the data has been presented to the model. To accomplish this, we count the number of rows for each transaction and select the line number that equals the total row count (that is, the last row for each transaction).

Notice that the model returns 3 recommended products, each with a confidence, in order of decreasing confidence. A next-best-offer engine would present the customer with the best option first (or potentially all three options ordered by decreasing confidence). Note that, if there is no rule that applies to the transaction, nulls will be returned in some or all of the corresponding columns.

There's more...

In this recipe, you'll notice that we generate recommendations across the entire transactional data set. By using all transactions, we are creating generalized next-best-offer recommendations; however, we know that we can probably segment (that is, cluster) our customers into different behavioral groups (for example, fashion conscience, value shoppers, and so on.). Partitioning the transactions by behavioral segment and generating separate models for each segment will result in rules that are more accurate and actionable for each group. The biggest challenge with this approach is that you will have to identify the customer segment for each customer before making recommendations (that is, scoring). A unified approach would be to use the general recommendations for a customer until a customer segment can be assigned then use segmented models.

Correcting a confusion matrix for an imbalanced target variable by incorporating priors

Classification models generate probabilities and a classification predicted class value. When there is a significant imbalance in the proportion of True values in the target variable, the confusion matrix as seen in the Analysis node output will show that the model has all predicted class values equal to the False value, leading an analyst to conclude the model is not effective and needs to be retrained. Most often, the conventional wisdom is to use a Balance node to balance the proportion of True and False values in the target variable, thus eliminating the problem in the confusion matrix.

However, in many cases, the classifier is working fine without the Balance node; it is the interpretation of the model that is biased. Each model generates a probability that the record belongs to the True class and the predicted class is derived from this value by applying a threshold of 0.5. Often, no record has a propensity that high, resulting in every predicted class value being assigned False.

In this recipe we learn how to adjust the predicted class for classification problems with imbalanced data by incorporating the prior probability of the target variable.

Getting ready

This recipe uses the datafile cup98lrn_reduced_vars3.sav and the stream Recipe – correct with priors.str.

How to do it...

To incorporate prior probabilities when there is an imbalanced target variable:

  1. Open the stream Recipe – correct with priors.str by navigating to File | Open Stream.
  2. Make sure the datafile points to the correct path to the datafile cup98lrn_reduced_vars3.sav.
  3. Open the generated model TARGET_B, and open the Settings tab. Note that compute Raw Propensity is checked. Close the generated model.
  4. Duplicate the generated model by copying and pasting the node in the stream. Connect the duplicated model to the original generated model.
  5. Add a Type node to the stream and connect it to the generated model. Open the Type node and scroll to the bottom of the list. Note that the fields related to the two models have not yet been instantiated. Click on Read Values so that they are fully instantiated.

  6. Insert a Filler node and connect it to the Type node.
  7. Open the Filler node and, in the variable list, select $N1-TARGET_B. Inside the Condition section, type $RP1-TARGET_B' >= TARGET_B_Mean, Click on OK to dismiss the Filler node (after exiting the Expression Builder).

  8. Insert an Analysis node to the stream. Open the Analysis node and click on the check box for Coincidence Matrices. Click on OK.
  9. Run the stream to the Analysis node. Notice that the coincidence matrix (confusion matrix) for $N-TARGET_B has no predictions with value = 1, but the coincidence matrix for the second model, the one adjusted by step 7 ($N1-TARGET_B), has more than 30 percent of the records labeled as value = 1.

How it works...

Classification algorithms do not generate categorical predictions; they generate probabilities, likelihoods, or confidences. For this data set, the target variable, TARGET_B, has two values: 1 and 0. The classifier output from any classification algorithm will be a number between 0 and 1. To convert the probability to a 1 or 0 label, the probability is thresholded, and the default in Modeler (and all predictive analytics software) is the threshold at 0.5. This recipe changes that default threshold to the prior probability.

The proportion of TARGET_B = 1 values in the data is 5.1 percent, and therefore this is the classic imbalanced target variable problem. One solution to this problem is to resample the data so that the proportion of 1s and 0s are equal, normally achieved through use of the Balance node in Modeler. Moreover, one can create the Balance node from running a Distribution node for TARGET_B, and using the Generate | Balance node (reduce) option. The justification for balancing the sample is that, if one doesn't do it, all the records will be classified with value = 0.

The reason for all the classification decisions having value 0 is not because the Neural Network isn't working properly. Consider the histogram of predictions from the Neural Network shown in the following screenshot. Notice that the maximum value of the predictions is less than 0.4, but the center of density is about 0.05. The actual shape of the histogram and the maximum predicted value depend on the Neural Network; some may have maximum values slightly above 0.5.

If the threshold for the classification decision is set to 0.5, since no neural network predicted confidence is greater than 0.5, all of the classification labels will be 0. However, if one sets the threshold to the TARGET_B prior probability, 0.051, many of the predictions will exceed that value and be labeled as 1. We can see the result of the new threshold by color-coding the histogram of the previous figure with the new class label, in the following screenshot.

This recipe used a Filler node to modify the existing predicted target value. The categorical prediction from the Neural Network whose prediction is being changed is $N1-TARGET_B. The $ variables are special field names that are used automatically in the Analysis node and Evaluation node. It's possible to construct one's own $ fields with a Derive node, but it is safer to modify the one that's already in the data.

There's more...

This same procedure defined in this recipe works for other modeling algorithms as well, including logistic regression. Decision trees are a different matter. Consider the following screenshot. This result, stating that the C5 tree didn't split at all, is the result of the imbalanced target variable.

Rather than balancing the sample, there are other ways to get a tree built. For C&RT or Quest trees, go to the Build Options, select the Costs & Priors item, and select Equal for all classes for priors: equal priors. This option forces C&RT to treat the two classes mathematically as if their counts were equal. It is equivalent to running the Balance node to boost samples so that there are equal numbers of 0s and 1s. However, it's done without adding additional records to the data, slowing down training; equal priors is purely a mathematical reweighting.

The C5 tree doesn't have the option of setting priors. An alternative, one that will work not only with C5 but also with C&RT, CHAID, and Quest trees, is to change the Misclassification Costs so that the cost of classifying a one as a zero is 20, approximately the ratio of the 95 percent 0s to 5 percent 1s.

IBM SPSS Modeler Cookbook Over 60 practical recipes to achieve better results using the experts' methods for data mining
Published: October 2013
eBook Price: $38.99
Book Price: $64.99
See more
Select your format and quantity:

Combining generated filters

When building a predictive model, if many data fields are available to use as inputs to the model, then reducing the number of inputs can lead to better, simpler and easier-to-use models. Fields or features can be selected in a number of ways: by using business and data knowledge, by analysis to select individual fields that have a relation to the predictive target, and by using other models to select features whose relevance is more multivariate in nature.

In a Modeler stream, selections of fields are usually represented by Filter nodes. If multiple selections from the same set of fields have been produced, for example by generating Filter nodes from different models, then it is useful to combine these filters. Filters can be combined in different ways; for example, if we wish to select only the fields that were selected by both models, then the filters are placed in sequence. If we wish to select all the fields that were selected by either model, then a different technique is required.

This recipe shows how to combine two Filters nodes, in this example each generated from a different model, to produce a new filter that selects all the fields that were selected in either of the original filters.

Getting ready

This recipe uses the datafile, cup98LRN.txt and the stream file, Combining_Generated_Filters.str.

How to do it...

To combine generated filters:

  1. Open the stream Combining_Generated_Filters.str by navigating to File | Open Stream.
  2. Edit the Type node; you can see the shape of the data by clicking on Preview in the edit dialogue. The Type node specifies 324 input fields and one target field for modeling. These modeling roles specified by the Type node will be used for all model building in this stream, and the models will be used to generate filter nodes.
  3. Run the Distribution node Target_B. In the raw data, the target field is mostly zeros, so a Balance node has been used to select a more balanced sample for modeling (shown in the following screenshot). This step also fills the cache on the Balance node so that the same sample will be used for all the models.

    Note that random selection by the Balance node means that the stream will not do exactly the same thing when run again; the models and therefore the generated filters may be slightly different, but the principles remain the same.

  4. Edit the Filter node TARGET_B T to the left of the stream. This Filter node was generated from the CHAID decision tree TARGET_B; as shown in the following screenshot, the filter selects 34 out of the available 325 fields, including the target variable.

  5. Edit the Filter node TARGET_B N to the left of the stream. This Filter node was generated from the Neural Network model TARGET_B; as shown in the following screenshot, the filter selects 21 out of the available 325 fields, including the target variable and the top 20 fields used by the Neural Network.

  6. Edit the Filter node TARGET_B T toggled in the branch with 3 Filter nodes to the right of the stream. This is a copy of the Filter node TARGET_B T in which all fields have been toggled; that is, fields that were on are switched off, and those that were off are switched on. It is important that this is done using the Filter options menu and the option Toggle All Fields. The Filter node dialogue is shown in the following screenshot; note that it is the inverse of the filter shown in the screenshot in step 4: instead of 34 fields output, 34 fields are filtered.

  7. Edit the Filter node TARGET_B N toggled in the branch with 3 Filter nodes to the right of the stream. This is a copy of the Filter node TARGET_B N that has been connected in sequence with TARGET_B T toggled and all fields have been toggled, as in step 6. Again it is important that this was done using the Filter options menu and the option Toggle All Fields. The Filter node dialogue is shown in the following screenshot. Again this is the inverse of the filter shown in the screenshot in step 5 but with a slightly less obvious relationship; it filters 18 fields instead of 21 because some of the relevant fields have already been filtered out by the previous node.

  8. Edit the Filter node New Filter on the far right of the stream. This is a new Filter node that has been connected in sequence with the two toggled Filter nodes and then has itself been toggled in the same way; again it is important that this is done using the Filter options menu. All the remaining fields are filtered out, but because it is a new Filter node, it holds no information about fields that were filtered out by the previous filters in the sequence.

  9. Edit the Filter node New Filter at the lowest edge of the stream; this is a copy of the New Filter node examined in step 8, but reconnected to the original set of fields. Viewed in this context, this node now makes sense, it outputs 52 fields, all of those output by the two generated filters. It does not output 55 fields (34 plus 21) because there is a slight overlap between the two generated filters.

How it works...

This technique can be viewed as performing Boolean logic on arrays or vectors of Boolean values represented by Filter nodes. Under this view, placing filters in sequence acts like a Boolean conjunction (AND), allowing a field to pass through (switched on or true) only if it is switched on in both of the original filters. Toggling a filter provides the equivalent of a Boolean negation or NOT. We want to construct a Boolean disjunction (OR), which will allow a field to pass through if it was true in either of the original filters. We use the equivalence:

p OR q = NOT( NOT(p) AND NOT(q) )

First we negate (toggle) each of the filters, then we conjoin (sequence) them, then we negate (toggle) the result in a new filter, producing the equivalent disjunction in the new filter.

This recipe has emphasized the importance of using the option to Toggle All Fields from the Filter options menu. This is important because there is another, more obviously accessible operation in the Filter node dialogue, the button whose tooltip is Remove fields by default. Using this button appears to have the same effect as toggling, but the semantics of the operation are different; this has the undesired effect that the final Filter node does not output any fields, even when reconnected to the original set of fields.

There's more...

This technique could be used to combine more than two filters in exactly the same way, only using a longer sequence of toggled filters. In all cases, the result is produced by adding a new filter at the end of the sequence and toggling it.

The technique could also be generalized to situations where other Boolean operations are required. For example, we might use it to test whether one filter is a superset of another by treating this as a Boolean implication.

This technique was developed for a project in which a large number of fields represented topological features in organic molecules. Because the number of fields was too large to manipulate individual field selections by hand, it was necessary to find semi-automated ways to manipulate filter nodes in order to control field selection en masse.

Summary

In this article we've got a rough idea of the IBM SPSS Modeler Cookbook. We have learnt how to use the feature selection node creatively and the next-best-offer for large datasets. We've also learnt how to correct a confusion matrix for an imbalanced target variable by incorporating priors. And lastly, we've learnt how to combine generated filters.

Resources for Article :


Further resources on this subject:


IBM SPSS Modeler Cookbook Over 60 practical recipes to achieve better results using the experts' methods for data mining
Published: October 2013
eBook Price: $38.99
Book Price: $64.99
See more
Select your format and quantity:

About the Author :


Keith McCormick

Keith McCormick is the Vice President and General Manager of QueBIT Consulting's Advanced Analytics team. He brings a wealth of consulting/training experience in statistics, predictive modeling and analytics, and data mining. For many years, he has worked in the SPSS community, first as an External Trainer and Consultant for SPSS Inc., then in a similar role with IBM, and now in his role with an award winning IBM partner. He possesses a BS in Computer Science and Psychology from Worcester Polytechnic Institute.

He has been using Stats software tools since the early 90s, and has been training since 1997. He has been doing data mining and using IBM SPSS Modeler since its arrival in North America in the late 90s. He is an expert in IBM's SPSS software suite including IBM SPSS Statistics, IBM SPSS Modeler (formally Clementine), AMOS, Text Mining, and Classification Trees. He is active as a moderator and participant in statistics groups online including LinkedIn's Statistics and Analytics Consultants Group. He also blogs and reviews related books at KeithMcCormick.com. He enjoys hiking in out of the way places, finding unusual souvenirs while traveling overseas, exotic foods, and old books.

Books From Packt


Instant IBM Lotus Notes 8.5.3 How-to
Instant IBM Lotus Notes 8.5.3 How-to

IBM Cognos Business Intelligence 10.1 Dashboarding Cookbook
IBM Cognos Business Intelligence 10.1 Dashboarding Cookbook

IBM WebSphere eXtreme Scale 6
IBM WebSphere eXtreme Scale 6

WS-BPEL 2.0 for SOA Composite Applications with IBM WebSphere 7
WS-BPEL 2.0 for SOA Composite Applications with IBM WebSphere 7

IBM Cognos 10 Framework Manager
IBM Cognos 10 Framework Manager

IBM Websphere Portal 8: Web Experience Factory and the Cloud
IBM Websphere Portal 8: Web Experience Factory and the Cloud

Application Development for IBM WebSphere Process Server 7 and Enterprise Service Bus 7
Application Development for IBM WebSphere Process Server 7 and Enterprise Service Bus 7

IBM WebSphere Application Server v7.0 Security
IBM WebSphere Application Server v7.0 Security


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software