Reader small image

You're reading from  Data Analysis with IBM SPSS Statistics

Product typeBook
Published inSep 2017
PublisherPackt
ISBN-139781787283817
Edition1st Edition
Right arrow
Authors (2):
Ken Stehlik-Barry
Ken Stehlik-Barry
author image
Ken Stehlik-Barry

Kenneth Stehlik-Barry, PhD, joined SPSS as Manager of Training in 1980 after using SPSS for his own research for several years. Working with others at SPSS, including Anthony Babinec, he developed a series of courses related to the use of SPSS and taught these courses to numerous SPSS users. He also managed the technical support and statistics groups at SPSS. Along with Norman Nie, the founder of SPSS and Jane Junn, a political scientist, he co-authored Education and Democratic Citizenship. Dr. Stehlik-Barry has used SPSS extensively to analyze data from SPSS and IBM customers to discover valuable patterns that can be used to address pertinent business issues. He received his PhD in Political Science from Northwestern University and currently teaches in the Masters of Science in Predictive Analytics program there.
Read more about Ken Stehlik-Barry

Anthony Babinec
Anthony Babinec
author image
Anthony Babinec

Anthony J. Babinec joined SPSS as a Statistician in 1978 after assisting Norman Nie, SPSS founder, in a research methods class at the University of Chicago. Anthony developed SPSS courses and trained many SPSS users. He also wrote many examples found in SPSS documentation and worked in technical support. Anthony led a business development effort to find products implementing then-emerging new technologies such as CHAID decision trees and neural networks and helped SPSS customers successfully apply them. Anthony uses SPSS in consulting engagements and teaches IBM customers how to use its advanced features. He received his BA and MA in Sociology with a specialization in Advanced Statistics from the University of Chicago and teaches classes at the Institute for Statistics Education. He is on the Board of Directors of the Chicago Chapter of the American Statistical Association, where he has served in different positions including President.
Read more about Anthony Babinec

View More author details
Right arrow

Aggregating and Restructuring Data

There are many instances in which the data provided initially needs to be changed before analysis can begin. Chapter 7, Creating New Data Elements, described a variety of SPSS capabilities to create new variables using the transformations commands and Chapter 8, Adding and Matching files, dealt with the capabilities available to match and add files. This chapter builds on what was covered in these two chapters by introducing the use of aggregation to create summary variables by calculating statistics such as the mean, sum, minimum and maximum across a set of cases in the data. This information can be used to add fields for analytical purposes, and the aggregated file itself can be used to conduct investigations using a different unit of analysis.

The key topics that will be addressed in this chapter are as follows:

  • Adding aggregated fields back...

Using aggregation to add fields to a file

A dataset contains information that is readily evident in the fields themselves, but it also has useful content that is inherent in data. Often, it is important to place specific values in a broader context to make them more meaningful. Personal income, for example, can be an important predictor in many situations but comparing someone's income with the average income in their area provides a more nuanced view of their relative economic situation. Similarly, a student score on a reading test can be compared to the national average but it is also useful to compare their score with other students in their district or school. A student's score may be just slightly preceding to the national norm but they may have one of the top scores in their school and they could benefit by being included in an advanced reading program.

For the...

Aggregating up one level

For the next set of Aggregate examples, a dataset from the repository maintained by UC Irvine will be used. This data can be downloaded in Excel format at the following link. It contains order information from a UK based online retailer. The data contains eight fields and 541,909 rows:

https://archive.ics.uci.edu/ml/datasets/Online+Retail

Source: Dr. Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

Dataset Information: This is a transnational dataset with data from 37 countries that contains all the transactions occurring between December 1, 2010 and December 9, 2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasions gifts. Many customers of the company are wholesalers.

To read the data into SPSS, you...

Second level aggregation

While this invoice level file might be of some use in terms of analytics, it is more likely to be valuable as a means of gaining better insight regarding customer behavior. To obtain such insights, more preparation needs to be done before aggregating up to the customer level. Sorting the invoice file by CustomerID and date will make it possible to calculate the number of days between purchases, which could be used for a wide range of marketing/promotional decisions. It could also help identify customers that appear to have been lost due to lack of activity relative to their typical purchase pattern.

Preparing aggregated data for further use

To calculate the number of days between purchases, the Shift...

Matching the aggregated file back to find specific records

Another use of this file is to match it back to the invoice-level file to find the invoice corresponding to the customer's largest purchase during the time covered by this data. Starting with the invoice-level file as the active dataset, the following dialog box shows how to use the customer-level file as a lookup table to get the information associated with the largest invoice:

CustomerID serves as the match key in this one-to-many match. The * and + symbols indicate which file was the source for each of the variables in the resulting file.

The SPSS syntax to perform this match is shown here along with the transformation commands used to identify the largest invoice for each customer:

STAR JOIN
/SELECT t0.InvoiceNo, t0.InvoiceDate, t0.itemcost_total, t0.numproducts, t0.priorinvoicedate,
t0.priorcustID, t0.daysincepurchase...

Restructuring rows to columns

There are situations in which the way the data is structured needs to be altered before analytics can be conducted. This is related to the unit of analysis required for certain types of statistical comparisons. In most cases, the fact that the organization of the original data will not work is obvious but the way to address the problem is not immediately evident. SPSS Statistics includes a feature to restructure data to address the most common types of challenges.

There are three basic types of data restructuring that can be performed in SPSS using the restructure data wizard accessible via the Restructure choice on the Data menu. These choices can be seen along with a brief description in the following screenshot. By comparing the structure of the data you have with the structure necessary for the analysis you want to perform, the best approach to...

Summary

This chapter focused on the various ways aggregation can be utilized to extract implicit information from the data and make it available for use in constructing derived fields that have the potential to yield deeper analytical insights. Adding aggregated variables back to the original dataset is a simple but powerful technique that supports the creation of fields better tailored for predictive modeling. Examples of one- and two-level aggregation were used to show how new datasets can be created to allow modeling at a different unit of analysis. Leveraging the high-level aggregations to identify key records in the original data was also demonstrated.

Finally, the data structuring capabilities in SPSS Statistics was introduced using a basic cases-to-rows consolidation example that illustrated how this allows calculations that would not otherwise be possible. With these data...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Analysis with IBM SPSS Statistics
Published in: Sep 2017Publisher: PacktISBN-13: 9781787283817
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Ken Stehlik-Barry

Kenneth Stehlik-Barry, PhD, joined SPSS as Manager of Training in 1980 after using SPSS for his own research for several years. Working with others at SPSS, including Anthony Babinec, he developed a series of courses related to the use of SPSS and taught these courses to numerous SPSS users. He also managed the technical support and statistics groups at SPSS. Along with Norman Nie, the founder of SPSS and Jane Junn, a political scientist, he co-authored Education and Democratic Citizenship. Dr. Stehlik-Barry has used SPSS extensively to analyze data from SPSS and IBM customers to discover valuable patterns that can be used to address pertinent business issues. He received his PhD in Political Science from Northwestern University and currently teaches in the Masters of Science in Predictive Analytics program there.
Read more about Ken Stehlik-Barry

author image
Anthony Babinec

Anthony J. Babinec joined SPSS as a Statistician in 1978 after assisting Norman Nie, SPSS founder, in a research methods class at the University of Chicago. Anthony developed SPSS courses and trained many SPSS users. He also wrote many examples found in SPSS documentation and worked in technical support. Anthony led a business development effort to find products implementing then-emerging new technologies such as CHAID decision trees and neural networks and helped SPSS customers successfully apply them. Anthony uses SPSS in consulting engagements and teaches IBM customers how to use its advanced features. He received his BA and MA in Sociology with a specialization in Advanced Statistics from the University of Chicago and teaches classes at the Institute for Statistics Education. He is on the Board of Directors of the Chicago Chapter of the American Statistical Association, where he has served in different positions including President.
Read more about Anthony Babinec