Reader small image

You're reading from  Data Analysis with IBM SPSS Statistics

Product typeBook
Published inSep 2017
PublisherPackt
ISBN-139781787283817
Edition1st Edition
Right arrow
Authors (2):
Ken Stehlik-Barry
Ken Stehlik-Barry
author image
Ken Stehlik-Barry

Kenneth Stehlik-Barry, PhD, joined SPSS as Manager of Training in 1980 after using SPSS for his own research for several years. Working with others at SPSS, including Anthony Babinec, he developed a series of courses related to the use of SPSS and taught these courses to numerous SPSS users. He also managed the technical support and statistics groups at SPSS. Along with Norman Nie, the founder of SPSS and Jane Junn, a political scientist, he co-authored Education and Democratic Citizenship. Dr. Stehlik-Barry has used SPSS extensively to analyze data from SPSS and IBM customers to discover valuable patterns that can be used to address pertinent business issues. He received his PhD in Political Science from Northwestern University and currently teaches in the Masters of Science in Predictive Analytics program there.
Read more about Ken Stehlik-Barry

Anthony Babinec
Anthony Babinec
author image
Anthony Babinec

Anthony J. Babinec joined SPSS as a Statistician in 1978 after assisting Norman Nie, SPSS founder, in a research methods class at the University of Chicago. Anthony developed SPSS courses and trained many SPSS users. He also wrote many examples found in SPSS documentation and worked in technical support. Anthony led a business development effort to find products implementing then-emerging new technologies such as CHAID decision trees and neural networks and helped SPSS customers successfully apply them. Anthony uses SPSS in consulting engagements and teaches IBM customers how to use its advanced features. He received his BA and MA in Sociology with a specialization in Advanced Statistics from the University of Chicago and teaches classes at the Institute for Statistics Education. He is on the Board of Directors of the Chicago Chapter of the American Statistical Association, where he has served in different positions including President.
Read more about Anthony Babinec

View More author details
Right arrow

Clustering

Cluster analysis is a family of classification techniques for finding groups in data when both the number of groups, and which object falls in which group, are not observed at the start. The object is typically a case (data row), although it can be a variable. This makes cluster analysis a type of unsupervised learning, meaning that the data consists of inputs with no target variable. Since you are not aiming to predict or explain a target variable, you cannot turn to measures of model performance used in predictive modeling, such as classification accuracy or percent of variance explained.

Some researchers have contended that the idea of a cluster is ill-defined. However, most sources suggest that clusters are groupings of objects that can be understood in terms of internal cohesion (homogeneity) and external separation. Cluster analysis has been used in market research...

Overview of cluster analysis

Cluster analysis is generally done in a series of steps. Here are things to consider in a typical cluster analysis:

  • Objects to cluster: What are the objects? Typically, they should be representative of the cluster structure to be present. Also, they should be randomly sampled if generalization of a population is required.
  • Variables to be used: The input variables are the basis on which clusters are formed. Popular clustering techniques assume that the variables are numeric in scale, although you might work with binary data or a mix of numeric and categorical data.
  • Missing values: Typically, you begin with the flat file of objects in rows and variables in columns. In the presence of missing data, you might either delete the case or input the missing value, while special clustering methods might allow other handling of missing data.
  • Scale the data...

Overview of SPSS Statistics cluster analysis procedures

SPSS Statistics offers three clustering procedures: CLUSTER, QUICK CLUSTER, and TWOSTEP CLUSTER.

CLUSTER produces hierarchical clusters of items based on distance measures of dissimilarity or similarity. The items being clustered are usually rows in the active dataset, and the distance measures are computed from the row values for the input variables. Hierarchical clustering produces a set of cluster solutions from a starting situation where each case is its own cluster of size one, to an ending situation where all cases are in one cluster. Case-to-case distance is unambiguous, but case-to-cluster and cluster-to-cluster distance can be defined in different ways, so there are multiple methods for agglomeration, which is the bring together of objects or clusters.

This form of clustering is called hierarchical because cluster...

Hierarchical cluster analysis example

The example data is the USA violent crime data previously analyzed via the Principal components analysis section in Chapter 14, Principal Components and Factor Analysis. Recall that the data consists of state-level data for the 50 states of the USA and also the District of Columbia. The data came from the year 2014, the most recent year available on our source website. For a full description of the data, see Chapter 14, Principal Components and Factor Analysis.

The goal is to use the seven crime rate variables as inputs in a hierarchical cluster analysis. The variables are:

  • MurderandManslaughterRate
  • RevisedRapeRate
  • RobberyRate
  • AggravatedAssaultRate
  • BurglaryRate
  • Larceny_TheftRate
  • MotorVehicleTheftRate

The overall problem size is small. The data is complete; there is no missing data. We are primarily interested in description, and there is...

K-means cluster analysis example

The example data includes 272 observations on two variables--eruption time in minutes and waiting time for the next eruption in minutes--for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. This data is available in many places, including the freeware R program.

An original source is Hardle, W. (1991) Smoothing Techniques with Implementation in S. New York: Springer.

One reason that this data is featured in examples is that charts reveal that the observations on each input are clearly bimodal. For this reason, we use them to illustrate K-means clustering with two clusters specified.

Our analysis proceeds as usual:

  • Descriptive analysis
  • Cluster analysis
  • Cluster profiling

Descriptive analysis

...

Twostep cluster analysis example

For this example, we return to the USA states violent crime data example. Recall that TWOSTEP CLUSTER offers an automatic method for selecting the number of clusters, as well as a Likelihood distance measure. We will run it to show some of the visuals in the model viewer output.

The approach here is to:

  1. First run TWOSTEP CLUSTER in automatic mode to identify a tentative number of clusters.
  2. Then run TWOSTEP CLUSTER again with a specified number of clusters.

Here is the SPSS code for the first run:

TWOSTEP CLUSTER
/CONTINUOUS VARIABLES=MurderR RRapeR RobberyR AssaultR BurglaryR LarcenyR VehicleTheftR
/DISTANCE Likelihood
/NUMCLUSTERS AUTO 15 BIC
/HANDLENOISE 0
/MEMALLOCATE 64
/CRITERIA INITHRESHOLD(0) MXBRANCH(8) MXLEVEL(3)
/VIEWMODEL DISPLAY=YES
/PRINT IC COUNT SUMMARY.

Here are comments on the SPSS code:

  • In a step not shown, the variable names...

Summary

SPSS Statistics offers three procedures for cluster analysis.

The CLUSTER procedure performs hierarchical clustering. Hierarchical clustering starts with the casewise proximities matrix and combines cases and clusters into clusters using one of the seven clustering methods. Schedule, Dendogram, and icicle plots are aids to identifying the tentative number of clusters. Consider using CLUSTER when you are unsure of the number of clusters at the start and are willing to compute the proximity matrix.

The QUICK CLUSTER procedure performs K-means clustering, which requires specification of an explicit tentative number of clusters. K-means clustering avoids forming the proximities matrix along with all the steps of agglomeration, and so it can be used on files with lots of cases. K-means clustering is not invariant to scaling, and furthermore, can impose a spherical structure...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Analysis with IBM SPSS Statistics
Published in: Sep 2017Publisher: PacktISBN-13: 9781787283817
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Ken Stehlik-Barry

Kenneth Stehlik-Barry, PhD, joined SPSS as Manager of Training in 1980 after using SPSS for his own research for several years. Working with others at SPSS, including Anthony Babinec, he developed a series of courses related to the use of SPSS and taught these courses to numerous SPSS users. He also managed the technical support and statistics groups at SPSS. Along with Norman Nie, the founder of SPSS and Jane Junn, a political scientist, he co-authored Education and Democratic Citizenship. Dr. Stehlik-Barry has used SPSS extensively to analyze data from SPSS and IBM customers to discover valuable patterns that can be used to address pertinent business issues. He received his PhD in Political Science from Northwestern University and currently teaches in the Masters of Science in Predictive Analytics program there.
Read more about Ken Stehlik-Barry

author image
Anthony Babinec

Anthony J. Babinec joined SPSS as a Statistician in 1978 after assisting Norman Nie, SPSS founder, in a research methods class at the University of Chicago. Anthony developed SPSS courses and trained many SPSS users. He also wrote many examples found in SPSS documentation and worked in technical support. Anthony led a business development effort to find products implementing then-emerging new technologies such as CHAID decision trees and neural networks and helped SPSS customers successfully apply them. Anthony uses SPSS in consulting engagements and teaches IBM customers how to use its advanced features. He received his BA and MA in Sociology with a specialization in Advanced Statistics from the University of Chicago and teaches classes at the Institute for Statistics Education. He is on the Board of Directors of the Chicago Chapter of the American Statistical Association, where he has served in different positions including President.
Read more about Anthony Babinec