Reader small image

You're reading from  Data Analysis with IBM SPSS Statistics

Product typeBook
Published inSep 2017
PublisherPackt
ISBN-139781787283817
Edition1st Edition
Right arrow
Authors (2):
Ken Stehlik-Barry
Ken Stehlik-Barry
author image
Ken Stehlik-Barry

Kenneth Stehlik-Barry, PhD, joined SPSS as Manager of Training in 1980 after using SPSS for his own research for several years. Working with others at SPSS, including Anthony Babinec, he developed a series of courses related to the use of SPSS and taught these courses to numerous SPSS users. He also managed the technical support and statistics groups at SPSS. Along with Norman Nie, the founder of SPSS and Jane Junn, a political scientist, he co-authored Education and Democratic Citizenship. Dr. Stehlik-Barry has used SPSS extensively to analyze data from SPSS and IBM customers to discover valuable patterns that can be used to address pertinent business issues. He received his PhD in Political Science from Northwestern University and currently teaches in the Masters of Science in Predictive Analytics program there.
Read more about Ken Stehlik-Barry

Anthony Babinec
Anthony Babinec
author image
Anthony Babinec

Anthony J. Babinec joined SPSS as a Statistician in 1978 after assisting Norman Nie, SPSS founder, in a research methods class at the University of Chicago. Anthony developed SPSS courses and trained many SPSS users. He also wrote many examples found in SPSS documentation and worked in technical support. Anthony led a business development effort to find products implementing then-emerging new technologies such as CHAID decision trees and neural networks and helped SPSS customers successfully apply them. Anthony uses SPSS in consulting engagements and teaches IBM customers how to use its advanced features. He received his BA and MA in Sociology with a specialization in Advanced Statistics from the University of Chicago and teaches classes at the Institute for Statistics Education. He is on the Board of Directors of the Chicago Chapter of the American Statistical Association, where he has served in different positions including President.
Read more about Anthony Babinec

View More author details
Right arrow

Discriminant Analysis

Discriminant analysis is a statistical technique used in classification. In general, a classification problem features a categorical target variable with two or more known classes and one or more inputs to be used in the classification. Discriminant analysis assumes that the inputs are numeric (scale) variables, although practitioners often employ discriminant analysis when the inputs are a mixture of numeric and categorical variables. To use categorical variables as inputs in SPSS Statistics Discriminant, you must employ dummy variable coding. If your inputs are exclusively categorical, you might consider using logistic regression instead.

A classic example where discriminant analysis could be used is the oft-cited Fisher Iris data example. A botanist approached the great statistician and geneticist R. A Fisher with a classification problem. He had four...

Descriptive discriminant analysis

One purpose of discriminant analysis is description--finding a way to separate and characterize the three species in terms of differences on the classifying variables. In the Iris data, Fisher saw that size matters--members of a certain species tend to have larger values for dimensional measurements on the individual samples such as petal length and width and sepal length and width. In addition, there was another pattern--members of a certain species that otherwise had small dimensional measurements on three of the indicators had relatively large sepal widths. Taking into account both of these patterns, one is able to classify irises with great accuracy as well as understand what characterizes exemplars of each species.

In descriptive discriminant analysis, you would report and focus on summary statistics within groups such as means, standard...

Predictive discriminant analysis

A second purpose of discriminant analysis is prediction--developing equations such that if you plug in the input values for a new observed individual or object, the equations would classify the individual or object into one of the target classes.

In modern predictive analytics, discriminant analysis is one of a large number of techniques that could be used in classification. The reason that so many classification techniques exist is that no method dominates the others across all problems and data. Typically, in a project, you might try a number of approaches and compare and contrast their performance on the data. A statistical method such as discriminant analysis could be one of these methods. In the event that the data meet the assumptions of discriminant analysis, it should perform well. As discriminant analysis is an equation-based method, the...

Assumptions underlying discriminant analysis

When using discriminant analysis, you make the following assumptions:

  • Independence of the observations. This rules out correlated data such as multilevel data, repeated measures data, or matched pairs data.
  • Multivariate normality within groups. Strictly speaking, the presence of any categorical inputs can make this assumption untenable. Nonetheless, discriminant analysis can be robust to violations of this assumption.
  • Homogeneity of covariances across groups. You can assess this assumption using the Box's M test.
  • Absence of perfect multicollinearity. A given input cannot be perfectly predicted by a combination of other inputs also in the model.
  • The number of cases within each group must be larger than the number of input variables.
IBM SPSS Statistics gives you statistical and graphical tools to assess the normality assumption...

Example data

The data analyzed in this chapter is the Wine dataset found in the UC-Irvine Machine Learning repository. The data is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 chemical components found in each of the three types of wine. There are 59, 71, and 48 instances respectively in the three classes. The class codes are 1, 2, and 3.

The attributes are as follows:

  • Alcohol
  • Malic acid
  • Ash
  • Alcalinity of ash
  • Magnesium
  • Total phenols
  • Flavanoids
  • Nonflavanoid phenols
  • Proanthocyanins
  • Color intensity
  • Hue
  • OD280/OD315 of diluted wines
  • Proline

In the context of classification, the task is to use the 13 attributes to classify each observation into one of the three wine types. Note that all 13 attributes are numeric.

Statistical and graphical summary of the data

There are many exploratory analyses that you can undertake at this point. Here, we show a simple table means as well as a scatterplot matrix that reveals the group structure.

Here are the group means on the attributes:

These statistics are presented for descriptive purposes. You are looking for overt differences in the means across the three types. If an input's means vary across type, then this suggests that the variable might be a useful discriminator. On the other hand, do not make too much of apparent differences as these are single-variable statistics. Discriminant analysis brings all of the inputs into the model at the same time, and therefore you get a sense of a variable's impact in the presence of other variables.

You might also try various charts. Here is a scatterplot matrix showing five of the attributes:

The...

Discriminant analysis setup - key decisions

You can run discriminant either from the menus or via syntax. When running discriminant analysis, you must make several higher-level decisions about the analysis.

Priors

First, do you have any prior information about the relative sizes of the target variable classes in the population? In the absence of any knowledge of target class sizes, you can use equal prior probabilities, which is the default, or prior probabilities can be in the proportions of the target variable class sizes in the data. A third alternative is that you can specify your own target class prior probabilities. The list of probabilities must sum to 1. Prior probabilities are used during classification. For more...

Examining the results

Running the syntax produces a lot of output. Here, we highlight and comment on some of the results.

Here is the Analysis Case Processing Summary:

The summary reports on cases missing for various reasons:

  • Missing or out-of-range group codes
  • At least one missing discriminating variable
  • Both missing or out-of-range group codes and at least one missing discriminating variable

In our analysis, the data is complete.

Here are the Tests of Equality of Group Means:

The standard statistical test for the equality of means for three or more groups is the F test for equality of means. The table considers each variable one at a time. Inspection of the table shows that each variable is statistically significant, meaning that the means of each variable differ somewhere across the three wine type. A smaller Wilks' Lambda is associated with a larger F. Judging by...

Scoring new observations

After you have developed and evaluated a model based on historical data, you can apply the model to new data in order to make predictions. In predictive analytics, this is called scoring. You score cases for which the outcome is not yet known. Your evaluation of the historical data gives you a sense of how the model is likely to perform in the new situation.

One way to implement scoring is to make use of the classification function coefficients. Here is the syntax in which the classification function coefficients are used in compute:

compute cf1=57.351*alcohol+.854*malic_acid+39.031*ash
-.662*ash_alcalinity+.502*magnesium-3.261*total_phenols
+3.579*flavanoids+39.626*nonflavanoid_phenols+1.243*proanthocyanins
-3.988*color_intensity+27.600*hue+22.527*dilution
+.021*proline-523.443.
compute cf2=52.373*alcohol+.134*malic_acid+28.029*ash
+.465*ash_alcalinity...

Summary

Discriminant analysis is a standard statistical approach to classification. Here are the takeaways from the presentation of discriminant analysis on the Wine data:

  • Discriminant analysis makes assumptions of multivariate normality within groups and homogeneity of covariance matrices across groups. You can use both the Discriminant procedure and IBM SPSS Statistics more generally to assess these assumptions.
  • As the analyst, you must make decisions regarding prior probabilities, whether to classify based on pooled or separate covariance matrices and what dimensionality represents the data.
  • The classification results table shows you overall classification accuracy and classification accuracy by class. You should assess accuracy not only on the training data, but also via leave-one-out analysis or cross-validation via the /SELECT subcommand.
  • The standardized canonical discriminant...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Analysis with IBM SPSS Statistics
Published in: Sep 2017Publisher: PacktISBN-13: 9781787283817
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Ken Stehlik-Barry

Kenneth Stehlik-Barry, PhD, joined SPSS as Manager of Training in 1980 after using SPSS for his own research for several years. Working with others at SPSS, including Anthony Babinec, he developed a series of courses related to the use of SPSS and taught these courses to numerous SPSS users. He also managed the technical support and statistics groups at SPSS. Along with Norman Nie, the founder of SPSS and Jane Junn, a political scientist, he co-authored Education and Democratic Citizenship. Dr. Stehlik-Barry has used SPSS extensively to analyze data from SPSS and IBM customers to discover valuable patterns that can be used to address pertinent business issues. He received his PhD in Political Science from Northwestern University and currently teaches in the Masters of Science in Predictive Analytics program there.
Read more about Ken Stehlik-Barry

author image
Anthony Babinec

Anthony J. Babinec joined SPSS as a Statistician in 1978 after assisting Norman Nie, SPSS founder, in a research methods class at the University of Chicago. Anthony developed SPSS courses and trained many SPSS users. He also wrote many examples found in SPSS documentation and worked in technical support. Anthony led a business development effort to find products implementing then-emerging new technologies such as CHAID decision trees and neural networks and helped SPSS customers successfully apply them. Anthony uses SPSS in consulting engagements and teaches IBM customers how to use its advanced features. He received his BA and MA in Sociology with a specialization in Advanced Statistics from the University of Chicago and teaches classes at the Institute for Statistics Education. He is on the Board of Directors of the Chicago Chapter of the American Statistical Association, where he has served in different positions including President.
Read more about Anthony Babinec