Reader small image

You're reading from  Mastering Predictive Analytics with Python

Product typeBook
Published inAug 2016
Reading LevelIntermediate
Publisher
ISBN-139781785882715
Edition1st Edition
Languages
Right arrow
Author (1)
Joseph Babcock
Joseph Babcock
author image
Joseph Babcock

Joseph Babcock has spent more than a decade working with big data and AI in the e-commerce, digital streaming, and quantitative finance domains. Through his career he has worked on recommender systems, petabyte scale cloud data pipelines, A/B testing, causal inference, and time series analysis. He completed his PhD studies at Johns Hopkins University, applying machine learning to the field of drug discovery and genomics.
Read more about Joseph Babcock

Right arrow

Similarity and distance metrics


The first step in clustering any new dataset is to decide how to compare the similarity (or dissimilarity) between items. Sometimes the choice is dictated by what kinds of similarity we are most interested in trying to measure, in others it is restricted by the properties of the dataset. In the following sections we illustrate several kinds of distance for numerical, categorical, time series, and set-based data—while this list is not exhaustive, it should cover many of the common use cases you will encounter in business analysis. We will also cover normalizations that may be needed for different data types prior to running clustering algorithms.

Numerical distance metrics

Let us begin by exploring the data in the wine.data file. It contains a set of chemical measurements describing the properties of different kinds of wines, and the quality level (I-III) to which the wines are assigned (Forina, M., et al. PARVUS An Extendible Package for Data Exploration. Classification...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Mastering Predictive Analytics with Python
Published in: Aug 2016Publisher: ISBN-13: 9781785882715

Author (1)

author image
Joseph Babcock

Joseph Babcock has spent more than a decade working with big data and AI in the e-commerce, digital streaming, and quantitative finance domains. Through his career he has worked on recommender systems, petabyte scale cloud data pipelines, A/B testing, causal inference, and time series analysis. He completed his PhD studies at Johns Hopkins University, applying machine learning to the field of drug discovery and genomics.
Read more about Joseph Babcock