Reader small image

You're reading from  Mastering Numerical Computing with NumPy

Product typeBook
Published inJun 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788993357
Edition1st Edition
Languages
Tools
Right arrow
Authors (3):
Umit Mert Cakmak
Umit Mert Cakmak
author image
Umit Mert Cakmak

Umit Mert Cakmak is a data scientist at IBM, where he excels at helping clients solve complex data science problems, from inception to delivery of deployable assets. His research spans multiple disciplines beyond his industry and he likes sharing his insights at conferences, universities, and meet-ups.
Read more about Umit Mert Cakmak

Tiago Antao
Tiago Antao
author image
Tiago Antao

Tiago Antao is a bioinformatician currently working in the field of genomics. A former computer scientist, Tiago moved into computational biology with an MSc in Bioinformatics from the Faculty of Sciences at the University of Porto (Portugal) and a PhD on the spread of drug-resistant malaria from the Liverpool School of Tropical Medicine (UK). Postdoctoral, Tiago has worked with human datasets at the University of Cambridge (UK) and with mosquito whole genome sequencing data at the University of Oxford (UK), before helping to set up the bioinformatics infrastructure at the University of Montana. He currently works as a data engineer in the biotechnology field in Boston, MA. He is one of the co-authors of Biopython, a major bioinformatics package written in Python.
Read more about Tiago Antao

Mert Cuhadaroglu
Mert Cuhadaroglu
author image
Mert Cuhadaroglu

Mert Cuhadaroglu is a BI Developer in EPAM, developing E2E analytics solutions for complex business problems in various industries, mostly investment banking, FMCG, media, communication, and pharma. He consistently uses advanced statistical models and ML algorithms to provide actionable insights. Throughout his career, he has worked in several other industries, such as banking and asset management. He continues his academic research in AI for trading algorithms.
Read more about Mert Cuhadaroglu

View More author details
Right arrow

Clustering Clients of a Wholesale Distributor Using NumPy

You are definitely advancing your skills by seeing NumPy in action for various use cases. This chapter is about a different type of analysis than what you have seen so far. Clustering is an unsupervised learning technique that is used for understanding and capturing the various formations in your dataset. Since you don't have label to supervise your learning algorithm, in many cases, visualization is the key, which is why you will see various visualization techniques as well.

In this chapter, we will cover the following topics:

  • Unsupervised learning and clustering
  • Hyperparameters
  • Extending simple algorithm to cluster the clients of a wholesale distributor

Unsupervised learning and clustering

Let's quickly review supervised learning with an example. When you are training machine-learning algorithms, you are able to observe and direct the learning by providing labels. Think about the following dataset, where each row indicates a customer and each column represents a different feature such as Age, Gender, Income, Profession, Tenure and City. Take a look at this table:

You may want to perform different kinds of analysis. One of them could be to predict which of the customers is likely to leave, namely, churn analysis. To do that, you need to label each customer based on their history to indicate which customers have left or stayed, as displayed here, in this table:


Your algorithm will learn the characteristics of customers based on their label. Algorithm will learn the characteristics of customers who left or stayed, and, when...

Hyperparameters

Hyperparameter could be considered as high-level parameter which determines one of the various properties of a model such as complexity, training behavior and learning rate. These parameters naturally differ from model parameters as they need to be set before training starts.

For example, the k in k-means or k-nearest-neighbors is a hyperparameter for these algorithms. The k in k-means denotes the number of clusters to be found, and the k in k-nearest-neighbors denotes the number of closest records to be used to make predictions.

Tuning hyperparameters is a crucial step in any machine learning project to improve predictive performance. There are different techniques for tuning, such as grid search, randomized search and bayesian optimization, but these techniques are beyond the scope of this chapter.

Let's have a quick look at the k-means algorithms parameters...

The loss function

The loss function helps algorithms to update model parameters during training through measuring the error, which is an indication of predictive performance. Loss function is usually denoted as follows:

Where L measures the difference between the prediction and the actual value. During the training process, this error is minimized. Different algorithms have different loss functions, and the number of iterations will depend on convergence conditions.

For example, the loss function for k-means minimizes the square distances between a points and closest cluster mean as follows:

You will see detailed implementation in the following section.

Implementing our algorithm for a single variable

Let's implement the k-means algorithm for a single variable. You will start with one dimensional vector, which has 20 records, as shown here:

data = [1,2,3,2,1,3,9,8,11,12,10,11,14,25,26,24,30,22,24,27] 
 
trace1 = go.Scatter( 
    x=data, 
    y=[0 for x in data], 
    mode='markers', 
    name='Data', 
    marker=dict( 
        size=12 
    ) 
) 
 
layout = go.Layout( 
title='1D vector',
)

traces = [trace1]

fig = go.Figure(data=traces, layout=layout)

plot(fig)

This will output following plot, as shown in this diagram:

Our aim is to find 3 clusters which are visible in the data. In order to start implementing the k-means algorithm, you need to initialize cluster centers by choosing random indexes, as shown here:

n_clusters = 3

c_centers = np.random.choice(X, n_clusters)

print(c_centers)

# [ 1 22 26...

Modifying our algorithm

Now you have understood the internal of k-means on a single variable, you can extend this implementation to multiple variables and apply it to a more realistic dataset.

The dataset to be used in this section is from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/wholesale+customers), and it includes the client information of wholesales distributor. There 440 customers with eight features. In the following list, first six features are related to annual spending for corresponding products, seventh feature shows the channel that this product is bought and the eighth feature shows the region:

  • FRESH
  • MILK
  • GROCERY
  • FROZEN
  • DETERGENTS_PAPER
  • DELICATESSEN
  • CHANNEL
  • REGION

First download the dataset and read the it as a numpy array:

from numpy import genfromtxt
wholesales_data = genfromtxt('Wholesale customers data.csv', delimiter...

Summary

In this chapter, you have learned the basics of unsupervised learning and using the k-means algorithm for clustering.

There are many clustering algorithms that show different behavior. Visualization is key when it comes to unsupervised learning algorithms, and you have seen a couple of different ways to visualize and inspect your dataset.

In the next chapter, you will learn other libraries which are commonly used with NumPy such as SciPy, Pandas and scikit-learn. These are all important libraries in the practitioner's toolkit, and they complement one another. You will find yourself using these libraries together with NumPy, as each will make certain tasks easier; hence, it's important to know more about the Python data science stack.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Numerical Computing with NumPy
Published in: Jun 2018Publisher: PacktISBN-13: 9781788993357
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Umit Mert Cakmak

Umit Mert Cakmak is a data scientist at IBM, where he excels at helping clients solve complex data science problems, from inception to delivery of deployable assets. His research spans multiple disciplines beyond his industry and he likes sharing his insights at conferences, universities, and meet-ups.
Read more about Umit Mert Cakmak

author image
Tiago Antao

Tiago Antao is a bioinformatician currently working in the field of genomics. A former computer scientist, Tiago moved into computational biology with an MSc in Bioinformatics from the Faculty of Sciences at the University of Porto (Portugal) and a PhD on the spread of drug-resistant malaria from the Liverpool School of Tropical Medicine (UK). Postdoctoral, Tiago has worked with human datasets at the University of Cambridge (UK) and with mosquito whole genome sequencing data at the University of Oxford (UK), before helping to set up the bioinformatics infrastructure at the University of Montana. He currently works as a data engineer in the biotechnology field in Boston, MA. He is one of the co-authors of Biopython, a major bioinformatics package written in Python.
Read more about Tiago Antao

author image
Mert Cuhadaroglu

Mert Cuhadaroglu is a BI Developer in EPAM, developing E2E analytics solutions for complex business problems in various industries, mostly investment banking, FMCG, media, communication, and pharma. He consistently uses advanced statistical models and ML algorithms to provide actionable insights. Throughout his career, he has worked in several other industries, such as banking and asset management. He continues his academic research in AI for trading algorithms.
Read more about Mert Cuhadaroglu