Clustering and Other Unsupervised Learning Methods

In this article by Ferran Garcia Pagans, author of the book Predictive Analytics Using Rattle and Qlik Sense, we will learn about the following:

  • Define machine learning
  • Introduce unsupervised and supervised methods
  • Focus on K-means, a classic machine learning algorithm, in detail

We'll create clusters of customers based on their annual money spent. This will give us a new insight. Being able to group our customers based on their annual money spent will allow us to see the profitability of each customer group and deliver more profitable marketing campaigns or create tailored discounts.

Finally, we'll see hierarchical clustering, different clustering methods, and association rules. Association rules are generally used for market basket analysis.

Machine learning – unsupervised and supervised learning

Machine Learning (ML) is a set of techniques and algorithms that gives computers the ability to learn. These techniques are generic and can be used in various fields. Data mining uses ML techniques to create insights and predictions from data.

In data mining, we usually divide ML methods into two main groups – supervisedlearning and unsupervisedlearning. A computer can learn with the help of a teacher (supervised learning) or can discover new knowledge without the assistance of a teacher (unsupervised learning).

In supervised learning, the learner is trained with a set of examples (dataset) that contains the right answer; we call it the training dataset. We call the dataset that contains the answers a labeled dataset, because each observation is labeled with its answer. In supervised learning, you are supervising the computer, giving it the right answers. For example, a bank can try to predict the borrower's chance of defaulting on credit loans based on the experience of past credit loans. The training dataset would contain data from past credit loans, including if the borrower was a defaulter or not.

In unsupervised learning, our dataset doesn't have the right answers and the learner tries to discover hidden patterns in the data. In this way, we call it unsupervised learning because we're not supervising the computer by giving it the right answers. A classic example is trying to create a classification of customers. The model tries to discover similarities between customers.

In some machine learning problems, we don't have a dataset that contains past observations. These datasets are not labeled with the correct answers and we call them unlabeled datasets.

In traditional data mining, the terms descriptive analytics and predictive analytics are used for unsupervised learning and supervised learning.

In unsupervised learning, there is no target variable. The objective of unsupervised learning or descriptive analytics is to discover the hidden structure of data. There are two main unsupervised learning techniques offered by Rattle:

  • Cluster analysis
  • Association analysis

Cluster analysis

Sometimes, we have a group of observations and we need to split it into a number of subsets of similar observations. Cluster analysis is a group of techniques that will help you to discover these similarities between observations.

Market segmentation is an example of cluster analysis. You can use cluster analysis when you have a lot of customers and you want to divide them into different market segments, but you don't know how to create these segments.

Sometimes, especially with a large amount of customers, we need some help to understand our data. Clustering can help us to create different customer groups based on their buying behavior.

In Rattle's Cluster tab, there are four cluster algorithms:

  • KMeans
  • EwKm
  • Hierarchical
  • BiCluster

The two most popular families of cluster algorithms are hierarchical clustering and centroid-based clustering:

Centroid-based clustering the using K-means algorithm

I'm going to use K-means as an example of this family because it is the most popular.

With this algorithm, a cluster is represented by a point or center called the centroid. In the initialization step of K-means, we need to create k number of centroids; usually, the centroids are initialized randomly. In the following diagram, the observations or objects are represented with a point and three centroids are represented with three colored stars:

After this initialization step, the algorithm enters into an iteration with two operations. The computer associates each object with the nearest centroid, creating k clusters. Now, the computer has to recalculate the centroids' position. The new position is the mean of each attribute of every cluster member.

This example is very simple, but in real life, when the algorithm associates the observations with the new centroids, some observations move from one cluster to the other.

The algorithm iterates by recalculating centroids and assigning observations to each cluster until some finalization condition is reached, as shown in this diagram:

The inputs of a K-means algorithm are the observations and the number of clusters, k. The final result of a K-means algorithm are k centroids that represent each cluster and the observations associated with each cluster.

The drawbacks of this technique are:

  • You need to know or decide the number of clusters, k.
  • The result of the algorithm has a big dependence on k.
  • The result of the algorithm depends on where the centroids are initialized.
  • There is no guarantee that the result is the optimum result. The algorithm can iterate around a local optimum.

In order to avoid a local optimum, you can run the algorithm many times, starting with different centroids' positions. To compare the different runs, you can use the cluster's distortion – the sum of the squared distances between each observation and its centroids.

Customer segmentation with K-means clustering

We're going to use the wholesale customer dataset we downloaded from the Center for Machine Learning and Intelligent Systems at the University of California, Irvine. You can download the dataset from here – https://archive.ics.uci.edu/ml/datasets/Wholesale+customers#.

The dataset contains 440 customers (observations) of a wholesale distributor. It includes the annual spend in monetary units on six product categories – Fresh, Milk, Grocery, Frozen, Detergents_Paper, and Delicatessen. We've created a new field called Food that includes all categories except Detergents_Paper, as shown in the following screenshot:

Load the new dataset into Rattle and go to the Cluster tab. Remember that, in unsupervised learning, there is no target variable.

I want to create a segmentation based only on buying behavior; for this reason, I set Region and Channel to Ignore, as shown here:

In the following screenshot, you can see the options Rattle offers for K-means. The most important one is Number of clusters; as we've seen, the analyst has to decide the number of clusters before running K-means:

We have also seen that the initial position of the centroids can have some influence on the result of the algorithm. The position of the centroids is random, but we need to be able to reproduce the same experiment multiple times. When we're creating a model with K-means, we'll iteratively re-run the algorithm, tuning some options in order to improve the performance of the model. In this case, we need to be able to reproduce exactly the same experiment. Under the hood, R has a pseudo-random number generator based on a starting point called Seed. If you want to reproduce the exact same experiment, you need to re-run the algorithm using the same Seed.

Sometimes, the performance of K-means depends on the initial position of the centroids. For this reason, sometimes you need to able to re-run the model using a different initial position for the centroids. To run the model with different initial positions, you need to run with a different Seed.

After executing the model, Rattle will show some interesting information. The size of each cluster, the means of the variables in the dataset, the centroid's position, and the Within cluster sum of squares value. This measure, also called distortion, is the sum of the squared differences between each point and its centroid. It's a measure of the quality of the model.

Another interesting option is Runs; by using this option, Rattle will run the model the specified number of times and will choose the model with the best performance based on the Within cluster sum of squares value.

Deciding on the number of clusters can be difficult. To choose the number of clusters, we need a way to evaluate the performance of the algorithm. The sum of the squared distance between the observations and the associated centroid could be a performance measure. Each time we add a centroid to KMeans, the sum of the squared difference between the observations and the centroids decreases. The difference in this measure using a different number of centroids is the gain associated to the added centroids. Rattle provides an option to automate this test, called Iterative Clusters.

If you set the Number of clusters value to 10 and check the Iterate Clusters option, Rattle will run KMeans iteratively, starting with 3 clusters and finishing with 10 clusters. To compare each iteration, Rattle provides an iteration plot. In the iteration plot, the blue line shows the sum of the squared differences between each observation and its centroid. The red line shows the difference between the current sum of squared distances and the sum of the squared distance of the previous iteration. For example, for four clusters, the red line has a very low value; this is because the difference between the sum of the squared differences with three clusters and with four clusters is very small. In the following screenshot, the peak in the red line suggests that six clusters could be a good choice.

This is because there is an important drop in the Sum of WithinSS value at this point:

In this way, to finish my model, I only need to set the Number of clusters to 3, uncheck the Re-Scale checkbox, and click on the Execute button:

Finally, Rattle returns the six centroids of my clusters:

Now we have the six centroids and we want Rattle to associate each observation with a centroid. Go to the Evaluate tab, select the KMeans option, select the Training dataset, mark All in the report type, and click on the Execute button as shown in the following screenshot. This process will generate a CSV file with the original dataset and a new column called kmeans. The content of this attribute is a label (a number) representing the cluster associated with the observation (customer), as shown in the following screenshot:

After clicking on the Execute button, you will need to choose a folder to save the resulting file to and will have to type in a filename. The generated data inside the CSV file will look similar to the following screenshot:

In the previous screenshot, you can see ten lines of the resulting file; note that the last column is kmeans.

Preparing the data in Qlik Sense

Our objective is to create the data model, but using the new CSV file with the kmeans column.

We're going to update our application by replacing the customer data file with this new data file. Save the new file in the same folder as the original file, open the Qlik Sense application, and go to Data load editor.

There are two differences between the original file and this one. In the original file, we added a line to create a customer identifier called Customer_ID, and in this second file we have this field in the dataset. The second difference is that in this new file we have the kmeans column.

From Data load editor, go to the Wholesale customer data sheet, modify line 2, and add line 3. In line 2, we just load the content of Customer_ID, and in line 3, we load the content of the kmeans field and rename it to Cluster, as shown in the following screenshot. Finally, update the name of the file to be the new one and click on the Load data button:

When the data load process finishes, open the data model viewer to check your data model, as shown here:

Note that you have the same data model with a new field called Cluster.

Creating a customer segmentation sheet in Qlik Sense

Now we can add a sheet to the application. We'll add three charts to see our clusters and how our customers are distributed in our clusters. The first chart will describe the buying behavior of each cluster, as shown here:

The second chart will show all customers distributed in a scatter plot, and in the last chart we'll see the number of customers that belong to each cluster, as shown here:

I'll start with the chart to the bottom-right; it's a bar chart with Cluster as the dimension and Count([Customer_ID]) as the measure. This simple bar chart has something special – colors. Each customer's cluster has a special color code that we use in all charts. In this way, cluster 5 is blue in the three charts. To obtain this effect, we use this expression to define the color as color(fieldindex('Cluster', Cluster)), which is shown in the following screenshot:

You can find this color trick and more in this interesting blog by Rob Wunderlich – http://qlikviewcookbook.com/.

My second chart is the one at the top. I copied the previous chart and pasted it onto a free place. I kept the dimension but I changed the measure by using six new measures:

  • Avg([Detergents_Paper])
  • Avg([Delicassen])
  • Avg([Fresh])
  • Avg([Frozen])
  • Avg([Grocery])
  • Avg([Milk])

I placed my last chart at the bottom-left. I used a scatter plot to represent all of my 440 customers. I wanted to show the money spent by each customer on food and detergents, and its cluster. I used the y axis to show the money spent on detergents and the x axis for the money spent on food. Finally, I used colors to highlight the cluster. The dimension is Customer_Id and the measures are Delicassen+Fresh+Frozen+Grocery+Milk (or Food) and [Detergents_Paper]. As the final step, I reused the color expression from the earlier charts.

Now our first Qlik Sense application has two sheets – the original one is 100 percent Qlik Sense and helps us to understand our customers, channels, and regions. This new sheet uses clustering to give us a different point of view; this second sheet groups the customers by their similar buying behavior. All this information is useful to deliver better campaigns to our customers. Cluster 5 is our least profitable cluster, but is the biggest one with 227 customers. The main difference between cluster 5 and cluster 2 is the amount of money spent on fresh products. Can we deliver any offer to customers in cluster 5 to try to sell more fresh products?

Select retail customers and ask yourself, who are our best retail customers? To which cluster do they belong? Are they buying all our product categories?

Hierarchical clustering

Hierarchical clustering tries to group objects based on their similarity. To explain how this algorithm works, we're going to start with seven points (or observations) lying in a straight line:

We start by calculating the distance between each point. I'll come back later to the term distance; in this example, distance is the difference between two positions in the line. The points D and E are the ones with the smallest distance in between, so we group them in a cluster, as shown in this diagram:

Now, we substitute point D and point E for their mean (red point) and we look for the two points with the next smallest distance in between. In this second iteration, the closest points are B and C, as shown in this diagram:

We continue iterating until we've grouped all observations in the dataset, as shown here:

Note that, in this algorithm, we can decide on the number of clusters after running the algorithm. If we divide the dataset into two clusters, the first cluster is point G and the second cluster is A, B, C, D, E, and F. This gives the analyst the opportunity to see the big picture before deciding on the number of clusters.

The lowest level of clustering is a trivial one; in this example, seven clusters with one point in each one.

The chart I've created while explaining the algorithm is a basic form of a dendrogram. The dendrogram is a tree diagram used in Rattle and in other tools to illustrate the layout of the clusters produced by hierarchical clustering.

In the following screenshot, we can see the dendrogram created by Rattle for the wholesale customer dataset. In Rattle's dendrogram, the y axis represent all observations or customers in the dataset, and the x axis represents the distance between the clusters:

Association analysis

Association rules or association analysis is also an important topic in data mining. This is an unsupervised method, so we start with an unlabeled dataset. An unlabeled dataset is a dataset without a variable that gives us the right answer. Association analysis attempts to find relationships between different entities. The classic example of association rules is market basket analysis. This means using a database of transactions in a supermarket to find items that are bought together. For example, a person who buys potatoes and burgers usually buys beer. This insight could be used to optimize the supermarket layout.

Online stores are also a good example of association analysis. They usually suggest to you a new item based on the items you have bought. They analyze online transactions to find patterns in the buyer's behavior.

These algorithms assume all variables are categorical; they perform poorly with numeric variables. Association methods need a lot of time to be completed; they use a lot of CPU and memory. Remember that Rattle runs on R and the R engine loads all data into RAM memory.

Suppose we have a dataset such as the following:

Our objective is to discover items that are purchased together. We'll create rules and we'll represent these rules like this:

Chicken, Potatoes → Clothes

This rule means that when a customer buys Chicken and Potatoes, he tends to buy Clothes.

As we'll see, the output of the model will be a set of rules. We need a way to evaluate the quality or interest of a rule. There are different measures, but we'll use only a few of them. Rattle provides three measures:

  • Support
  • Confidence
  • Lift

Support indicates how often the rule appears in the whole dataset. In our dataset, the rule Chicken, Potatoes → Clothes has a support of 48.57 percent (3 occurrences / 7 transactions).

Confidence measures how strong rules or associations are between items. In this dataset, the rule Chicken, Potatoes → Clothes has a confidence of 1. The items Chicken and Potatoes appear three times in the dataset and the items Chicken, Potatoes, and Clothes appear three times in the dataset; and 3/3 = 1. A confidence close to 1 indicates a strong association.

In the following screenshot, I've highlighted the options on the Associate tab we have to choose from before executing an association method in Rattle:

The first option is the Baskets checkbox. Depending on the kind of input data, we'll decide whether or not to check this option. If the option is checked, such as in the preceding screenshot, Rattle needs an identification variable and a target variable. After this example, we'll try another example without this option.

The second option is the minimum Support value; by default, it is set to 0.1. Rattle will not return rules with a lower Support value than the one you have set in this text box. If you choose a higher value, Rattle will only return rules that appear many times in your dataset. If you choose a lower value, Rattle will return rules that appear in your dataset only a few times. Usually, if you set a high value for Support, the system will return only the obvious relationships. I suggest you start with a high Support value and execute the methods many times with a lower value in each execution. In this way, in each execution, new rules will appear that you can analyze.

The third parameter you have to set is Confidence. This parameter tells you how strong the rule is.

Finally, the length is the number of items that contains a rule. A rule like Beer è Chips has length of two. The default option for Min Length is 2. If you set this variable to 2, Rattle will return all rules with two or more items in it.

After executing the model, you can see the rules created by Rattle by clicking on the Show Rules button, as illustrated here:

Rattle provides a very simple dataset to test the association rules in a file called dvdtrans.csv. Test the dataset to learn about association rules.

Further learning

In this article, we introduced supervised and unsupervised learning, the two main subgroups of machine learning algorithms; if you want to learn more about machine learning, I suggest you complete a MOOC course called Machine Learning at Coursera:

https://www.coursera.org/learn/machine-learning

The acronym MOOC stands for Massive Open Online Course; these are courses open to participation via the Internet. These courses are generally free. Coursera is one of the leading platforms for MOOC courses.

Machine Learning is a great course designed and taught by Andrew Ng, Associate Professor at Stanford University; Chief Scientist at Baidu; and Chairman and Co-founder at Coursera. This course is really interesting.

A very interesting book is Machine Learning with R by Brett Lantz, Packt Publishing.

Summary

In this article, we were introduced to machine learning, and supervised and unsupervised methods. We focused on unsupervised methods and covered centroid-based clustering, hierarchical clustering, and association rules.

We used a simple dataset, but we saw how a clustering algorithm can complement a 100 percent Qlik Sense approach by adding more information.

Resources for Article:


Further resources on this subject:


You've been reading an excerpt of:

Predictive Analytics Using Rattle and Qlik Sense

Explore Title
comments powered by Disqus