Reader small image

You're reading from  Interactive Dashboards and Data Apps with Plotly and Dash

Product typeBook
Published inMay 2021
Reading LevelBeginner
PublisherPackt
ISBN-139781800568914
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Elias Dabbas
Elias Dabbas
author image
Elias Dabbas

Elias Dabbas is an online marketing and data science practitioner. He produces open-source software for building dashboards, data apps, as well as software for online marketing, with a focus on SEO, SEM, crawling, and text analysis.
Read more about Elias Dabbas

Right arrow

Chapter 9: Letting Your Data Speak for Itself with Machine Learning

While making histograms we got a glimpse of a technique that visualizes aggregates, and not data points directly. In other words, we visualized data about our data. We will take this concept several steps further in this chapter, by using a machine learning technique to demonstrate some options that can be used to categorize or cluster our data. As you will see in this chapter, and even while using a single technique, there are numerous options and combinations of options that can be explored. This is where the value of interactive dashboards comes into play. It would be very tedious if users were to explore every single option by manually creating a chart for it.

This chapter is not an introduction to machine learning, nor does it assume any prior knowledge of it. We will explore a clustering technique called KMeans clustering and use the sklearn machine learning package. This will help us in grouping our data...

Technical requirements

We will be exploring a few options from sklearn, as well as NumPy. Otherwise, we will be using the same tools we have been using. For visualization and building interactivity, Dash, JupyterDash, the Dash Core Component library, Dash HTML Components, Dash Bootstrap Components, Plotly, and Plotly Express will be used. For data manipulation and preparation, we will use pandas and NumPy. JupyterLab will be used for exploring and building independent functionality. Finally, sklearn will be used for building our machine learning models, as well as for preparing our data.

The code files of this chapter can be found on GitHub at https://github.com/PacktPublishing/Interactive-Dashboards-and-Data-Apps-with-Plotly-and-Dash/tree/master/chapter_09.

Check out the following video to see the Code in Action at https://bit.ly/3x8PAmt.

Understanding clustering

So, what exactly is clustering and when might it be helpful? Let's start with a very simple example. Imagine you have a group of people for whom we want to make T-shirts. We can make a T-shirt for each one of them, in whatever size required. The main restriction is that we can only make one size. The sizes are as follows: [1, 2, 3, 4, 5, 7, 9, 11]. Think how you might tackle this problem. We will use the KMeans algorithm for that, so let's start right away, as follows:

  1. Import the required packages and models. NumPy will be imported as a package, but from sklearn we will import the only model that we will be using for now, as illustrated in the following code snippet:
    import numpy as np
    from sklearn.cluster import KMeans
  2. Create a dataset of sizes in the required format. Note that each observation (person's size) should be represented as a list, so we use the reshape method of NumPy arrays to get the data in the required format, as follows...

Finding the optimal number of clusters

We will now see the options we have in choosing the optimal number of clusters and what that entails, but let's first take a look at the following screenshot to visualize how things progress from having one cluster to eight clusters:

Figure 9.3 – Data points and cluster centers for all possible cluster numbers

Figure 9.3 – Data points and cluster centers for all possible cluster numbers

We can see the full spectrum of possible clusters and how they relate to data points. At the end, when we specified 8, we got the perfect solution, where every data point is a cluster center.

In reality, you might not want to go for the full solution, for two main reasons. Firstly, it is probably going to be prohibitive from a cost perspective. Imagine making 1,000 T-shirts with a few hundred sizes. Secondly, in practical situations, it usually wouldn't add much value to add more clusters after a certain fit has been achieved. Using our T-shirt example, imagine if we have two people with...

Clustering countries by population

We will first understand this with one indicator that we are familiar with (population), and then make it interactive. We will cluster groups of countries based on their population.

Let's start with a possible practical situation. Imagine you were asked to group countries by population. You are supposed to have two groups of countries, of high and low populations. How do you do that? Where do you draw the line(s), and what does the total of the population have to be in order for it to qualify as "high"? Imagine that you were then asked to group countries into three or four groups based on their population. How would you update your clusters?

We can easily see how KMeans clustering is ideal for that.

Let's now do the same exercise with KMeans using one dimension, and then combine that with our knowledge of mapping, as follows:

  1. Import pandas and open the poverty dataset, like this:
    import pandas as pd
    poverty = pd...

Preparing data with scikit-learn

scikit-learn is one of the most widely used and comprehensive machine learning libraries in Python. It plays very well with the rest of the data-science ecosystem libraries, such as NumPy, pandas, and matplotlib. We will be using it for modeling our data and for some preprocessing as well.

We now have two issues that we need to tackle first: missing values and scaling data. Let's see two simple examples for each, and then tackle them in our dataset. Let's start with missing values.

Handling missing values

Models need data, and they can't know what to do with a set of numbers containing missing values. In such cases (and there are many in our dataset), we need to make a decision on what to do with those missing values.

There are several options, and the right choice depends on the application as well as the nature of the data, but we won't get into those details. For simplicity, we will make a generic choice of replacing...

Creating an interactive KMeans clustering app

Let's now put everything together and make an interactive clustering application using our dataset. We will give users the option to choose the year, as well as the indicator(s) that they want. They can also select the number of clusters and get a visual representation of those clusters, in the form of a colored choropleth map, based on the discovered clusters.

Please note that it is challenging to interpret such results with multiple indicators because we will be handling more than one dimension. It can also be difficult if you are not an economist and don't know which indicators make sense to be checked with which other indicators, and so on.

The following screenshot shows what we will be working toward:

Figure 9.9 – An interactive KMeans clustering application

Figure 9.9 – An interactive KMeans clustering application

As you can see, this is a fairly rich application in terms of the combinations of options that it provides. As I also mentioned...

Summary

We first got an idea of how clustering works. We built the simplest possible model for a tiny dataset. We ran the model a few times and evaluated the performance and outcomes for each of the numbers of clusters that we chose.

We then explored the elbow technique to evaluate different clusters and saw how we might discover the point of diminishing returns, where not much improvement is achieved by adding new clusters. With that knowledge, we used the same technique for clustering countries by a metric with which most of us are familiar and got firsthand experience in how it might work on real data.

After that, we planned an interactive KMeans app and explored two techniques for preparing data before running our model. We mainly explored imputing missing values and scaling data.

This gave us enough knowledge to get our data in a suitable format for us to create our interactive app, which we did at the end of the chapter.

We next explored advanced features of Dash...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Interactive Dashboards and Data Apps with Plotly and Dash
Published in: May 2021Publisher: PacktISBN-13: 9781800568914
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Elias Dabbas

Elias Dabbas is an online marketing and data science practitioner. He produces open-source software for building dashboards, data apps, as well as software for online marketing, with a focus on SEO, SEM, crawling, and text analysis.
Read more about Elias Dabbas