Reader small image

You're reading from  Matplotlib for Python Developers. - Second Edition

Product typeBook
Published inApr 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788625173
Edition2nd Edition
Languages
Right arrow
Authors (3):
Aldrin Yim
Aldrin Yim
author image
Aldrin Yim

Aldrin Yim is a PhD candidate and Markey Scholar in the Computation and System Biology program at Washington University, School of Medicine. His research focuses on applying big data analytics and machine learning approaches in studying neurological diseases and cancer. He is also the founding CEO of Codex Genetics Limited, which provides precision medicine solutions to patients and hospitals in Asia.
Read more about Aldrin Yim

Claire Chung
Claire Chung
author image
Claire Chung

Claire Chung is pursuing her PhD degree as a Bioinformatician at the Chinese University of Hong Kong. She enjoys using Python daily for work and lifehack. While passionate in science, her challenge-loving character motivates her to go beyond data analytics. She has participated in web development projects, as well as developed skills in graphic design and multilingual translation. She led the Campus Network Support Team in college, and shared her experience in data visualization in PyCon HK 2017.
Read more about Claire Chung

Allen Yu
Allen Yu
author image
Allen Yu

Allen Yu, PhD, is a Chevening Scholar, 2017-18, and an MSC student in computer science at the University of Oxford. He holds a PhD degree in Biochemistry from the Chinese University of Hong Kong, and he has used Python and Matplotlib extensively during his 10 years of bioinformatics experience.
Read more about Allen Yu

View More author details
Right arrow

Chapter 10. Integrating Data Visualization into the Workflow

We have now come to the concluding chapter of this book. Throughout the course of this book, you have mastered the techniques to create and customize static and animated plots using real-world data in different formats scraped from the web. To wrap up, we will start a mini-project in this chapter to combine the skills of data analytics with the visualization techniques you've learned. We will demonstrate how to integrate visualization techniques in your current workflow.

In the era of big data, machine learning becomes fundamental to ease analytic work by replacing huge amounts of manual curation with automatic prediction. Yet, before we enter model building, Exploratory Data Analysis (EDA) is always essential to get a good grasp of what the data is like. Constant review during the optimization process also helps improve our training strategy and results.

High-dimensional data typically requires special processing techniques to be...

Getting started


Recall the MNIST dataset we briefly touched upon in Chapter 04Advanced Matplotlib. It contains 70,000 images of handwritten digits, often used in data mining tutorials as Machine Learning 101. We will continue using a similar image dataset of handwritten digits for our project in this chapter.

We are almost certain that you had already heard about the popular keywords—deep learning or machine learning in general—before starting with this course. That's why we are choosing it as our showcase. As detailed concepts in machine learning, such as hyperparameter tuning to optimize performance, are beyond the scope of this book, we will not go into them. But we will cover the model training part in a cookbook style. We will focus on how visualization helps our workflow. For those of you interested in the details of machine learning, we recommend exploring further resources that are largely available online.

Visualizing sample images from the dataset


Data cleaning and EDA are indispensable components of data science. Before we begin analyzing our data, it is important to understand some basic properties of what we have input. The dataset we are using comprises standardized images with regular shapes and normalized pixel values. The features are simple, thin lines. Our goal is straightforward as well, to recognize digits from images. Yet, in many cases of real-world practice, the problems can be more complicated; the data we collect is going to be raw and often much more heterogeneous. Before tackling the problem, it is usually worth the time to sample a small amount of input data for inspection. Imagine training a model to recognize Ramen just to get you drooling ;). You will probably take a look at some images to decide what features make a good input sample to exemplify the presence of the bowl. Besides the initial preparatory phase, during model building taking out some of the mislabeled...

Exploring the data nature by the t-SNE method


After visualizing a few images and glimpsing of how the samples are distributed, we will go deeper into our EDA.

Each pixel comes with an intensity value, which makes 64 variables for each 8x8 image. The human brain is not good at intuitively perceiving dimensions higher than three. For high-dimensional data, we need more effective visual aids. 

Dimensionality reduction methods, such as the commonly used PCA and t-SNE, reduce the number of input variables under consideration, while retaining most of the useful information. As a result, the visualization of data becomes more intuitive.

In the following section, we will focus our discussion on the t-SNE method by using the scikit-learn library in Python.

Understanding t-Distributed stochastic neighbor embedding 

The t-SNE method was proposed by van der Maaten and Hinton in 2008 in the publication Visualizing Data using t-SNE. It is a nonlinear dimension reduction method that aims to effectively visualize...

Creating a CNN to recognize digits


In the following section, we will use Keras. Keras is a Python library for neural networks and provides a high-level interface to TensorFlow libraries. We do not intend to give a complete tutorial on Keras or CNN, but we want to show how we can use Matplotlib to visualize the loss function, accuracy, and outliers of the results. 

Readers who are not familiar with machine learning should be able to go through the logic of the remaining chapter and hopefully understand why visualizing the loss function, accuracy, and outliers of the results is important in fine-tuning the CNN model. 

Here is a snippet of code for the CNN; the most important part is the evaluation section after this!

# Import sklearn models for preprocessing input data
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import LabelBinarizer

# Import the necessary Keras libraries
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten...

Evaluating prediction results with visualizations


We have specified the callbacks that store the loss and accuracy information for each epoch to be saved as the variable history. We can retrieve this data from the dictionary history.history. Let's check out the dictionary keys:

print(history.history.keys())

This will output dict_keys(['loss', 'acc']).

Next, we will plot out the loss function and accuracy along epochs in line graphs:

import pandas as pd
import matplotlib
matplotlib.style.use('seaborn')

# Here plots the loss function graph along Epochs
pd.DataFrame(history.history['loss']).plot()
plt.legend([])
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Validation loss across 100 epochs',fontsize=20,fontweight='bold')
plt.show()

# Here plots the percentage of accuracy along Epochs
pd.DataFrame(history.history['acc']).plot()
plt.legend([])
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Accuracy loss across 100 epochs',fontsize=20,fontweight='bold')
plt.show()

Upon training, we can...

Summary


Congratulations! You have now completed this chapter as well as the whole book. In this chapter, we integrated various data visualization techniques along with an analytic project workflow, from the initial inspection and exploratory analysis of data, to model building and evaluation. Give yourself a huge round of applause, and get ready to leap forward into the journey of data science!

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Matplotlib for Python Developers. - Second Edition
Published in: Apr 2018Publisher: PacktISBN-13: 9781788625173
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Aldrin Yim

Aldrin Yim is a PhD candidate and Markey Scholar in the Computation and System Biology program at Washington University, School of Medicine. His research focuses on applying big data analytics and machine learning approaches in studying neurological diseases and cancer. He is also the founding CEO of Codex Genetics Limited, which provides precision medicine solutions to patients and hospitals in Asia.
Read more about Aldrin Yim

author image
Claire Chung

Claire Chung is pursuing her PhD degree as a Bioinformatician at the Chinese University of Hong Kong. She enjoys using Python daily for work and lifehack. While passionate in science, her challenge-loving character motivates her to go beyond data analytics. She has participated in web development projects, as well as developed skills in graphic design and multilingual translation. She led the Campus Network Support Team in college, and shared her experience in data visualization in PyCon HK 2017.
Read more about Claire Chung

author image
Allen Yu

Allen Yu, PhD, is a Chevening Scholar, 2017-18, and an MSC student in computer science at the University of Oxford. He holds a PhD degree in Biochemistry from the Chinese University of Hong Kong, and he has used Python and Matplotlib extensively during his 10 years of bioinformatics experience.
Read more about Allen Yu