Reader small image

You're reading from  Matplotlib 2.x By Example

Product typeBook
Published inAug 2017
PublisherPackt
ISBN-139781788295260
Edition1st Edition
Right arrow
Authors (3):
Allen Yu
Allen Yu
author image
Allen Yu

Allen Yu, PhD, is a Chevening Scholar, 2017-18, and an MSC student in computer science at the University of Oxford. He holds a PhD degree in Biochemistry from the Chinese University of Hong Kong, and he has used Python and Matplotlib extensively during his 10 years of bioinformatics experience.
Read more about Allen Yu

Claire Chung
Claire Chung
author image
Claire Chung

Claire Chung is pursuing her PhD degree as a Bioinformatician at the Chinese University of Hong Kong. She enjoys using Python daily for work and lifehack. While passionate in science, her challenge-loving character motivates her to go beyond data analytics. She has participated in web development projects, as well as developed skills in graphic design and multilingual translation. She led the Campus Network Support Team in college, and shared her experience in data visualization in PyCon HK 2017.
Read more about Claire Chung

Aldrin Yim
Aldrin Yim
author image
Aldrin Yim

Aldrin Yim is a PhD candidate and Markey Scholar in the Computation and System Biology program at Washington University, School of Medicine. His research focuses on applying big data analytics and machine learning approaches in studying neurological diseases and cancer. He is also the founding CEO of Codex Genetics Limited, which provides precision medicine solutions to patients and hospitals in Asia.
Read more about Aldrin Yim

View More author details
Right arrow

Chapter 8. Exploratory Data Analytics and Infographics

Let the data speak for themselves.

This is a well-known quote to many data scientists in the field. However, it is often not trivial to capture the hidden characteristics or features in big data, and some exploratory data analysis must be done before we fully understand the dataset.

In this chapter, we aim to perform some exploratory data analysis on two datasets, using the techniques that we have discussed in previous chapters. Here is a brief outline of this chapter:

  • Visualizing categorical data
  • Visualizing geographical data
  • GeoPandas library
  • Working with images using the PIL library
  • Importing/transforming images
  • Multiple subplots
  • Heatmap
  • Survival graph

We assume that the readers are now comfortable with the use of pandas DataFrame as it will be heavily used in this chapter.

Readers should also be noted that most exploratory data analyses actually involve a significant amount of statistics, including dimension reduction approaches such as PCA...

Visualizing population health information


The following section will be dedicated to combining both geographical and population health information of the US. Since this is a tutorial on Python, we focus more on ways to visualize the data, rather than to draw solid conclusions from it. However, many of the findings below actually concur with the population health research and news reports that one may find online.

To begin, let us first download the following information:

  • Top 10 leading causes of death in the United States from 1999 to 2013 from Healthdata.gov
  • 2016 TIGER GeoDatabase from US Census Bureau
  • Survival data of various type of cancers from The Cancer Genome Atlas (TCGA) project (https://cancergenome.nih.gov/)

Since some of the information does not allow direct download through links, we have included the raw data in our code repository:

Survival data analysis on cancer


Since we've spent a significant amount of time discussing death rate, let us conclude this chapter with one final analysis of two cancer datasets. We have obtained the de-identified clinical dataset of breast cancer and brain tumor from http://www.cbioportal.org/; our goal is to see what the overall survival outcome looks like, and whether the two cancers are having statistically different survival outcomes. The datasets are being explored only for research purposes:

# The clinical dataset are in tsv format
# We can use the .read_csv() method and add an argument sep='\t'
# to construct the dataframe
gbm_df = pd.read_csv('https://github.com/PacktPublishing/Matplotlib-2.x-
By-Example/blob/master/gbm_tcga_clinical_data.tsv',sep='\t')
gbm_primary_df = gbm_df[gbm_df['Sample Type']=='Primary Tumor']
.dropna(subset=['Overall Survival (Months)'])

brca_df = pd.read_csv('https://github.com/PacktPublishing/Matplotlib-2.x-
By-Example/blob/master/brca_metabric_clinical_data...

Summary


In this chapter, we explored different ways of performing exploratory data analysis, specifically focusing on population health information. With all the code provided in this book, the readers can definitely combine more datasets and explore the hidden characteristics. For instance, one can explore whether illegal drug usage is correlated with suicide, or whether exercise is anti-correlated with heart disease across the USA. One key message is that the readers should not mix up association and causality, which is a frequent mistake even made by experienced data scientists. Hopefully, by now, the readers are getting more comfortable with data analysis using Python, and we, the authors, are looking forward to your contribution to the Python community.

Happy coding!

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Matplotlib 2.x By Example
Published in: Aug 2017Publisher: PacktISBN-13: 9781788295260
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Allen Yu

Allen Yu, PhD, is a Chevening Scholar, 2017-18, and an MSC student in computer science at the University of Oxford. He holds a PhD degree in Biochemistry from the Chinese University of Hong Kong, and he has used Python and Matplotlib extensively during his 10 years of bioinformatics experience.
Read more about Allen Yu

author image
Claire Chung

Claire Chung is pursuing her PhD degree as a Bioinformatician at the Chinese University of Hong Kong. She enjoys using Python daily for work and lifehack. While passionate in science, her challenge-loving character motivates her to go beyond data analytics. She has participated in web development projects, as well as developed skills in graphic design and multilingual translation. She led the Campus Network Support Team in college, and shared her experience in data visualization in PyCon HK 2017.
Read more about Claire Chung

author image
Aldrin Yim

Aldrin Yim is a PhD candidate and Markey Scholar in the Computation and System Biology program at Washington University, School of Medicine. His research focuses on applying big data analytics and machine learning approaches in studying neurological diseases and cancer. He is also the founding CEO of Codex Genetics Limited, which provides precision medicine solutions to patients and hospitals in Asia.
Read more about Aldrin Yim