Reader small image

You're reading from  Apache Superset Quick Start Guide

Product typeBook
Published inDec 2018
Reading LevelIntermediate
Publisher
ISBN-139781788992244
Edition1st Edition
Languages
Right arrow
Author (1)
Shashank Shekhar
Shashank Shekhar
author image
Shashank Shekhar

Shashank Shekhar is a data analyst and open source enthusiast. He has contributed to Superset and pymc3 (the Python Bayesian machine learning library), and maintains several public repositories on machine learning and data analysis projects of his own on GitHub. He heads up the data science team at HyperTrack, where he designs and implements machine learning algorithms to obtain insights from movement data. Previously, he worked at Amino on claims data. He has worked as a data scientist in Silicon Valley for 5 years. His background is in systems engineering and optimization theory, and he carries that perspective when thinking about data science, biology, culture, and history.
Read more about Shashank Shekhar

Right arrow

Visualizing Data in a Column

Tabular data is present everywhere! And for most analytics, answers are available in a few important columns. Tables can have many columns, but some columns are more significant than others. Each column in a tabular dataset represents a unique feature of the dataset. Once we have identified a column of interest, our goal in this chapter is to make visualizations in Superset that help us to explore and interpret that data.

In this chapter, we will understand columnar data through distribution plots, a point-wise comparison with reference columns, and charts that are just one-line summaries:

  • Distribution: Histogram
  • Comparison: Distribution box plots for subsets of column values
  • Comparison: Compare distributions of columns with values belonging to different scales
  • Comparison: Compare metrics and distributions between subsets of column values
  • Summary...

Dataset

A favorite on my blogroll at https://austinrochford.com/ is a good place to discover intuitive explanations of Bayesian machine learning methods. In the December 29, 2017 blog post Quantifying Three Years of Reading (https://austinrochford.com/posts/2017-12-29-quantifying-reading.html), the blogger analyzes changes in their own reading log dataset. The reading log is available as a Google sheet at this link: https://docs.google.com/spreadsheets/d/1wNbJv1Zf4Oichj3-dEQXE_lXVCwuYQjaoyU1gGQQqk4/.

The reading log is a time series dataset, which the blogger frequently updates. I have taken a snapshot of the reading log and saved it in the chapter's GitHub directory, https://github.com/PacktPublishing/Superset-Quick-Start-Guide/Chapter04/. The dataset has been modified by the addition of a new column. You can run the Jupyter Notebook generate_dataset.ipynb to generate the...

Distribution – histogram

After uploading the file as a table, open it for visualization and select the Histogram option. Make sure that start_date is selected as Time Column. The Time window defined between Since and Until must be large enough to include all the books, because we do not want to do any Time window-specific analysis.

Page count is an important feature in the dataset, where each row is a book. It is a numerical value. So, to begin with let's look at a distribution plot of page counts. It will give us a sense of the variance in the feature value:

Data form for a histogram chart

The number of bins in a histogram limits the granularity of questions we can answer about the variance of the feature:

Distribution plot of page counts

Because we have set five bins, what is identifiable is that about 41-42 out of 93 books (approx. 44%-45%) have page counts of...

Comparison – relationship between feature values

Let's say we are curious about a trend where the time taken to read a book increases with respect to the page count.

Books often have a gripping effect on a reader, once they find them interesting. So, we cannot expect the number of pages to proportionately grow with the number of days taken to read a book, because books that the reader finds gripping will be read at a faster pace than others. In any dataset, there are samples that are noisy and hard to explain. In this dataset, we will find that some books with lower page counts take more days to finish than books with higher page counts.

It will be useful to look at the number of samples we have available for each group, defined by number of reading days. Select COUNT(*) as Metrics to plot the number_of_books read:

Defining the reading capacity for each group of...

Comparison – box plots for groups of feature values

The previous charts described the relationship between days taken to finish reading a book and page count. Now, we will try to understand the highest page counts in calendar months, where a book was finished after x number of reading days. In the first chart, we plotted the number of samples we have for each group of books, which were completed in the same number of days. There are multiple samples in many groups. Here, we will plot a distribution for multiple samples in each group.

We can define a statistic to summarize the average page counts of books completed in the same calendar month as the a book was completed after x number of days.

We will make a box plot chart as follows:


Parameters to set box plot chart

The data that we are visualizing in this box plot is made using multiple group by operations, because the...

Comparison – side-by-side visualization of two feature values

With a time series dataset, interactions can best be analyzed by plotting two features side by side on a shared Time axis. Let's say we are curious to ascertain how, on a monthly basis, the page count of books affects the number of books read that month. To do this, we will use Dual Axis Line Chart. To mark the books finished on the Time axis, select end_date as the Time Column and month as the Time Grain. We select the page count of the longest book read and the number of books read for each month as follows:

Setting the parameters for two feature values

The output for it is displayed as follows:

Side-by-side visualization of two feature values

It is noticeable that the two y axes have repeated values. The range of the left y axis value is 1 to 6, and the range of the right axis value is 100 to 900....

Summary statistics – headline

Superset has a chart useful for dashboards called a headline. It is a chart that plots a single metric. Single numbers can answer key questions that we have about datasets.

In the process of analyzing the page count feature using charts in Superset, we looked at its distribution and its relationship with other features. One of the simplest questions one can ask is the average number of days required to read per page.

We will plot the average value for the page count divided by days across all books to capture the answer to that question. In Superset, we can write Custom SQL code to calculate metrics:

Customizing SQL code to calculate a metric

After clicking on the Metric, select the Custom SQL tab to write AVG(pages/days) as the code:

Average number of pages read each reading day

Information in charts must be easy to understand. The precision...

Summary

That completes our exploration of the page count feature value. By compiling those charts, we have aggregated sufficient charts to make a dashboard that anyone who has questions about the page count feature of the dataset will find useful. In the next chapter, we will compile charts with a view to understanding the relationship between two different feature values instead of just focusing on one.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Superset Quick Start Guide
Published in: Dec 2018Publisher: ISBN-13: 9781788992244
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Shashank Shekhar

Shashank Shekhar is a data analyst and open source enthusiast. He has contributed to Superset and pymc3 (the Python Bayesian machine learning library), and maintains several public repositories on machine learning and data analysis projects of his own on GitHub. He heads up the data science team at HyperTrack, where he designs and implements machine learning algorithms to obtain insights from movement data. Previously, he worked at Amino on claims data. He has worked as a data scientist in Silicon Valley for 5 years. His background is in systems engineering and optimization theory, and he carries that perspective when thinking about data science, biology, culture, and history.
Read more about Shashank Shekhar