You're reading from Apache Superset Quick Start Guide

Product typeBook

Published inDec 2018

Reading LevelIntermediate

Publisher

ISBN-139781788992244

Edition1st Edition

Languages

Python

Tools

Apache Superset

Concepts

Business Intelligence

Author (1)

Shashank Shekhar

Visualizing Data in a Column

Tabular data is present everywhere! And for most analytics, answers are available in a few important columns. Tables can have many columns, but some columns are more significant than others. Each column in a tabular dataset represents a unique feature of the dataset. Once we have identified a column of interest, our goal in this chapter is to make visualizations in Superset that help us to explore and interpret that data.

In this chapter, we will understand columnar data through distribution plots, a point-wise comparison with reference columns, and charts that are just one-line summaries:

Distribution: Histogram
Comparison: Distribution box plots for subsets of column values
Comparison: Compare distributions of columns with values belonging to different scales
Comparison: Compare metrics and distributions between subsets of column values
Summary...

Dataset

A favorite on my blogroll at https://austinrochford.com/ is a good place to discover intuitive explanations of Bayesian machine learning methods. In the December 29, 2017 blog post Quantifying Three Years of Reading (https://austinrochford.com/posts/2017-12-29-quantifying-reading.html), the blogger analyzes changes in their own reading log dataset. The reading log is available as a Google sheet at this link: https://docs.google.com/spreadsheets/d/1wNbJv1Zf4Oichj3-dEQXE_lXVCwuYQjaoyU1gGQQqk4/.

The reading log is a time series dataset, which the blogger frequently updates. I have taken a snapshot of the reading log and saved it in the chapter's GitHub directory, https://github.com/PacktPublishing/Superset-Quick-Start-Guide/Chapter04/. The dataset has been modified by the addition of a new column. You can run the Jupyter Notebook generate_dataset.ipynb to generate the...

Distribution – histogram

After uploading the file as a table, open it for visualization and select the Histogram option. Make sure that start_date is selected as Time Column. The Time window defined between Since and Until must be large enough to include all the books, because we do not want to do any Time window-specific analysis.

Page count is an important feature in the dataset, where each row is a book. It is a numerical value. So, to begin with let's look at a distribution plot of page counts. It will give us a sense of the variance in the feature value:

Data form for a histogram chart

The number of bins in a histogram limits the granularity of questions we can answer about the variance of the feature:

Distribution plot of page counts

Because we have set five bins, what is identifiable is that about 41-42 out of 93 books (approx. 44%-45%) have page counts of...

Comparison – relationship between feature values

Let's say we are curious about a trend where the time taken to read a book increases with respect to the page count.

Books often have a gripping effect on a reader, once they find them interesting. So, we cannot expect the number of pages to proportionately grow with the number of days taken to read a book, because books that the reader finds gripping will be read at a faster pace than others. In any dataset, there are samples that are noisy and hard to explain. In this dataset, we will find that some books with lower page counts take more days to finish than books with higher page counts.

It will be useful to look at the number of samples we have available for each group, defined by number of reading days. Select COUNT(*) as Metrics to plot the number_of_books read:

Defining the reading capacity for each group of...

Comparison – box plots for groups of feature values

The previous charts described the relationship between days taken to finish reading a book and page count. Now, we will try to understand the highest page counts in calendar months, where a book was finished after x number of reading days. In the first chart, we plotted the number of samples we have for each group of books, which were completed in the same number of days. There are multiple samples in many groups. Here, we will plot a distribution for multiple samples in each group.

We can define a statistic to summarize the average page counts of books completed in the same calendar month as the a book was completed after x number of days.

We will make a box plot chart as follows:

Parameters to set box plot chart

The data that we are visualizing in this box plot is made using multiple group by operations, because the...

Comparison – side-by-side visualization of two feature values

With a time series dataset, interactions can best be analyzed by plotting two features side by side on a shared Time axis. Let's say we are curious to ascertain how, on a monthly basis, the page count of books affects the number of books read that month. To do this, we will use Dual Axis Line Chart. To mark the books finished on the Time axis, select end_date as the Time Column and month as the Time Grain. We select the page count of the longest book read and the number of books read for each month as follows:

Setting the parameters for two feature values

The output for it is displayed as follows:

Side-by-side visualization of two feature values

It is noticeable that the two y axes have repeated values. The range of the left y axis value is 1 to 6, and the range of the right axis value is 100 to 900....

Summary statistics – headline

Superset has a chart useful for dashboards called a headline. It is a chart that plots a single metric. Single numbers can answer key questions that we have about datasets.

In the process of analyzing the page count feature using charts in Superset, we looked at its distribution and its relationship with other features. One of the simplest questions one can ask is the average number of days required to read per page.

We will plot the average value for the page count divided by days across all books to capture the answer to that question. In Superset, we can write Custom SQL code to calculate metrics:

Customizing SQL code to calculate a metric

After clicking on the Metric, select the Custom SQL tab to write AVG(pages/days) as the code:

Average number of pages read each reading day

Information in charts must be easy to understand. The precision...

Summary

That completes our exploration of the page count feature value. By compiling those charts, we have aggregated sufficient charts to make a dashboard that anyone who has questions about the page count feature of the dataset will find useful. In the next chapter, we will compile charts with a view to understanding the relationship between two different feature values instead of just focusing on one.

The rest of the chapter is locked

You have been reading a chapter from

Apache Superset Quick Start Guide

Published in: Dec 2018Publisher: ISBN-13: 9781788992244

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Shashank Shekhar

Shashank Shekhar is a data analyst and open source enthusiast. He has contributed to Superset and pymc3 (the Python Bayesian machine learning library), and maintains several public repositories on machine learning and data analysis projects of his own on GitHub. He heads up the data science team at HyperTrack, where he designs and implements machine learning algorithms to obtain insights from movement data. Previously, he worked at Amino on claims data. He has worked as a data scientist in Silicon Valley for 5 years. His background is in systems engineering and optimization theory, and he carries that perspective when thinking about data science, biology, culture, and history.
Read more about Shashank Shekhar

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages