Reader small image

You're reading from  Apache Superset Quick Start Guide

Product typeBook
Published inDec 2018
Reading LevelIntermediate
Publisher
ISBN-139781788992244
Edition1st Edition
Languages
Right arrow
Author (1)
Shashank Shekhar
Shashank Shekhar
author image
Shashank Shekhar

Shashank Shekhar is a data analyst and open source enthusiast. He has contributed to Superset and pymc3 (the Python Bayesian machine learning library), and maintains several public repositories on machine learning and data analysis projects of his own on GitHub. He heads up the data science team at HyperTrack, where he designs and implements machine learning algorithms to obtain insights from movement data. Previously, he worked at Amino on claims data. He has worked as a data scientist in Silicon Valley for 5 years. His background is in systems engineering and optimization theory, and he carries that perspective when thinking about data science, biology, culture, and history.
Read more about Shashank Shekhar

Right arrow

Comparing Feature Values

Given a table with many columns, an understanding of the range and simple statistics of the feature values in every column often results in an individual becoming curious about how different features affect one another. Relationships between features are modeled as correlation measures. Formulating and computing correlations between features in a dataset is a complex problem. Sometimes, joint distribution plots are able to encapsulate and visualize these relationships very well.

We can visualize multiple features for every row at once as points on a chart. The bubble chart in Superset can be used to visualize a feature type on the y axis perpendicular to the x axis timeline. A second feature is color-coded, and a third feature value is reflected as bubble size in a group of one or more rows in a dataset. In this chapter, we will make the following charts...

Dataset

We will be working with trading data on commodities in this chapter. The Federal Reserve Bank of St Louis, United States, compiles data on commodities. Datasets are available on http://fred.stlouisfed.org. You can obtain time series data on import values and import volumes of commodities traded by the United States. We will download data on bananas, olive oil, sugar, uranium, cotton, oranges, wheat, aluminium, iron, and corn.

Inside the chapter directory of the GitHub repository, you will find the generate_dataset.ipynb Jupyter Notebook. Just run the Notebook to download, transform, and generate the two CSV files we will upload. If you want to skip running the Notebook, the two CSV files, fsb_st_louis_commodities.csv and usda_oranges_and_bananas_data.csv, are also present in the repository, ready for upload.

The FSB data on commodity prices in fsb_st_louis_commodities...

Comparing multiple time series

The time series line chart is useful for visualizing the price trends for every type of commodity together. Using the first dataset that was uploaded, we will visualize prices of commodities over time on the x axis and see how they compare against each other, as follows:

Setting the parameters for the time series chart

Remember to clear the time thresholds in the Time section. Then, select feature as the Group by value, AVG(value) as Metrics, and render the graph:

The time series line chart for all values

The tooltip shows the y axis price values for each commodity type and the units used. We can notice that the highly priced commodities have mostly non-overlapping price ranges. The data extends from January 1980 to June 2018. After the expensive commodities, bananas and oranges have fairly overlapping price ranges. It will be easier to compare...

Comparing two time series

Stacked charts are often useful for measuring the combined area covered and relative differences in y axis values for two or more series. We will use the time series stacked chart to compare the prices of oranges and bananas:

Setting parameters for the time series stacked chart

The Style section of the chart provides a stream style option. The width of each stream is proportional to the value in that category:

Time series stacked chart

In the stacked chart, the increase in price of both bananas and oranges is visualized through the increasing width of the stream. Since 2010, the color-coded streams show that oranges have had a relatively higher price variance than bananas. We can switch to expand styles and see whether, besides the higher price variation, oranges show a higher upward trend in prices:

Changing the variation

After switching to expand...

Summary

With two datasets, we were able to compare the prices of food commodities. We then dived deep into a comparison of the imported prices of oranges and bananas in United States. We made use of five chart types that helped to give us a better understanding of how bananas correlate with respect to oranges, although we did not attempt to quantify the relationship between banana and orange import prices. Still, we were able to understand how they differed in a very significant way.

In the next chapter, we will visualize relationships as graphs instead of coordinates on orthogonal axes. This will help us to visualize features in a dataset connected in a network.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Superset Quick Start Guide
Published in: Dec 2018Publisher: ISBN-13: 9781788992244
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Shashank Shekhar

Shashank Shekhar is a data analyst and open source enthusiast. He has contributed to Superset and pymc3 (the Python Bayesian machine learning library), and maintains several public repositories on machine learning and data analysis projects of his own on GitHub. He heads up the data science team at HyperTrack, where he designs and implements machine learning algorithms to obtain insights from movement data. Previously, he worked at Amino on claims data. He has worked as a data scientist in Silicon Valley for 5 years. His background is in systems engineering and optimization theory, and he carries that perspective when thinking about data science, biology, culture, and history.
Read more about Shashank Shekhar