Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Apache Superset Quick Start Guide
Apache Superset Quick Start Guide

Apache Superset Quick Start Guide: Develop interactive visualizations by creating user-friendly dashboards

By Shashank Shekhar
€19.99 €13.98
Book Dec 2018 188 pages 1st Edition
eBook
€19.99 €13.98
Print
€24.99
Subscription
€14.99 Monthly
eBook
€19.99 €13.98
Print
€24.99
Subscription
€14.99 Monthly

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Dec 19, 2018
Length 188 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781788992244
Vendor :
Apache
Category :
Table of content icon View table of contents Preview book icon Preview Book

Apache Superset Quick Start Guide

Getting Started with Data Exploration

Apache Superset is a web platform for creating data visualizations and telling stories with data using dashboards. Packing visualizations in a dashboard is fun, and dashboards render updates to the visualizations in real time.

The best part is that Superset has a very interactive user experience. Programming knowledge is not required for using Superset.

Superset makes it easy to share and collaborate on data analytics work. It has user roles and permission management built into it as core components. This makes it a great choice for data analysis work collaboration between a cross functional team of data analysts, business professionals, and software engineers.

There all sorts of charts to make on Superset. Many common analytical questions on data can be addressed using the charts, which are easy to use. In this book, we will do data exploration and analysis of different types of datasets. In the process, we will try to understand different aspects of Superset.

In this chapter, we will learn about the following:

  • Datasets
  • Installing Superset
  • Sharing Superset
  • Configuring Superset
  • Adding a database
  • Adding a table
  • Creating a chart
  • Uploading a CSV file
  • Configuring a table schema
  • Customizing the visualization
  • Making a dashboard

Datasets

We will be working on a variety of datasets in this book, and we will analyze their data. We will make many charts along the way. Here is how we will go about it:

  • Visualizing data distributions:
    • Headlines
    • Distributions
    • Comparisons
  • Finding trends in time series or multi-feature datasets:
    • Joint distributions with time series data
    • Joint distributions with a size feature
    • Joint distributions
  • Discovering hierarchical and graphical relationships between features:
    • Hierarchical maps
    • Path maps
  • Plotting features with location information on maps:
    • Heatmaps using Mapbox
    • 2D maps using Mapbox
    • 3D maps using MapGL
    • World map

Superset plugs into any SQL database that has a Python SQLAlchemy connector, such as PostgreSQL, MySQL, SQLite, MongoDB, and Snowflake. The data stored in any of the databases is fetched for making charts. Most database documents have a requirement for the Python SQLAlchemy connector.

In this book, we will use Google BigQuery and PostgreSQL as our database. Our datasets will be public tables from Google BigQuery and .csv files from a variety of web resources, which we will upload to PostgreSQL. The datasets cover topics such as Ethereum, globally traded commodities, airports, flight routes, and a reading list of books, because the generating process for each of these datasets is different. It will be interesting to visualize and analyze the datasets.

Hopefully, the experience that we will gain over the course of this book will help us in becoming effective at using Superset for data visualization and dashboarding.

Installing Superset

Let's get started by making a Superset web app server. We will cover security, user roles, and permissions for the web app in the next chapter.

Instead of a local machine, one can also choose to set up Superset in the cloud. This way, we can even share our Superset web app with authenticated users via an internet browser (for example, Firefox or Chrome).

We will be using Google Compute Engine (GCE) for the Superset server. You can use the link https://console.cloud.google.com and set up your account.

After you have set up your account, go to the URL https://console.cloud.google.com/apis/credentials/serviceaccountkey to download a file, `<project_id>.json`. Save this somewhere safe. This is the Google Cloud authorization JSON key file. We will copy the contents of this file to our GCE instance after we launch it. Superset uses the information in this file to authenticate itself to Google BigQuery.

GCE instances are very easy to configure and launch. Anyone with a Google account can use it. After logging in to you Google account, use this URL: https://console.cloud.google.com/compute/instances. Here, launch a g1-small (1 vCPU, 1.7 GB memory) instance with default settings. When we have to set up Superset for a large number of concurrent users (greater than five), we should choose higher compute power instances.

After launching, on the VM instances screen we can see our g1-small GCE instance is up and running:

GCE dashboard on Google Cloud Platform

Sharing Superset

We will need to share our Superset web app with others, and for that we will have to figure out the URL users can use to access it through their internet browsers.

The standard format of a web server URL is http://{address}:{port number}.

The default port for Superset is 8088. On a locally run Superset web app server, the address is localhost. Servers on internal networks are available on their internal IP address. Web apps on cloud services such as GCE or Amazon Elastic Compute have the machine's external IP as the address.

On GCE's VM instances screen, an external IP is displayed for each instance that is started. A new external IP is generated for every new instance. In the following screenshot, the external IP specified is 35.233.177.180. To share the server with registered users on the internet, we make a note of the external IP on our own screens:


The sidebar on Google Cloud Platform

To allow users to access the port, we need to go to VPC network | Firewall rules and Create a firewall rule that will open port 8088 for users. We can use the field values shown in the following screenshot for the rule:

Firewall rule setup

Now, we are ready to install Superset!

Before we proceed, use the ssh option to open a Terminal that is connected to the GCE instance while staying inside your browser. This is one of the many amazing features of GCE.

In the Terminal, we will run some commands to install the dependencies and configure Superset for our first dashboard:

# 1) Install os-level dependencies
sudo apt-get install build-essential libssl-dev libffi-dev python-dev python-pip libsasl2-dev libldap2-dev
# 2) Check for Python 2.7
python --version
# 3) Install pip
wget https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py
# 4) Install virtualenv
sudo pip install --upgrade virtualenv
# 5) Install virtualenvironment manager
sudo pip install virtualenvwrapper
source /usr/local/bin/virtualenvwrapper.sh
echo 'source /usr/local/bin/virtualenvwrapper.sh' >> ~/.bash_profile
# 6) Make virtual environment
mkvirtualenv supervenv
# 7) Install superset and virtualenv in the new virtual environment
(supervenv) pip install superset
(supervenv) pip install virtualenv virtualenvwrapper
# 8) Install database connector
(supervenv) pip install pybigquery
# 9) Create and open an authentication file for BigQuery
(supervenv) vim ~/.google_cdp_key.json
# 10) Copy and paste the contents of <project_id>.json key file to ~/.google_cdp_key.json
# 11) Load the new authentication file
(supervenv) echo 'export GOOGLE_APPLICATION_CREDENTIALS="$HOME/
.google_cdp_key.json"' >> ~/.bash_profile
(supervenv) source ~/.bash_profile

Configuring Superset

Superset uses the Flask-AppBuilder framework (fabmanager) to store and manage data for authentication, user permissions, and user roles in Superset.

After installing fabmanager in the Python virtual environment, we use the create-admin command in fabmanager and specify Superset as the app. The Flask-AppBuilder framework will create a metadata database using SQLite by default in the ~/.superset location:

# On the Terminal to setup FlaskAppBuilder for superset on GCE
# Create an admin user (you will be prompted to set username, first and last name before setting a password)
(supervenv) fabmanager create-admin --app superset

After creating the admin user for the Superset app, we have to run the following commands to create tables and update columns in the metadata database:

# Initialize the database
(supervenv) superset db upgrade

# Creates default roles and permissions
(supervenv) superset init

We can do a sanity check to verify that the metadata database has been created in the expected location. For this, we install sqlite3 to query the SQLite metadata database:

# Install sqlite3
(superenv) sudo apt-get install sqlite3
# Navigate to the home directory
(supervenv) cd ~/.superset
# Verify database is created
(supervenv) sqlite3
> .open superset.db
> .tables
sqlite> .tables
ab_permission annotation_layer logs
ab_permission_view clusters metrics
ab_permission_view_role columns query
ab_register_user css_templates saved_query
ab_role dashboard_slices slice_user
ab_user dashboard_user slices
ab_user_role dashboards sql_metrics
ab_view_menu datasources table_columns
access_request dbs tables
alembic_version favstar url
annotation keyvalue

Finally, let's start the Superset web server:

# run superset webserver
(supervenv) superset runserver

Go to http://<your_machines_external_ip>:8088 in your Chrome or Firefox web browser. The external IP I used is the one specified for the GCE instance I am using. Open the web app in your browser and log in with the admin credentials you entered when using the create-admin command on fabmanager.

After the login screen, you will see the welcome screen of your Superset web app:

Dashboards list

Adding a database

The navigation bar lists all the features. The Sources section is where you will create and maintain database integrations and configure table schemas to use as sources of data.

Any SQL database that has a SQLAlchemy connector such as PostgreSQL, MySQL, SQLite, MongoDB, and Snowflake can work with Superset.

Depending on the databases that we connect to Superset, the corresponding SQLAlchemy connectors have to be installed:

Database
PyPI package
MySQL
mysqlclient
PostgreSQL
psycopg2
Presto
pyhive
Hive
pyhive
Oracle
cx_oracle
SQLite
Included in Superset
Snowflake
snowflake-sqlalchemy
Redshift
sqlalchemy-redshift
MS SQL
pymssql
Impala
impyla
Spark SQL
pyhive
Greenplum
psycopg2
Athena
PyAthenaJDBC>1.0.9
Vertica
sqlalchemy-vertica-python
ClickHouse
sqlalchemy-clickhouse
Kylin
kylinpy
BigQuery
pybigquery

It is recommended that you use a database that supports the creation of views. When columns from more than one table have to be fetched for visualization, views of those joins can be created in the database and visualized on Superset, because table joins are not supported in Superset.

SQL query execution for fetching data and rendering visualizations is done at the database level, and Superset only fetches results afterwards. A database with a query execution engine that scales with your data will make your dashboard more real time.

In this book, we will work with public datasets available in Google BigQuery. We have already installed a connector for BigQuery in our installation routine, using the pip install pybigquery command. We have set up authentication for BigQuery using a key file. You should verify that, by confirming that the environment variable points to the valid key file:

echo $GOOGLE_APPLICATION_CREDENTIALS
# It should return
> /home/<your user name>/.google_cdp_key.json

Now, let's add BigQuery as a database in three steps:

  1. Select the Databases option from the drop-down list and create (+) your first database
  2. Set Database to superset-bigquery and SQLAlchemy URI to bigquery://
  3. Save the database

You can verify the database connection by clicking on the Test Connection button; it should return Seems OK! as follows:

Seems OK! dialog box is generated when test connection to database is successful

Adding a table

We will add the questions table from the Stack Overflow public dataset at https://cloud.google.com/bigquery/public-data/stackoverflow in three steps:

  1. Select the Tables option from the drop-down list, and create your first table
  2. Set values in Database to superset-bigquery and Table Name to bigquery-public-data.stackoverflow.posts_questions
  1. Save the table:
Select the database and insert the table name identifier in the form

Creating a visualization

That was smooth! You were able to add your first database and table to Superset. Now, it's time for the fun part, which is visualizing and analyzing the data. In Table, we will find the bigquery-public-data.stackoverflow.posts_questions listed as follows:

List Tables shows all available tables that can be used to make charts

When you click on it, it will take you to the chart UI:

Options available to modify the data visualized in the chart

Here, we will make a time series plot of the number of questions posted by year. In the Data tab, the Time section is used to restrict data by a temporal column value. We do not want to restrict data for the time series plot. We can clear the Since field.

In order to add axis labels to the line chart, select the Style tab and add descriptions in the X Axis Label and Y Axis Label fields:

Style form for the chart

Set year as Time Grain and COUNT(*) as the Metrics. Finally, hit Run Query:

Line chart showing total number of questions posted on Stack Overflow from 2008-2018

We have our first visualization! We can see how the number of questions grew quickly from 2008-2013. Now, Save the visualization, so that we can add it to our dashboard later:

A name and dashboard can be assigned to the chart on the Save form

Uploading a CSV

In many types of analytical work, data is available in CSV or Excel files and not in a database. You can use the Upload a CSV feature to upload CSVs as tables in Superset, without parent database integration.

We will get some real data to test this. Let's download the Ethereum transaction history from http://etherscan.io and create a new table:

curl https://etherscan.io/chart/tx?output=csv > /tmp/eth_txn.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 35279 0 35279 0 0 98k 0 --:--:-- --:--:-- --:--:-- 98k

# create a sqlite database to store the csv
cd ~/.superset
# this will create a sqlite database, quit after it opens the console
sqlite3 upload_csv.db
Edit Database details form

Once you have created the upload_csv database integration, make sure you select it when you are uploading the .csv file, as shown in the following screenshot:

Load CSV form

Configuring the table schema

The List Columns tab on the Edit Table form lets you configure the Column attributes:

Edit column properties after adding a new table

Customizing the visualization

The Ethereum dataset has a Date (UTC) column, a UnixTimestamp column, and a value representing the total transaction volume in USD on that date. Let's chart the Value column in the latest Ethereum transaction history data:

Data form for the Ethereum transaction volume chart

The Data form calculates the sum of transactions grouped by dates. There is only one value over which the SUM(Value) aggregate function is computed:

Table chart with total value of Ethereum over the years

The sum of transaction values, grouped by the Date column value and sorted in descending order, shows that the busiest days in the Ethereum network are also the most recent.

Making a dashboard

Making a dashboard in Superset is quick and easy. Just go to Dashboards and create a new dashboard. In the form, fill in the Title and a string value in the Slug field, which will be used to create the dashboard's URL, and hit Save:

Edit Dashboard form

Open the dashboard and select the Edit Dashboard option. Because we have two seemingly unrelated datasets, we can use the Tabs dashboard component to see them one at a time:

Insert components list when editing dashboard

Once you have added a Tabs component, insert the two charts you just made using the Your charts & filters option:

Dashboard for Chapter 1: Getting Started

The dashboard URL syntax is http://{address}:{port number}/superset/dashboard/getting-started. Replace the address and port number variables with the appropriate values, and you can use this link to open or share your dashboard.

Our dashboard is ready and live for users with accounts on the web server. In the next chapter, we will learn about user roles. After that, you will able to get your favorite collaborators to register. With them on board, you can start collaborating on charts and dashboards for your data analysis projects.

Summary

That must have felt productive, since we were able to create our dashboard from nothing in Superset.

Before we summarize what we have just finished in this chapter, it is important that we discuss when Superset might not be the right visualization tool for a data analysis project.

Visualization of data requires data aggregation. Data aggregation is a function of one or more column values in tables. A group by operation is applied on a particular column to create groups of observations, which are then replaced with the summary statistics defined by the data aggregation function. Superset provides many data aggregation functions; however, it has limited usability when hierarchical data aggregation is required for visualizations.

Hierarchical data aggregation is the process of taking a large amount of rows in a table and displaying summaries of partitions and their sub-partitions. This is not an option in Superset for most of the visualizations.

Also, Superset has limited customization options on the design and formatting of visualizations. It supports changes in color schemes and axis label formatting. Individuals or teams who want to tinker and optimize the visual representation of their data will find Superset very limited for their needs.

Finally, it's time to summarize our achievements. We have been able to install Superset, add a database, create a dashboard, and share it with users. We are now ready to add additional databases and tables, and create new visualizations and dashboards. Exploring data and telling data stories with Superset dashboards is one of your skill sets now!

Left arrow icon Right arrow icon

Key benefits

  • Work with Apache Superset's rich set of data visualizations
  • Create interactive dashboards and data storytelling
  • Easily explore data

Description

Apache Superset is a modern, open source, enterprise-ready business intelligence (BI) web application. With the help of this book, you will see how Superset integrates with popular databases like Postgres, Google BigQuery, Snowflake, and MySQL. You will learn to create real time data visualizations and dashboards on modern web browsers for your organization using Superset. First, we look at the fundamentals of Superset, and then get it up and running. You'll go through the requisite installation, configuration, and deployment. Then, we will discuss different columnar data types, analytics, and the visualizations available. You'll also see the security tools available to the administrator to keep your data safe. You will learn how to visualize relationships as graphs instead of coordinates on plain orthogonal axes. This will help you when you upload your own entity relationship dataset and analyze the dataset in new, different ways. You will also see how to analyze geographical regions by working with location data. Finally, we cover a set of tutorials on dashboard designs frequently used by analysts, business intelligence professionals, and developers.

What you will learn

Get to grips with the fundamentals of data exploration using Superset Set up a working instance of Superset on cloud services like Google Compute Engine Integrate Superset with SQL databases Build dashboards with Superset Calculate statistics in Superset for numerical, categorical, or text data Understand visualization techniques, filtering, and grouping by aggregation Manage user roles and permissions in Superset Work with SQL Lab

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Dec 19, 2018
Length 188 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781788992244
Vendor :
Apache
Category :

Table of Contents

10 Chapters
Preface Chevron down icon Chevron up icon
Getting Started with Data Exploration Chevron down icon Chevron up icon
Configuring Superset and Using SQL Lab Chevron down icon Chevron up icon
User Authentication and Permissions Chevron down icon Chevron up icon
Visualizing Data in a Column Chevron down icon Chevron up icon
Comparing Feature Values Chevron down icon Chevron up icon
Drawing Connections between Entity Columns Chevron down icon Chevron up icon
Mapping Data That Has Location Information Chevron down icon Chevron up icon
Building Dashboards Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.