Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Apache Superset Quick Start Guide
Apache Superset Quick Start Guide

Apache Superset Quick Start Guide: Develop interactive visualizations by creating user-friendly dashboards

By Shashank Shekhar
Free Trial per month
Book Dec 2018 188 pages 1st Edition
eBook
Can$33.99 Can$22.99
Print
Can$41.99
Subscription
Free Trial
eBook
Can$33.99 Can$22.99
Print
Can$41.99
Subscription
Free Trial

What do you get with a Packt Subscription?

Free for first 7 days. $15.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details


Publication date : Dec 19, 2018
Length 188 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781788992244
Vendor :
Apache
Category :
Table of content icon View table of contents Preview book icon Preview Book

Apache Superset Quick Start Guide

Getting Started with Data Exploration

Apache Superset is a web platform for creating data visualizations and telling stories with data using dashboards. Packing visualizations in a dashboard is fun, and dashboards render updates to the visualizations in real time.

The best part is that Superset has a very interactive user experience. Programming knowledge is not required for using Superset.

Superset makes it easy to share and collaborate on data analytics work. It has user roles and permission management built into it as core components. This makes it a great choice for data analysis work collaboration between a cross functional team of data analysts, business professionals, and software engineers.

There all sorts of charts to make on Superset. Many common analytical questions on data can be addressed using the charts, which are easy to use. In this book, we will do data exploration and analysis of different types of datasets. In the process, we will try to understand different aspects of Superset.

In this chapter, we will learn about the following:

  • Datasets
  • Installing Superset
  • Sharing Superset
  • Configuring Superset
  • Adding a database
  • Adding a table
  • Creating a chart
  • Uploading a CSV file
  • Configuring a table schema
  • Customizing the visualization
  • Making a dashboard

Datasets

We will be working on a variety of datasets in this book, and we will analyze their data. We will make many charts along the way. Here is how we will go about it:

  • Visualizing data distributions:
    • Headlines
    • Distributions
    • Comparisons
  • Finding trends in time series or multi-feature datasets:
    • Joint distributions with time series data
    • Joint distributions with a size feature
    • Joint distributions
  • Discovering hierarchical and graphical relationships between features:
    • Hierarchical maps
    • Path maps
  • Plotting features with location information on maps:
    • Heatmaps using Mapbox
    • 2D maps using Mapbox
    • 3D maps using MapGL
    • World map

Superset plugs into any SQL database that has a Python SQLAlchemy connector, such as PostgreSQL, MySQL, SQLite, MongoDB, and Snowflake. The data stored in any of the databases is fetched for making charts. Most database documents have a requirement for the Python SQLAlchemy connector.

In this book, we will use Google BigQuery and PostgreSQL as our database. Our datasets will be public tables from Google BigQuery and .csv files from a variety of web resources, which we will upload to PostgreSQL. The datasets cover topics such as Ethereum, globally traded commodities, airports, flight routes, and a reading list of books, because the generating process for each of these datasets is different. It will be interesting to visualize and analyze the datasets.

Hopefully, the experience that we will gain over the course of this book will help us in becoming effective at using Superset for data visualization and dashboarding.

Installing Superset

Let's get started by making a Superset web app server. We will cover security, user roles, and permissions for the web app in the next chapter.

Instead of a local machine, one can also choose to set up Superset in the cloud. This way, we can even share our Superset web app with authenticated users via an internet browser (for example, Firefox or Chrome).

We will be using Google Compute Engine (GCE) for the Superset server. You can use the link https://console.cloud.google.com and set up your account.

After you have set up your account, go to the URL https://console.cloud.google.com/apis/credentials/serviceaccountkey to download a file, `<project_id>.json`. Save this somewhere safe. This is the Google Cloud authorization JSON key file. We will copy the contents of this file to our GCE instance after we launch it. Superset uses the information in this file to authenticate itself to Google BigQuery.

GCE instances are very easy to configure and launch. Anyone with a Google account can use it. After logging in to you Google account, use this URL: https://console.cloud.google.com/compute/instances. Here, launch a g1-small (1 vCPU, 1.7 GB memory) instance with default settings. When we have to set up Superset for a large number of concurrent users (greater than five), we should choose higher compute power instances.

After launching, on the VM instances screen we can see our g1-small GCE instance is up and running:

GCE dashboard on Google Cloud Platform

Sharing Superset

We will need to share our Superset web app with others, and for that we will have to figure out the URL users can use to access it through their internet browsers.

The standard format of a web server URL is http://{address}:{port number}.

The default port for Superset is 8088. On a locally run Superset web app server, the address is localhost. Servers on internal networks are available on their internal IP address. Web apps on cloud services such as GCE or Amazon Elastic Compute have the machine's external IP as the address.

On GCE's VM instances screen, an external IP is displayed for each instance that is started. A new external IP is generated for every new instance. In the following screenshot, the external IP specified is 35.233.177.180. To share the server with registered users on the internet, we make a note of the external IP on our own screens:


The sidebar on Google Cloud Platform

To allow users to access the port, we need to go to VPC network | Firewall rules and Create a firewall rule that will open port 8088 for users. We can use the field values shown in the following screenshot for the rule:

Firewall rule setup

Now, we are ready to install Superset!

Before we proceed, use the ssh option to open a Terminal that is connected to the GCE instance while staying inside your browser. This is one of the many amazing features of GCE.

In the Terminal, we will run some commands to install the dependencies and configure Superset for our first dashboard:

# 1) Install os-level dependencies
sudo apt-get install build-essential libssl-dev libffi-dev python-dev python-pip libsasl2-dev libldap2-dev
# 2) Check for Python 2.7
python --version
# 3) Install pip
wget https://bootstrap.pypa.io/get-pip.py
sudo python get-pip.py
# 4) Install virtualenv
sudo pip install --upgrade virtualenv
# 5) Install virtualenvironment manager
sudo pip install virtualenvwrapper
source /usr/local/bin/virtualenvwrapper.sh
echo 'source /usr/local/bin/virtualenvwrapper.sh' >> ~/.bash_profile
# 6) Make virtual environment
mkvirtualenv supervenv
# 7) Install superset and virtualenv in the new virtual environment
(supervenv) pip install superset
(supervenv) pip install virtualenv virtualenvwrapper
# 8) Install database connector
(supervenv) pip install pybigquery
# 9) Create and open an authentication file for BigQuery
(supervenv) vim ~/.google_cdp_key.json
# 10) Copy and paste the contents of <project_id>.json key file to ~/.google_cdp_key.json
# 11) Load the new authentication file
(supervenv) echo 'export GOOGLE_APPLICATION_CREDENTIALS="$HOME/
.google_cdp_key.json"' >> ~/.bash_profile
(supervenv) source ~/.bash_profile

Configuring Superset

Superset uses the Flask-AppBuilder framework (fabmanager) to store and manage data for authentication, user permissions, and user roles in Superset.

After installing fabmanager in the Python virtual environment, we use the create-admin command in fabmanager and specify Superset as the app. The Flask-AppBuilder framework will create a metadata database using SQLite by default in the ~/.superset location:

# On the Terminal to setup FlaskAppBuilder for superset on GCE
# Create an admin user (you will be prompted to set username, first and last name before setting a password)
(supervenv) fabmanager create-admin --app superset

After creating the admin user for the Superset app, we have to run the following commands to create tables and update columns in the metadata database:

# Initialize the database
(supervenv) superset db upgrade

# Creates default roles and permissions
(supervenv) superset init

We can do a sanity check to verify that the metadata database has been created in the expected location. For this, we install sqlite3 to query the SQLite metadata database:

# Install sqlite3
(superenv) sudo apt-get install sqlite3
# Navigate to the home directory
(supervenv) cd ~/.superset
# Verify database is created
(supervenv) sqlite3
> .open superset.db
> .tables
sqlite> .tables
ab_permission annotation_layer logs
ab_permission_view clusters metrics
ab_permission_view_role columns query
ab_register_user css_templates saved_query
ab_role dashboard_slices slice_user
ab_user dashboard_user slices
ab_user_role dashboards sql_metrics
ab_view_menu datasources table_columns
access_request dbs tables
alembic_version favstar url
annotation keyvalue

Finally, let's start the Superset web server:

# run superset webserver
(supervenv) superset runserver

Go to http://<your_machines_external_ip>:8088 in your Chrome or Firefox web browser. The external IP I used is the one specified for the GCE instance I am using. Open the web app in your browser and log in with the admin credentials you entered when using the create-admin command on fabmanager.

After the login screen, you will see the welcome screen of your Superset web app:

Dashboards list

Adding a database

The navigation bar lists all the features. The Sources section is where you will create and maintain database integrations and configure table schemas to use as sources of data.

Any SQL database that has a SQLAlchemy connector such as PostgreSQL, MySQL, SQLite, MongoDB, and Snowflake can work with Superset.

Depending on the databases that we connect to Superset, the corresponding SQLAlchemy connectors have to be installed:

Database
PyPI package
MySQL
mysqlclient
PostgreSQL
psycopg2
Presto
pyhive
Hive
pyhive
Oracle
cx_oracle
SQLite
Included in Superset
Snowflake
snowflake-sqlalchemy
Redshift
sqlalchemy-redshift
MS SQL
pymssql
Impala
impyla
Spark SQL
pyhive
Greenplum
psycopg2
Athena
PyAthenaJDBC>1.0.9
Vertica
sqlalchemy-vertica-python
ClickHouse
sqlalchemy-clickhouse
Kylin
kylinpy
BigQuery
pybigquery

It is recommended that you use a database that supports the creation of views. When columns from more than one table have to be fetched for visualization, views of those joins can be created in the database and visualized on Superset, because table joins are not supported in Superset.

SQL query execution for fetching data and rendering visualizations is done at the database level, and Superset only fetches results afterwards. A database with a query execution engine that scales with your data will make your dashboard more real time.

In this book, we will work with public datasets available in Google BigQuery. We have already installed a connector for BigQuery in our installation routine, using the pip install pybigquery command. We have set up authentication for BigQuery using a key file. You should verify that, by confirming that the environment variable points to the valid key file:

echo $GOOGLE_APPLICATION_CREDENTIALS
# It should return
> /home/<your user name>/.google_cdp_key.json

Now, let's add BigQuery as a database in three steps:

  1. Select the Databases option from the drop-down list and create (+) your first database
  2. Set Database to superset-bigquery and SQLAlchemy URI to bigquery://
  3. Save the database

You can verify the database connection by clicking on the Test Connection button; it should return Seems OK! as follows:

Seems OK! dialog box is generated when test connection to database is successful

Adding a table

We will add the questions table from the Stack Overflow public dataset at https://cloud.google.com/bigquery/public-data/stackoverflow in three steps:

  1. Select the Tables option from the drop-down list, and create your first table
  2. Set values in Database to superset-bigquery and Table Name to bigquery-public-data.stackoverflow.posts_questions
  1. Save the table:
Select the database and insert the table name identifier in the form

Creating a visualization

That was smooth! You were able to add your first database and table to Superset. Now, it's time for the fun part, which is visualizing and analyzing the data. In Table, we will find the bigquery-public-data.stackoverflow.posts_questions listed as follows:

List Tables shows all available tables that can be used to make charts

When you click on it, it will take you to the chart UI:

Options available to modify the data visualized in the chart

Here, we will make a time series plot of the number of questions posted by year. In the Data tab, the Time section is used to restrict data by a temporal column value. We do not want to restrict data for the time series plot. We can clear the Since field.

In order to add axis labels to the line chart, select the Style tab and add descriptions in the X Axis Label and Y Axis Label fields:

Style form for the chart

Set year as Time Grain and COUNT(*) as the Metrics. Finally, hit Run Query:

Line chart showing total number of questions posted on Stack Overflow from 2008-2018

We have our first visualization! We can see how the number of questions grew quickly from 2008-2013. Now, Save the visualization, so that we can add it to our dashboard later:

A name and dashboard can be assigned to the chart on the Save form

Uploading a CSV

In many types of analytical work, data is available in CSV or Excel files and not in a database. You can use the Upload a CSV feature to upload CSVs as tables in Superset, without parent database integration.

We will get some real data to test this. Let's download the Ethereum transaction history from http://etherscan.io and create a new table:

curl https://etherscan.io/chart/tx?output=csv > /tmp/eth_txn.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 35279 0 35279 0 0 98k 0 --:--:-- --:--:-- --:--:-- 98k

# create a sqlite database to store the csv
cd ~/.superset
# this will create a sqlite database, quit after it opens the console
sqlite3 upload_csv.db
Edit Database details form

Once you have created the upload_csv database integration, make sure you select it when you are uploading the .csv file, as shown in the following screenshot:

Load CSV form

Configuring the table schema

The List Columns tab on the Edit Table form lets you configure the Column attributes:

Edit column properties after adding a new table

Customizing the visualization

The Ethereum dataset has a Date (UTC) column, a UnixTimestamp column, and a value representing the total transaction volume in USD on that date. Let's chart the Value column in the latest Ethereum transaction history data:

Data form for the Ethereum transaction volume chart

The Data form calculates the sum of transactions grouped by dates. There is only one value over which the SUM(Value) aggregate function is computed:

Table chart with total value of Ethereum over the years

The sum of transaction values, grouped by the Date column value and sorted in descending order, shows that the busiest days in the Ethereum network are also the most recent.

Making a dashboard

Making a dashboard in Superset is quick and easy. Just go to Dashboards and create a new dashboard. In the form, fill in the Title and a string value in the Slug field, which will be used to create the dashboard's URL, and hit Save:

Edit Dashboard form

Open the dashboard and select the Edit Dashboard option. Because we have two seemingly unrelated datasets, we can use the Tabs dashboard component to see them one at a time:

Insert components list when editing dashboard

Once you have added a Tabs component, insert the two charts you just made using the Your charts & filters option:

Dashboard for Chapter 1: Getting Started

The dashboard URL syntax is http://{address}:{port number}/superset/dashboard/getting-started. Replace the address and port number variables with the appropriate values, and you can use this link to open or share your dashboard.

Our dashboard is ready and live for users with accounts on the web server. In the next chapter, we will learn about user roles. After that, you will able to get your favorite collaborators to register. With them on board, you can start collaborating on charts and dashboards for your data analysis projects.

Summary

That must have felt productive, since we were able to create our dashboard from nothing in Superset.

Before we summarize what we have just finished in this chapter, it is important that we discuss when Superset might not be the right visualization tool for a data analysis project.

Visualization of data requires data aggregation. Data aggregation is a function of one or more column values in tables. A group by operation is applied on a particular column to create groups of observations, which are then replaced with the summary statistics defined by the data aggregation function. Superset provides many data aggregation functions; however, it has limited usability when hierarchical data aggregation is required for visualizations.

Hierarchical data aggregation is the process of taking a large amount of rows in a table and displaying summaries of partitions and their sub-partitions. This is not an option in Superset for most of the visualizations.

Also, Superset has limited customization options on the design and formatting of visualizations. It supports changes in color schemes and axis label formatting. Individuals or teams who want to tinker and optimize the visual representation of their data will find Superset very limited for their needs.

Finally, it's time to summarize our achievements. We have been able to install Superset, add a database, create a dashboard, and share it with users. We are now ready to add additional databases and tables, and create new visualizations and dashboards. Exploring data and telling data stories with Superset dashboards is one of your skill sets now!

Left arrow icon Right arrow icon

Key benefits

  • Work with Apache Superset's rich set of data visualizations
  • Create interactive dashboards and data storytelling
  • Easily explore data

Description

Apache Superset is a modern, open source, enterprise-ready business intelligence (BI) web application. With the help of this book, you will see how Superset integrates with popular databases like Postgres, Google BigQuery, Snowflake, and MySQL. You will learn to create real time data visualizations and dashboards on modern web browsers for your organization using Superset. First, we look at the fundamentals of Superset, and then get it up and running. You'll go through the requisite installation, configuration, and deployment. Then, we will discuss different columnar data types, analytics, and the visualizations available. You'll also see the security tools available to the administrator to keep your data safe. You will learn how to visualize relationships as graphs instead of coordinates on plain orthogonal axes. This will help you when you upload your own entity relationship dataset and analyze the dataset in new, different ways. You will also see how to analyze geographical regions by working with location data. Finally, we cover a set of tutorials on dashboard designs frequently used by analysts, business intelligence professionals, and developers.

What you will learn

Get to grips with the fundamentals of data exploration using Superset Set up a working instance of Superset on cloud services like Google Compute Engine Integrate Superset with SQL databases Build dashboards with Superset Calculate statistics in Superset for numerical, categorical, or text data Understand visualization techniques, filtering, and grouping by aggregation Manage user roles and permissions in Superset Work with SQL Lab

What do you get with a Packt Subscription?

Free for first 7 days. $15.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details


Publication date : Dec 19, 2018
Length 188 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781788992244
Vendor :
Apache
Category :

Table of Contents

10 Chapters
Preface Chevron down icon Chevron up icon
Getting Started with Data Exploration Chevron down icon Chevron up icon
Configuring Superset and Using SQL Lab Chevron down icon Chevron up icon
User Authentication and Permissions Chevron down icon Chevron up icon
Visualizing Data in a Column Chevron down icon Chevron up icon
Comparing Feature Values Chevron down icon Chevron up icon
Drawing Connections between Entity Columns Chevron down icon Chevron up icon
Mapping Data That Has Location Information Chevron down icon Chevron up icon
Building Dashboards Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.