Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Apache Superset Quick Start Guide

You're reading from  Apache Superset Quick Start Guide

Product type Book
Published in Dec 2018
Publisher
ISBN-13 9781788992244
Pages 188 pages
Edition 1st Edition
Languages
Author (1):
Shashank Shekhar Shashank Shekhar
Profile icon Shashank Shekhar

Table of Contents (10) Chapters

Preface 1. Getting Started with Data Exploration 2. Configuring Superset and Using SQL Lab 3. User Authentication and Permissions 4. Visualizing Data in a Column 5. Comparing Feature Values 6. Drawing Connections between Entity Columns 7. Mapping Data That Has Location Information 8. Building Dashboards 9. Other Books You May Enjoy

Configuring Superset and Using SQL Lab

Superset has a flexible software architecture. This means that a Superset setup can be made for many different production environment needs. The production environment at Airbnb runs Superset inside Kubernetes and serves 600+ daily users, rendering over 100,000 charts every day.

At the same time, Superset can be set up with default settings for most users. When launching our first dashboard on a Google Compute Instance, we did not have to make any changes to the default parameters.

In this chapter, we will learn about the following:

  • Setting the web server
  • Metadata database
  • Web server
  • Setting up an NGINX reverse proxy
  • Setting up HTTPS or SSL certification
  • Flask-AppBuilder permissions
  • Securing session data
  • Caching queries
  • Mapbox access token
  • Long-running queries
  • Upgrading Superset
  • Main configuration file
  • SQL Lab
...

Setting the web server

Start the Superset web server with this command:

superset runserver

Superset loads the configuration from a superset_config.py Python file. This file must be present in the path stored in the SUPERSET_CONFIG_PATH environment variable. The configuration variables present in this config file will override their default values. Superset uses the default values for variables not defined in the file.

So to configure the application, we need to create a Python file. After creating the Python file, we need to update SUPERSET_CONFIG_PATH to include the file path.

On your GCE instance, run the following commands:

shashank@superset:~$ touch $HOME/.superset/superset_config.py
shashank@superset:~$ echo 'export SUPERSET_CONFIG_PATH=$HOME/.superset/superset_config.py' >> ~/.bash_profile
shashank@superset:~$ source ~/.bash_profile

Those are the last commands...

Creating the metadata database

The SQLALCHEMY_DATABASE_URI variable value is picked up by the Flask-AppBuilder manager to create the metadata database for the web app. The metadata database is persisted in ~/.superset/superset.db by default. This can be verified by running sqlite3 in the directory and listing the tables in the database:

shashank@superset:~/.superset$ sqlite3 
SQLite version 3.16.2 2017-01-06 16:32:41Enter ".help" for usage hints.Connected to a transient in-memory database.Use ".open FILENAME" to reopen on a persistent database.
sqlite> .open superset.db
sqlite> .tables
ab_permission annotation_layer logs ab_permission_view clusters metrics ab_permission_view_role columns query ab_register_user css_templates saved_query ab_role dashboard_slices slice_user ab_user dashboard_user slices ab_user_role dashboards sql_metrics ab_view_menu datasources...

Migrating data from SQLite to PostgreSQL

Before we move forward, let's make sure all tables have been migrated from the SQLite database to the newly set up PostgreSQL database.

First, we need to migrate the SQLite metadata database to our new PostgreSQL installation. We will use sequel, an open-source database toolkit available as a Ruby gem. It works very well with migration tasks from sqlite3 to PostgreSQL, which is why we are using it.

We will install OS dependencies and gem dependencies along with the sequel Ruby gem:

sudo apt-get install ruby-dev libpq-dev libsqlite3-dev
sudo gem install pg sqlite3
sudo gem install sequel

After installing sequel, the migration is as simple as running the following command. Make sure the path to the sqlite3 database is set correctly:

sequel -C sqlite:///home/shashank/.superset/superset.db postgresql://superset:superset@localhost/superset...

Web server

We can integrate Superset with many web server options, such as Gunicorn, NGINX, and Apache HTTP, depending on our runtime requirements.

Web servers handle HTTP or HTTPS requests. A Superset web server typically processes a large number of such requests to render charts. Each request generates an I/O-bound database query in Superset. This query is not CPU-bound because the query execution happens at the database level and the result is returned to Superset by the database query execution engine. Requests to a Superset web server almost always require a dynamic output and not a static resource as a response. Gunicorn is a Python WSGI HTTP server. WSGI is a Python application interface based on the Python Enhancement Proposal (PEP) 333 standard. It specifies how Python applications interface with a web server. Gunicorn is the recommended web server for deploying a Superset...

Setting up an NGINX reverse proxy

We are going to set up NGINX as a proxy server that will retrieve resources on behalf of a client from the Gunicorn web server. NGINX has many functionalities and it is the most popular proxy server in use. We will use it primarily to redirect connections when someone enters a registered web domain name in their web browser, or the external IP address directly into our Superset web server.

We will set up SSL certification for the NGINX proxy server. This way, web connections to our web app will always be encrypted and secure. More popular browsers, such as Chrome and Firefox, will show a warning if the web page does not have an SSL certificate. No worries, we will get the certificate!

We will first install NGINX in our GCE instance. GCE runs an Ubuntu OS:

# Install
sudo apt-get update
sudo apt-get install nginx 

The NGINX service is now installed...

Setting up HTTPS or SSL certification

We will be using Let's Encrypt (https://letsencrypt.org/) a free, automated, and open certificate authority managed by the non-profit Internet Security Research Group (ISRG).

Secure Socket Layer (SSL) is a secure transport layer that can be used in any protocol; HTTPS is a common instance of it, that we will be implementing for our Superset web server.

Just like most other things, configuring SSL has OS level dependencies. First, we will install certbot, which is the free automated certificate service. It needs to verify our site first. It does this by doing some checks (which it calls challenges) in http://<url>/.well_known:

# Install certbot
sudo add-apt-repository ppa:certbot/certbot
sudo apt-get install certbot
# Create .well_known directory
cd /var/www/html
mkdir .well_known

We also need to update the superset.conf file in the...

Flask-AppBuilder permissions

Superset uses the Flask-AppBuilder framework to store metadata required for permissions in Superset. Every time a Flask-AppBuilder app is initialized, permissions and views are automatically created for the Admin role. When multiple concurrent workers are started by Gunicorn, they might lead to contention and race conditions between the workers trying to write to one metadata database table.

The automatic updating of permissions in the metadata database can be disabled by setting the value of the SUPERSET_UPDATE_PERMS environment variable to zero. It is one or enabled by default:

export SUPERSET_UPDATE_PERMS=1 superset init
# Make sure superset init is called before Superset starts with a new metadata database
export SUPERSET_UPDATE_PERMS=0 gunicorn -w 10 … superset:app

Securing session data

Session data that is exchanged between the Superset web server and a browser client or internet bot can be encrypted using the SECRET_KEY parameter value present in the superset_config.py file. It uses a cryptographic one-way hashing algorithm for encryption. Since the secret is never included with data the web server sends to a browser client or internet bot, neither can tamper with session data and hope to decrypt it.

Just set its value to a random string of length greater than ten:

SECRET_KEY = 'AdLcixY34P' # random string

Caching queries

Superset uses Flask-Cache for cache management and Flask-Cache provides support for many backend implementations that fit different use cases.

Redis is the recommended cache backend for Superset. But if you do not expect many users to use your Superset installation, then FileSystemCache is a good alternative to a Redis server.

The following are some of the cache implementations that are available, with a description and their configuration variables:

CACHE_TYPE
Description and configuration
simple
Uses a local Python dictionary to store results. This is not really safe when using multiple workers on the web server.
filesystem

Uses the filesystem to store cached values. The CACHE_DIR variable is the directory path used by FileSystemCache.

memcached

Uses a memcached server to store values. Requires the pylibmc Python package installed in the...

Mapbox access token

The MAPBOX_API_KEY variable needs to be defined because we will use Mapbox visualizations in Superset charts. We need to get a Mapbox access token using the guidelines available here: https://www.mapbox.com/help/how-access-tokens-work/.

After you have obtained it, set the MAPBOX_API_KEY variable to the valid access token value.

Long-running queries

Database queries that are initiated by Superset to render charts must complete within the lifetime of HTTP/HTTPS requests. Some long-running database queries can cause a request timeout if they exceed the maximum duration of a request. But it is possible to configure Superset to handle long-running queries properly using a Celery distributed queue, and transfer the responsibility of query handling to Celery workers.

In large databases, it is common to run queries that run for minutes and hours while most commonly web request timeouts are within 30-60 seconds. Therefore, it is necessary that we configure this asynchronous query execution backend for Superset.

We need to ensure that the worker and the Superset server both have the same values for common configuration variables.

Redis is the recommended message queue for submitting new queries to Celery workers...

Main configuration file

So, we have completed configuring Superset. Let's take a look at the complete Superset configuration file:

# Superset Configuration file
# add file superset_config.py to PYTHONPATH for usage

# Metadata database
SQLALCHEMY_DATABASE_URI = "postgresql+psycopg2://superset:superset@localhost/superset"

# Securing Session data
SECRET_KEY = 'AdLcixY34P' # random string

# Caching Queries
CACHE_CONFIG = {
# Specify the cache type

'CACHE_TYPE': 'redis',
'CACHE_REDIS_URL': 'redis://localhost:6379/0',
# The key prefix for the cache values stored on the server
'CACHE_KEY_PREFIX': 'superset_results'
}

# Set this API key to enable Mapbox visualizations
MAPBOX_API_KEY = os.environ.get('MAPBOX_API_KEY', 'mapbox-api-key')

# Long running query handling using Celery workers
class
...

SQL Lab

SQL Lab is a powerful SQL IDE inside Superset. It works with any database that has a SQLAlchemy Python connector. It is great for data exploration. It can query any data sources in the Superset, including the metadata database.

It is a solid playground from which we can slice and dice the dataset in many ways to arrive at a form that needs to be visualized to solve the analytical question that the chart was created to answer.

First, we need to enable SQL Lab use on the superset-bigquery data source. We will explore and visualize the data in the table using SQL queries.

After clicking on the Sources | Databases option on the navigation bar, select the Edit record option for the superset-bigquery data source:

The overview chart of the list of databases

Then, make sure the following three options are enabled. Allow Run Sync should be enabled by default. We are doing this...

Summary

We understood that when the Superset web server is started we can configure it for our runtime environment needs using the superset_config.py file. We looked at the configuration parameters that can make Superset secure and scalable to match optimal trade-offs.

SQL Lab provides an opportunity to experiment with result sets before plotting. It can be used as an excellent tool for exploring datasets and developing charts.

In this chapter, we replaced SQLite metadata with a PostgreSQL database and configured a web app to use it as the database. So that the web app can handle many concurrent users, we deployed it on a Gunicorn server:

  • PostgreSQL metadata database
  • Gunicorn
  • NGINX
  • HTTPS authorization
  • Securing session data
  • Redis caching system
  • Celery for long-running queries
  • Mapbox access token

Nicely done! We have been able to make dashboards, use SQL Lab, and understand the...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Apache Superset Quick Start Guide
Published in: Dec 2018 Publisher: ISBN-13: 9781788992244
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}