You're reading from Snowflake Cookbook

Product typeBook

Published inFeb 2021

Reading LevelBeginner

PublisherPackt

ISBN-139781800560611

Edition1st Edition

Languages

Python

Tools

Cloud Foundry

Concepts

Data Science

Authors (2):

Hamid Mahmood Qureshi

Hammad Sharif

View More author details

Managing external tables and stages

An important aspect of ETL applications is managing the loading of data. This recipe introduces you to managing incoming data by creating a stage and querying that data for loading into native Snowflake tables. The process is very different from traditional data warehouses as it mixes concepts from modern big data systems. Details around ETL will not be covered here but are deferred till later chapters to explain how an ETL pipeline can be managed.

Getting ready

The following example requires SnowSQL to run the different steps. Apart from that, you will need to have access to an AWS S3 bucket where data can be placed and made available as files.

How to do it…

The following steps start with the creation of a stage, which is used to temporarily store data before it can be copied into Snowflake:

Let's first create a stage. A stage is a logical concept or an abstraction of a filesystem location that is external or internal to Snowflake. In this case, an external stage has been used. The location can be managed in one of the object stores supported by the underlying cloud storage. In the case of AWS, S3 is used for this purpose. This recipe uses S3. The following statement creates a stage named sfuser_ext_stage. The stage should be accessible to Snowflake:
```
CREATE OR REPLACE STAGE sfuser_ext_stage
URL='s3://snowflake-cookbook/Chapter02/r4/';
```
The response should say that a stage has been created successfully, as shown:
Figure 2.21 – Stage area created
Let's now do a listing on the SFUSER_EXT_STAGE stage pointing to the snowflake-cookbook S3 bucket:
```
LIST@SFUSER_EXT_STAGE;
```
This statement should generate the following output in the web UI:
Figure 2.22 – Listing the stage to the S3 bucket
We can see that there are two types of files in the preceding listing: csv and parquet. In the case of csv, the electronic-card-transactions-may-2020-headless.csv file is a header-less version of the electronic-card-transactions-may-2020.csv file. There is a parquet format file as well called userdata1.parquet. We shall create external tables on both files. An external table is different from usual database tables because unlike tables that point to data inside a database, external tables provide a view on top of files stored in a stage.
These are read-only tables that maintain metadata that's helpful in interpreting the contents of a file, which could be formatted as parquet, csv, and so on.
Let's now look at how the parquet file can be loaded into an external table. We shall be creating an external table called ext_tbl_userdata1. The creation of the table would require a location from which data can be read into the table. It would also require a file format. In this case, the file type is parquet:
```
create or replace external table ext_tbl_userdata1
with location = @sfuser_ext_stage                                                                      
file_format = (type = parquet);
```
Let's query the newly created external table. This would show each row of the result set as a JSON document. Within each row, you should be able to see different columns with their respective values as key-value pairs:
```
select * from ext_tbl_userdata1;
```
The following screenshot is only showing some of the key-value pairs due to size constraints:
Figure 2.23 – Output of the query showing key-value pairs
Similarly, by pointing to a different location, the CSV file can be loaded into another external table. An ext_card_data table is created that has the location pointing to the stage. In this case, the file is located in a …/ch2/r4/csv subfolder. This gets us to the folder where the file is located. file_format, in this case, is providing the information that the file is a CSV, and finally, a file filter is provided to constrain the search to CSVs with headless in their names:
```
create or replace external table ext_card_data
with location = @sfuser_ext_stage/csv
file_format = (type = csv)
pattern = '.*headless[.]csv';
```
Let's query the new external table:
```
select * from ext_card_data; 
```
The following screenshot of the resulting records shows some of the rows generated by the query. Please note the difference here. There are no meaningful column names in this case, unlike the previous case of the Parquet file:
Figure 2.24 – Output of the select statement
As we have observed, an external table always ends up having data in JSON format at the end of a copy process. This step shows how some meaningful names can be given to the dummy or automatically created columns in the JSON document and how it can be flattened to generate column-oriented rows. The following query shows how aliases can be created for the automatically created column names. Please note that the query only selects two columns, c3 and c2, and creates the card_sum and period aliases, respectively. Moreover, to use the columns effectively, casting has been carried out for each column value:
```
select top 5 value:c3::float as card_sum,
value:c2::string as period 
from ext_card_data;
```
The result of the query is shown:
Figure 2.25 – Output of the query selecting two columns
Now that we have loaded tables and we can see how this data can be used in queries, we can drop the tables. This would end the life of the external tables that had been created to run queries of raw data – a typical purge-on-load pattern is applicable:
```
drop table ext_card_data;
drop table ext_tbl_userdata1; 
```

How it works…

The stage created in step 1 can be thought of as a reference to a storage location. It is treated as a read-only location that can only be accessed using the appropriate access rights. The S3 bucket in step 1 is a public bucket and does not need credentials to be accessed. In later chapters, when we look at staging in more detail, we shall start delving into securing the staging locations.

Step 2 is dereferencing the specified S3 bucket to list all the files that are available along with the last modified dates.

In step 3, when the Parquet file is loaded into the external table, the table rows have all the field names captured, as shown in step 4. But in step 5, when the CSV-formatted file is loaded into the external table, there are dummy column names created by Snowflake, as can be seen in step 6. This is because a Parquet file has metadata stored inside the file, while a CSV file does not have that metadata embedded in it. This is a major difference and would usually require additional steps as shown in step 7 to generate meaningful column names, plus the casting of data types.

There's more…

In this recipe, we did not look at how data could be loaded into a stage. It is possible to load data into a stage from a local filesystem. This method will be discussed in the later chapters.

Looking at the metadata limitations for external tables, it can be argued that Avro files can be the best format for staging data as Avro files can specify field names and data types as well. We shall look into an example in a later chapter where we discuss ETL processing.

Please note that the last modified dates for each file in a stage can be a useful mechanism to trigger updating data in an external table and can be used for running the ETL process.

You have been reading a chapter from

Snowflake Cookbook

Published in: Feb 2021Publisher: PacktISBN-13: 9781800560611

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Hamid Mahmood Qureshi

Hamid Qureshi is a senior cloud and data warehouse professional with almost two decades of total experience, having architected, designed, and led the implementation of several data warehouse and business intelligence solutions. He has extensive experience and certifications across various data analytics platforms, ranging from Teradata, Oracle, and Hadoop to modern, cloud-based tools such as Snowflake. Having worked extensively with traditional technologies, combined with his knowledge of modern platforms, he has accumulated substantial practical expertise in data warehousing and analytics in Snowflake, which he has subsequently captured in his publications.
Read more about Hamid Mahmood Qureshi

Hammad Sharif

Hammad Sharif is an experienced data architect with more than a decade of experience in the information domain, covering governance, warehousing, data lakes, streaming data, and machine learning. He has worked with a leading data warehouse vendor for a decade as part of a professional services organization, advising customers in telco, retail, life sciences, and financial industries located in Asia, Europe, and Australia during presales and post-sales implementation cycles. Hammad holds an MSc. in computer science and has published conference papers in the domains of machine learning, sensor networks, software engineering, and remote sensing.
Read more about Hammad Sharif

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages