Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Amazon Redshift Cookbook

You're reading from  Amazon Redshift Cookbook

Product type Book
Published in Jul 2021
Publisher Packt
ISBN-13 9781800569683
Pages 384 pages
Edition 1st Edition
Languages
Authors (3):
Shruti Worlikar Shruti Worlikar
Profile icon Shruti Worlikar
Thiyagarajan Arumugam Thiyagarajan Arumugam
Profile icon Thiyagarajan Arumugam
Harshida Patel Harshida Patel
Profile icon Harshida Patel
View More author details

Table of Contents (13) Chapters

Preface 1. Chapter 1: Getting Started with Amazon Redshift 2. Chapter 2: Data Management 3. Chapter 3: Loading and Unloading Data 4. Chapter 4: Data Pipelines 5. Chapter 5: Scalable Data Orchestration for Automation 6. Chapter 6: Data Authorization and Security 7. Chapter 7: Performance Optimization 8. Chapter 8: Cost Optimization 9. Chapter 9: Lake House Architecture 10. Chapter 10: Extending Redshift's Capabilities 11. Other Books You May Enjoy Appendix

Managing tables

In Amazon Redshift, you can create a collection of tables within a schema with related entities and attributes. Working backward from your business requirements, you can use different modeling techniques to create tables in Amazon Redshift. You can choose a star or snowflake schema by using Normalized, Denormalized, or Data Vault data modeling techniques.

In this recipe, we will create tables in the finance schema, insert data into those tables and cover the key concepts to leverage the massively parallel processing (MPP) and columnar architecture.

Getting ready

To complete this recipe you will need a SQL client, or you can use the Amazon Redshift query editor.

How to do it…

Let's explore how to create tables in Amazon Redshift.

  1. Let's create a customer table in the finance schema with customer_number, first_name, last_name, and date_of_birth related attributes:
    CREATE TABLE finance.customer 
    (
      customer_number   INTEGER,
      first_name        VARCHAR(50),
      last_name         VARCHAR(50),
      date_of_birth     DATE
    );

    Note

    The key ingredient when creating a customer table is to define columns and their corresponding data types. Amazon Redshift supports data types such as numeric, character, date, datetime with time zone, boolean, geometry, HyperLogLog, and super.

  2. We will now insert 10 records into the customer table using a multi-value insert statement:
    insert into finance.customer values
    (1, 'foo', 'bar', '1980-01-01'),
    (2, 'john', 'smith', '1990-12-01'),
     (3, 'spock', 'spock', '1970-12-01'),
     (4, 'scotty', 'scotty', '1975-02-01'),
     (5, 'seven', 'of nine', '1990-04-01'),
     (6, 'kathryn', 'janeway', '1995-07-01'),
     (7, 'tuvok', 'tuvok', '1960-06-10'),
     (8, 'john', 'smith', '1965-12-01'),
     (9, 'The Doctor', 'The Doctor', '1979-12-01'),
     (10, 'B Elana', 'Torres', '2000-08-01');
  3. You can now review the information on the customer table using the svv_table_info system view. Execute the following query:
    select "schema", table_id, "table", encoded, diststyle, sortkey1,  size, tbl_rows
    from svv_Table_info
    where "table" = 'customer'
    and "schema" = 'finance';

    This is the expected output:

    schema table_id table encoded diststyle sortkey1 size tbl_rows
    finance 167482 customer Y AUTO(ALL) AUTO(SORTKEY) 14 10

    Table_id is the object ID and the number of records in the table is 10 rows. The encoded column indicates the table is compressed. Amazon Redshift stores columns in 1 megabyte (MB) immutable blocks. The size of the table is 14 MB. Let's dive into the terminology and concept of diststyle and sortkey. The customer table is created with default sort key of AUTO, where Amazon Redshift handles the distribution style of the table on the computer nodes.

    • diststyle is a table property that dictates how that table's data is distributed throughout the cluster.
    • KEY: The value is hashed, and the same value goes to same location (slice) on the compute node.
    • ALL: The full table data goes to the first slice of every compute node.
    • EVEN: Round-robin across all the compute nodes.
    • AUTO: When the table is small, it starts with an AUTO style, and when it becomes larger in size, Amazon Redshift converts it to an EVEN style.

Further information about distribution styles can be found at the following link:

https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html

  1. Let's run a query against the customer table to list customers who were born before 1980:
    select *
    from finance.customer
    where extract(year from date_of_birth) < 1980;
  2. You can also create a copy of the permanent table using create table as (CTAS). Let's execute the following query to create another table for a customer born in 1980:
    create table finance.customer_dob_1980 as 
    select *
    from finance.customer
    where extract(year from date_of_birth) = 1980 ;
  3. You can also create temporary tables—for example, to generate IDs in a data loading operation. The temporary tables can only be queried during the current session and are automatically dropped when the session ends. The temporary tables are created in the session-specific schema and are not visible to any other user. You can use a create temporary table command to do this. Execute the following three queries in single session:
    create temporary table #customer(custid integer IDENTITY(1,1), customer_number integer);
    insert into #customer (customer_number) values(1);
    select * from #customer;

    This is the expected output:

    custid  customer_number
    1 1
  4. Reconnect to the Amazon Redshift cluster using the SQL client. Reconnecting will create a new session. Now, try to execute the following query against the #customer temporary table. You will get an ERROR: 42P01: relation "#customer" does not exist error message as the temporary tables are only visible to the current session:
    select * from #customer;

How it works…

When you create a table in Amazon Redshift, it stores the data on disk, column by column, on 1 MB blocks. Amazon Redshift by default compresses the columns, which reduces the storage footprint and the input/output (I/O) when you execute a query against the table. Amazon Redshift provides different distribution styles to spread the data across all the compute nodes, to leverage the MPP architecture for your workload. The metadata and the table summary information can be queried using the catalog table and summary view.

Amazon Redshift stores metadata about the customer table. You can query the pg_table_def catalog table to retrieve this information. You can execute the following query to view the table/column structure:

select * from pg_table_def where schemaname = 'finance';. 

Important note

When data is inserted into a table, Amazon Redshift automatically builds, in memory, the metadata of the minimum and maximum values of each block. This metadata, known as a zone map, is accessed before a disk scan in order to identify which blocks are relevant to a query. Amazon Redshift does not have indexes; it does, however, have sort keys. Sort key columns govern how data is physically sorted for a table on disk and can be used as a lever to improve query performance. Sort keys will be covered in depth in the performance-tuning best practices chapter.

You have been reading a chapter from
Amazon Redshift Cookbook
Published in: Jul 2021 Publisher: Packt ISBN-13: 9781800569683
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}