You're reading from Data Engineering with dbt

Product type Book

Published in Jun 2023

Publisher Packt

ISBN-13 9781803246284

Pages 578 pages

Edition 1st Edition

Languages

Concepts

Data Streaming

Author (1):

Roberto Zagni

Table of Contents (21) Chapters

Preface

1. Part 1: The Foundations of Data Engineering

2. Chapter 1: The Basics of SQL to Transform Data

3. Chapter 2: Setting Up Your dbt Cloud Development Environment

4. Chapter 3: Data Modeling for Data Engineering

5. Chapter 4: Analytics Engineering as the New Core of Data Engineering

6. Chapter 5: Transforming Data with dbt

7. Part 2: Agile Data Engineering with dbt

8. Chapter 6: Writing Maintainable Code

9. Chapter 7: Working with Dimensional Data

10. Chapter 8: Delivering Consistency in Your Data

11. Chapter 9: Delivering Reliability in Your Data

12. Chapter 10: Agile Development

13. Chapter 11: Team Collaboration

14. Part 3: Hands-On Best Practices for Simple, Future-Proof Data Platforms

15. Chapter 12: Deployment, Execution, and Documentation Automation

16. Chapter 13: Moving Beyond the Basics

17. Chapter 14: Enhancing Software Quality

18. Chapter 15: Patterns for Frequent Use Cases

19. Index

Why subscribe?

20. Other Books You May Enjoy

Advanced – introducing window functions

A window function is a SQL function that operates on a set of rows, called a window of rows.

These functions are also called analytic functions, as they are mostly used in analytic queries.

A window is a set of correlated rows, containing any amount from 0 to all the rows resulting from a query.

Generally, the windows divide the resulting rows into disjointed partitions.

Some window functions can also operate on a window frame, which is a dynamic subset of the rows defined by the window, based on the current row.

You can define a cumulative window frame, taking the rows of the window before or after the current row, or a sliding window frame, going from some rows before the current row to some rows after the current row.

Window functions have a lot in common with aggregated functions, the ones that use GROUP BY, as they both divide the results of a query into groups, that is, partitions, and then calculate the function on the rows in each group.

The big difference is that with GROUP BY, you get only one row per group, while with window functions, all the rows are returned, with the function calculated on the desired subset of rows.

Let’s show some examples to make the concept more real.

Assume we have a table with orders, and we would like to know how much each order contributes to the total by day or month. We can calculate the percentage with the following formula:

monthly_pct = order_total / monthly_total * 100

Using group by, you could calculate the order total for each day or month, in separate queries, and then you would need to join those results with the original table to be able to apply the formula.

Window functions are here to help, as you can calculate the daily or monthly measures in a single step by defining a window that puts the rows together in different ways, as in the following query:

SELECT
  O_ORDERKEY,
  O_CUSTKEY,
  O_ORDERDATE,
  O_TOTALPRICE,
  avg(O_TOTALPRICE) over(partition by O_ORDERDATE) as daily_avg,
  sum(O_TOTALPRICE) over(partition by O_ORDERDATE) as daily_total,
  sum(O_TOTALPRICE) over(partition by
          DATE_TRUNC(MONTH, O_ORDERDATE)) as monthly_total,
  O_TOTALPRICE / daily_avg * 100 as avg_pct,
  O_TOTALPRICE / daily_total * 100 as day_pct,
  O_TOTALPRICE / monthly_total * 100 as month_pct
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
QUALIFY row_number() over(partition by O_ORDERDATE
                          order by O_TOTALPRICE DESC) <= 5
order by O_ORDERDATE, O_TOTALPRICE desc;

This example uses window functions to calculate totals and averages over different periods and also to limit the returned rows to the five biggest orders for each day.

This query is pretty efficient, executing in a few seconds even after processing million of rows. By changing the schema from TPCH_SF1 to TPCH_SF10, TPCH_SF100, or TPCH_SF1000, you can try out bigger tables in the Snowflake sample data, to see how it works with bigger datasets.

Window definition

A window is defined using the OVER clause, which specifies how to partition and order the rows of the incoming data to apply the window function.

To use a window function, you use this syntax:

<function_name> ([<arguments>])
    OVER ( [PARTITION BY <part_expr>] [ORDER BY <order_expr>])

The presence of the OVER clause specifies that the function is applied to the windows defined inside the OVER clause. Of course, the function can also take its own arguments.

The PARTITION BY clause defines how the rows are partitioned and <part_expr> is a comma-separated list of columns or expressions calculated on the query data, in a similar way to what we have seen for the group by expressions and shown in the previous example.

The ORDER BY clause specifies the ordering of the rows to be used to calculate the function and it is not visible outside of the window function. The order by clause and <order_expr> work pretty much as they do in the SELECT statement to order the results of the query.

At the syntax level, both the PARTITION BY and ORDER BY clauses are optional, even if some functions require or forbid one of the two. Look at each function’s documentation for more details.

The ORDER BY clause has an effect on the order-sensitive window function, which comes in two types:

Rank-related functions: These functions calculate values based on the rank of the row, which is the order of the row in the window. They are used to make explicit, that is, crystallized into a value that can be used later, one of many possible orderings of a row in a window.

One example is the ordering of teams in a championship based on points or medals in a sport: you can have different styles of ordering based on what is decided when you have the same value.

As an example, if two teams (A and B) have the same score, after another team (C) with an higher score, you can say that A and B are both second or you can say that one is second and the other one third.

Window frame functions: These functions work on a window frame, which is a dynamic subset of rows of a window. As we saw previously, they can use cumulative or sliding window frames. The next section will delve into these.

Window frame definition

Some functions can use a dynamic window frame instead of a fixed window.

You specify the window frame inside the OVER clause, after the ORDER BY clause of the window:

<function_name> ([<arguments>])
    OVER ( [PARTITION BY <part_expr>]
           ORDER BY <order_expr>
           <cumulative_frame_def> | <sliding_frame_def>
         )

A cumulative window frame is a subset of the window that is either growing or shrinking based on the current row inside the window. The frame contains the rows of the window either before or after the current row. The syntax of the cumulative frame definition is the two following lines:

  {ROWS | RANGE} BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
| {ROWS | RANGE} BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING

The first expression takes all the rows in the partition or all the rows in the same order range up to the current one, while the second takes from the current row to the end of the partition or range. A better description of ROWS and RANGE follows.

A sliding window frame is a subset of rows that extends from a number of N rows before the current row to a number of M rows after the current row. One or both sides can be fixed to the start or end of the partition.

The following are the three possible forms of the syntax:

  ROWS BETWEEN <N> { PRECEDING | FOLLOWING }
       AND <M> { PRECEDING | FOLLOWING }
| ROWS BETWEEN UNBOUNDED PRECEDING AND <M> { PRECEDING | FOLLOWING }
| ROWS BETWEEN <N> { PRECEDING | FOLLOWING } AND UNBOUNDED FOLLOWING

The first syntax is when you want to create a frame that extends from N rows before to M rows after the current rows, while the other two are the cases when one of the two extremities is fixed to be the start or end of the window.

ROWS/RANGE in a cumulative window definition

When specifying a cumulative window frame, we can use either ROWS or RANGE, but what is the difference?

ROWS: Indicates that the fixed side of the frame extends to the limit of the partition: from the start of the partition to the current row or from the current row to the end of the partition.
RANGE: Indicates that the fixed side of the frame extends to the previous/following rows that have the same value for the ORDER BY value. The frame operates on the preceding/following rows that are in the same position in the ordering. As an example, you can think of stepping through many teams that have the same number of points in a championship list.

Important note

Providing extensive information on window functions is beyond the scope of this introduction to SQL used in analytical operations. It is also one area where SQL is not always the same in all databases. We suggest searching for “window functions” in the documentation of your database.

You're reading from Data Engineering with dbt

Table of Contents (21) Chapters

Advanced – introducing window functions

Window definition

Window frame definition

ROWS/RANGE in a cumulative window definition

Authors (1)

Personalised recommendations for you