Free eBook - Learn T-SQL Querying

4.5 (2 reviews total)
By Pedro Lopes , Pam Lahoud
  • A new free eBook every day on the latest in tech
  • 30 permanently free eBooks from our core tech library
  1. Anatomy of a Query

About this book

Transact-SQL (T-SQL) is Microsoft's proprietary extension to the SQL language that is used with Microsoft SQL Server and Azure SQL Database. This book will be a useful guide to learning the art of writing efficient T-SQL code in modern SQL Server versions as well as the Azure SQL Database.

The book will get you started with query processing fundamentals to help you write powerful, performant T-SQL queries. You will then focus on query execution plans and leverage them for troubleshooting. In later chapters, you will explain how to identify various T-SQL patterns and anti-patterns. This will help you analyze execution plans to gain insights into current performance, and determine whether or not a query is scalable. You will also build diagnostic queries using dynamic management views (DMVs) and dynamic management functions (DMFs) to address various challenges in T-SQL execution. Next, you will work with the built-in tools of SQL Server to shorten the time taken to address query performance and scalability issues. In the concluding chapters, the book will guide you through implementing various features, such as Extended Events, Query Store, and Query Tuning Assistant, using hands-on examples.

By the end of this book, you will have developed the skills to determine query performance bottlenecks, avoid pitfalls, and discover the anti-patterns in use.

Foreword by Conor Cunningham, Partner Architect – SQL Server and Azure SQL – Microsoft

Publication date:
May 2019
Publisher
Packt
Pages
484
ISBN
9781789348811

 

Anatomy of a Query

Transact-SQL, or T-SQL, as it has come to be commonly known, is the language that is used to communicate with Microsoft SQL Server. Any actions a user wishes to perform in a server, such as retrieving or modifying data in a database, creating objects, changing server configurations, and so on, are all done via a T-SQL command.

In this chapter, we will be introduced to the typical components of a T-SQL statement, including the logical order with which SQL Server processes a statement. This is essential for introducing the reader to why certain query writing patterns work best and to provide a fundamental reference for better understanding the other chapters.

There are four main groups of T-SQL statements that we can have in a Relational Database Management System (RDBMS) like SQL Server:

  • Data Control Language statements, also known as DCL, are used to handle control access to a database or parts of the database. T-SQL commands such as GRANT and REVOKE are used to change permissions on objects (known as securables), or to add users to SQL Server.
  • Transactional Control Language statements, also known as TCL, are used to control transactions in SQL Server with T-SQL commands such as BEGIN TRANSACTION, COMMIT TRANSACTION, or ROLLBACK.
  • Data Definition Language statements, also known as DDL, are used to create, change, or delete the database and any objects contained within such as tables or indexes. Examples of DDL include CREATE, ALTER, CREATE OR ALTER, or DROP T-SQL commands.
  • Data Manipulation Language statements, also known as DML, can be distilled into 4 logical operations on a database:
    • Retrieving data via the SELECT statement.
    • Updating and Inserting data, also known as UPSERTs, via the UPDATE and INSERT statements.
    • Deleting data via the DELETE statement.
    • There is also a MERGE statement. This is a conditional structure that combines UPDATEs, INSERTs and/or DELETEs into a single statement, which together with SELECTs, make up the fundamental DML operations available in SQL Server.

While all these types of statements must be parsed and validated by the Database Engine before execution, with very few exceptions only DML statements are optimized. This means that the way DML statements are constructed can have an impact on their resulting performance, so care must be taken to write them efficiently. For this reason, we will focus on DML statements throughout the course of this book.

In this chapter we will cover the following topics: 

  • Building blocks of a T-SQL statement
  • Logical statement-processing flow
 

Building blocks of a T-SQL statement

When writing a T-SQL statement, the following three actions are required:

  1. Express the intended operation, such as reading or changing data
  2. Provide a target or source list of affected tables
  3. Provide a condition that filters the affected records

The intended operation is determined by the presence of the following clauses:

  • The SELECT clause lists columns or expressions that will be displayed in the result set
  • The DELETE, INSERT, or UPDATE clauses state the target table or view for these logical operations

As for the affected tables and filters, they are determined by the following clauses:

  • The FROM clause lists the source tables, views and/or sub-queries that contain the data to be queried
  • The WHERE clause states one or more conditions that will serve to filter the result set to the desired rows

The preceding clauses determine which data will be manipulated. The formatting of the results can be further modified by adding any of the following parts:

  • The ORDER BY clause defines the order in which the rows will be returned
  • The GROUP BY clause aggregates rows together based on the criteria provided (typically combined with aggregate functions in the SELECT clause)
  • The HAVING clause applies a predicate to the results (different than the WHERE clause, which applies a predicate to the source rows)

SELECT

The SELECT clause defines the columns and expressions that will be returned in the results and is the only element that is required to form a valid T-SQL data retrieval statement. Elements in the SELECT statement can be as simple as a single constant value, or as complex as a full T-SQL sub-query, but generally it is a comma-separated list of columns from tables and views in a database.

The following query will return a single row with a single column:

 SELECT 1;

In the following screenshot we can see the result:

The SELECT clause can also be used to format the results by providing column aliases or using expressions to modify the data. Aliases are created with the optional keyword AS, followed by the intended column name to be displayed in the result set:

SELECT Name AS ProductName, LEFT(ProductNumber, 2) AS ProductCode, ISNULL(color, 'No Color') AS Color [...]

Note that, in the results, any row that has a value for Color will display that value, whereas any row that has a null color will display No Color instead:

DISTINCT

DISTINCT specifies that repeated rows in the result set are collapsed into a single row.

SELECT DISTINCT Name AS ProductName, LEFT(ProductNumber, 2) AS ProductCode, ISNULL(color, 'No Color') AS Color [...]

TOP

The TOP clause specifies that from the applicable rows, the results set only produces a predetermined number of rows, set in percentage or absolute number.

SELECT TOP 25 Name AS ProductName, LEFT(ProductNumber, 2) AS ProductCode, ISNULL(color, 'No Color') AS Color [...]

FROM

The FROM clause specifies the tables or views used in the SELECT, DELETE, and UPDATE statements. It is required unless a SELECT list contains only constants, variables or arithmetic expressions, or an UPDATE clause does not contain references to other tables other than the target (for example, a join is not needed).

It can be a single table, a derived table (a table created from a nested sub-query), a Table-Valued Function (TVF), or it can be several tables and/or views joined together.

If the desired result set contains data from more than one table or view, joins can be used to link rows from one table to another. There are essentially the following three types of logical joins that are expressed when writing a query:

  • Inner joins
  • Outer joins (left, right, and full)
  • Cross joins

In specific scenarios, SQL Server may infer the intended type of logical join operation. However, it is best to always explicitly state the type of required logical join operation the reader intends to implement.

When specifying multiple tables and/or views in the FROM clause, it is helpful to create aliases for these objects, which can be used within the query, such as in the SELECT clause. This is particularly useful when different objects have columns with the same names. In this case, we must disambiguate the columns throughout the query, and an alias allows us to do this without having to specify the entire object name. So, as is the case for column aliases, a table, view, or function alias is created with the AS keyword.

INNER JOIN

Inner joins compare the rows from two tables based on conditions specified in the query. Typically, this type of join would be used to intersect rows that have the same value in a specific column or set of columns. The only rows that would be returned are the ones that have matching rows in both tables, as represented in black in the following diagram:

For example, the AdventureWorks sample database has a Product table that contains the ProductID and Name columns and a ProductInventory table that contains the ProductID and Quantity columns. To write a query that returns the product name and the quantity together, an inner join can be used to combine rows from the Product table with rows from the ProductInventory table based on matching values in the ProductID column. In this case, only products that have rows in both tables will be returned. The query would look like the following:

SELECT Name AS ProductName, Quantity
FROM Production.Product
INNER JOIN Production.ProductInventory ON Product.ProductID = ProductInventory.ProductID;

OUTER JOIN

Outer joins are used to return all the data in one table, plus any matching rows in the other table. In the left outer join, the entire left table is returned along with any matching rows from the right table. If there is no matching row on the right, null values will be returned for these columns:

Building on the preceding example, there are some rows in the Product table that are not currently in inventory; therefore, there are no rows with these product IDs in the ProductInventory table. To return all the products, whether they have matching rows in the ProductInventory table or not, a left outer join can be used.

In this case, rows with no inventory will return NULL for the Quantity column:

SELECT Name AS ProductName, Quantity
FROM Production.Product
LEFT OUTER JOIN Production.ProductInventory ON Product.ProductID = ProductInventory.ProductID;

So, following on from what was covered previously in the SELECT clause section, NULL can be replaced by zeros in the results, if desired, by using an expression:

SELECT Name AS ProductName, ISNULL(Quantity, 0) AS Quantity
FROM Production.Product
LEFT OUTER JOIN Production.ProductInventory ON Product.ProductID = ProductInventory.ProductID;

In a right outer join, all the rows from the right table are returned along with any matching rows from the left table (and NULL for the left columns if no match exists):

If there are products in the ProductInventory table that are not in the Products table for some reason, a right outer join would return all the quantities whether they have a corresponding name or not, and a NULL value for the Name column if no matching row exists in the Products table:

SELECT Name AS ProductName, Quantity
FROM Production.Product
RIGHT OUTER JOIN Production.ProductInventory ON Product.ProductID = ProductInventory.ProductID;

In this case, it might be good to display the ProductID column if the Name column is null:

SELECT ISNULL(Name, ProductInventory.ProductID) AS ProductName, Quantity
FROM Production.Product
RIGHT OUTER JOIN Production.ProductInventory ON Product.ProductID = ProductInventory.ProductID;

For full outer joins, all rows are returned from both tables; if rows are matched, they are combined into a single row in the results:

Using a full outer join, all the rows from both the Product and the ProductInventory tables will be returned. The Name column will be NULL for rows that appear only in the ProductInventory table, and the Quantity column will be NULL for rows that appear only in the Product table:

SELECT ISNULL(Name, ProductInventory.ProductID) AS ProductName, ISNULL(Quantity, 0) AS Quantity
FROM Production.Product
FULL OUTER JOIN Production.ProductInventory ON Product.ProductID = ProductInventory.ProductID;
In all the preceding examples, SQL Server can recognize the intended type of join is outer, even if the OUTER clause is not present. For example, writing LEFT JOIN instead of LEFT OUTER JOIN.

CROSS JOIN

Cross joins are also called Cartesian products. In a cross join, every row in the left table is returned, and each of these rows is combined with all the rows from the right table. If the left table has 10 rows and the right table 100 rows, then the cross join produces 1000 rows.

APPLY

APPLY is like a cross join in the type of result set that it produces, but usable only with functions. In a cross join, both inputs (left and right) are tables or views that already exist in the database, with a fixed definition. However, APPLY is used in scenarios where a join cannot be used. In APPLY, one of the inputs (the right) is not physically materialized in the database because its output is dependent on input parameters, such as in the case of a table-valued function (TVF).

For example, the AdventureWorks sample database has a SalesPerson table that contains the BusinessEntityID and SalesYTD columns, and a ufnGetContactInformation TVF that returns the FirstName, LastName, and JobTitle columns. The TVF creates a runtime abstraction for columns that exist in multiple underlying tables, like building a table on-the-fly. To write a query that returns the year-to-date (YTD) sales per sales person, together with their name and job title, a cross apply can be used to return all rows from the SalesPerson table, and each of those rows is combined with the rows coming from the ufnGetContactInformation TVF.

The query would look like the following code block:

SELECT SP.SalesYTD, P.FirstName, P.LastName, P.JobTitle
FROM Sales.SalesPerson AS SP
CROSS APPLY dbo.ufnGetContactInformation (SP.BusinessEntityID) AS P;

In the following screenshot, the results of the ufnGetContactInformation function are displayed alongside the SalesYTD column, just as if they came from another table using a simple inner or outer join:

However, the following query produces an error (ID 4104) because a join cannot be used directly with a TVF:

SELECT SP.SalesYTD, P.FirstName, P.LastName, P.JobTitle
FROM Sales.SalesPerson AS SP
CROSS JOIN dbo.ufnGetContactInformation (SP.BusinessEntityID) AS P;

WHERE

The WHERE clause specifies the search condition that determines whether a row should be returned in the result set. Rows will be returned only if the entire WHERE clause evaluates to TRUE. Each condition within the WHERE clause is referred to as a predicate. There is no limit to the number of predicates that can appear in a WHERE clause, and predicates are combined using the AND, OR, and NOT logical operators.

For example, the AdventureWorks sample database has a Product table that contains the Name and ProductID columns, a ProductInventory table that contains the Quantity, LocationID, and ProductID columns, and a Location table that contains the LocationID and Name columns. A query that returns the current product inventory per location, for the entire Touring line of products would look like the following code block:

SELECT P.Name AS ProductName, [PI].Quantity, L.Name AS LocationName
FROM Production.Product AS P
INNER JOIN Production.ProductInventory AS [PI] ON P.ProductID = [PI].ProductID
INNER JOIN Production.Location AS L ON [PI].LocationID = L.LocationID
WHERE P.Name LIKE 'Touring%';

The following screenshot shows that all the ProductName values in the result set begin with the word Touring:

ORDER BY

The ORDER BY clause orders the results. The default order is ascending, meaning that the ASC keyword can be omitted. To produce the result set in descending order, DESC must be set explicitly. Building on the same example from the WHERE section, the following code block is used to explicitly present results in descending order of product name and location name:

SELECT P.Name AS ProductName, [PI].Quantity, L.Name AS LocationName
FROM Production.Product AS P
INNER JOIN Production.ProductInventory AS [PI] ON P.ProductID = [PI].ProductID
INNER JOIN Production.Location AS L ON [PI].LocationID = L.LocationID
WHERE P.Name LIKE 'Touring%'
ORDER BY P.Name DESC, L.Name DESC;

The following screenshot shows the results in the specified order:

GROUP BY

GROUP BY aggregates the results on the required column names or expressions. Building on the same example from the ORDER BY section, we want to know the overall product quantity per product name and location, from the following code snippet. The Quantity column is using the aggregate function SUM. Therefore, the remaining columns need to be contained in the aggregation GROUP BY clause:

SELECT P.Name AS ProductName, SUM([PI].Quantity) AS Total_Quantity, L.Name AS LocationName
FROM Production.Product AS P
INNER JOIN Production.ProductInventory AS [PI] ON P.ProductID = [PI].ProductID
INNER JOIN Production.Location AS L ON [PI].LocationID = L.LocationID
WHERE P.Name LIKE 'Touring%'
GROUP BY P.Name, L.Name
ORDER BY P.Name DESC, L.Name DESC;

The following screenshot shows the results with one row per set as defined by the GROUP BY clause:

Aggregations can be further specified by using the following keywords:

  • ROLLUP: Specifies the creation of subtotals and totals for the required column names or expressions.
  • CUBE: Specifies the creation of subtotals and totals for all combinations of columns in the GROUP BY clause.
  • GROUPING SETS: Allows the use of multiple GROUP BY clauses, such as using ROLLUP and CUBE together.

HAVING

HAVING further filters the result based on values in the results, rather than the actual data. A HAVING clause only applies to columns that are included in the GROUP BY clause or in an aggregate function. Building on the same example used in the WHERE, ORDER BY, and GROUP BY sections, here we want to additionally know which locations carry an inventory of over 100 items per product. For that, after the GROUP BY clause, the query has a HAVING clause over the aggregate function, where its result is greater than 100:

SELECT P.Name AS ProductName, SUM([PI].Quantity) AS Total_Quantity, L.Name AS LocationName
FROM Production.Product AS P
INNER JOIN Production.ProductInventory AS [PI] ON P.ProductID = [PI].ProductID
INNER JOIN Production.Location AS L ON [PI].LocationID = L.LocationID
WHERE P.Name LIKE 'Touring%'
GROUP BY P.Name, L.Name
HAVING SUM([PI].Quantity) > 100
ORDER BY P.Name DESC, L.Name DESC;

The following screenshot shows the results as containing only rows with an aggregate Total_Quantity greater than 100:

 

Logical statement processing flow

When writing T-SQL, it is important to be familiar with the order in which the SQL Server Database Engine interprets queries, to later create an execution plan. This helps anticipate possible performance issues from poorly written queries. However,  it also helps to understand cases of unintended results. The following steps outline the process that the database engine follows to process a T-SQL statement:

  1. Parse the query for correctness; in other words, validate the syntax.
  2. Build a structure that represents the logic of the query as expressed by the developer—a query tree, also called a sequence tree.
  3. Process all the source and target objects stated in the FROM clause (tables, views, TVFs), together with the intended logical operation (JOIN, APPLY) to perform on those objects.
  4. Apply whatever pre-filters are defined in the WHERE clause that can reduce the number of incoming rows from those objects.
  5. Apply any aggregation defined in GROUP BY, followed by any filters that can only be applied to the aggregations.
  6. Keep only the required columns for the output, and account for any limits stated in a TOP or DISTINCT clause.
  7. Order the resulting row set as specified by the ORDER BY clause, and make the result set available for the client.
Keep in mind that even though TOP is processed before the ORDER BY clause, during execution the entire result set is sorted before the TOP clause is applied.

It becomes clearer now that properly defining how tables are joined (the logical join type) is important to any scalable T-SQL query, namely by carefully planning on which columns the tables are joined. For example, in an inner join, these join arguments are the first level of data filtering that can be enforced, because only the rows that represent the intersection of two tables are eligible for subsequent operations.

Then, it also makes sense to filter out rows from the result set using a WHERE clause, before applying any post-filtering conditions that apply to sub-groupings using a HAVING clause. This is because SQL Server evaluates a WHERE clause before a HAVING clause, and it can limit the row count earlier in the execution phase, translating into reduced I/O and memory requirements, and also reduced CPU usage when applying the post-filter to the group.

The following diagram summarizes the logical statement processing flow for the building blocks discussed in this chapter:

 

Summary

To understand how to write solid, performant T-SQL, users should become familiar with how SQL Server runs T-SQL syntax to deliver the intended result sets in a scalable fashion. In this chapter, we covered the basic building blocks that make up a T-SQL statement, as well as how SQL Server interprets those blocks to begin the process of executing the user's query. The concepts introduced in this chapter will be used throughout the remaining sections of the book to explain most patterns and anti-patterns, as well as mitigation strategies.

In the next chapter, we will build on our knowledge of how SQL Server processes T-SQL statements to understand how the Query Processor optimizes, caches, and ultimately executes the query.

About the Authors

  • Pedro Lopes

    Pedro Lopes is a Program Manager in the Database Systems group, based in Redmond, WA, USA. He has over 19 years of industry experience and has been with Microsoft for 9 years. He is currently responsible for program management of Database Engine features for in-market and vNext versions of SQL Server, with a special focus on the Relational Engine. He has extensive experience with query performance troubleshooting and is a regular speaker at numerous conferences such as SQLBits, PASS Summit, SQLIntersection, Microsoft Ignite, and Microsoft Build. He blogs about SQL on the SQL Server Team blog. He has authored several tools in the Tiger toolbox on GitHub: AdaptiveIndexDefrag maintenance solution, BPCheck, and usp_WhatsUp.

    Browse publications by this author
  • Pam Lahoud

    Pam Lahoud is a Program Manager in the Database Systems group, based in Redmond, WA, USA. She has been with Microsoft for 13 years and is currently responsible for program management of Database Engine features for in-market and vNext versions of SQL Server, with a special focus on the Storage Engine area. She is passionate about SQL Server performance and has focused on performance tuning and optimization, particularly from the developer's perspective, throughout her career. She is a SQL Server 2008 Microsoft Certified Master (MCM) with over 20 years of experience working with SQL Server.

    Browse publications by this author

Latest Reviews

(2 reviews total)
i could find what i was looking for, to support my developer collegues.
The content is excellent. These authors know their subject and communicate it well.

Recommended For You

SQL for Data Analytics

Take your first steps to become a fully qualified data analyst by learning how to explore large relational datasets.

By Upom Malik and 2 more
Networking Fundamentals

Become well-versed with basic networking concepts such as routing, switching, and subnetting, and prepare for the Microsoft 98-366 exam

By Gordon Davies
Hands-On Software Architecture with C# 8 and .NET Core 3

Design scalable and high-performance enterprise applications using the latest features of C# 8 and .NET Core 3

By Francesco Abbruzzese and 1 more
Mastering Visual Studio 2019 - Second Edition

Explore the new features in Visual Studio 2019 and build apps using WPF, .NET Core, TypeScript, and Azure

By Kunal Chowdhury