There is a maxim that comes from the world of Python programming: “There should be one - and preferably only one - obvious way to do it.” (See The Zen of Python) in the references. While that is a good goal for any language, it is a difficult goal to achieve. T-SQL is no exception here!
Consider a simple problem: You have a customer transaction table with dated rows. You are asked to produce a list of the top transactions per day. “Top” could be defined as the one with the most items, the highest value, the most important customer or a variety of other criteria. However, you define it, at some point, you will want the maximum (or minimum) value of one of the columns. Then you will want to output the entire row (or rows, in the case of a tie) matching that value. Sounds simple, right? Well, it actually is simple, but the number of ways to do it may surprise you!
This article will look at the various ways to construct such a query and analyze each one for its performance characteristics. The goal is not so much about picking a winner as it is to learn how to analyze queries using a tool like ApexSQL Plan.
Let’s consider a simple transaction table. In all likelihood yours will be more complicated than this one, but it will suffice for the examples and analysis that follows.
CREATE TABLE [dbo].[Transactions] ( [TransactionID] [bigint] IDENTITY(1, 1) NOT NULL ,[CustomerID] [bigint] NOT NULL ,[Transactiontype] [tinyint] NOT NULL ,[TransactionTime] [datetime] NOT NULL ,[TransactionAmount] [money] NOT NULL ) ON [PRIMARY] CREATE CLUSTERED INDEX [IX_Transactions_TransactionID] ON [dbo].[Transactions] ([TransactionID] ASC)
I’ll use ApexSQL Generate to populate this table with 100,000 rows, choosing random values except for the identity column, which is assigned by SQL Server.
For this exercise, let’s assume that the problem is stated like this:
“For each day, select the row containing the greatest value in the TransactionAmount column. Now that means that we need to find the greatest amount for each day and use that to get to the rest of the columns. Selecting the greatest amount for each day is a simple query with GROUP BY:
SELECT CAST(t.TransactionTime AS DATE) AS DateOfTransaction ,MAX(TransactionAmount) AS MaxTransactionAmount GROUP BY CAST(t.TransactionTime AS DATE) ORDER BY CAST(t.TransactionTime AS DATE);
This [Grab your reader’s attention with a great quote from the document or use this space to emphasize a key point. To place this text box anywhere on the page, just drag it.] works just fine, but I’m troubled by something. Three times in this query we have the expression:
CAST(t.TransactionTime AS DATE)
The way this is written, I need the expression in at least two places, the SELECT statement and the HAVING clause. I could change the ORDER BY to use the column alias DateOfTransaction, since column aliases are available after the SELECT portion of a query, but I’d really like to get rid of the repetition altogether. I can use CROSS APPLY for that:
SELECT dt.DateOfTransaction ,MAX(TransactionAmount) AS MaxTransactionAmount FROM dbo.Transactions t CROSS APPLY ( SELECT CAST(t.TransactionTime AS DATE) ) dt(DateOfTransaction) GROUP BY dt.DateOfTransaction ORDER BY dt.DateOfTransaction;
Now, I bet you’re wondering how this change affects the performance of the query. To answer briefly, not at all! To prove it, let’s fire up ApexSQL Plan.
If we just look at the estimated plan for each query, we can see that they are the same:
The CROSS APPLY version is a little longer to type but eases maintenance since the original column name is now referenced only once. The query is DRY-er.
“Well,” you say, “an estimated plan is all well and good, but what happens when you run it?” Good question! Let’s do that!
The I/O reads are identical, so I only pasted one copy. If you dig into the execution plan XML you won’t see any effect of the CROSS APPLY either, other than in the StatementText attribute. Since the CROSS APPLY only references columns in the current row in the input buffer, it has no effect other than making the query easier to read. Note that if the CROSS APPLY referenced another table, this would not be the case; it would function like a JOIN and the execution plan, CPU usage and I/O counts would reflect that.
Many developers coming from a non-declarative language such as Java or C or Python might think in terms of writing a loop. In T-SQL, that would be a cursor. We can use the query above in just such a loop to produce the desired results:
-- Get maximum amount by day DECLARE MaxAmtCur CURSOR LOCAL FORWARD_ONLY FAST_FORWARD READ_ONLY FOR SELECT dt.DateOfTransaction ,MAX(TransactionAmount) AS MaxTransactionAmount FROM dbo.Transactions t CROSS APPLY ( SELECT CAST(t.TransactionTime AS DATE) ) dt(DateOfTransaction) GROUP BY dt.DateOfTransaction ORDER BY dt.DateOfTransaction; OPEN MaxAmtCur -- Create a table variable to hold the results DECLARE @DateOfTransaction DATE ,@MaxTransactionAmount MONEY DECLARE @Results TABLE ( [TransactionID] [BIGINT] ,[CustomerID] [BIGINT] NOT NULL ,[TransactionType] [TINYINT] NOT NULL ,[TransactionTime] [DATETIME] NOT NULL ,[TransactionAmount] [MONEY] NOT NULL ); -- Loop through the maximum amounts to find matching transactions WHILE 1 = 1 BEGIN FETCH NEXT FROM MaxAmtCur INTO @DateOfTransaction ,@MaxTransactionAmount; IF @@FETCH_STATUS <> 0 BREAK; -- Insert matching rows into the results table INSERT INTO @Results ( TransactionID ,CustomerID ,TransactionType ,TransactionTime ,TransactionAmount ) SELECT t.* FROM dbo.Transactions t WHERE CAST(t.TransactionTime AS DATE) = @DateOfTransaction AND t.TransactionAmount = @MaxTransactionAmount; END;
This script inserts the results in to a table variable. This lets me focus on the query and ignore the intermediate results. As for this one, well it certainly works but I didn’t have the patience to wait for it to complete. Here’s why: After the initial query to find the maximum amounts by day, the loop inserts one row at a time (RBAR = Row By Agonizing Row – an acronym created by Jeff Moden, long-time SQL Server MVP.) Now, the table of maximum values by day has over 2100 rows. How does the INSERT perform inside the loop? The plan diagram will show us!
It does a clustered index scan! Recall that there are 100,000 rows in the transactions table. That means 2100 iterations * 100000 rows to read all the rows to complete the job. I’ll let you do the math. The result is not pretty.
There is an ABC lesson here: Anything But Cursors! If you can do it any other way, it will perform better than a cursor-based approach. Keep in mind that a relational database is based on relational algebra, which combines set theory with predicate calculus. Think in sets, not rows!
Here is a join-based query:
SELECT TransactionID ,CustomerID ,TransactionType ,TransactionTime ,TransactionAmount FROM dbo.Transactions t INNER JOIN ( SELECT dt.DateOfTransaction ,MAX(TransactionAmount) AS MaxTransactionAmount FROM dbo.Transactions t CROSS APPLY ( SELECT CAST(t.TransactionTime AS DATE) ) dt(DateOfTransaction) GROUP BY dt.DateOfTransaction ) t1 ON CAST(t.TransactionTime AS DATE) = t1.DateOfTransaction AND t.TransactionAmount = t1.MaxTransactionAmount ORDER BY t1.DateOfTransaction
This is an enormous improvement! According to ApexSQL Plan, the query above runs in
The execution plan is a little complicated though:
And has two clustered index scans of the transactions table. What about the I/Os?
Quite respectable, really!
There are a few different alternatives to using an inner join. Perhaps you wondered if I could use CROSS APPLY? Indeed:
SELECT TransactionID ,CustomerID ,TransactionType ,TransactionTime ,TransactionAmount FROM dbo.Transactions t CROSS APPLY ( SELECT CAST(t.TransactionTime AS DATE) ) dt(DateOfTransaction) CROSS APPLY ( SELECT dt.DateOfTransaction ,MAX(TransactionAmount) AS MaxTransactionAmount FROM dbo.Transactions t CROSS APPLY ( SELECT CAST(t.TransactionTime AS DATE) ) dt(DateOfTransaction) GROUP BY dt.DateOfTransaction ) t1 WHERE CAST(t.TransactionTime AS DATE) = t1.DateOfTransaction AND t.TransactionAmount = t1.MaxTransactionAmount ORDER BY t1.DateOfTransaction;
This query produces the same plan and statistics as the INNER JOIN variant, so I won’t waste space here showing it. Changing the join operation to a CROSS APPLY (and the ON clause to a WHERE Clause) changes nothing on the inside. Note that in this case, the second CROSS APPLY performs the aggregating sub query and is not an intra-row operation as is the first one.
A more interesting alternative would be this:
WITH MaxAmts ( TransactionDate ,TransactionAmount ) AS ( SELECT dt.DateOfTransaction ,MAX(TransactionAmount) AS MaxTransactionAmount FROM dbo.Transactions t CROSS APPLY ( SELECT CAST(t.TransactionTime AS DATE) ) dt(DateOfTransaction) GROUP BY dt.DateOfTransaction ) SELECT t.TransactionID ,t.CustomerID ,t.TransactionType ,t.TransactionTime ,t.TransactionAmount FROM dbo.Transactions t WHERE EXISTS ( SELECT 1 FROM MaxAmts m WHERE m.TransactionDate = CAST(t.TransactionTime AS DATE) AND m.TransactionAmount = t.TransactionAmount )
The Common Table Expression (CTE) used in this example doesn’t change the plan. Recall that, except for the recursive variety, a CTE is just a way to write a sub query before the main query instead of after it. Used in this way, CTEs can make for more readable code. If you thought the correlated sub query in the WHERE EXISTS clause would change the plan, you’d be mistaken! The plan and performance characteristics are identical to the previous two queries. So, which one do I prefer to use? In general, when given a choice, I like queries that are easy to read. While I admit that that is subjective, for me it generally means opting for fewer subqueries and shallower levels of parentheses and indentation. With that in mind, I like this, hybrid version best:
WITH MaxAmts ( TransactionDate ,TransactionAmount ) AS ( SELECT dt.DateOfTransaction ,MAX(TransactionAmount) AS MaxTransactionAmount FROM dbo.Transactions t CROSS APPLY ( SELECT CAST(t.TransactionTime AS DATE) ) dt(TransactionDate) GROUP BY dt.TransactionAmount ) SELECT t.TransactionID ,t.CustomerID ,t.TransactionType ,t.TransactionTime ,t.TransactionAmount FROM dbo.Transactions t INNER JOIN MaxAmts m ON CAST(t.TransactionTime AS DATE) = m.TransactionDate AND t.TransactionAmount = m.TransactionAmount ORDER BY m.TransactionDate
This has the same execution plan and run characteristics as the first three and is – from my perspective at least – the easiest one to read.
There is one other thing these three variants have in common: All report a missing index.
I’ll come back to the problem of the missing index a bit later. For now, let’s continue with a different approach:
Starting with SQL Server 2005, Microsoft began introducing window functions. These are not Microsoft’s invention but rather their implementation of the ANSI/ISO Standard SQL:2003 which was later extended in ANSI/ISO Standard SQL:2008
Since we want to find the maximum/minimum value of some column, the MAX() OVER() and MIN() OVER() functions come to mind. Let’s start there!
SELECT t.TransactionID ,t.CustomerID ,t.TransactionType ,t.TransactionTime ,t.TransactionAmount FROM ( SELECT * ,MAX(TransactionAmount) OVER (PARTITION BY dt.DateOfTransaction) AS MaxTransactionAmount FROM dbo.Transactions t CROSS APPLY ( SELECT CAST(t.TransactionTime AS DATE) ) dt(DateOfTransaction) ) t WHERE t.TransactionAmount = t.MaxTransactionAmount ORDER BY t.TransactionTime;
This query produces a complicated plan:
© ALL RIGHTS RESERVED. Feedback Terms of Use Privacy Cookie Preference Center