Of late, I penned down an article – SQL SERVER – Interesting Observation of ON Clause on LEFT JOIN – How ON Clause Effects Resultset in LEFT JOIN – which received a very intriguing comment from one of my regular blog readers Craig. According to him this phenomenon happens due to Logical Query Processing. His comment instigated a question in my mind. I have put forth this question to all my readers at the end of the article. Let me first give you an introduction to Logical Query Processing Phase.
What actually sets SQL Server apart from other programming languages is the way SQL Server processes its code. Generally, most programming languages process statement from top to bottom. By contrast, SQL Server processes them in a unique order which is known as Logical Query Processing Phase. These phases generate a series of virtual tables with each virtual table feeding into the next phase (virtual tables not viewable). These phases and their orders are given as follows:
1. FROM
2. ON
3. OUTER
4. WHERE
5. GROUP BY
6. CUBE | ROLLUP
7. HAVING
8. SELECT
9. DISTINCT
10 ORDER BY
11. TOP
As OUTER join is applied subsequent to ON clause, all rows eliminated by the ON clause will still be included by the OUTER join as described in the article SQL SERVER – Interesting Observation of ON Clause on LEFT JOIN – How ON Clause Effects Resultset in LEFT JOIN.
However, I am perplexed about the last two, ORDER BY and TOP. According to some people TOP comes first in logical query processing phase while others suggest that ORDER BY comes first. Now, here I’ve laid down my questions for you all to think about:
1) What is the correct answer for order query processing phase – ORDER BY or TOP?
2) How can we create an example to verify query processing phase for ORDER BY and TOP?
I will soon publish the answers I receive to the above questions on this blog, with due credit given to my readers.
Reference : Pinal Dave (https://blog.sqlauthority.com)
38 Comments. Leave new
Hi Pinal,
1) In My knowledge ORDER BY clause comes first and then TOP.
2) Example:
/* Create a Table to Play with */
CREATE TABLE MyTable(ID INT)
/* Insert Some random Data */
INSERT INTO MyTable(ID) VALUES(8)
INSERT INTO MyTable(ID) VALUES(9)
INSERT INTO MyTable(ID) VALUES(2)
INSERT INTO MyTable(ID) VALUES(1)
INSERT INTO MyTable(ID) VALUES(5)
INSERT INTO MyTable(ID) VALUES(3)
INSERT INTO MyTable(ID) VALUES(4)
INSERT INTO MyTable(ID) VALUES(10)
INSERT INTO MyTable(ID) VALUES(6)
INSERT INTO MyTable(ID) VALUES(7)
/* See the records */
SELECT * FROM MyTable
/*
If the TOP is applied first and then ORDER BY clause then we must see top 5 records in acsending order, something like 1,2,5,8,9
Otherwise, all records first gets into order and then top 5 will be displayed from that ordered list, like 1,2,3,4,5
Lets verify….
*/
SELECT TOP 5 * FROM MyTable ORDER BY 1
/* Clean Up */
DROP TABLE MyTable
-Vinit
In order to process the sql statement first it sort the records and then take top records from of it. for Example –
CREATE TABLE TEST(ID INT);
INSERT INTO TEST VALUES(1);
INSERT INTO TEST VALUES(2);
INSERT INTO TEST VALUES(5);
INSERT INTO TEST VALUES(6);
INSERT INTO TEST VALUES(3);
INSERT INTO TEST VALUES(4);
SELECT TOP 5 * FROM TEST
ID
1
2
5
6
3
SELECT TOP 5 * FROM TEST ORDER BY ID DESC
ID
6
5
4
3
2
SELECT TOP 5 * FROM TEST ORDER BY ID
ID
1
2
3
4
5
Integer- table containing digits 1- 1 000 000
select top 10 * from integer order by 1 desc returns
100000
99999
99998
99997
99996
99995
99994
99993
99992
99991
this proves order by is the first in simple words.
Also, if you take a look at query plan, table\index scan is ordered first. Then top is applied.
…..
…..
ooops, sorry, XML tags are not shown in my prev post
Top
|—IndexScan Ordered=”true”
The help file for TOP states “If the query includes an ORDER BY clause, the first expression rows, or expression percent of rows, ordered by the ORDER BY clause are returned. If the query has no ORDER BY clause, the order of the rows is arbitrary.”
Example:
First for an example, we want to get the query to read the TABLE in a different order than the ORDER BY. To do that, the data should be stored opposite the ORDER BY. That can be accomplished via a CLUSTERED INDEX:
CREATE TABLE #Data
(
Id_Clustered TINYINT,
Id_Ordered TINYINT,
CONSTRAINT Data_Id_PK PRIMARY KEY CLUSTERED (Id_Clustered)
);
INSERT INTO #Data(Id_Clustered, Id_Ordered)
SELECT 1, 2 UNION ALL
SELECT 2, 1;
Id_Clustered is the COLUMN the TABLE will be CLUSTERED on, whereas Id_Ordered will be in the ORDER BY of the query. The values in the two records are opposites. This will force a query without an ORDER BY and with an ORDER BY to be different:
SELECT
Id_Clustered,
Id_Ordered
FROM
#Data;
Yeilds:
1,2
2,1
Because that is how it is CLUSTERED. But, if we add an ORDER BY:
SELECT
Id_Clustered,
Id_Ordered
FROM
#Data
ORDER BY
Id_Ordered;
The result is reversed to:
2,1
1,2
because it is following the ORDER BY statement.
Add a TOP 1 for the test to see which one gets pulled:
SELECT TOP 1
Id_Clustered,
Id_Ordered
FROM
#Data
ORDER BY
Id_Ordered;
Which shows:
2,1
This is the same as the ORDER BY results.
The Query plan shows the Clustered Index Scan first and then a Sort (Top N Sort). It specifically reordered the results to follow the ORDER BY.
This example shows that ORDER BY is handled first, and then TOP.
Hi Brian Tkatch,
Very nice!
Regards,
Pinal
Brian,
You are answering like Imran, who used to answer really nice.
Hi Pinal –
Those examples definitely help show that order by happens first. The query plan will definitely show it as well.
As for a simple way to think about it, just think about what happens with each operator.
The ORDER BY is ordering your results, it is forcing them to come back in a certain way (contrary to relational theory :) ). The TOP operator is telling SQL Server to restrict the rows returned to a certain number of rows (either explicitly or by percentage). If TOP happened first that would mean you would just be ordering by the number of rows specified in the top. If you were to query a table and ask for the TOP 1 ORDERED BY a certain value that would be silly for a couple reasons:
1.) Silly to order by one row
2.) Let’s say you were querying a column that was not the clustered key, the TOP isn’t going to guarantee you which row you get (as described above) so not doing the ORDER BY first means you are not going to be getting meaningful results.
Good post and always helpful to remind folks the order of logical query processing (physical processing doesn’t always follow that order but it helps people think of ways to better limit their results earlier and improve performance.
Itzik’s book INSIDE SQL Server 2005 T-SQL Querying describes each of these steps in detail (though I believe that is actually Lubar’s chapter if memory serves correct)
Here, take a look at Itzik Ben-Gan’s flow chart of Logical Query Processing (PDF download):
“the breakpoint will not currently be hit.unable to bind sql breakpoint at this time”.
Can anybody help me to solve this error while try to debug
SQL CLR procedure in VS2005.
Thank you guys for this wonderful contribution. I am still not clear what is first ORDER BY or TOP.
Any more ideas?
Regards,
Pinal
Isn’t TOP part of your SELECT statement, so it runs in that phase, before ORDER BY runs…
So looking at what it does, how it works and looking at Itzik’s poster in the link above and his other books that logical processing is described in. I would say the answer is “Yes”
It would appear as though to do the top it has to know the Order By clause and it needs to understand there is an Order By, and then filter only the top number/percent of rows according to that.
The actual ordering for display or passing on to the next query happens after the TOP filter is applied (even though it looked at the order by).
One important thing to remember is that we are talking about the logical query processing here. The physical order of events is not necessarily (not, actually) the same always.
Nice post…
Left Join is Quite Simple !
but turns out to be very dangerous
if used incorrectly that is the query may take hours to execute and the final result contains lakhs of unwanted rows.
Simple way to implement join
on tables all having one field in common
say ID,i.e. ID field links all tables to each other
Sample simple query could be used for n number of tables and n number of joins
select
a.ID,
a.field2,
a.field3,
b.field1,
b.field2,
c.field1,
c.field2,
d.field1,
d.field2,
d.field3,
d.field4
from
(
(select * from table1) as a
left join
(select * from table2) as b
on a.ID=b.ID
left join
(select * from table3) as c
on a.ID=c.ID
left join
(select * from table4) as d
on a.ID=d.ID
)
I suggest never use left join !
If have to , create index on field like ID in above sample
on all tables to reduce query time
The logical processing order that you repeat above from the March blog comments isn’t correct in all cases, despite its being repeated in books, classes, and blogs. It is useful for explaining to those new to SQL how one might envision the processing being accomplished, however, it is neither specified by any ANSI SQL standard, nor is it guaranteed by Microsoft, nor is it accurate with respect to Microsoft’s current or prior implementations. No ANSI SQL standard requires a specific processing order, but instead leaves that to the implementation. The implementation’s are therefore free to use that flexibility to optimize certain scenarios as they see fit and as long as the syntax and results conform to the standard. Even the ORDER BY isn’t as cut and dry as one might suspect, since the implementation can use the information it knows to be true about the data to best effect. As a result, it would be unwise to declare statements in such a way as to depend on a specific processing order.
In the case of ORDER BY and TOP, Brian Tkatch gave you the only answer that you can hang your hat on; that is, that in the absence of an ORDER BY clause, TOP will return arbitrary rows. Therefore, you can consider that ORDER BY precedes TOP when ORDER BY is specified, however, that may not be what actually happens procedurally, just what apparently happens when the results come back. I can conceive of times where an optimizer could know enough about the data to select the TOP rows before even making a JOIN. Like gravity, Microsoft’s implementation is a black box to us: we all can make use of gravity everyday without having to know how it works. In this case, we are better off not knowing how it works, because we don’t want to be implementation-specific when we write our SQL. I can produce sample code that shows these differences if you’re interested.
P.S. It would be easy for someone to read Brian Tkatch’s comment and think that an ORDER BY is not required in the case where you pull results from a single table with a clustered index and the cluster key order is the order you desire. The fact is that result ordering is never guaranteed in the absence of an ORDER BY clause. On a related note, rows are not physically order by cluster key. The clustered index is a structure that is built from the data pages.
Excellent note Bob. I appreciate it.
Kind Regards,
Pinal
@Bob
Can you give an example where there is a CLUSTERED INDEX, no ORDER BY, and the query uses the CLUSTERED INDEX but the ordering is otherwise?
Brian,
>> Can you give an example …uses the CLUSTERED INDEX but the ordering is otherwise?
Something like this:
CREATE TABLE A (Col varchar(900) NOT NULL, Data varchar(4) NULL
, CONSTRAINT [PK_A] PRIMARY KEY CLUSTERED ([Col] ASC)
)
GO
INSERT INTO A (Col, Data) VALUES (‘2kakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnl1akfvbsaib83y34nlnmhsojhg’, ‘B’)
INSERT INTO A (Col, Data) VALUES (‘1kakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnlkakfvbsaib83y34nlnmhsojhg’, ‘A’)
GO
CREATE NONCLUSTERED INDEX [NCIX_Data_DESC] ON [dbo].[A] ([Data] DESC)
GO
SELECT TOP 2 * FROM A
The point is that when the BOL says that TOP returns arbitrary rows in the absence of ORDER BY, they are describing the expected behavior and not trying to give us insight into the implementation. The implementation, confronted with a clustered index without an ORDER BY, may typically return an ordered set; but we are not supposed to perceive it as an ordered set because there may be at least one codepath through the optimizer that doesn’t order as we would expect.
pinaldave seems to have a knack for finding those code paths :)
I can also give an example where step #2 ON is evaluated after step #8 SELECT or STEP #4 WHERE.
Don’t rely on implementation whenever and wherever you can get away with it.
Bob,
Its too late, I just read your conversation, but it might help some other reader. In your example, if you see, you created a non-clustered index. And the priority is, IF THE ONLY SELECTABLE COLUMN(s) IS AVAILABLE IN AN INDEX, and Order by is required, then instead of selecting from table, Sql Server prefers reading from Index. As it generates less I/O.
Create a table with two columns, A and B, Add a non-clustered Index on A, Now select
SELECT A
WHERE A BETWEEN (2500 AND 3000)
Now Create the same table with a Clustered Index and select
SELECT A
WHERE A BETWEEN (2500 AND 3000)
Compare the I/O of two queries. Query two will have much higher I/O than Query 1. Now change the query to
SELECT A, B
WHERE A BETWEEN (2500 AND 3000)
This time I/O will remain same for clustered Index, but much higher for non-clustered Index.
This depicts that Sql Server uses non-clustered Index if operations do not require any other column.
Since you created only one column. Add one more column, and select both of them and then tell the difference.
I hope this will clear the understanding (though after 4 years)
Brian,
I just reread the challenge. To have the query use the clustered index and not return cluster key ordered rows?
No, I don’t think I’ll attempt that.
But the query above shows that just having a clustered key present doesn’t mean that the TOP results will use that key. In the case above the Non-clustered index is used even w/o an ORDER BY.
@Bob
“just having a clustered key present doesn’t mean that the TOP results will use that key.”
That’s something the help file states.
But if the CLUSTERED INDEX is used, that is “probably” the order. According to Denny Cherry “If you don’t have an ORDER BY, then the SQL Server will probably sort by the clustered index and use it for filtering. If there is no clustered index it would then fall back to using the table itself and return the data in the order it comes across it.” From:
So, technically it doesn’t have to use it. And sure, you probably shouldn’t rely on it. But the fact remains, it is used. And blaming bugs in the code on “techncially”s is silly. Unless it really is the case, but my gut feeling is otherwise.
I just attended Itzik’s session on this subject.
One of the best!
He also discussed that when TOP is there ORDER BY is executed along with it. If it does not exist, ORDER BY is executed separate.
More detail I will post soon.
Great session.
Kind Regards,
Pinal
Hi Pinal , In which sequence Where clause is executed ?
thanks for this best post… can you please tell me which comes first in this two things
the increment(Identity) is incremented first
or the constraints are checked first
As I know the answer that increments comes first but I need a reason for it..
because if the increment is done first and then constraints are checked if constraints are not matched then still the identity is increased which is wrong according to me..
but why these happens please reply as soon as possible