SQL SERVER – Stream Aggregate and Hash Aggregate

February 17, 2020

SQL Server Performance tuning is an amazing subject, it never stops to amaze me. Every single day, I take only one Comprehensive Database Performance Health Check and keep the rest of the time open for either learning new stuff or helping people with On Demand (50 Minutes). Recently during the on-demand consultation, I was asked a very interesting question about Stream Aggregate and Hash Aggregate. Let us discuss the same topic today.

SQL SERVER - Stream Aggregate and Hash Aggregate Stream-Aggregate-800x182

Stream Aggregate and Hash Aggregate

Senior DBA recently asked me what is my preference between stream Aggregate and Hash Aggregate and how do I react when I see them in the execution plan.

My answer to this question is very simple and it is – SQL Server knows the BEST in most of the cases. Let us understand my answer a bit more in detail now. Before we continue our discussion, let us understand the definition of the two of the physical aggregate operators.

Stream Aggregate

The stream aggregate is a physical operator to aggregate the data and it expects input rowset must be sorted in order of the grouping columns. When this operator receives the input rowset in the sorted format, it usually performs very efficiently.

Hash Aggregate

The hash aggregate is a physical operator to aggregate the data and it works by creating a hash table and hash buckets like hash join to aggregate the data. This operator is very effective and efficient when the input data rowset is not in the sorted format.

Real World Demonstration

Now let us see a quick demonstration. In the following demonstration, we will see a query that has both the aggregate operators – Stream and Hash. We will try to force our query to use only one kind of aggregate operation and see how it performs.

Here is the query which we will use for our demonstration. Please do not forget to enable the execution plan and also include the command to see statistics IO for our queries. The query is running on the sample database WideWorldImporters.

SET STATISTICS IO ON
GO
USE WideWorldImporters
GO
SELECT [CustomerID], COUNT([OrderID]) totOID
FROM [Sales].[Invoices] o
GROUP BY [CustomerID]
GO

Let us see the statistics IO for the query.

Table ‘Worktable’. Scan count 0, logical reads 0
Table ‘Workfile’. Scan count 0, logical reads 0
Table ‘Invoices’. Scan count 9, logical reads 11994

Let us see the execution plan. You will notice that there are both the operators’ Stream and Hash in this simple query.

SQL SERVER - Stream Aggregate and Hash Aggregate Stream Aggregate1

Now let us for a query hint of Stream Aggregate.

SELECT [CustomerID], COUNT([OrderID]) totOID
FROM [Sales].[Invoices] o
GROUP BY [CustomerID]
OPTION (ORDER GROUP)
GO

Let us see the statistics IO for the query. You will notice the page IO reads are the same as the earlier query and there is no difference at all.

Table ‘Worktable’. Scan count 0, logical reads 0
Table ‘Invoices’. Scan count 9, logical reads 11994

Let us see the execution plan. You will notice that there are both the aggregation are using stream aggregate operators.

SQL SERVER - Stream Aggregate and Hash Aggregate Stream Aggregate2

Now let us for a query hint of Hash Aggregate.

SELECT [CustomerID], COUNT([OrderID]) totOID
FROM [Sales].[Invoices] o
GROUP BY [CustomerID]
OPTION (HASH GROUP)
GO

Let us see the statistics IO for the query. You will notice the page IO reads are the same as the earlier query and there is no difference at all.

Table ‘Worktable’. Scan count 0, logical reads 0
Table ‘Workfile’. Scan count 0, logical reads 0
Table ‘Invoices’. Scan count 9, logical reads 11994

Let us see the execution plan. You will notice that there are both the aggregation are using hash aggregate operators.

SQL SERVER - Stream Aggregate and Hash Aggregate Stream Aggregate3

So essentially, there is no difference in the Page IO read when you look at the queries. I have tested this query under extreme workload tests and the query most of the time is giving us almost the same performance as well.

This tells me that SQL Server had made already the best possible decision when it originally built the execution plan. It had taken the input rowset and output required and build the best possible execution plan.

Summary

While in this example, when I used the query hint, the performance was not much different in both the cases, it is very much possible that if the data changes, your performance start changing when you have used query hints. However, wherever SQL Server Engine has built the execution plan, based on the statistics, it can easily change the aggregate operators and tune the plan accordingly.

Instead of forcing query hints, I would leave the decision of using the aggregate operation to the SQL Server Optimization engine.

Sneak Peak

During the consultation with my client, I explained to him that he should not worry about the aggregate operations but rather focus on how we can work on reducing the Statistics IO. We will talk about this in future blog post post.

Reference: Pinal Dave (https://blog.sqlauthority.com)