Selecting a random record or a sample of random rows from an SQL Server table is a common requirement for many applications. However, unlike other databases, SQL Server has no built-in function to directly return randomized results. This post will explore various techniques to retrieve random rows in SQL Server and analyze their performance and randomness characteristics in depth.
Native Methods in SQL Server
Using NEWID()
The simplest method is to use the NEWID() function to generate a unique GUID for each row and select the top rows after ordering by it:
SELECT TOP 10 * FROM table ORDER BY NEWID();
However, this method performs terribly and does not scale for large tables. Some major downsides are:
- NEWID() will table scan the entire table, assigning a unique ID to every row. This is highly inefficient for large tables.
- Sorting the millions of rows consumes huge temporary disk space.
- The sorting process also takes a long time to complete.
The query cost increases exponentially with table size when using ORDER BY NEWID().
Leveraging TABLESAMPLE
SQL Server provides the TABLESAMPLE clause to retrieve a statistical sample of rows:
SELECT * FROM table TABLESAMPLE (10 PERCENT);
While this sounds like an efficient way to get random rows, there are some caveats:
- TABLESAMPLE does not sample random rows. It selects random 8KB data pages and returns all rows in those pages.
- If the table has a clustered index ordered by a column like ID, the sampled rows could belong to a particular range of IDs, skewing the results.
- TABLESAMPLE applies before WHERE, JOIN, or other filters. So, the sampled rows need to be further filtered.
Overall, TABLESAMPLE should be used carefully for generating random samples from a table.
Alternatives for Random Row Selection
Randomly Filtering Rows with CHECKSUM
A simple way to fetch random rows is to filter based on a random condition using the CHECKSUM function:
SELECT * FROM table WHERE ABS(CHECKSUM(NEWID())) % 100 < 10; -- 10% random sample
This performs very well as CHEKSUM is optimized in SQL Server and requires just an index/table scan. We can easily tune the randomized sample size by changing the filter percentage. Sometimes, this query does not return any results; in that case, just re-run it.
Some tips on using this method effectively:
- Ensure an index on the fields being checked to avoid a full table scan.
- Use NEWID() or CAST(Column AS VARBINARY) within CHECKSUM for better distribution. CHECKSUM(Column) alone can be biased.
- Use multiple CHECKSUM conditions with ORs and different filter percentages to reduce duplicates in larger samples.
Hybrid Approach
For a single random row, we can combine TABLESAMPLE and NEWID() ordering:
SELECT TOP 1 * FROM table TABLESAMPLE (100 ROWS) ORDER BY NEWID();
This minimized the rows sorted by limiting NEWID() ordering to the sampled subset. With a large enough sample, randomness is preserved while improving performance.
Conclusion
SQL Server provides multiple alternatives for retrieving random rows with different performance implications. Using CHECKSUM-based filtering or picking indexed column values works best for most cases. The TABLESAMPLE and NEWID() hybrid approach also executes quickly for a single random row. If you use any other random-row techniques, just let me know via LinkedIn.
Reference: Pinal Dave (https://blog.sqlauthority.com)