
Native Methods in SQL Server
Using NEWID()
The simplest method is to use the NEWID() function to generate a unique GUID for each row and select the top rows after ordering by it:
SELECT TOP 10 * FROM table ORDER BY NEWID();
However, this method performs terribly and does not scale for large tables. Some major downsides are:
- NEWID() will table scan the entire table, assigning a unique ID to every row. This is highly inefficient for large tables.
- Sorting the millions of rows consumes huge temporary disk space.
- The sorting process also takes a long time to complete.
The query cost increases exponentially with table size when using ORDER BY NEWID().
Leveraging TABLESAMPLE
SQL Server provides the TABLESAMPLE clause to retrieve a statistical sample of rows:
SELECT * FROM table TABLESAMPLE (10 PERCENT);
While this sounds like an efficient way to get random rows, there are some caveats:
- TABLESAMPLE does not sample random rows. It selects random 8KB data pages and returns all rows in those pages.
- If the table has a clustered index ordered by a column like ID, the sampled rows could belong to a particular range of IDs, skewing the results.
- TABLESAMPLE applies before WHERE, JOIN, or other filters. So, the sampled rows need to be further filtered.
Overall, TABLESAMPLE should be used carefully for generating random samples from a table.
Alternatives for Random Row Selection
Randomly Filtering Rows with CHECKSUM
A simple way to fetch random rows is to filter based on a random condition using the CHECKSUM function:
SELECT * FROM table WHERE ABS(CHECKSUM(NEWID())) % 100 < 10; -- 10% random sample
This performs very well as CHEKSUM is optimized in SQL Server and requires just an index/table scan. We can easily tune the randomized sample size by changing the filter percentage. Sometimes, this query does not return any results; in that case, just re-run it.
Some tips on using this method effectively:
- Ensure an index on the fields being checked to avoid a full table scan.
- Use NEWID() or CAST(Column AS VARBINARY) within CHECKSUM for better distribution. CHECKSUM(Column) alone can be biased.
- Use multiple CHECKSUM conditions with ORs and different filter percentages to reduce duplicates in larger samples.
Hybrid Approach
For a single random row, we can combine TABLESAMPLE and NEWID() ordering:
SELECT TOP 1 * FROM table TABLESAMPLE (100 ROWS) ORDER BY NEWID();
This minimized the rows sorted by limiting NEWID() ordering to the sampled subset. With a large enough sample, randomness is preserved while improving performance.
Conclusion
SQL Server provides multiple alternatives for retrieving random rows with different performance implications. Using CHECKSUM-based filtering or picking indexed column values works best for most cases. The TABLESAMPLE and NEWID() hybrid approach also executes quickly for a single random row. If you use any other random-row techniques, just let me know via LinkedIn.
Reference: Pinal Dave (https://blog.sqlauthority.com)