[Note from Pinal]: This is a 53rdth episode of Notes from the Field series. Everyday I get 100s of emails and most of the emails have a similar request. Everyone wants to get maximum performance, but they want to make the least amount of changes in their code. Well, though both of them are contradictory requests, it is possible in most of the cases if you know the technology inside like Linchpin People do. Here in this blog post, my close friend Stuart Ainsworth explains a cool trick, which I just learned today after so many years of experience. Wow, Stuart – thanks for this amazing note from the fields – I learned something new and there will be so many who will enjoy this post.
In this episode of the Notes from the Field series database expert Stuart Ainsworth explains Using Bitwise And (&) Instead of a Junction Table.
In the simplest version of a fruit basket, you’d have two database objects: the basket, and the assortment of fruit. Baskets can use different combinations of fruit, and samples of fruit may appear in more than one basket, like so:
Basket 1: Apples
Basket 2: Apples, Bananas
Basket 3: Grapes, Apples
Basket 4: Strawberries, Bananas
The traditional method of modeling this relationship would be to use a junction table, as illustrated below.
However, my client had 500,000 baskets, and roughly 50 different fruits to choose from. Assuming that every basket had at least 10 different fruits, the junction table would have at least 5,000,000 rows of data. Even though the junction table was well indexed and strongly typed, my client’s design was suffering from slow read times. The client needed an alternative. Enter the bitwise AND (&).
Setting Up a Demo
Let’s set up a demo that illustrates both the junction table method and the bitwise AND alternative. First, you’ll create the following three tables and populate them (using table valued constructors):
- Baskets, which includes a column for use with the Bitwise AND
- FruitID, which is set up for use with the Bitwise AND
- FruitBaskets, which is a junction table
Note that primary and foreign key references are not included for the simplicity of the demo. You’ll also be adding an extra column to the Baskets table to use for the Bitwise join. Finally, note that the ID column of the Fruit table mirrors the decimal values of the binary bit positions (e.g., 1, 2, 4, 8, 16, 32, 64, 128).
CREATE TABLE Baskets
(
BasketID INT
, BasketName VARCHAR(100)
, FruitBitHash BIGINT
)
CREATE TABLE Fruit
(
FruitID BIGINT
, FruitName VARCHAR(20)
)
CREATE TABLE FruitBaskets
(
BasketID INT
, FruitID BIGINT
)
GO
INSERT INTO Fruit
( FruitID, FruitName)
VALUES ( 1, 'Apples'),
( 2, 'Bananas'),
( 4, 'Grapes'),
( 8, 'Strawberries')
GO
INSERT INTO dbo.Baskets
( BasketID, BasketName, FruitBitHash)
VALUES ( 1, 'Apples', 1),
( 2, 'Apples, Bananas', 1 + 2),
( 3, 'Grapes, Apples', 1 + 4),
( 4, 'Strawberries, Bananas', 8 + 2)
GO
INSERT INTO dbo.FruitBaskets
( BasketID, FruitID)
VALUES ( 1, 1),
( 2, 1 ),
( 2, 2 ),
( 3, 1 ),
( 3, 4 ),
( 4, 8 ),
( 4, 2 )
GO
Now that you’ve got your tables set up, let’s run a couple of queries. First, you’ll use a junction table (the traditional, normalized model), and then you’ll use the Bitwise AND (&). In both cases, youy’re looking for baskets that contain apples:
/*Select the fruitbaskets containing Apples using the junction table*/
SELECT BasketID, BasketName
FROM dbo.Baskets b
WHERE EXISTS (SELECT *
FROM dbo.FruitBaskets fb
JOIN dbo.Fruit f ON fb.FruitID = f.FruitID
WHERE b.BasketID = fb.BasketID
AND f.FruitName = 'Apples')
GO
/*Select the fruitbaskets containing Apples using the bithash*/
SELECT BasketID, BasketName
FROM dbo.Baskets b
WHERE EXISTS (SELECT *
FROM dbo.Fruit f
WHERE b.FruitBitHash & f.FruitID <>0
AND f.FruitName = 'Apples')
GO
If you run this demo, you’ll see that you get the exact same results from the two queries. However, the first query would need to read data from 3 tables, and the second query only needs 2. If the junction table is very large, the traditional method can be significantly slower than the second method.
But how does it work? An excellent explanation can be found here, but the short answer is that when you’re using the Bitwise AND (&) to compare two different integers, any value other than 0 that is returned from that comparison means that those integers share a common base. The magic happens with this line of code:
WHERE b.FruitBitHash & f.FruitID <>0
So, why don’t we do this all the time?
There’s an old expression, “If all you have is a hammer, then everything looks like a nail.” Different tools are best suited for different problems. The limitations of using the Bitwise method to remove a junction table include:
- Violation of relational integrity: The surrogate IDs in the lookup table (e.g., the Fruit table) have to have a specific order and meaning. If you make a mistake when setting up the key values, you can get wrong answers.
- A limited number of bitwise values can be stored in a bigint: In SQL Server, a bigint is 8 bytes, which means that there are 64 bits. When using a single bithash column, you can only have one value per bit. (Note that you can work around this by using multiple columns, but that gets complicated.)
The benefit of the Bitwise AND method is reduced disk I\O because it eliminates a large junction table. In this case, you did notice increased CPU usage using the Bitwise method, but the increase in performance was significant. However, on faster hardware, a junction table would probably have worked as well and still maintained relational integrity. For now, Bitwise AND is a useful tool for a very specific type of problem.
If you want to get started with SQL Server with the help of experts, read more over at Fix Your SQL Server.
Reference: Pinal Dave (https://blog.sqlauthority.com)
1 Comment. Leave new
Hi,
When the values are limited then I have use Enum. Basic idea is same in both.
So in my proc i will write only,
select * from Baskets
and get the value of FruitBitHash from enum in my c# code.
So in case of limited and static value using enum give far better performance.
But for unlimited and dynamic values like in above example we cannot use enum so this is better.
Thanks for sharing your idea .