I think the title is clear what I am going to write in your post.
This is age old problem and I want to compile the list stating advantages and disadvantages of using GUID and INT as a Primary Key or Clustered Index or Both (the usual case).
Let me start a list by suggesting one advantage and one disadvantage in each case.
- Numeric values (and specifically integers) are better for performance when used in joins, indexes and conditions.
- Numeric values are easier to understand for application users if they are displayed.
- If your table is large, it is quite possible it will run out of it and after some numeric value there will be no additional identity to use.
- Unique across the server.
- String values are not as optimal as integer values for performance when used in joins, indexes and conditions.
- More storage space is required than INT.
Please note that I am looking to create list of all the generic comparisons. There can be special cases where the stated information is incorrect, feel free to comment on the same.
Please leave your opinion and advice in comment section. I will combine a final list and update this blog after a week. By listing your name in post, I will also give due credit.
Reference: Pinal Dave (https://blog.sqlauthority.com)
No James… they are not “globally unique”. The term “Globally Unique” is a misnomer and it is an error for Books Online to say “A GUID is a unique binary number”. Type 4 GUIDs are nothing more than random numbers with a huge domain and Microsoft currently uses TYPE 4 GUIDs.
Please read the MSDN article at the following URL…
… where it states in the REMARKS section, and I quote…
“Such an identifier has a very low probability of being duplicated.”
Microsoft uses “TYPE 4” Guids which no longer contain the MAC address of the computer like “TYPE 1” Guids did. To understand the differences between TYPE 1 and TYPE 4 Guids, please see the following URL…
The bottom line is that GUID’s currently being used in SQL Server are NOT guaranteed to be globally unique. Although the likelyhood of duplication is very low because of the very large domain, Type 4 GUIDs (which is what Microsoft currently uses) aren’t even guaranteed to be unique on a single machine.
“Such an identifier has a very low probability of being duplicated.”
Do you have any idea how low the probability is?
“equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate. In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs.”
You have a much greater chance of a stray sub-atomic particle flipping bits in your RAM. If you’re not programming around that potential problem, then for all intents and purposes GUIDs (even type 4) are unique.
True enough but because TYPE 4 GUIDs are NOT guaranteed to be unique and DO stand a chance, no matter how remote it may be, of being duplicated, you run into the same problem as if you used an IDENTITY column… you need some central repository that all global entities in your enterprise have to check and see if the number has ever been duplicated in your system. They cannot operate asynchronously with a **guarantee** of unique TYPE 4 GUIDS… and Microsoft currently uses TYPE 4 GUIDs (at least in 2k5, they do).
The chance is so mind-bogglingly tiny that there’s no point in being worried about it, you can treat them as unique. You would need to generate a billion GUIDs per second for 100 years to have a 50% chance of creating a duplicate.
That’s no worse than the chance of a neutrino from space hitting your hard drive platter and causing an IDENTITY column to create a duplicate.
Please see the links I posted two posts above. Books Online is incorrect.
From your own link: A GUID is a 128-bit integer (16 bytes) that can be used across all computers and networks wherever a unique identifier is required. Such an identifier has a very low probability of being duplicated.
notice across all computers and networks.
Very low probability means that you have a better chance of being hit by lightening 1000 feet under water than having a duplicate.
As for needing a check. You don’t need to check. You need to handle a primary key constraint error and regen a new one when you have that error fired at you, and then save it again.
Since this will happen about once in the next 1000 years this isn’t a high cost and you don’t have to round trip.
Same is true for synchronization. Which is why MS uses them in their own replication. And by making them the primary key, it makes it even easier to fail fast instead of worrying about dups.
I also said that the probablity was remote. The problem is, it’s still a possibility and with that, the TYPE 4 GUIDs are not what they say they are. They are NOT globally unique and regardless of how remote it may be, duplicates can occur.
Yes, I agree that if everyone in your enterprise dumps their data to the same table, then a PK or UNIQUE index will catch it. However, there are a lot of folks who use GUIDs as the PK for many or all of their tables and duplication could occur with the dupes in separate tables where there’s no chance of a PK or UNIQUE index checking on it. If your systems are built under the assumption that every GUID is guaranteed to be unique across all tables and all servers for your enterprise across the world and you “hit the lottery” with one of those extremely rare dupes and you don’t know that GUIDs can be duplicated, I’m thinking that it’ll take you more than a couple of minutes trying to figure out what the heck happened.
The listed advantage of GUIDs (as they apply to SQL Server) being unique across the world is false.
GUIDS are not strings and JOIN at the exact same performance level as INTS on anything better than an 1970 8088 chipset.
Jeff Moden declared that GUIDS are not unique, Well until someone has identified a collision where the GUIDS where system derived and not manually set, I would disagree. There are 6 x 10 ^ 28th Stars in the sky.
There 1 x 10 ^ 128th possbile GUIDS
Identifying a Column as a Primary Key adds the unique Constraint. I have collapsed 10 Terabytes of disc space and failed to create a duplicate GUID.
Maybe a GUID non-clustered primary key in a distributed environment, but I’d need a convincing case made. Really only a natural key should be the PK.
As for clustering on a GUID, that is totally wrong. The disadventages are: page splits, bloated rowid, fragmented indexes, and poorly performing bookmark lookups.
The advantages would be what?
Randy Pitkin wrote:
>>Jeff Moden declared that GUIDS are not unique, Well until someone has identified a collision where the GUIDS where system derived and not manually set, I would disagree. There are 6 x 10 ^ 28th Stars in the sky.
There 1 x 10 ^ 128th possbile GUIDS
True enough. To make it even more awsome sounding, if a mile were just 1 GUID, then every possible GUID lined up in a row would be wider than 17 Quadrillion Milky Way Galaxies. Further, since the GUIDs on a single machine are pseudo random (sequence implied), that makes them unlikely to ever repeat until they’ve all been used.
However, the fact that they are pseudo random also means that, in theory, two machines could generate the same GUID. It’s good that you use UNIQUE indexes on your GUIDs ;-) especially if you use sequential GUIDs.
The bottom line is that Microsoft has stated that the GUIDs aren’t unique and that they could repeat and you shouldn’t assume a “guarantee” of uniqueness. I’ll see if I can find the URL where I saw that later on tonight.
Shifting gears, I agree with what Eric Russell stated above. Unless you happen to have an affinity for page splits, you should never use a GUID as a clustered index.
I would say it depends on the solution that works best for your environment. I have seen both INTs and GUIDs working and not working in production environments. So it would be unfair for either camp to say which is the best of the lot.
I just posted my thoughts on this topic here:
I have created a table with 2 columns (i.e col1 ,col2)
and inserted 2 rows with null values in second column (i.e Col2)
after that i fired below qry
select disinct col2 from
as a result above query returned only single row instead of returning 2 rows,how can it be possible as the two nulls will not be same.
—–Sql Scripts ——————
create table #tmd(col1 int,col2 int)
insert into #tmd(col1)values(1)
insert into #tmd(col1)values(2)
select distinct col2 from #tmd
I’m interested to know if DB’s like SQL Server have been optimised for indexing and joins on GUID’s as opposed to other alphanumeric ID’s? In other words, does GUID processing behave like any other alphanumerics, or are they a special case? And if so, is this only the case for SQL Server?
GUIDs aren’t stored as alphanumeric values, they’re treated internally as 128-bit integers.
@Randy Pitkin: your numbers are wrong. Being 128 bits in size, it’s a maximum of 1*2^128, not 1*10^128. Then, since the GUID version requires some bits (6 I think) to be static, this reduces the range to 2^122, which is still a fair number of possibilities with over 5*10^36 different GUIDs.
I would say, the best candidate for a clustered index is the INT and the worst is the GUID;
First because of sorting, it is much easier to sort and INT than a string
Second indexes are stored in a B Tree structure, which makes the size of the index not only affect the storage, but also affect the length of the B Tree and accordingly affect the performance (the shorter your tree is the faster you reach your node).
It is extremely dangerous to have a bad performance clustered index, a bad index is worst than no index , because the cost of maintaining a bad index , and the cost of its “updates” and “inserts” will certainly overlook the little or no performance gain on the “select” and because all the non clustered indexes will be build using the clustered index that means they will have bad performance as well.
However if I needed to have the GUID as the primary key, I would do that but I would create another unique key for the clustered index (the default in MSSQL is to have the clustered index as the primary key but it is not necessary) , why ? because eventually most of the search on my table will not be by ID but rather by another logical (business) Key , meaning I will depend on non clustered indexes to gain high search performance , and my non clustered indexes will be attached to the clustered index , which needs to be small, unique and incremental to gain the best performance.
GUID is not a string…
@Natasha, I agree that a B tree will be slower as it get bigger, but if you 10 item in it, it does not matter if the time of the item is a GUID, or an int, it still 10 items. so the fact that a int is smaller the a GUID does not help the case when it comes to B Tree.
It’s worth thinking about the sort of algorithm that creats Guids.
A baseline of static machine specific data is assembled at machine start-up: machine name, BIOS manufacturer and version, MAC address, CPUID, processor string, mainboard manufacturer, windows version…. you name it.
At OS logon, append some more data such as user name, domain membership, account creation and last login dates.
At the moment of creating a Guid, append some extra time-specific information: The current date and time. The number of milliseconds elapsed since the last reboot. The number of keystrokes appended since the last reboot. That sort of thing.
Now take this entire data set and perform something like a 128-bit MD5 hashing algorithm upon it. Voila! you have created a Guid. The computational expense of all this is why it takes an order of magnitude longer to create a new Guid than a new integer (but we’re still microseconds on a modern PC).
The likelihood of duplicate Guid creation is far more dependent upon the hashing algorithm used. Different hashes have different probabilities of collision. MD5 is pretty good, but it can happen in theory. In the parlance, it is “computationally infeasible”.
Consequently: Guids are fine as a surrogate primary key ***provided you then remove the clustered index created by default on this key***. Being a primary key means that the DBMS will enforce uniqueness, so don’t lose sleep over the astronomically minute chance of duplicate key generation.
Of course, you should only be doing this in the situation where no suitable natural key exists, or maybe if a natural key is so wide (e.g. a big compound key) that it would occupy more than ~900 bytes and is thus unusable on the Sybase architecture. And if an int or a bigint is inappropriate (replication issues being one possibility).
All relational database engines are optimized to work with indexes on columns whose value increments after insertions (such as with in integer) – this is knows as a ‘natural progression.’
Relation database engines are NOT optimized to work with indexes on columns whose value is assigned a random value upon insertion (such as with a GUID, or any other alphanumeric string) – this is known as an ‘unnatural progression’.
If you are implementing a solution that uses a relational database engine to store your data, whenever you are choosing a data type for either an index column or for a table’s primary key ALWAYS choose an integer (of the appropriate size).
Choosing a GUID or some other alphanumeric data type for an index column or for a primary key is now and has always been a rookie’s mistake.
a GUID is not alphanumeric. Your rookie is showing. :-)
If you’d bothered to read the rest of the thread what you just wrote was already covered.
Yes, random GUIDs are bad. However sequential ones are not, and are only double the side of int64. Thus there is very little negative effect in using them. Either use newsequentialid or create your own function that does the same and you’re golden.
Recently, I was working on GUID. And GUID seems a big horror for me. That’s becuase I wasnt able to convert string value to GUID. Everytime stucked with the same error Conversion failed while converting GUID to string. So what’s the solution now?
A couple of comments. First, a googling of the performance hit of sequencial GUID in a DB turns up a typical answer in the 5-10% range. In other works operations that take 1 unit of type with INT take approximately 1.05 to 1.1 unit of time with GUIDs. This is obvisouly significant, but not necessarily a show stopper. You should construct your own benchmark and find out what the effect is for your own data. The same is true for the additional storage. Yes consuming four time the storage for the idenitfier is worth noting. On a table with several (I think the number was 8) keys this is even more significant. On the other hand if the table has a few string columns (i.e. nvarchar(150)), then the storage of the index has become a fairly small percentage of the overall table size.
There is another advantage of GUIDs that has not been mentioned here. They solve the cross table refence bug. In this bug a peice of code gets an ID from table A and then erroneously uses it as a foriegn key in table B. The bug is that the foriegn key column in Table B references Table C. So let’s say that there is an invoice and an invoice detail table. There is also a customer table. It is possible to write code that incorrectly uses the customer id value in the invoice details table as a reference to the invoice that detail row should be associated with. Using sequenctial integers for ids this will almost always work. Using GUID this will almost always fail.
BTW, Please comment here when you create the new Post with all these thoughts consolidated. I found this quite informative, though somewhat laborious to read.
Yes, I will have a new post out soon.
I am using this article to create my benchmarks:
You can view the English translation here:
I changed the GUID table to use NEWSEQUENCIALID rather than NEWID.
I am seeing similar difference in the insertions (note that Sprecher is a little faster than the machine they used. )
Still the ratio is about 2 to 1 on many sequential inserts.
The clustering is less problematic by far:
table_name index_id index_type_desc index_depth avg_fragmentation_in_percent page_count
TestGuid 1 CLUSTERED INDEX 2 14.8148148148148 27
TestGuid 2 NONCLUSTERED INDEX 2 33.3333333333333 9
TestInt 1 CLUSTERED INDEX 2 9.52380952380952 21
TestInt 2 NONCLUSTERED INDEX 2 33.3333333333333 3
The 300% increase is troubling, but not nearly so troubling as the 1000% increase of the original article.
Similarly the space usage is not as dire:
index_name type_desc space_used_in_kb
PK__TestGuid__3213E83F59063A47 CLUSTERED 216.0
PK__TestInt__3213E83F5DCAEF64 CLUSTERED 168.0
uix__TestInt__fk NONCLUSTERED 24.0
uix__TestGuid__fk NONCLUSTERED 72.0
While he was getting 2 to 1 cluster and 4 to 1 not clustered, my numbers are more like plus 30% and 3 to 1. So space is an issue, but performance is likely not.
In addition I created a set of single row selects from both tables. The set included every row value (I created 2048 rows in both tables). The results are:
As we see the performance is not quite 2 to one, but remember that this represents 2048 individual selects. The performance difference per select would be difficult to measure.
The result is that though there is a difference it is not particularly dire for my use case. Perhaps for yours it is. As with most things it depends on what you plan to do with it.