SQL SERVER – GUID vs INT – Your Opinion

I think the title is clear what I am going to write in your post.

This is age old problem and I want to compile the list stating advantages and disadvantages of using GUID and INT as a Primary Key or Clustered Index or Both (the usual case).

Let me start a list by suggesting one advantage and one disadvantage in each case.

INT

Advantage:

  1. Numeric values (and specifically integers) are better for performance when used in joins, indexes and conditions.
  2. Numeric values are easier to understand for application users if they are displayed.

Disadvantage:

  1. If your table is large, it is quite possible it will run out of it and after some numeric value there will be no additional identity to use.

GUID

Advantage:

  1. Unique across the server.

Disadvantage:

  1. String values are not as optimal as integer values for performance when used in joins, indexes and conditions.
  2. More storage space is required than INT.

Please note that I am looking to create list of all the generic comparisons. There can be special cases where the stated information is incorrect, feel free to comment on the same.

Please leave your opinion and advice in comment section. I will combine a final list and update this blog after a week. By listing your name in post, I will also give due credit.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

About these ads

104 thoughts on “SQL SERVER – GUID vs INT – Your Opinion

  1. I like GUIDs, the fact that they can be shared across diferent instances makes it easy to move things back and forth with simple scripts.

    Also I like that I can put just “details.asp?id=(guid)” without the fear of the user just iterating or trying different ids to get records they should not access… not a big security, but enought for most simple apps, more if WS are used from the JS, I feel a little safer with just no effort

    Like

    • I have a problem with this sentence about the GUID: “String values are not as optimal as integer values for performance when used in joins, indexes and conditions.”
      My problem is: GUID’s are not stored or prosessed as string. The GUID or UUID is usually just a 128-bit Integer that is displayed (for human readability) as a Hexadecimal string. It’s identical to how IPv4 adresses are actually just 32 bit that are shown a decimal values with points (so please don’t store them as Varchar(15), use 1 INT or 4 tinyint’s) or IPv6 uses a 128-bit Integer. Or a single char/nchar 8 and 16 bit integers repsectively.
      The only real difference between GUID and a normal 128-bit Integer, is how the next number is generated – randomly, rather than sequentially.

      Also, regarding the Integers disadvantage:
      If a 32 bit singed Integer is to small, you can easily use a 64-bit integer instead. If you have a 64-bit system, there won’t be any performance issues. It will still be faster than guid/uuid, as they only need 8 bit in on this disk (and thus when beign read or written).but it’s much harder to run out of numbers than with 32-bit (9,223,372,036,854,775,807 – when it is singed).

      Like

  2. GUID

    Disadvantages:
    1) Clustered primary key with a sort will drag your table and database to its knees as more rows get entered. Since this is the default PK added by Enterprise Manager, this is dangerous.

    2) Not easily translated to enums in programming languages. For types, and type tables.

    Advantage: Easier integration with replication.

    Like

  3. Personally, I hate the use of Guids, performance is a big one but by far it is storage. One of our applications has a table that has 14,000 rows but 8 columns are guids because of joins, so you can imagine it eats up a lot of space because of the relationships.

    Guid’s take up a lot of space so I would avoid them where ever we can. If Int is too small, use bigint, still less space that Guid.

    Harry Yeh
    CEO / CTO

    http://www.ushopfast.com

    [2 links removed by admin to pass the comment test]

    Like

  4. Just to add one more point to the ‘advantages’ of GUID –

    When using a GUID column a new id can be generated at the client side.

    With INT, a client application typically will sand a ‘save’ request to the server, the server will save the record and send back the identity key generated by the insert. So the client application will know the new ID only after the record is saved.

    With GUID, the client application can generate a new value and can send it to the server. It does not need to wait till the SAVE function returns to know what is the ID.

    Like

  5. My primary involvement with database programming is desktop application development, and in that regard, I vastly prefer int. It’s much easier to interact with programmatically. I typically write code on the frontend to determine IDs on INSERT, or use a stored procedure which determines and selects a usable identifier, so that’s not an issue for me personally.

    Like

    • @George, do you have a concurrent enviroment? If you gen the Id(INT) on the client you have to garantee that no one else will get to the database first, if someone does, you loose the number.

      Like

  6. I’ve run out of int values several times. The next step up is a bigint, which at 8 bytes is still half the size of a guid.

    I would almost never use a guid as the clustered index key for a few reasons. First, 16 bytes is a big key. All nonclustered indexes will be based on the clustered index key which translates to more store and less efficiency. The second reason is fragmentation. Because guids are random, they will get inserted throughout the table increasing page splits and fragmentation.

    As Jacob said, guids are truly useful when you need to use them across systems, or generate them outside of the database. Even then, I was use the guid as a nonclustered key, and still cluster on an identity, or a datetime value.

    Like

  7. I think GUId as Primary Key will have performance issues. I feel it is not advisable to have GUId to be used as primary key because it will have overhead of both maintenance and performance while retrieving data.

    In order to address the issue of running out of values when using INT or BIGINT data types, we can think of various options based on the nature of data that we are storing in the table. I am sure even if we have a composite Primary key, it will have less overhead than GUId.

    Like

  8. Because guid’s are not consecutive, they can appear anywhere in the results. They are much harder to debug with, because a guid is much harder to enter into a query than an int.

    I like the idea of starting with an int and then moving on to a bigint myself (if it runs out of numbers, that is, which rarely happens). There’s no real benefit of generating a guid on the client, as you need to insert the data at some point and you can return the ID at that point.

    A guid primary key also gets added to every index that you create on that table, so every index becomes unnecessarily larger.

    If you really need a guid in a row, make it a secondary key.

    Like

    • Actually, I hate to use the guid to be the primary key either. but as the scenario as you mentioned that the client generate the guide which can be used immediately in you client side. you don’t need to wait for the response from server what the ID is.

      Like

  9. Is GUID really handled as string in joins and internally by the DB? That doesn’t make sense, at least to me.

    GUID is 128 bit wide value. It is usually presented as string like “{xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}” which looks like it’s a biiiiiig value but you could just as well store it with four INTs or with two BIGINTs if you like. You could even stored it as BINARY(16).

    I am really amazed if Sql Server uses and stores them as strings.

    But still, if Sql Server stores them as integer there comes some performance hit because the value is larger than what the CPU understands so the value can’t be compared on a single pass. Probably, I don’t know how memory comparison works on that level.

    But anyway. I have two situations where GUIDs are used in a clever, but maybe not in so efficient way.

    In one software GUID was used as user ID and it was split into pieces where first 4 bytes told in what country person worked, I don’t remember what the next 4 was, but last 8 bytes was a running number and each country/the-thing-I-don’t-recall domain.

    Another software had on offline mode where user could create and modify records on the field without connection to main computer and sync changed back to central DB when in office. Naturally when you create new record you need a new ID (PK) for it. So they came up with an idea that every time the client was connected to the server it got a range of new keys which it could use offline. Keys were GUIDs where first 8 bytes told the user’s ID and last 8 bytes were running number reserved for the user alone. This way they never had to deal with duplicate keys.

    Like

    • @Marko Awesome man, I am actually gonna check about how it is stored in the DB, because it makes sense as you said. Take a look a the NonSQL with MongoDB, they use something like you are saying to distinguish the origin of information, if I am not mistaken

      Like

        • Thanks for the link!

          So it’s 16 byte binary value (16 consecutive bytes of memory). In machine language everything is memory so there’s no difference if it’s BINARY(16) or two BIGINTs or even four INTs.

          C-style pointer magic follows to demostrate that comparing 16 bytes of memory is fast. The following takes 3 to 5 CPU cycles if I calculated correctly.

          // Here are the GUIDs
          byte *guid1 = byte[16];
          byte *guid2 = byte[16];

          // The actual comparison operation
          ((*(int64*)guid1 == *(int64*)guid2) && (*(int64*)(guid1 + 8) == *(int64*)(guid2 + 8)));

          I love pointers and pointer arithmetic even though I’m happy I don’t have to use them in C# anymore :)

          Like

  10. Hi Pinal,

    GUID will create more fragmentation and it also occupies more disk space than Int.

    However it is useful in case it can be used domain wide, and newsequentialid() is used to generate guid.

    So, personally I would not suggest using GUID as PK unless it

    Like

  11. GUIDs performance is an issue when your db starts getting big.

    INT has a few flavours to pick from so you can more selectively choose your data type based on it’s application. If I have a table which will hold reference data, let’s say AccountStatus – you might discover during the requirements gathering that there are only 5 or 6 statuses available, TINYINT would be sufficient for this ID field.

    If you have a system liable to hold huge volumes of data you would probably assign BIGINT in the design phase, not INT.

    Given the performance advantages in joining etc I would go with a variation on INT dependant on exactly what the requirements are for the table in question.

    Like

  12. Many have already mentioned GUID will have space and performance issues when joined on the Key.

    Another interesting point is that the clustered index on GUID will also affect all the Non Clustered Indexes on the table by unnecessarily making the length of the Key more than normal.

    Like

  13. Not all TABLEs need an Id. For the ones that do, decide if it needs uniqueness on the server or not. In most cases it isn;t needed.

    Don’t let possible performance benefits change design before you know it actually is an issue.

    Like

  14. Are you excluding BIGINT from this discussion? The range of an integer identity column can be doubled by seeding it at the maximum negative range. This gives BIGINT a range of 18,446,744,073,709,500,000 rows, large enough for most applications. A BIGINT takes 8 bytes of storage while a GUID takes 16 bytes. This provides you a 50% increase in memory devoted to indexes which in turn reduces IO necessary to support index lookups.
    Negative values negate your second advantage because large negative numbers are quite impossible for us humans to deal with.
    A corresponting range for INT is 4,294,967,294 rows stored in 4 bytes giving 75% more index memory space.
    And for SMALLIT the range is 65,534 rows stored in 2 bytes and a whole lot more index values in memory.

    Like

  15. GUID

    Advantages

    – Consolidation – You have Customer Table in 5 different Databases and you want to make a Datawarehouse – no problem – the records can keep their keys. With INTs, you are going to need new keys which is going to get messy with your FKs

    – Easier to Generate Multiples in Multi-User environment – If you need to insert multiples items at once allong with Child Records of those items, very easy. With INTs you may need to reserve the keys or need to lock do prevent 2 users attempting to use the same PK

    Disadvantages

    – Along with being harder for Users to understand if they are shown – Harder for DBAs to work with even when they are not shown to Users

    – No gurantee of uniqueness (this one may be contraversial) – Although the chances of duplication are very very low, I wouldn’t feel comfortable if a Hospital or Bank was using them, would you???

    Like

    • Guids are guaranteed unique because they are generated from server specific characteristics such as MAC address, server time, disk serial number, randomization etc

      As the MAC address is unique, no 2 servers will create the same GUID and as no 2 GUIDS can be created in the same clock tick, No 2 guids can be created on the same server. Even if you set the internal clock back, the added randomization, coupled with the clock tick, a disk id etc, the chances of creating the exact same id in the same clock tick is effectively impossible.

      As for their format, they are nothing more than a collection of 8 & 16 byte ints, as if it were a 16 byte BigInt with groups of its bits populated from different sources.

      Like

  16. INT:

    Advantages:

    – You can remember the numbers, are faster to type, faster to understad, faster for joins, clean code.
    – Them give an idea of order of insertion that is important to get a fast idea if a record is new or old
    – For a customer you can use the INT key as the customer code, with GUI mmmm not a good idea to tell a custer you are the customer {xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx} :P
    – Like others said you can use BIGINT that is big enough for all purpouse I can’t imagine an scenario that it dont work.

    We hate long numbers like GUID and we never used them.

    Che

    Like

    • * Rememeber
      1.879.658.256

      Ok can you remember this

      *Order
      1.879.658.256
      1.879.558.257

      Can you quickly access the order of this numbers? By the way should you use a time stamp/Date Time for that purpose

      *Exposing to customers
      you are customer 1.879.658.256, is that good for him? how about you customer Marcos(marcos@hotmail.com) isn’t that better and personal.

      So does big int solve all your problems, guess not.

      Like

    • INT primary keys (especially identity seeds) being human readable are also a disadvantage as queries that should be date based, often use the identity seed based on the assumption that a lower ID was created earlier, which may not be true. Also, presenting a primary key as a reference code to a user, may have negative implications when master table references need to be reordered, which would require changing the primary key of the master and updating all tables where the key was a foreign key. It’s much better to have a key where a developer can not make this mistake and is required to add a reference column and a unique index to the table, that would only require the master table be updated, should such a scenario occur.

      Joins are no easier using ints

      As for insertion order, you can create sequential GUIDS

      See above about why using the primary key as a Customer code is a terrible idea.

      Here’s one real question you should ask when deciding on GUIDS: Will I want to transfer the data in this record, back and forth to other systems? If so, GUIDS are great because you can insert the row into the other system without changing the key or worrying about sending a BigInt to a system with INT as the PK. You don’t even need to see if the other system has a duplicate key when inserting because unlike identity seeds, GUIDs are unique across all tables and systems.

      The time I would not consider a GUID is when I have local lookup tables that are never intended to be exported.

      Like

  17. I believe you should use both:

    Use a BIGINT as the primary key – clustered etc, for join’s, fast retrievals etc. i.e. this is optimised for retrieval and user convenience.

    The have a GUID as your unique marker for the record, this is what you use when porting data across servers/applications etc. This would never be shown to users, but is stamped to all records.

    This gives you the best of both worlds, you avoid the fragmentation and performance hit of using GUIDs as a primary key, but retain its benefit when your moving data / sharing it between systems since you can independently identify the record.

    Some may argue that the extra storage cost is excessive, but I’d say spending a few dollars more on your SAN is cheaper in the long run than trying to do two different things with the same data column.

    Like

  18. I’ve had several apps migrate/merge from local/regional usage to our worldwide enterprise level, GUIDs saved us.

    Sure, if you have a few tables that are migrating, INT or BIGINT aren’t much to handle in the migration routines you have to write. But start merging huge customer and law enforcement databases – it’s best not to have to manage that and just use GUIDs from the outset.

    Like

    • Yes, thats the real reason you should use GUIDs, you may be able to handle with begint, by portioning the numbers, so branch 1 takes number from 1 to 2 billion, branch 2 goes from 2 billion and 1 to 4 billion, and so on. But then branch 1 run out of numbers, then what you do? Take the next available series, and now branch 1 has 1 to 2 billion and 6 to 8 billion, that works, but can get quite confusing and hard to manage.
      If you that not real case, just remember the very way you got to this post… An IP address, have you ever though about it. 000.000.000.000 to 255.255.255.255 that a very large amount of number to pick from, but we are running out of them, and guess what IPv6 is a GUID, why?, one would wonder. Why?

      Like

  19. GUIDs have a sordid past, and I think every DBA approaches GUIDs with trepidation. I have only seen one situation where a GUID was appropriate – a large data warehouse where data had to be stored on multiple tables and where ‘absolute’ uniqueness was a requirement. Almost every time a developer or data modeler approaches me with a table using GUIDs a few minutes of conversation results in use of INT or BIGINT.

    The uniquness of GUID’s seems to be a popular topic. It seems very difficult (and nearly statistically impossible) to duplicate a GUID unless you are trying very hard to do so. Since there are no large scale studies proving that GUIDs are bullet-proof, I agree with Denis the Theif would never rely on them in a Bank-like setting with ridiculously high transaction rates.

    One serious disadvantages of GUIDs comes from privacy issues from V1 GUIDs (where the MAC address was used as a base in creating the GUID). Another is that current-generation GUIDs can be somewhat predicted which makes them unsuitable for security uses and the like.

    Like

  20. As far as I remeber CPUs can utilize paired registers, so now we have 64-bit CPUs which can handle 128bit values easily, without performance hits. RAM is cheap as well as disk space.
    We are going to interconnect many databases and analyze records from different sources, replication and dispersed databases are more and more popular.
    So, why should we reflect if the PK contains 64 or 128 bits? How many extra columns your tables have, which occupy more unused space than those extra 64 bits?

    From my point of view, the only important factor is the data processing speed. This is something we must consider as our databases grow, grow and grow and will never stop to do so.

    Have anyone compared GUID:INT indexes & joins efficiency?
    Does DB engines checks for duplicates in GUID PK rows?

    Like

    • Comparing values that are larger than a CPU’s register is always tricky and has to be performed inside full fences as it is a non-atomic operation.
      Assume a 64bit variable on a 32bit system. Comparing it is a 2 step process that can be suspended in-between during regular thread preemption. So thread A is comparing 64bit variable to some other value and is suspended in the middle ater the first half (as only 32 bits fit into CPU’s register at once) while some other thread that becomes alive after the first one gets asleep updates the same variable’s value resulting in inconsistent comparison. This must not happen so a full fence is deployed which introduces performance hit somewhere around 10 – 40ns on 2010-era desktop. The same happens with GUIDs unless 128bit computers and servers become standard.

      Like

  21. You don’t have to use one or the other exclusively as some people seem to think.

    It is ok to use your PK as an INT and then put a GUID for accountGuid to use in data warehouses. The important thing is to optimize for the use you have in your local database. If you need to get that data across a warehouse then you use the accountGuid field. And you don’t fragment your PK index by using a GUID.

    Of course there is also the sequential GUID that SQL Server uses for Sync Framework if you want to ensure they are always in order, but that just seems like an even worse idea to me.

    Like

  22. GUIDs are invaluable in several instances:

    1. Replication. You have a choice. You can either just use it as your PK or you can have PK AND a GUID and waste even more space. Might as well just get it over with.
    2. Multi-linking. I.e. ItemID which is referencing multiple tables such as in accounting. Yes, you can do it with ints, but good luck with that one. I’ve seen the results of ints in these cases, and man what a disaster. And Replication is even worse.
    3. Datawarehousing as stated before. Really, is anyone not concerned about the day that they need sharding?
    4. Sync from multiple sources: Imagine if Outlook stored a GUID that we could use and then query for all changes since the last record? Or IMAP? Or ActiveSync? Or any other source where you need to get last changes on records by ID. GUID is the only way to go, otherwise you end up with the mess that is Outlook’s communication with Exchange server.

    The downsides are four:

    1. Take up more room. But as Harry said, if you’re worried about space, then bigint is your way to go. That’s 64 bits, instead of 128 for a GUID. Double the space for an guid, who cares when drives are so large? None of our clients are even over the 20 gig mark and that’s with email attachments and documents as filestreams included in a backup. IT’S NOT A BIG DEAL. (i.e. I just did a filestream conversion for a client with SQL Server 2008 R2 and he had over 50 million emails in his database for his company. No one ever deletes any email. They have it set in security not to allow it. That’s 10 years of emails in our system, and it still flies for every one of them on the network, with replication and automatic backup happening. The indexes total < 1 gig total in their database, compared to 19 gigs of real data. I'll take that any day.
    2. Join speed: I don't see this one at all. We tried it just to see, and we're talking ms on hundreds of thousands of records… at least with SQL Server 2008.
    3. Clustered Keys killing speed: This has a solution. Use a sequential ID. SQL Server 2008 supports it natively if badly, and you can write your own routine that does even better.
    4. Possible collision: This is made worse when using a sequential guid because you're taking away part of the randomness. But we're still talking, when using a robust routine based on date/time of having to insert more than 100,000 million rows in a second to have a statistically plausible chance of a collision. And really, if your database engine doesn't have the ability to handle these errors, then you're doing something wrong. We wrote a wrapper for all of our calls to handle for this possibility. It's never been called, because it would tell us about it, but it's there, and no database worth its salt will allow you to enter duplicate PKs so you don't have to worry about corruption, just error which is easy to handle.

    Being hard to read… come on! If you're that bad with SQL that this is an issue, you need to be in another line of work.

    twk: No it won't let you insert dupes if you have your PK set correct. And as I said, ms on hundreds of thousands of rows, and if you're querying that much at once, you're probably doing it wrong anyhow.

    Wil: GUIDs in SQL Server do not use the network mac only. You cannot predict them. MS paid a lot of money to test if there was any possibility of it, and no one has ever been able to create a prediction algorithm.

    There really is no excuse not to use GUIDs and in today's world of data interop, you'd be a really bad citizen not to expose a GUID for data sharing. So the question is do you want to use 192 bytes or 128? (Bigint + Guid or just Guid). I prefer 128, and I'll put the speed of our database design against anyone's.

    Like

  23. Great Comments everybody. I will compile a blog post with all the notes and will publish it soon. This are great thoughts. I will link to all of you who have given good comments.

    Excellent. Keep your ideas flowing it is going to help someone who needs help.

    Kind Regards,
    Pinal

    Like

    • Hi james ,
      Really amazing explanation i m up with you use only guid thats the better way to perform n savin DB from Warehousing disasters….i m up wid you

      Kamesh shah

      Like

  24. Someone mentioned fragmentation. This brings a question: Is it possible to GUID to get fragmented in multiple pages?

    If it is then yes, there could be problem with fragmentation if the pages are spread all over the hard drive. Otherwise I don’t see it as a problem.

    MSDN “The SQL Server database page size is 8 KB.”

    Event with header fields, 16 bytes of GUID fits quite nicely there. And it’s not a big deal for hard drive and I/O to read 16 consecutive bytes from disk. It might even read those bytes to the disk buffer anyway.

    twk: Thank you for the info about CPU. It’s been a while since I’ve dealth with such low level stuff that I don’t recall it anymore.

    But to be pedantic, which I very much like to be … ;)

    Even if CPU can compare two registers at once, you have to load two values to registers instead of (with INT or BIGINT) just one. But it’s just one CPU cycle more and as James Hancock said, it’s not a big deal.

    Like

  25. We have used both together. Means table have int id as primary key but unique is for GUID. Now the question must be why both at the same time. So ans is that we use the GUID for integration between three systems. Means we publish GUID trough webservice while communicating with other system. As this is unique so we can use.

    I hope i have clear.

    Like

  26. Fragmentation: Yes, unless you use sequentialid or create your own incremental implimetnation of GUIDs you will get fragmentation in your indexes badly because it’s random. Also clustering won’t work because its’ random so it can’t do an incremental lookup.

    But this is easy to fix, just not obvious because MS doesn’t really provide guidance on this.

    Like

  27. So far the biggest concerns i see for Guid are storage and performance. Does any one actually have the stats that can prove that Guid is performance intensive?

    I am an app developer and i find the Guid invaluable for my apps which keep changing continuously(including the DB) and keep getting updated/altered.

    Like

  28. Guid has one great advantage. I can create a unique Id without making a call to the database. This comes in handy when I am inserting one parent with several children.

    Like

  29. “Guid has one great advantage. I can create a unique Id without making a call to the database. This comes in handy when I am inserting one parent with several children.”

    …And as long as you use the same function in your language to generate the sequential GUID, you are more than making up for any speed difference in the process…

    Like

  30. int/guid choice depend from data in database. For small/medium very distributed database GUID will be best choice. For large tables int/bigint is preferred.

    Unfortunately, big clustered PK included in all other indexes and 16-bytes record pointer will included in all our indexes on table if we will use GUID type as PK.

    Now, I use bigint type in distributed database with large ranges and don’t have any troubles with PK generation.

    Like

  31. I’ve got a product in my datacenter that makes use of guids… Sometimes multiple per row and they’re indexed which leaves us with unbearable performance issues it seems when loading new data.

    Like

  32. Wow! What a horrible misconception a lot of people have about GUID’s and, from what I’ve seen on this thread so far, only the poster known as “Dennis the Thief” picked up on it.

    For all of those relying on GUID’s to be UNIQUE… especially you poor buggers with extremely massive data warehouses… surprise, surprise, surprise…

    The GUID’s in SQL Server are “Type 4″ GUIDs and you can verify this by looking at the 15th character of any SQL Server GUID you can put your hands on. You should certainly GOOGLE for what being a “Type 4″ GUID means but let me save you a little trouble and repeat what Dennis stated in his post….

    Although it is highly unlikely that any given GUID will ever be repeated on a single server, GUIDS ARE **NOT**, I repeat, **NOT** GUARANTEED TO BE UNIQUE!!!

    GUID’s are nothing more than random numbers with a huge domain.

    If something in one of your tables has to be guaranteed to be absolutely unique, THEN DON’T USE GUIDs BECAUSE THEY ARE NOT UNIQUE.

    –Jeff Moden

    Like

  33. And they’re globally unique, not just to the computer. Sequential ids are only unique on that computer, but again, you shouldn’t be using this routine because you have to do a get to get the last inserted record as well on every call. You should be creating a function that calculates them the same way both server side (as a default) and as a function in your programming language so that you never need to do a second round trip.

    Like

  34. @James Hancock

    No James… they are not “globally unique”. The term “Globally Unique” is a misnomer and it is an error for Books Online to say “A GUID is a unique binary number”. Type 4 GUIDs are nothing more than random numbers with a huge domain and Microsoft currently uses TYPE 4 GUIDs.

    Please read the MSDN article at the following URL…

    http://msdn.microsoft.com/en-us/library/system.guid.aspx

    … where it states in the REMARKS section, and I quote…

    “Such an identifier has a very low probability of being duplicated.”

    Microsoft uses “TYPE 4″ Guids which no longer contain the MAC address of the computer like “TYPE 1″ Guids did. To understand the differences between TYPE 1 and TYPE 4 Guids, please see the following URL…

    http://en.wikipedia.org/wiki/Universally_Unique_Identifier

    The bottom line is that GUID’s currently being used in SQL Server are NOT guaranteed to be globally unique. Although the likelyhood of duplication is very low because of the very large domain, Type 4 GUIDs (which is what Microsoft currently uses) aren’t even guaranteed to be unique on a single machine.

    Like

    • “Such an identifier has a very low probability of being duplicated.”

      Do you have any idea how low the probability is?

      “equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate. In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs.”

      http://en.wikipedia.org/wiki/Universally_Unique_Identifier

      You have a much greater chance of a stray sub-atomic particle flipping bits in your RAM. If you’re not programming around that potential problem, then for all intents and purposes GUIDs (even type 4) are unique.

      Like

  35. @Rob Boek,

    True enough but because TYPE 4 GUIDs are NOT guaranteed to be unique and DO stand a chance, no matter how remote it may be, of being duplicated, you run into the same problem as if you used an IDENTITY column… you need some central repository that all global entities in your enterprise have to check and see if the number has ever been duplicated in your system. They cannot operate asynchronously with a **guarantee** of unique TYPE 4 GUIDS… and Microsoft currently uses TYPE 4 GUIDs (at least in 2k5, they do).

    Like

    • The chance is so mind-bogglingly tiny that there’s no point in being worried about it, you can treat them as unique. You would need to generate a billion GUIDs per second for 100 years to have a 50% chance of creating a duplicate.

      That’s no worse than the chance of a neutrino from space hitting your hard drive platter and causing an IDENTITY column to create a duplicate.

      Like

  36. Jeff:

    From your own link: A GUID is a 128-bit integer (16 bytes) that can be used across all computers and networks wherever a unique identifier is required. Such an identifier has a very low probability of being duplicated.

    notice across all computers and networks.

    Very low probability means that you have a better chance of being hit by lightening 1000 feet under water than having a duplicate.

    As for needing a check. You don’t need to check. You need to handle a primary key constraint error and regen a new one when you have that error fired at you, and then save it again.

    Since this will happen about once in the next 1000 years this isn’t a high cost and you don’t have to round trip.

    Same is true for synchronization. Which is why MS uses them in their own replication. And by making them the primary key, it makes it even easier to fail fast instead of worrying about dups.

    Like

  37. James:

    I also said that the probablity was remote. The problem is, it’s still a possibility and with that, the TYPE 4 GUIDs are not what they say they are. They are NOT globally unique and regardless of how remote it may be, duplicates can occur.

    Yes, I agree that if everyone in your enterprise dumps their data to the same table, then a PK or UNIQUE index will catch it. However, there are a lot of folks who use GUIDs as the PK for many or all of their tables and duplication could occur with the dupes in separate tables where there’s no chance of a PK or UNIQUE index checking on it. If your systems are built under the assumption that every GUID is guaranteed to be unique across all tables and all servers for your enterprise across the world and you “hit the lottery” with one of those extremely rare dupes and you don’t know that GUIDs can be duplicated, I’m thinking that it’ll take you more than a couple of minutes trying to figure out what the heck happened.

    The listed advantage of GUIDs (as they apply to SQL Server) being unique across the world is false.

    Like

  38. GUIDS are not strings and JOIN at the exact same performance level as INTS on anything better than an 1970 8088 chipset.

    Jeff Moden declared that GUIDS are not unique, Well until someone has identified a collision where the GUIDS where system derived and not manually set, I would disagree. There are 6 x 10 ^ 28th Stars in the sky.
    There 1 x 10 ^ 128th possbile GUIDS

    Identifying a Column as a Primary Key adds the unique Constraint. I have collapsed 10 Terabytes of disc space and failed to create a duplicate GUID.

    Like

  39. Maybe a GUID non-clustered primary key in a distributed environment, but I’d need a convincing case made. Really only a natural key should be the PK.
    As for clustering on a GUID, that is totally wrong. The disadventages are: page splits, bloated rowid, fragmented indexes, and poorly performing bookmark lookups.
    The advantages would be what?

    Like

  40. Randy Pitkin wrote:
    >>Jeff Moden declared that GUIDS are not unique, Well until someone has identified a collision where the GUIDS where system derived and not manually set, I would disagree. There are 6 x 10 ^ 28th Stars in the sky.
    There 1 x 10 ^ 128th possbile GUIDS

    True enough. To make it even more awsome sounding, if a mile were just 1 GUID, then every possible GUID lined up in a row would be wider than 17 Quadrillion Milky Way Galaxies. Further, since the GUIDs on a single machine are pseudo random (sequence implied), that makes them unlikely to ever repeat until they’ve all been used.

    However, the fact that they are pseudo random also means that, in theory, two machines could generate the same GUID. It’s good that you use UNIQUE indexes on your GUIDs ;-) especially if you use sequential GUIDs.

    The bottom line is that Microsoft has stated that the GUIDs aren’t unique and that they could repeat and you shouldn’t assume a “guarantee” of uniqueness. I’ll see if I can find the URL where I saw that later on tonight.

    Shifting gears, I agree with what Eric Russell stated above. Unless you happen to have an affinity for page splits, you should never use a GUID as a clustered index.

    Like

  41. I have created a table with 2 columns (i.e col1 ,col2)

    and inserted 2 rows with null values in second column (i.e Col2)

    after that i fired below qry

    select disinct col2 from

    as a result above query returned only single row instead of returning 2 rows,how can it be possible as the two nulls will not be same.

    —–Sql Scripts ——————

    create table #tmd(col1 int,col2 int)

    insert into #tmd(col1)values(1)

    insert into #tmd(col1)values(2)

    select distinct col2 from #tmd

    Like

  42. I’m interested to know if DB’s like SQL Server have been optimised for indexing and joins on GUID’s as opposed to other alphanumeric ID’s? In other words, does GUID processing behave like any other alphanumerics, or are they a special case? And if so, is this only the case for SQL Server?

    Like

  43. @Randy Pitkin: your numbers are wrong. Being 128 bits in size, it’s a maximum of 1*2^128, not 1*10^128. Then, since the GUID version requires some bits (6 I think) to be static, this reduces the range to 2^122, which is still a fair number of possibilities with over 5*10^36 different GUIDs.

    Like

  44. I would say, the best candidate for a clustered index is the INT and the worst is the GUID;
    First because of sorting, it is much easier to sort and INT than a string
    Second indexes are stored in a B Tree structure, which makes the size of the index not only affect the storage, but also affect the length of the B Tree and accordingly affect the performance (the shorter your tree is the faster you reach your node).
    It is extremely dangerous to have a bad performance clustered index, a bad index is worst than no index , because the cost of maintaining a bad index , and the cost of its “updates” and “inserts” will certainly overlook the little or no performance gain on the “select” and because all the non clustered indexes will be build using the clustered index that means they will have bad performance as well.
    However if I needed to have the GUID as the primary key, I would do that but I would create another unique key for the clustered index (the default in MSSQL is to have the clustered index as the primary key but it is not necessary) , why ? because eventually most of the search on my table will not be by ID but rather by another logical (business) Key , meaning I will depend on non clustered indexes to gain high search performance , and my non clustered indexes will be attached to the clustered index , which needs to be small, unique and incremental to gain the best performance.

    Like

  45. It’s worth thinking about the sort of algorithm that creats Guids.

    A baseline of static machine specific data is assembled at machine start-up: machine name, BIOS manufacturer and version, MAC address, CPUID, processor string, mainboard manufacturer, windows version…. you name it.

    At OS logon, append some more data such as user name, domain membership, account creation and last login dates.

    At the moment of creating a Guid, append some extra time-specific information: The current date and time. The number of milliseconds elapsed since the last reboot. The number of keystrokes appended since the last reboot. That sort of thing.

    Now take this entire data set and perform something like a 128-bit MD5 hashing algorithm upon it. Voila! you have created a Guid. The computational expense of all this is why it takes an order of magnitude longer to create a new Guid than a new integer (but we’re still microseconds on a modern PC).

    The likelihood of duplicate Guid creation is far more dependent upon the hashing algorithm used. Different hashes have different probabilities of collision. MD5 is pretty good, but it can happen in theory. In the parlance, it is “computationally infeasible”.

    Consequently: Guids are fine as a surrogate primary key ***provided you then remove the clustered index created by default on this key***. Being a primary key means that the DBMS will enforce uniqueness, so don’t lose sleep over the astronomically minute chance of duplicate key generation.

    Of course, you should only be doing this in the situation where no suitable natural key exists, or maybe if a natural key is so wide (e.g. a big compound key) that it would occupy more than ~900 bytes and is thus unusable on the Sybase architecture. And if an int or a bigint is inappropriate (replication issues being one possibility).

    Like

  46. All relational database engines are optimized to work with indexes on columns whose value increments after insertions (such as with in integer) – this is knows as a ‘natural progression.’

    Relation database engines are NOT optimized to work with indexes on columns whose value is assigned a random value upon insertion (such as with a GUID, or any other alphanumeric string) – this is known as an ‘unnatural progression’.

    If you are implementing a solution that uses a relational database engine to store your data, whenever you are choosing a data type for either an index column or for a table’s primary key ALWAYS choose an integer (of the appropriate size).

    Choosing a GUID or some other alphanumeric data type for an index column or for a primary key is now and has always been a rookie’s mistake.

    Like

  47. Chuck:

    If you’d bothered to read the rest of the thread what you just wrote was already covered.

    Yes, random GUIDs are bad. However sequential ones are not, and are only double the side of int64. Thus there is very little negative effect in using them. Either use newsequentialid or create your own function that does the same and you’re golden.

    Like

  48. Hi,
    Recently, I was working on GUID. And GUID seems a big horror for me. That’s becuase I wasnt able to convert string value to GUID. Everytime stucked with the same error Conversion failed while converting GUID to string. So what’s the solution now?

    Like

  49. A couple of comments. First, a googling of the performance hit of sequencial GUID in a DB turns up a typical answer in the 5-10% range. In other works operations that take 1 unit of type with INT take approximately 1.05 to 1.1 unit of time with GUIDs. This is obvisouly significant, but not necessarily a show stopper. You should construct your own benchmark and find out what the effect is for your own data. The same is true for the additional storage. Yes consuming four time the storage for the idenitfier is worth noting. On a table with several (I think the number was 8) keys this is even more significant. On the other hand if the table has a few string columns (i.e. nvarchar(150)), then the storage of the index has become a fairly small percentage of the overall table size.
    There is another advantage of GUIDs that has not been mentioned here. They solve the cross table refence bug. In this bug a peice of code gets an ID from table A and then erroneously uses it as a foriegn key in table B. The bug is that the foriegn key column in Table B references Table C. So let’s say that there is an invoice and an invoice detail table. There is also a customer table. It is possible to write code that incorrectly uses the customer id value in the invoice details table as a reference to the invoice that detail row should be associated with. Using sequenctial integers for ids this will almost always work. Using GUID this will almost always fail.

    Like

  50. I am using this article to create my benchmarks:

    http://community.ugiss.org/blogs/dmauri/archive/2008/59/23/utilizzo-di-guid-pro-e-contro.aspx

    You can view the English translation here:

    http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&u=http%3A%2F%2Fcommunity.ugiss.org%2Fblogs%2Fdmauri%2Farchive%2F2008%2F59%2F23%2Futilizzo-di-guid-pro-e-contro.aspx&sl=it&tl=en&swap=1

    I changed the GUID table to use NEWSEQUENCIALID rather than NEWID.

    I am seeing similar difference in the insertions (note that Sprecher is a little faster than the machine they used. )
    INT 12ms
    GUID 23ms
    Still the ratio is about 2 to 1 on many sequential inserts.

    The clustering is less problematic by far:

    table_name index_id index_type_desc index_depth avg_fragmentation_in_percent page_count
    TestGuid 1 CLUSTERED INDEX 2 14.8148148148148 27
    TestGuid 2 NONCLUSTERED INDEX 2 33.3333333333333 9
    TestInt 1 CLUSTERED INDEX 2 9.52380952380952 21
    TestInt 2 NONCLUSTERED INDEX 2 33.3333333333333 3

    The 300% increase is troubling, but not nearly so troubling as the 1000% increase of the original article.

    Similarly the space usage is not as dire:
    index_name type_desc space_used_in_kb
    PK__TestGuid__3213E83F59063A47 CLUSTERED 216.0
    PK__TestInt__3213E83F5DCAEF64 CLUSTERED 168.0
    uix__TestInt__fk NONCLUSTERED 24.0
    uix__TestGuid__fk NONCLUSTERED 72.0

    While he was getting 2 to 1 cluster and 4 to 1 not clustered, my numbers are more like plus 30% and 3 to 1. So space is an issue, but performance is likely not.

    In addition I created a set of single row selects from both tables. The set included every row value (I created 2048 rows in both tables). The results are:
    INT 29ms
    GUID 52ms
    As we see the performance is not quite 2 to one, but remember that this represents 2048 individual selects. The performance difference per select would be difficult to measure.

    The result is that though there is a difference it is not particularly dire for my use case. Perhaps for yours it is. As with most things it depends on what you plan to do with it.

    Pat O

    Like

  51. convert GUID to int:

    Sure you can, but it’s better with long to get the most value because a guid uses 16 bytes:

    byte[] gb = Guid.NewGuid().ToByteArray();

    int i = BitConverter.ToInt32(gb,0);

    long l = BitConverter.ToInt64(gb,0);

    .

    Like

  52. My Thoughts – GUID has its place. I prefer it used for Authentication of User only. Outside of this I like to have the DB provide INT / BIGINT sequential ID’s for each new entry. I work with web apps that theoretically will have lots of users and lots of look ups over time. Indexing on the transactions is very important. As for the users, they need to be authenticated to do anything and then again authorized at each level. GUID gives flexibility here for he user, but I fear using it outside of the “User Authentication” table. The nice thing about Integer based types in the database is that as they are created the numbers can increase by one. This means that if I have one million rows of transactions I do not need to return them all. Chances are the lower numbers are old and outdated. This also helps with faster indexing and Joins.

    Authentication – Consider GUID
    Other Tables – Lean toward Integer (TINYINT / SMALLINT / INT / BIGINT as needed).

    Like

  53. Is re-indexing a problem anymore? It would seem to me that INT would have a disadvantage should an event ever occur that caused re-indexing to be necessary.

    Like

  54. Using integer ranges instead of GUIDs:

    Note this only works if you have control over every SQL Server to install ranges!!

    An alternative to GUIDs is to use int/bigint with ranges on each server in a distrobuted environment. A word of warning here though, identities do not work, they always take max(id)+1, you have to write a sequence solution or use sequences in SQL Server 2012. I tried using identities and the first time you sych the data between 2 servers the identies start to clash, or you range constraints raise errors.

    The solution I used on SQL Server 2008 was as follows:

    1. create a table to hold your sequences {name, currentValue, MaxValue, Increment}

    For each PK configure a named sequence (note current value should be initialised to the min of the range or if data exists the table max +1 within the range) The name of the sequence is just the pk name, I found that the easiest way and most meaningful.

    I then create a stored procedure with the following signature

    GetNextSequenceValue @sequenceName varchar(128), @value int OUT

    In the SP I simply do a lock for update select to get the currect value of the named sequence + 1, check if it is in range, update the sequnce table and set the out value.

    I also created a scalar function

    GetCurrentSequenceValue( @sequenceName varchar(128))

    This simply gives you the current value of a named sequence

    I then had to set the default value of the int PK columns in the table to 0 as we will never use a value of 0, it is just a flag to the trigger to get get the next sequence value.

    Then I added an instead of insert trigger to the table, where if the PK value is 0 I get the next sequence value using the SP and then do the real insert. This means that if the PK is not defined in the initial insert, the sequence value is used, if it is defined, the defined value is used, this allows sync to write the PK ids.

    Note: no range constraints can be used on the tables, this will cause sync from other ranges to fail, the SP must control the range using the definition from the sequence table.

    This solves all issues with guids and integers, except when you have no control over the sql servers are in your system, and configuring the ranges is not posible.

    I suggest ranges as follows:

    Central server: -2147483647 to -1 (use all negative values gives you the same range as a positive integer)

    Client servers:
    First Client: 1 to 99999999
    Second Client 100000001 to 199999999

    Last client 9900000001 to 2147483647

    You can always use negatives on clients if the central server makes more sense to have positive IDs.

    Like

  55. Well a GUID can give me a strong unique id in the world/cosmos for many many years to come. Statistically, it would take trillions of years to get a duplicate if you would generate 1000s per second.

    I should be able to use this uniqueness in many apps.

    Like

  56. Pingback: SQL SERVER – Weekly Series – Memory Lane – #027 | SQL Server Journey with SQL Authority

  57. Kindly let me know what would be the recommended option for primary key, given the following scenario.
    1. Table contains a UUID column and a ID column which is a decimal
    (28,0)
    2. UUID generated is stored in 8 bytes using binary columns
    3. Not a distributed environment
    4. No of rows will be maximum tens of thousands

    Is it true that modern recommendations for table design suggest usage of UUID instead of sequences for security reasons?

    Like

  58. HI friends,

    as we learned we can use unique identifier column same as identity column in sql server table and in that case the values will be globally unique i,e across tables,databases,servers,computers.
    but here i am little bit confused.
    i am agree the values will be globally unique across tables,databases since both belong to same machine but how it will be unique across servers and computers.how it interprets the unique values across different servers and different computers if both are not know each other i mean not connected each other.

    Please suggest me on the same..

    Thank you.

    Like

    • Ashish,

      The algorithm for generating GUIDs usually takes details of the computer in order to make GUIDs which won’t clash with other computers, such as the ethernet id which is supposed to be unique to each ethernet card. There is still a tiny chance that two machines could make the same GUID but the odds are so small it’s hard to comprehend. You’d need to generate 1 billion GUIDs per second for 100 years before you’d have a 50% chance of a single collision.

      Like

    • Actually there is no guarantee that two GUIDs generated on different computers will not collide (be the same).
      It’s just so goddamn unlikely, that it will propably not happen in your lifetime:
      “GUIDs generated from random numbers sometimes contain 6 fixed bits saying they are random and 122 random bits; the total number of unique such GUIDs is 2122 or 5.3×1036. This number is so large that the probability of the same number being generated randomly twice is negligible;[…]Assuming uniform probability for simplicity, the probability of one duplicate would be about 50% if every person on earth owned 600 million GUIDs.”
      128-bit integers have a really big amount of numbers (over 4 billions squared).

      Like

  59. HI Friends,

    Many Thanks for valuable time as well as suggestion.

    But I would like to say something on the same,

    As Adam Said about Ethernet ID of machine, can we say it physical address of machine?
    As we all know every machine have its own physical address and it is unique for every machine so whenever we will be trying to create GUID’S on machine it will be different or unique across tables, machines, servers……………………..

    Because for creating GUID’s values, SQL server Engine will be taking reference of physical address of machine.

    We can see physical address by CMD prompt by command

    C:\ipconfig /all

    Also one more thing I would like to say.
    Every computer have MAC(media access control) address ,it is a number of 12 characters(present in 6 pair of two characters each) used to identify NIC(network interface card) installed on machine.
    I would like to say the GUIDs would be unique across machines or servers only when , NIC card and there installation is on source machines. otherwise we can’t say the GUIDs will be unique.
    And the MAC address identify as physical address of machine.

    Like

    • The MAC adress, aka physical adress, aka Ethernet Adress (http://en.wikipedia.org/wiki/MAC_address) should be unique between different cards. Note that I said should, not is.
      Every manufacturer get’s thier own “Mac Ranges” and should ask for new ranges when it exhausted one. That did not prevent some very stupid executive from saying “Gee, why not use the same MAC adress for cards sold in the Asia, US and Europ markets. They’ll never mix anyway.”
      Plus with virtual machines and virtual network devices you can actually choose a MAC adress to be used. Or rather you have to generate a different, random one for each device.

      MAC adress is not a reliable way of identifiying hardware or a computer and never was supposed too. At tops it can reliably identify a Network Device in any physical/virtual network.

      Like

      • Hi Friends,

        Thanks for the clarification but I am little bit confuse here.
        How we can say that the production company can produce 2 NIC(network interface card) with same MAC number.
        if they can then machine will refuse the connection because of collision whenever we will be trying to connect both in same network.

        It should be unique across the world and it is.

        So I don’t think we have 2 machine with same MAC number.

        Please correct me if I am wrong.

        Like

        • “How we can say that the production company can produce 2 NIC(network interface card) with same MAC number.”
          The same way I can write the same number twice. 5. 5.

          Yes, they should not have. Indeed everybody with half a brain can figure that out. And afaik they stopped that by now from complaints.
          And they were at least smart enough to send NIC with same MAC adresses into different markets.
          Furthermore, most NIC are not even conencted to the same network. The original plan was that every computer has a public IP adress (no routers) and IPv6 goes into that direction again. But right now most NIC are connected to a router. The space in wich MAC adresses can collide ends with a router. Wich MAC adresses computer in the internet have is irrelevant for the local network and vice versa. (Same goes for local wired network, local wireless network and internet – all three distinct layer 2 Networks).

          But the MAC adress can still be changed, overridden, and could be duplicated. It is not a reliable way to identify a computer. At least not past a local network.

          Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s