SQL SERVER – A Common Design Problem – Should the Primary Key Always be a Clustered Index

November 23, 2009

In SQL Server, whenever we create any key, a Primary Key automatically creates clustered index on the same. I like this feature and I use this feature every now and then.

The question is does the change of any column as Primary Key should also create a Clustered Index? Moreover, is there any case, where one would not do the same?

One of the recent conversations I had with one SQL Expert is with regard to the SSN number. The discussion was that SSN numbers are always unique and never repeated and hence are the best candidates for primary key. Additionally SSN numbers contains dashes (-), which make the datatype of the SSN numbers as String (VARCHAR or NVARCHAR). A clustered index on an integer usually performs better over a clustered index on an integer and makes the DBA to choose Primary Key of datatype Integer. At one point in our conversation, we discussed that if SSN number should be a Unique Constraint and if there should be another Identity Column as the Primary Key.

Some of the questions from our interesting discussion are as follows:

Would you have the datatype of your Primary Key as string?
Would you treat SSN as string datatype or remove the dashes and change it into an integer?
Do you have a real life example, where your primary key is not a clustered index?
What are the best practices for SSN number to store in database and obtain optimal performance?

I will post an interesting answer discussing this subject in a separate post with due credit.

Reference: Pinal Dave (https://blog.sqlauthority.com)

SQL Constraint and Keys

SQL SERVER – Remove Bookmark Key Lookup – 4 Different Ideas

SQL SERVER – Interesting Observation – TOP 100 PERCENT and ORDER BY

24 Comments. Leave new

Anil Das
November 23, 2009 8:44 am
In the USA, SSN’s are composed of three parts. The first three digits are the Area Number, the next two digits are the Group Number, and the final four digits are the Serial Number. ) I would store a SSN as a set of smallint, tinyint, smallint columns. Doing so would provide a natural constraint to numbers which a varchar would lack. (I would add further declarative constraints as BETWEEN 0 AND 999 and BETWEEN 0 AND 99 and BETWEEN 0 AND 9999.) Now you have a five byte key with good contraints. If most queries are based primarily on SSN, you could make them the compound PK.
Reply
Kathleen MacMahon
November 23, 2009 5:13 pm
When the data schema for the system I am working on today was incepted 9 years ago, all of the primary keys (which were GUIDs) were assigned as clustered (by default).
As the databases grew in size, due to fragmenting indexes, it became more apparant that this was a practice that was overused in many areas. Especially in tables where there is a large amount of insert/deletes.
We have since gone back and re-evaluated some of the key tables to correct for this.
One example would be in the invoicing module, where an invoice may have 1-30,000+ invoice details, the clustered index was moved to the foriegn key invoice relationship from the invoice detail primary key. This drastically improved both load and write performance.
Reply
Jay Hunnicutt
November 23, 2009 5:17 pm
Pinal,
I would not recommend using the SSN for a primary key. SSN’s are difficult to work with partly because of there structure but more importantly because they are by nature Personal Identifying Information (PII) which must be protected. Any time SSN’s are stored they should be encrypted until such time as the information is required and even then the transmission should be encrypted. If you convert the SSN to an integer then when you require the information you will have to restore the data structure, including adding leading zeros, or store the information twice. I would give each person a unique identifier and store the SSN, encrypted, as part of the person’s demographic data along with DOB and other information that typically does not change. I work in the health caer industry and protecting PII is a constant concern.
R/S
Jay
Reply
BEL8490
November 23, 2009 6:46 pm
A clustered index on an integer usually performs better over a clustered index on an integer and …….
Please clarify (or Rectify)
Reply
Brian Tkatch
November 23, 2009 7:50 pm
@Anil Das
From the linked document “This is an archival or historical document and may not reflect current policies or procedures”
@Pinal
Should SSN be the PK? That matters. If the purpose of the TABLE is to list SSNs, such as the agency itself may do, SSN should indeed be the PK, because each number is listed only once and it is the point of the TABLE.
If, however, the purpose of the TABLE is customers at a Tax preparation business, the point of the TABLE is the User and the SSN is an attribute. Which makes it a bad candidate for the PK, as the attribute may change.
Also, SSNs are reused over time. So, if this is an SSN TABLE that by definition lists each SSN once (as opposed to an SSN TABLE that lists all editions) SSN will be unique, because the person is the attribute, and that is what changes. But, if a firm has historical information on tax preparation they may well have the same SSN for two different people if it is kept long enough.
Reply
- Paul Nelson
  April 2, 2015 1:16 am
  SSNs are never reused for any reason (excepting fraud of course). See FAQ at SSA.gov
  Reply
ramdas
November 23, 2009 7:59 pm
Hi Pinal,
This is an interesting subject, I have run into this issue during design time. Would like to know how this issue is resolved.
Thank you
Reply
Buck Woody
November 23, 2009 9:24 pm
It’s also important to note that not everyone has a SSN, and that they can in fact (and were for some time) be re-used. Add that to the privacy issue, and I would not recommend an SSN to be a PK.
Reply
SQLAJ
November 23, 2009 9:38 pm
In the US and hopefully all over the world organizations are suppose to be moving away from using SSNs as indemnification of individuals.
Look at Wikipedia
“Its primary purpose is to track individuals for taxation purposes.”
“…although rare errors occur where duplicates do exist.”
Never a good ideas to use SSNs as primary key. Use in addition to other identifiers starting with numeric, int, number fields (depending on flavor of RDBMS)
Just my $.02 worth
Reply
Malathi Mahadevan
November 23, 2009 9:40 pm
There are many problems with using SSN as primary key as Jay pointed out. It is sensitive data, and the idea that it is unique is well..how things have to be. There are duplicate SSNs for various reasons. Generally speaking any information supplied by outside sources including State agency is a poor candidate for a primary key. That said primary key and clustered index are conceptually two different things. A Primary key is a design concern, how the data is identified in a table. A clustered index is a performance concern and whatever key makes the table perform best is an ideal clustered index.
Reply
Peter
November 25, 2009 3:43 pm
From a conceptual POV I tend to view clusterd indexes as a physical order of information on disk. Even when this isn’t really exactly true the end result is a good enough approximation for reasoning about design.
If the structure of the primary key diverges from the usage pattern of the table its a safe bet you may want to *think* about making them different.
For example a queue table where you are always either adding records to the bottom or pulling from the top. Assume the tables primary key effectivly contains random values.
In this case if you make the primary key a clustered index it will fragment the index and reduce throughput.
If instead a sequential identifier was used as the clustered index there would be much less fragmentation and higher throughput. More importantly when bulk push/pulling records to and from the table the clustered index will be used directly to resolve the search range leading to less “exploring” in terms of locks and non-sequential I/O.
SQLServer resolves update operations to the clustered index so one should generally not take lightly the idea of separating the primary key from the clustered index! If your updating rows much more often than adding or removing records theres a good chance thinking about making the primary key different will not pay off.
Reply
Nakul Vachhrajani
November 25, 2009 10:38 pm
Hello, Pinal!
Here’s my two cents on the questions for the discussion:
Would you have the datatype of your Primary Key as string?
NAV: No
Would you treat SSN as string datatype or remove the dashes and change it into an integer?
NAV: I would say it depends on the use (going by your famous “It depends” clause :) ). If the purpose is simply to give a unique number, then we should convert to an integer.
Do you have a real life example, where your primary key is not a clustered index?
NAV: No.
What are the best practices for SSN number to store in database and obtain optimal performance?
NAV: I would say, when storing SSN, split it up into three parts – 3:3:4.These 3 columns can then be made a composite primary key.
Have a great rest-of-the-day!
Thanks,
Nakul.
Reply
Jack
December 2, 2009 1:26 am
SSN are not numbers, are not unique and never as a primary key!
There are SSN that have an ‘A’ as the first character. These are used as a way for alien residents to pay their taxes.
SSN are being recycled, also there many many many cases where SSN are being shared. Detecting and tracking this issue helps all clean up the issue.
SSN are not unique even to a person. You can request a new SSN under special cases like identity thief and protective services.
Please clean this article up to help SQL community to learn the right way to design databases with proper data typing!
Reply
Marko Parkkola
December 23, 2009 2:10 pm
I’m little late to the subject but wanted to point out couple of real problems in using SSN as PK.
First. In Finland there is a rule how you can distinguish man from woman when lookin at SSN. Now there’s a possibility that someone goes and changes his/her sex so the SSN changes too. So that would mean that you’d have to change PK field in DB and … well, I don’t want to even start thinking about that.
Second. SSN goes in the form of 010101-XXXY. That means that person is born January the first 1901. X is a latter or digit which forms unique combination with the birthdate. Y is a hash of it all. Now if person is born in 2001 SSN goes 010101A1234. Or if person is born in 1981 it goes 010101+1234. Because of this you can’t convert SSN into integer or you would have to figure out some kind of mapping so that 0=+, 1=1, 2=A and so on.
To put it in short: Never use SSN as PK. Or any other information that has the slightest possibility that it would change.
But I’d like to read more about nonclustered PK. I came up with the following example but I’m not sure if this is sound… I like to keep PK fields hidden as much as possible so that only the lowest level of application knows about it and never show it to the user. This is why I usually add fields like OrderNumber which is then displayed in the UI.
create table [Order]
(
id bigint not null,
OrderNumber bigint not null identity(1,1),
ReceiveDate datetime not null,
DeliveryDate datetime,
constraint PK_Order_Id primary key nonclustered (id)
)
create unique clustered index IX_Order_OrderNumber on [Order]
(
OrderNumber
)
Reply
Tom
July 15, 2010 8:27 pm
How to entered the SSN number in SQL Server 2005? it is having any special datatype.
E.g. SSN No. 111-22-3333
Reply
- Imran Mohammed
  July 16, 2010 6:21 am
  @Tom.
  Datatype would be : Varchar(11), but you can have a check constraint that says first 3 Characters should be Numbers and 4th character should be hipen and next two letters should be Numbers and then followed by a hipen and then next 4 characters should be Numbers.
  Check out Books Online, under topic Check Constraints, I believe Books Online has one of these examples.
  ~ Peace.
  Reply
Prachi
March 25, 2011 9:27 am
Hi Pinal,
I have some trouble with-
I want to make my email_Id column a unique key,So
1. Is it possible to set unique key to a column containing alphanumeric values?
2. I want the pwd column to be encrypted with some character like ‘*’ or ‘$’.So even if I open the table in SQL environment the password should be encrypted like ********
3. How to initialized an auto_increment value with some value like 4014 instead of default value 1? or any other way to generate random values?
Reply
- madhivanan
  April 1, 2011 12:05 pm
  1 yes
  2 Serach for encryption
  3 Set the start value to 4014
  ex
  create table(i int identity(4014,1)….
  Reply
Carlos
October 20, 2012 9:25 am
Hello guys. Great post. Please, let me point out that in the early 1930’s in US the SSN number was shared between couples. Some of these people are still alive and these marriages still share their SSN. That’s why the US Health Department created the HCFA Number which is the SSN with a suffix that distinguishes between husband and wife. Not that it was the primary key, but it was supposed to be unique, I you had worked with elder Healthcare systems you know what I’m talking about,
Reply
Rikhil
February 27, 2013 10:45 am
Hello sir ,
My opinion regarding the questions are…..
Q: Would you have the datatype of your Primary Key as string?
A:No. SQL server create clustured index by default on primary key if we not specify externally. and data in table physically stored based on clustured index so its not good to have primary key on ssn (string data type) because record in table is sort according to primary key (clustured Index).
Q:Would you treat SSN as string datatype or remove the dashes and change it into an integer?
A: Its also not a good idea to do this. we should specify non clustured unique index on it. Its enough no need to convert it into integer.
Q: Do you have a real life example, where your primary key is not a clustered index?
A:no. we specify our primary key as identity(1,1) and it create clustured index by default.
Q: What are the best practices for SSN number to store in database and obtain optimal performance?
A: for SSN it should be non clustured unique Index.
Reply
Ken
March 31, 2013 12:07 am
Very interesting blog as always. I don’t see a problem using a SSN as a PK for United States SSNs, but casting them as int or tinyint might cause a problem. Valid US SSNs can begin with a zero. While this would not present a problem in the table itself, it may present problems in ETL operations with other tables where leading zeros (or dashes between the 3rd and 4th and 5th and 6th character) are preserved. Best to specify SSN as a non clustered unique index in my opinion.
Reply
Rajesh
January 30, 2019 11:53 am
Q:when i have a new table with no constraints except a column A with no null constraint, i created a clustered index on a column B and then make the column A as primary key. Here the A is a primary and non clustered by default.
But now my table sorting is based on primary key column A but not the clustered indexed column B why?
Reply