This article has been submitted by Marko Parkkola, Data systems designer at Saarionen Oy, Finland. Marko is excellent developer and always thinking at next level. You can read his earlier comment which created very interesting discussion here: SQL SERVER- IF EXISTS(Select null from table) vs IF EXISTS(Select 1 from table). I must express my special thanks to Marko for sending this best practice for Enumerations in Relational Database. He has really wrote excellent piece here and welcome comments here.
Enumerations in Relational Database
This is a subject which is very basic thing in relational databases but often not very well understood and sometimes badly implemented. There are of course many ways to do this but I concentrate only two cases, one which is “the right way” and one which is definitely wrong way.
The concept
Let’s say we have table Person in our database. Person has properties/fields like Firstname, Lastname, Birthday and so on. Then there’s a field that tells person’s marital status and let’s name it the same way; MaritalStatus.
Now MaritalStatus is an enumeration. In C# I would definitely make it an enumeration with values likes Single, InRelationship, Married, Divorced. Now here comes the problem, SQL doesn’t have enumerations.
The wrong way
This is, in my opinion, absolutely the wrong way to do this. It has one upside though; you’ll see the enumeration’s description instantly when you do simple SELECT query and you don’t have to deal with mysterious values. There’s plenty of downsides too and one would be database fragmentation.
Consider this (I’ve left all indexes and constraints out of the query on purpose).
CREATE TABLE [dbo].[Person] (
[Firstname] NVARCHAR(100),
[Lastname] NVARCHAR(100),
[Birthday] datetime,
[MaritalStatus] NVARCHAR(10)
)
You have nvarchar(20) field in the table that tells the marital status. Obvious problem with this is that what if you create a new value which doesn’t fit into 20 characters? You’ll have to come and alter the table. There are other problems also but I’ll leave those for the reader to think about.
The correct way
Here’s how I’ve done this in many projects. This model still has one problem but it can be alleviated in the application layer or with CHECK constraints if you like.
First I will create a namespace table which tells the name of the enumeration. I will add one row to it too. I’ll write all the indexes and constraints here too.
CREATE TABLE [CodeNamespace] (
[Id] INT IDENTITY(1, 1),
[Name] NVARCHAR(100) NOT NULL,
CONSTRAINT [PK_CodeNamespace] PRIMARY KEY ([Id]),
CONSTRAINT [IXQ_CodeNamespace_Name] UNIQUE NONCLUSTERED ([Name])
)
GO
INSERT INTO [CodeNamespace] SELECT 'MaritalStatus'
GO
Then I create a table that holds the actual values and which reference to namespace table in order to group the values under different namespaces. I’ll add couple of rows here too.
CREATE TABLE [CodeValue] (
[CodeNamespaceId] INT NOT NULL,
[Value] INT NOT NULL,
[Description] NVARCHAR(100) NOT NULL,
[OrderBy] INT,
CONSTRAINT [PK_CodeValue] PRIMARY KEY CLUSTERED ([CodeNamespaceId], [Value]),
CONSTRAINT [FK_CodeValue_CodeNamespace] FOREIGN KEY ([CodeNamespaceId]) REFERENCES [CodeNamespace] ([Id])
)
GO
-- 1 is the 'MaritalStatus' namespace
INSERT INTO [CodeValue] SELECT 1, 1, 'Single', 1
INSERT INTO [CodeValue] SELECT 1, 2, 'In relationship', 2
INSERT INTO [CodeValue] SELECT 1, 3, 'Married', 3
INSERT INTO [CodeValue] SELECT 1, 4, 'Divorced', 4
GO
Now there’s four columns in CodeValue table. CodeNamespaceId tells under which namespace values belongs to. Value tells the enumeration value which is used in Person table (I’ll show how this is done below). Description tells what the value means. You can use this, for example, column in UI’s combo box. OrderBy tells if the values needs to be ordered in some way when displayed in the UI.
And here’s the Person table again now with correct columns. I’ll add one row here to show how enumerations are to be used.
CREATE TABLE [dbo].[Person] (
[Firstname] NVARCHAR(100),
[Lastname] NVARCHAR(100),
[Birthday] datetime,
[MaritalStatus] INT
)
GO
INSERT INTO [Person] SELECT 'Marko', 'Parkkola', '1977-03-04', 3
GO
Now I said earlier that there is one problem with this. MaritalStatus column doesn’t have any database enforced relationship to the CodeValue table so you can enter any value you like into this field. I’ve solved this problem in the application layer by selecting all the values from the CodeValue table and put them into a combobox / dropdownlist (with Value field as value and Description as text) so the end user can’t enter any illegal values; and of course I’ll check the entered value in data access layer also.
I said in the “The wrong way” section that there is one benefit to it. In fact, you can have the same benefit here by using a simple view, which I schema bound so you can even index it if you like.
CREATE VIEW [dbo].[Person_v] WITH SCHEMABINDING
AS
SELECT p.[Firstname], p.[Lastname], p.[BirthDay], c.[Description] MaritalStatus
FROM [dbo].[Person] p
JOIN [dbo].[CodeValue] c ON p.[MaritalStatus] = c.[Value] JOIN [dbo].[CodeNamespace] n ON n.[Id] = c.[CodeNamespaceId] AND n.[Name] = 'MaritalStatus'
GO
-- Select from View
SELECT *
FROM [dbo].[Person_v]
GO
This is excellent write up byMarko Parkkola. Do you have this kind of design setup at your organization? Let us know your opinion.
Reference: Pinal Dave (https://blog.sqlauthority.com)
42 Comments. Leave new
Sound Very Fancy, From what I have understood Marko is talking here all about Normalization.
Nice handy Information.
Thanks for Sharing.
Imran.
Sounds like an extremely complicated solution to a simple problem. Would it not be better to have a MaritalStatus table and a single column referencing to that table?
totally agree
But this is re-usable, you can add multiple lookups to these tables and create views to access the data! very clean solution, 3years late but props!
A very different way of looking at the problem in hand, at the same time my initial thoughts were along the lines of John.
Thank you
I have implemented solutions using a single code table and it has worked well. However, thanks to reading some compelling arguments against using a single table, I now make a separate table for each enumeration.
The most important reason for this is that a databases primary job is to ensure data integrity. If you have all of the codes in one table, there is still a chance that marital status may reference a code that is not appropriate. For instance, if you have the marital status codes and gender codes in a single table, and a mistake is made in data entry or in the application code, you may end up with Marital Status pointing to a code record that returns “Female”.
Therefore, while the “Wrong Way” above is obviously wrong, and the “Correct Way” is better, I believe the best solution, in terms of data quality is to have each enumeration in a separate table (e.g. MaritalStatus, CustomerType, Gender, etc).
I totally agree with Jake M and John McLusky. A relational database should have relations and not a table of unrelated data that ‘magically’ fits if you know about the association. A separate table for each enumeration is the ‘Right’ way.
or having an enum in code… problem solved
I think you have created master table MaritalStatusType
and used this in drop down and saved in person table.
Can you able to give any other example.
Hi all and thanks for your comments.
@John, Ramdas, Jake
There is one solid reason to have single code table and not to split it into separate tables. Maintainability.
In some cases I add special field to CodeNamespace table IsUserEditable which tells, well, if user is able to edit it. And by user I mean application administrator, not the DBA necessarily.
Now with this approach it is rather easy to make administration application/form for this. All you need to do is scan two tables, group them in UI and provide couple of controls to alter the values. I could do this with two datagrids in .NET plus couple of buttons.
With multiple tables you need to write multiple queries or use some other little more complicated method to do this.
Then the data integrity problem. I do like it that DB is able to maintain data integrity but in this case I would leave to the application. Like I said, if you provide dropdownlist for the user, which you populate from the code table (and, of course, you will check the user input in any case) then there is no way you could end up writing bogus values to the table, even when there is no foreign key to prevent it.
@Tobi
If you write enum in code you come to maintenance problems. When I release my project to customer I really don’t want to make another release when ever there is a need for a new enumeration value, even if it brings me revenues. You could use some XML mapping and construct enums dynamically in your code but then there is XML file to maintain.
@Imran
I don’t think this is normalization per se, but yes, I think you could say it’s about normalization :)
Hi Marko,
Thanks for the detailed explanation, i agree with your views on data integrity, which is very crucial.
I would agree with Marko, it soon becomes a maintenance nightmare having a table for each list value combination, is it really worth having a table to hold 3 values, in my experience its not.
In addition lots of applications I have worked upon have user defined lists and list values, as such these need to be accessible to those end users, having a single table and a single UI simplifies this process,
Oh, I forgot one thing. With this approach it is easy to group enumerations if it’s needed (very rarely though). Just add field for this to CodeNamespace table:
ParentNamespace INT REFERENCES CodeNamespace(CodeNamespaceId)
And now you can build nice tree view into the administration form.
There is also other more complex groupings which can be easily achieved with this like populating dropdownlist B based on the selection on dropdownlist A.
Plus there is all kinds of positive things like I once wrote a generic ASP.NET template engine which used reflection and metadata to populate search fields which was then used by ObjectDataSource to get search results and so on. I could have done it other way too but this approach was the easiest I could think of.
Enum types do not belong in the database, as they do not change. A database is a persistence mechanism and really should be avoided as an application integration platform as it couples you to the schema and always leads to maintainability issues. Since a real enum cannot be changed without recompiling the application, simply serializing the integer values should suffice.
Now, if by enum you mean variable lists that can be edited, that’s fine, but that is a totally different concept as you cannot just bind that to an enum in code.
Keeping a copy of a real enum in a “lookup” table of any sorts is duplication that should be avoided.
“Now, if by enum you mean variable lists that can be edited”
Heh. That’s exactly what I meant. Thank you for clearing this up, since my English is not so good :)
Simple problem, explained very well – congratulations. I’ve seen even worse table definitions than the one described in ‘The wrong way’.
And one more thing came to mind: localization.
If we add extra column to CodeValue table (and add this new column to PK):
ALTER TABLE CodeValue ADD LanguageID INT
We can localize all the values simply by (1=English, 2=Finnish):
INSERT INTO CodeValue SELECT 1, 1, ‘Single’, 1, 1
INSERT INTO CodeValue SELECT 1, 1, ‘Naimaton’, 1, 2
INSERT INTO CodeValue SELECT 1, 2, ‘In relationship’, 2, 1
INSERT INTO CodeValue SELECT 1, 2, ‘Suhteessa’, 2, 2
etc.
Now we can easily fetch localized values. I’d probably create a function for this:
CREATE FUNCTION GetValues(@NamespaceID INT, @LanguageID INT)
RETURNS TABLE
AS
RETURN (
SELECT Value, Description
FROM CodeValues
WHERE CodeNamespaceId = @NamespaceID AND LanguageID = @LanguageID
ORDER BY OrderBy
)
Every year a newbie re-discovers this design flaw. It is so common, it has a name “One True Lookup Table” (OTLT) .
The first hint that this is the wrong way to program should be “code_value” — an absurd name for a data element. It can be a “_code” or a “_value”, but it cannot be just a vague generic “code_value” by definition.
The fact that he even mentions using this for the User Interface shows that he is not yet thinking in RDBMS — he mindset is still in procedural code!
I have a whole section in one of my books on this nightmare.
I’m sorry but was you intention to write something constructive or just advertise your book? I’ll gladly take constructive critisism but I don’t like to be called by names.
The problem boils down to this being a compromising solution that is forsaking RDBMS principle in order to make OOP easier. Most database professionals consider data integrity sacred, so a compromise you make that makes formal foreign key constraints impossible is seen as going in the wrong direction. A quick google of OTLT returns a host of reasons as to why this can be a bad approach. Joe Celko’s posting, which he didn’t list here, wasn’t in the top three, but was linked by one of those.
You use Localization as a feature of something easy to add. I consider that indicative of the biggest problem: what happens when something grows beyond being a mere enumeration? What if some of those other Code Namespaces in there aren’t things that are localized? What if one of these enumerations expands into something much bigger that needs a lot more meta-data? Do you add columns to the generic table that aren’t used by everything else, or break it out into a separate table?
Hi,
Thank you. That was a good and constructive comment. I will surely consider the points you make and try to Google around a bit.
I’ve seen solutions to those points you made about data integrity and non-localized values. Allthough I don’t really like them they do seem to work. In one project my DBA colleague wrote stored procs to handle these situations but I think it just lead to even more complex design.
And I must clear up a bit why I brought GUI into this picture. I’m usually the one implementing the system from ground up, form DB to GUI. I try to make things as generic as possible so I can reuse my code. This saves time a lot and in the end, hopefully, reduces the amount of bugs. I do realise now, after thinking this a bit, that this approach doesn’t necessarily work well when there’s whole department working on the same project. At least I need to publish strict interfaces to protect my code & data.
On my project I’ve solved the problem of “the database enforced relationship to the CodeValue table” by adding the calculated column to the data table. In your case it would look like this:
ALTER TABLE [dbo].[Person]
ADD [CodeNamespaceId] AS (1)
ALTER TABLE [dbo].[Person]
ADD CONSTRAINT [FK_Person_CodeValue]
FOREIGN KEY ([CodeNamespaceId], [MaritalStatus]) REFERENCES [CodeValue]
([CodeNamespaceId], [Value])
Hope this could help
Yes, that should work. Thanks for the tip.
If you have multiple codevalues in a single table you just have to think about column and constraint naming a bit.
I am using a similar approach, but I have a few additions.
1) My enum table also allows user defined enumerated values which definitely takes this issue out of the static list scenario and must be kept in the DB not in code. I added an additional column to my table named ‘client’ where (null or 0) means that the enum entry applies universally, but if a client has a unique value for an enumeration he can add it. Like Marko, these values are then used to populate UI choices and in reports. A given client sees the set of universal values + any unique values that have been defined. For example: Sales Regions, where the universal regions may be (Northern, Southern, Eastern, and Western) and assigned code 1-4 respectfully. If a specific user wants to use the add “International” it can be added as a unique client entry with a code of 5. Also if a client wants to call the ‘Western’ region ‘Western US’ the table allows for the overload using the same code (4), but with the unque client ID. In this way any business logic can still apply to code (4), but the user preferred term is displayed in the program and on reports.
2) In addition to the order column that indicates the order in which the list should be displayed. I added an ‘rank’ column to be used if the enum value is used in calculations. For example: Grades, school grades (A..F) may have codes of 1-5, but a rank can be assigned A-90, B-80, etc. Since I keep the rank separate from the order and the value, enum entries can have the the same rank without messing up any listings. This is also useful as clients add new enum entries.
3) Lastly, I added some additional logic and codes to accomidate linked sub-enum lists. These can be displayed as trees or cascaded combo box. This is tricky, but is very useful. For example: (again my Sales Regions), I can define ‘Southern CA’,’Northern CA’, and ‘Pacific NW’ as sub-enum lists associated with the Sales Region ‘Western’. Of course these can be overridden or augmented by the user as well.
I keep all of these definitions out of my ‘code’ which helps maintain consistancy across client platforms (fat, thin, web, mobile, ect.) and pushing out new releases.
I think some of you are confusing and enumeration with a taxonomy.
An enumeration is a list fixed to a release is relational in nature. It is convenient to store these values as numerics for various reasons.
A taxonomy is more complicated and often involves business rules. Taxonomies may have multiple layers and can become quite robust.
I use a enumeration structure for enumerations and build my enumeration library from that structure. Although it is true, that the code does not need this value defined in the database, but for the data to be meaningful to the user, I need to be able to present it to the user in reports. Hence I store this value in the database so I can link to it and generated reports.
In brief my struct look like:
Table: Enumeration (int Enumeration_ID, char namespace)
Table: EnumerationValue (FK Enumeration_ID, PK int EnumerationValue_ID, char enumName, char displayValue)
(Please note the ID is a Primary Key, so you cannot reuse numbers)
Example:
Enumeration (1, ‘MaritalStatus’)
Enumeration (2, ‘Gender’)
Enumeration Value (1, 1, ‘Married’, ‘Married’)
Enumeration Value (1, 2, ‘Single’, ‘Single’)
Enumeration Value (1, 3, ‘SingleParent’, ‘Single Parent’)
Enumeration Value (1, 4, ‘Divorced’, ‘Divorced’)
Enumeration Value (2, 5, ‘Male’, ‘Male’)
Enumeration Value (2, 6, ‘Female’, ‘Female’)
Enumeration Value (2, 7, ‘NotSpecified’, ‘Not Specified’)
This is similar to the authors example with the exception that the enumeration value must be unique in the database, but allows for a simplified join. (Please note: I deal with terabyte size datasets and the propose method costs way too much.
SELECT p.FirstName, p.LastName, E.DisplayValue
FROM
Person p INNER JOIN Enumeration_Value E
ON p.MaritalStatus = E.Enumeration_Value_ID
Hi Marko,
Thanks for the article. When I did a google search regarding this Lookup design almost 90% of the results returned this approach to be a bad practice. I really got confused and unable to make a conclusion on this approach. Based on this article (in the below link) could you please justify the points given under the list of disadvantages so that everyone in the forum gets a clarity.
Thanks
Kris
Hi,
Sorry I’ve forgotten this article totally. I’ll try to answer some of points in the article but I really don’t understand them all. I mean I don’t understand the text. There’s some words I have to even run through translator but I get to learn more English at least!
First I’ll admit that there an issue with referential integrity. You’ll have to handle it with some other way. I would handle it in the application code. (All the DBA purists jump in their chairs now:)
I don’t understand the type issue. In my example all the code columns are INT typed. There’s no need for using other types.
I think “mediocre constraints” refers to situation where you have multiple lookup tables. I’m not really fond of those because of maintainability issues. Instead of one table you have n+1 tables to look after.
I don’t understand the issue with NULLs. Just define the column non-nullable and that’s it. If you need NULL value, insert a special row for this into the lookup table. In fact, this is much more simpler because you don’t have to work around NULL values in application code. (purists grind their teeths:)
Also I didn’t understand the rigidity section. Something to do with hierarchies? I think I covered this in some of my comment above. Sidenote: Just couple of days ago I figured out a nice way to add versioning to the lookup table.
Hidden complexity? If you think your Person table to hold entities, you can think that MaritalStatus is an attribute of that entity. Person table just holds the numeric expression of the attribute defined in the lookup table. It’s like key-value pair. Many times I don’t need to fetch the textual expression of the attribute at all but when I do, I’ll just join the lookup table to the query. I don’t know how one can use UNION with lookup tables.
I haven’t thought about physical issues with large lookup tables. I’ve seen lookup table with couple of thousand rows in it and it have worked perfectly well. Hundreds of thousands of rows sounds really massive!
One reason to use eumeration is to avoid hard coding values all over the place. e.g. extending above example, if you have as Stored Procedure that takes parameter of @MaritalStatus. If this parameter is used in more than one SP, then it would be convenient to have a enumeration like feature in Database.
In my mind, this is a clear shortcoming of SQL Server. User-defined types should permit enumerations, and they should be used for things such as mentioned above which do not change or change very rarely. Sex: Male/Female/Other for example.
But, given that we don’t have that I would only say that relying on your application to enforce data integrity overlooks the fact that stored procedures may be coded to break your data integrity. Your approach is certainly not “wrong”, but for me personally it would be preferable to have a foreign key to a lookup table specifically for the thing you think of as an enum.
Just my $0.02
I have created a medical application for a number of hospitals.
The Patient Master Index table (which contains information about the patient) and the Patient Episode table (which contains information about each visit by a patient to the hospital) between them have about 100 data items which could be handled by enumerations (eg Sex (Yes please or No thanks), Marital Status, Title (eg Mr/Mrs/Ms/Prof/Dr) and many, many others.
The application was written many years ago in using an ISAM database and I didn’t want to allow free-form data entry (because it’s too easy to make spelling errors etc), and I didn’t want to hard code the possibilities into my program code (because the end user needed to be able to edit some of these lookup lists) and I didn’t want to store the whole description (eg married, widowed, divorced, single, de-facto) because it wastes too much space so I chose to store a short alpha code (eg M / W / D / S / F) in the database so that I could usually understand what the code was likely to mean without having to resort to a lookup table.
However, for data validation in the program a Miscellaneous Types table (one row for each type of code) and a Miscellaneous Codes table (one row for each code for each type of code) were used so that the user had to select from a predetermined list and the items in the list were selected using the longer description of the code and then the code stored in the database.
Some codes were marked as SYSTEM and couldn’t be edited but for other codes the administrator could amend the list if required. It only required one maintenance screen to edit ALL of the codes and it was the simplest and most flexible way to do this.
Due to the way the ISAM database worked a single character code (1 byte) used less space than a numeric code (2 bytes) although some of my codes were 2 or 3 characters long as well.
While I can see why some experts are saying – one lookup table per type of code – but that is a nightmare to setup and maintain when you have hundreds of different types of codes and surely ease of use is a primary consideration.