Earlier this week I asked a question where I asked how to Swap Values of the column without using CASE Statement. Read here: A Puzzle – Swap Value of Column Without Case Statement,there were more than 50 solutions proposed in the comment. There were many creative solutions. I have mentioned my personal favorite (different ones) here: Solution of Puzzle – Swap Value of Column Without Case Statement.
However, I received lots of questions regarding one of the Solution by SIJIN KUMAR V P. He has used the function SOUNDEX in his solution. The request was to explain how SOUNDEX and DIFFERENCE works. Well, there are pretty decent documentations provided over here SOUNDEX function and DIFFERENCE over on MSDN and if I attempt to explain this function I will end up writing the same details which are available on MSDN. Instead of writing theory, we will try to learn this function by using a couple of simple puzzles. You try to solve the puzzles using the MSDN and see if you can learn something very quickly.
In simple words – SOUNDEX converts an alphanumeric string to a four-character code to find similar-sounding words or names. The first character of the code is the first character of character_expression and the second through fourth characters of the code are numbers that represent the letters in the expression. Vowels incharacter_expression are ignored unless they are the first letter of the string. DIFFERENCE function returns an integer value. The integer returned is the number of characters in the SOUNDEX values that are the same. The return value ranges from 0 through 4: 0 indicates weak or no similarity, and 4 indicates strong similarity or the same values.
Learning Puzzle 1:
Now let us run following four queries and observe its output.
SELECT SOUNDEX('SQLAuthority') SdxValue
SELECT SOUNDEX('SLTR') SdxValue
SELECT SOUNDEX('SaLaTaRa') SdxValue
SELECT SOUNDEX('SaLaTaRaM') SdxValue
When you look at the result set all the four values are same.
The reason for all the values to be same is as for SQL Server SOUNDEX function all the four strings are similarly sounding string.
Learning Puzzle 2:
Now let us run following five queries and observe its output.
SELECT DIFFERENCE (SOUNDEX('SLTR'),SOUNDEX('SQLAuthority'))
SELECT DIFFERENCE (SOUNDEX('TH'),SOUNDEX('SQLAuthority'))
SELECT DIFFERENCE ('SQLAuthority',SOUNDEX('SQLAuthority'))
SELECT DIFFERENCE ('SLTR',SOUNDEX('SQLAuthority'))
SELECT DIFFERENCE ('SLTR','SQLAuthority')
When you look at the result set you will get the result in the ranges from 1 to 4.
Here is how it works if your result is 0 which means absolutely not relevant to each other and if your result is 1 which means the results are relevant to each other.
Have you ever used above two functions in your business need or on production server? If yes, would you please leave a comment with use cases. I believe it will be beneficial to everyone.
Reference: Pinal Dave (https://blog.sqlauthority.com)
11 Comments. Leave new
great job buddy nice post
In my machine,value of SELECT SOUNDEX(‘SQLAuthority’) SdxValue is S243 and rest are S436.WHy ?
I think we can use SOUNDEX and DIFFERENCE in search engine.There can be value in database which “Like operator” ignore,in that scenerio we can use SOUNDEX and DIFFERENCE.
For example, if i search “Lebron Jams” and the database has values “Lebron James”, “Lebron Jim”, “Michael James”, etc , the results that system returns should be ordered by similarity between strings, i.e, the results should be “Lebron James”, “Lebron Jim” and “Michael James”.
It is hard to implement this by using sql’s ‘like’.
I just do a test and found that value of SELECT SOUNDEX(‘Michael James’) in SQL SERVER DB is M240, but why the value in ORACLE DB is M242
Like Kumar stated, I have the SOUNDEX() function running on a production website. It is used in a public search function for people’s names. Works very well and was a breeze to implement once I knew of its existence. Would have been a big hassle to implement otherwise.
Nice, your posts are very informative. Kudos to the your course on Pluralsight. Looking forward for more videos,posts and insights on database design patterns
also see this link https://www.archives.gov/research/census/soundex.html
Dear Pinal,
there are invalid information about comparing soundex value in your article.
first, try to run these sql statement
SELECT
SOUNDEX(‘joanharimau’),
SOUNDEX(‘jhonsayang’),
DIFFERENCE(SOUNDEX(‘joanharimau’),SOUNDEX(‘jhonsayang’)),
DIFFERENCE(‘joanharimau’,’jhonsayang’)
the value will return
filed1 | field2 | field3 | field4
J565 J525 4 3
DIFFERENCE can only compare character expression. not soundex value.
and the return value of DIFFERENCE is also incorrect.
The return value ranges from 0 through 4: 0 indicates weak or no similarity, and 4 indicates strong similarity or the same values.
https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms188753(v=sql.105)
https://support.microsoft.com/en-us/help/100365
Thank you for your kind note. I will test everything and make changes as required.
Pinal,
I’m trying to understand how the DIFFERENCE function works. I am seeing strange results.
DECLARE @t VARCHAR(20), @s VARCHAR(20)
SET @t = ‘SDRF’
SET @s = ‘FRDS’
SELECT SOUNDEX(@t), SOUNDEX(@s),DIFFERENCE(@t,@s)
Gives:
S361, F632, 1 Yet, neither soundex value has a common digit in sequence.
DECLARE @t VARCHAR(20), @s VARCHAR(20)
SET @t = ‘SDRF’
SET @s = ‘SFRD’
SELECT SOUNDEX(@t), SOUNDEX(@s),DIFFERENCE(@t,@s)
Gives:
S361, S163, 2 Which makes sense, since they have the S and 6 in common.
DECLARE @t VARCHAR(20), @s VARCHAR(20)
SET @t = ‘SDRF’
SET @s = ‘SFFD’
SELECT SOUNDEX(@t), SOUNDEX(@s),DIFFERENCE(@t,@s)
Gives:
S361, S130, 2 First digit in common, but no others. ????
DECLARE @t VARCHAR(20), @s VARCHAR(20)
SET @t = ‘HELPER’
SET @s = ‘QUIZ’
SELECT SOUNDEX(@t), SOUNDEX(@s),DIFFERENCE(@t,@s)
Gives:
H416, Q200, 0 Which is what one would expect.
I’m confused by the less than consistent results. I’m sure I’m missing some key piece in the puzzle. Do you understand how this function works?
Now I am also confused.
I find that DIFFERENCE is a very “unfamiliar” function to SQL writers as it gives what feels like approximate answers in some way.. To me it has been a very powerful tool for “washing” internal registers against public data.. It is a great tool for separating “similar” Fields that can be updated automatically from totally wrong information that needs manual revision :) More of a practical approach…