SQL SERVER – Delete Duplicate Rows

June 23, 2009

I had previously penned down two popular snippets regarding deleting duplicate rows and counting duplicate rows. Today, we will examine another very quick code snippet where we will delete duplicate rows using CTE and ROW_NUMBER() feature of SQL Server 2005 and SQL Server 2008.

This method is improved over the earlier method as it not only uses CTE and ROW_NUMBER, but also demonstrates the power of CTE with DELETE statement. We will have a comprehensive discussion about it later in this article. For now, let us first create a sample table from which we will delete records.

/*Create Table with 7 entries - 3 are duplicate entries */
CREATE TABLE DuplicateRcordTable (Col1 INT, Col2 INT)
INSERT INTO DuplicateRcordTable
SELECT 1, 1
UNION ALL
SELECT 1, 1 --duplicate
UNION ALL
SELECT 1, 1 --duplicate
UNION ALL
SELECT 1, 2
UNION ALL
SELECT 1, 2 --duplicate
UNION ALL
SELECT 1, 3
UNION ALL
SELECT 1, 4
GO

The above table has total 7 records, out of which 3 are duplicate records. Once the duplicates are removed we will have only 4 records left.

/* It should give you 7 rows */
SELECT *
FROM DuplicateRcordTable
GO

SQL SERVER - Delete Duplicate Rows dup1

The most interesting part of this is yet to come. We will use CTE that will re-generate the same table with additional column, which is row number. In our case, we have Col1 and Col2 and both the columns qualify as duplicate rows. It may be a different set of rows for each different query like this. Another point to note here is that once CTE is created DELETE statement can be run on it. We will put a condition here – when we receive more than one rows of record, we will remove the row which is not the first one. When DELETE command is executed over CTE it in fact deletes from the base table used in CTE.

/* Delete Duplicate records */
WITH CTE (COl1,Col2, DuplicateCount)
AS
(
SELECT COl1,Col2,
ROW_NUMBER() OVER(PARTITION BY COl1,Col2 ORDER BY Col1) AS DuplicateCount
FROM DuplicateRcordTable
)
DELETE
FROM CTE
WHERE DuplicateCount > 1
GO

It is apparent that after delete command has been run, we will have only 4 records, which is almost the same result which we would have got with DISTINCT, with this resultset. If we had more than 2 columns and we had to run unique on only two columns, our distinct might have not worked here . In this case, we would have to use above the mentioned method.

/* It should give you Distinct 4 records */
SELECT *
FROM DuplicateRcordTable
GO

SQL SERVER - Delete Duplicate Rows dup2

This method is a breeze and we can use this for SQL Server version 2005 and the later versions.

Watch the view to see the above concept in action:

Reference : Pinal Dave (https://blog.sqlauthority.com)

Duplicate Records, SQL Function, SQL Scripts, SQL Server

SQL SERVER – Interesting Observation – Using sqlcmd From SSMS Query Editor

SQL SERVER – Connecting to Server Using Windows Authentication by SQLCMD

146 Comments. Leave new

Montee
June 23, 2011 12:08 am
What if I have columns from different Tables, How would I be able to perform a CTE on them to avoid the row duplicates
Reply
Montee
June 23, 2011 1:47 am
I got an error: View or function ‘repAuthPatStaff’ is not updatable because the modification affects multiple base tables.
when I tried to apply the CTE delete to multiple tables
Reply
Penchal Reddy.U
June 28, 2011 4:04 pm
very nice article…..great job….
Thanks,
U.Penchal Reddy.
Reply
Sivakumar
June 28, 2011 6:26 pm
Nice Answer Pinaldave. This will help to lot of ppl who is looking for good and better solution in database.
Reply
Tarak
August 1, 2011 11:53 pm
Hi Pinal,
Could you give me help on the following?
1. How many parameters that CTE can accepts?
2. Shall we able to move duplicate data into another table and How to delete duplicate records simultaneously using CTE.? Let me explain my actual scenario. I have a table called Test.
Test has the following data
Col1 Col2 Col3
—— —— ——
101 abc xyz
102 aaa bbb
101 abc xyz
101 abc xyz
103 xxx yyy
101 abc xyz
101 abc xyz
101 abc xyz
101 abc xyz
101 abc xyz
:::::::::::::::::::::::::
I need to store the duplicate records in the table Test_duplicate. Could you tell me how to achieve it with CTE.
My require output is
Test:
Col1 Col2 Col3
—— —– —–
101 abc xyz
102 aaa bbb
103 xxx yyy
Test_duplicate:
Col1 Col2 Col3
—– —– —–
101 abc xyz
101 abc xyz
101 abc xyz
101 abc xyz
101 abc xyz
101 abc xyz
101 abc xyz
::::::::::::::::::::::::
Thanks in advance.
Reply
Me
September 15, 2011 6:08 pm
If you’re using 2008, why not use the “undocumented” feature of physloc? Substitute your table for tablename and substitute the column which contains the duplicate values for columnwithdups. Works rather nicely if I do say so myself.
delete
from dbo.tablename
where tablename.%%physloc%%
NOT IN (Select min(b.%%physloc%%)
from dbo.tablename b
group by b.columnwithdups);
Reply
Alireza
October 2, 2011 4:29 pm
hi, Thanks for the article, it was so useful.
but i have another problem, similar to this:
My table has 10 rows, let say R1 to R10
R1 is the primary key .
i had some data while is filled in R1 to R5, and some else to R10.
The problem is: the system by mistake inserted duplicate data in R2-R5 (R1 is PK assuem autonumber) and in one of these duplicates R6 and R7 are also filled.
I want to remove the duplicates, which their R6,R7 are filled.
Consider that the R6,R7,… may be filled for other data, and they are ok!
by this method i could find them and remove the duplicates, but i cannot decide to remove which one! I mean I can not force to remove the duplicates which has R6,R7.
Please advice.
Reply
Alireza
October 3, 2011 12:15 pm
Thank you so much.
I am providing the working script for you to clarify the case.
Thanks and Regards,
Reply
Alireza
October 3, 2011 12:48 pm
Ok, here Iam.
these are the scripts (SQL Server 2005):
Table creation:
USE [MyDB]
GO
/****** Object: Table [dbo].[tBLtEST] Script Date: 10/03/2011 21:48:04 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[tBLtEST](
[R1] [varchar](16) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[R2] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[R3] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[R4] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[R5] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[R6] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[R7] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[R8] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[R9] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[R10] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
CONSTRAINT [PK_tBLtEST] PRIMARY KEY CLUSTERED
(
[R1] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
———————————————————————————
Data Insertion section:
INSERT INTO dbo.tBLtEST VALUES
(‘1′,’R2′,’R3′,’R4′,’R5’,NULL,NULL,NULL,NULL,NULL)–NORMAL RECORD
INSERT INTO dbo.tBLtEST VALUES
(‘A’,’RR’,’RR3′,’RG4′,’RG5′,’RH6′,’RH7′,’RTJ’,NULL,NULL)–NORMAL RECORD
INSERT INTO dbo.tBLtEST VALUES
(‘B’,’RB2′,’RB3′,’RB4′,’RB5′,NULL,’KJL’,’LKJU’,’BBNJ’,’TEST’)–NORMAL RECORD
INSERT INTO dbo.tBLtEST VALUES
(‘C’,’R2′,’R3′,’R4′,’R5′,’R6′,’R7′,NULL,NULL,NULL)–DUPLCATE RECORD WITH THE FIRST RECORD
INSERT INTO dbo.tBLtEST VALUES
(‘D’,’DR2′,’DR3′,’DR4′,’DR5′,NULL,NULL,NULL,NULL,NULL)
INSERT INTO dbo.tBLtEST VALUES
(‘E’,’DR2′,’DR3′,’DR4′,’DR5′,’KK’,’NN’,NULL,NULL,NULL)–DUPLICATE WITH ABOVE RECORD (pk=D)
INSERT INTO dbo.tBLtEST VALUES
(‘F’,’RF2′,’RF3′,’FR4′,’FR5′,’FFFF’,’OOO’,NULL,’MMM’,’KKL’)–NORMAL RECORD
INSERT INTO dbo.tBLtEST VALUES
(‘G’,’RGG2′,’GGGR3′,’RGGG4′,’GGGR5′,NULL,’GGG’,’GHH’,NULL,NULL)–NORMAL RECORD
———————————————————-
what i need is to remove the duplicates, but always remove the ones which the
R6 and R7 field are filled.
here in this sample i want to be sure that the Records C and E are removed.
hope it is clear now and you can guid me for that.
PS: please consider that I have Huge amount of data in the DB.
BRs,
Reply
Alireza
October 3, 2011 1:17 pm
I think I found it.
WITH CTE (R2,R3,R4,R5, DuplicateCount)
AS
(
SELECT R2,R3,R4,R5,
ROW_NUMBER() OVER(PARTITION BY R2,R3,R4,R5 ORDER BY R6,R7) AS DuplicateCount
FROM tBLtEST
)
DELETE
FROM CTE
WHERE DuplicateCount > 1
GO
Can you just check this and make me sure it is correct?
Reply
ChrisCarson
October 27, 2011 3:32 am
This is a nifty use of a CTE. My question is, how does using the CTE compare in terms of performance to using a SELECT DISTINCT?
The tables I am looking at text import tables. They are all very wide with upwards of 50 columns, all of the columns are varchar(255).
( I didn’t create the table but I have to maintain it! )
Duplicates would only exists if all 50 of the columns were identical.
The current query is a SELECT DISTINCT, but I wonder whether or not a CTE would be more efficient.
Maybe if I added a computed column that’s a hash of all the columns, and looked for duplicates based on that column?
Reply
Developer Mike
November 10, 2011 12:37 am
as always.. u r my savior.
Reply
madhan
November 16, 2011 1:38 pm
hi sir i want a query in table i want to show all records without top 3 records ?
any sqlquery please help me…madan
Reply
- Brian Tkatch
  November 16, 2011 7:57 pm
  @madhan
  What do you mean by “top 3”
  SELECT…FROM TABLE
  EXCEPT
  SELECT TOP(3) …FROM TABLE ORDER BY…
  Reply
ramprasad
November 29, 2011 1:00 am
Pinal you are doing a great job man.
Reply
Phani
December 9, 2011 5:54 pm
Not Bad
Reply
ajay babu
January 27, 2012 4:47 pm
This is for finding the duplicate rows:
duplicate rows=case 1-case2
SELECT Personentityid,COUNT(NAMESNO) AS DUPLICATE INTO DBA_DB.dbo.duplicatevalues FROM DBA_DB.dbo.SpellingVariation
GROUP BY Personentityid
case 1:
select sum(duplicate) from DBA_DB.dbo.duplicatevalues where duplicate>1
Case 2:
select Personentityid from DBA_DB.dbo.duplicatevalues where duplicate>1
Reply
David Hardwick
February 7, 2012 11:01 pm
We have to insert 30,000 rows every 10 minutes…we do that for about an hour. What is the best way to prevent duplicates?
a) Use a batch approach like the methods described in this article to remove duplicates
b) Use a real-time approach by adding a unique index to prevent duplicates when you insert?
Thanks,
David
Reply
ftj21
February 14, 2012 2:38 am
Man!!….that was very cool, thank dude
Reply
Balu
February 16, 2012 4:49 am
Dude I have to thank you.. have been using your posts very often.. felt compelled to leave a thank you note.. u r doing a gud service to ur fellow professionals….
Reply
suresh
February 20, 2012 11:46 am
hi to all,
Here i am having a case, Can we avoid duplicates at time while insertion done at two tables and if the query is re-run then also it should not insert the same data
Reply