SQL SERVER – Delete Duplicate Rows

I had previously penned down two popular snippets regarding deleting duplicate rows and counting duplicate rows. Today, we will examine another very quick code snippet where we will delete duplicate rows using CTE and ROW_NUMBER() feature of SQL Server 2005 and SQL Server 2008.

This method is improved over the earlier method as it not only uses CTE and ROW_NUMBER, but also demonstrates the power of CTE with DELETE statement. We will have a comprehensive discussion about it later in this article. For now, let us first create a sample table from which we will delete records.

/*Create Table with 7 entries - 3 are duplicate entries */
CREATE TABLE DuplicateRcordTable (Col1 INT, Col2 INT)
INSERT INTO DuplicateRcordTable
SELECT 1, 1
UNION ALL
SELECT 1, 1 --duplicate
UNION ALL
SELECT 1, 1 --duplicate
UNION ALL
SELECT 1, 2
UNION ALL
SELECT 1, 2 --duplicate
UNION ALL
SELECT 1, 3
UNION ALL
SELECT 1, 4
GO

The above table has total 7 records, out of which 3 are duplicate records. Once the duplicates are removed we will have only 4 records left.

/* It should give you 7 rows */
SELECT *
FROM DuplicateRcordTable
GO

SQL SERVER - Delete Duplicate Rows dup1

The most interesting part of this is yet to come. We will use CTE that will re-generate the same table with additional column, which is row number. In our case, we have Col1 and Col2 and both the columns qualify as duplicate rows. It may be a different set of rows for each different query like this. Another point to note here is that once CTE is created DELETE statement can be run on it. We will put a condition here – when we receive more than one rows of record, we will remove the row which is not the first one. When DELETE command is executed over CTE it in fact deletes from the base table used in CTE.

/* Delete Duplicate records */
WITH CTE (COl1,Col2, DuplicateCount)
AS
(
SELECT COl1,Col2,
ROW_NUMBER() OVER(PARTITION BY COl1,Col2 ORDER BY Col1) AS DuplicateCount
FROM DuplicateRcordTable
)
DELETE
FROM CTE
WHERE DuplicateCount > 1
GO

It is apparent that after delete command has been run, we will have only 4 records, which is almost the same result which we would have got with DISTINCT, with this resultset. If we had more than 2 columns and we had to run unique on only two columns, our distinct might have not worked here . In this case, we would have to use above the mentioned method.

/* It should give you Distinct 4 records */
SELECT *
FROM DuplicateRcordTable
GO

SQL SERVER - Delete Duplicate Rows dup2

This method is a breeze and we can use this for SQL Server version 2005 and the later versions.

Watch the view to see the above concept in action:

Reference : Pinal Dave (https://blog.sqlauthority.com)

Duplicate Records, SQL Function, SQL Scripts, SQL Server
Previous Post
SQL SERVER – Interesting Observation – Using sqlcmd From SSMS Query Editor
Next Post
SQL SERVER – Connecting to Server Using Windows Authentication by SQLCMD

Related Posts

146 Comments. Leave new

  • What if I have columns from different Tables, How would I be able to perform a CTE on them to avoid the row duplicates

    Reply
  • I got an error: View or function ‘repAuthPatStaff’ is not updatable because the modification affects multiple base tables.
    when I tried to apply the CTE delete to multiple tables

    Reply
  • Penchal Reddy.U
    June 28, 2011 4:04 pm

    very nice article…..great job….

    Thanks,
    U.Penchal Reddy.

    Reply
  • Nice Answer Pinaldave. This will help to lot of ppl who is looking for good and better solution in database.

    Reply
  • Hi Pinal,

    Could you give me help on the following?
    1. How many parameters that CTE can accepts?
    2. Shall we able to move duplicate data into another table and How to delete duplicate records simultaneously using CTE.? Let me explain my actual scenario. I have a table called Test.

    Test has the following data
    Col1 Col2 Col3
    —— —— ——
    101 abc xyz
    102 aaa bbb
    101 abc xyz
    101 abc xyz
    103 xxx yyy
    101 abc xyz
    101 abc xyz
    101 abc xyz
    101 abc xyz
    101 abc xyz
    :::::::::::::::::::::::::

    I need to store the duplicate records in the table Test_duplicate. Could you tell me how to achieve it with CTE.

    My require output is

    Test:

    Col1 Col2 Col3
    —— —– —–
    101 abc xyz
    102 aaa bbb
    103 xxx yyy

    Test_duplicate:
    Col1 Col2 Col3
    —– —– —–
    101 abc xyz
    101 abc xyz
    101 abc xyz
    101 abc xyz
    101 abc xyz
    101 abc xyz
    101 abc xyz
    ::::::::::::::::::::::::

    Thanks in advance.

    Reply
  • If you’re using 2008, why not use the “undocumented” feature of physloc? Substitute your table for tablename and substitute the column which contains the duplicate values for columnwithdups. Works rather nicely if I do say so myself.

    delete
    from dbo.tablename
    where tablename.%%physloc%%
    NOT IN (Select min(b.%%physloc%%)
    from dbo.tablename b
    group by b.columnwithdups);

    Reply
  • hi, Thanks for the article, it was so useful.
    but i have another problem, similar to this:
    My table has 10 rows, let say R1 to R10
    R1 is the primary key .
    i had some data while is filled in R1 to R5, and some else to R10.

    The problem is: the system by mistake inserted duplicate data in R2-R5 (R1 is PK assuem autonumber) and in one of these duplicates R6 and R7 are also filled.
    I want to remove the duplicates, which their R6,R7 are filled.
    Consider that the R6,R7,… may be filled for other data, and they are ok!
    by this method i could find them and remove the duplicates, but i cannot decide to remove which one! I mean I can not force to remove the duplicates which has R6,R7.
    Please advice.

    Reply
  • Thank you so much.
    I am providing the working script for you to clarify the case.
    Thanks and Regards,

    Reply
  • Ok, here Iam.
    these are the scripts (SQL Server 2005):
    Table creation:
    USE [MyDB]
    GO
    /****** Object: Table [dbo].[tBLtEST] Script Date: 10/03/2011 21:48:04 ******/
    SET ANSI_NULLS ON
    GO
    SET QUOTED_IDENTIFIER ON
    GO
    SET ANSI_PADDING ON
    GO
    CREATE TABLE [dbo].[tBLtEST](
    [R1] [varchar](16) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [R2] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [R3] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [R4] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [R5] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [R6] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [R7] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [R8] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [R9] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [R10] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    CONSTRAINT [PK_tBLtEST] PRIMARY KEY CLUSTERED
    (
    [R1] ASC
    )WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
    ) ON [PRIMARY]

    GO
    SET ANSI_PADDING OFF
    ———————————————————————————

    Data Insertion section:

    INSERT INTO dbo.tBLtEST VALUES
    (‘1′,’R2′,’R3′,’R4′,’R5’,NULL,NULL,NULL,NULL,NULL)–NORMAL RECORD
    INSERT INTO dbo.tBLtEST VALUES
    (‘A’,’RR’,’RR3′,’RG4′,’RG5′,’RH6′,’RH7′,’RTJ’,NULL,NULL)–NORMAL RECORD
    INSERT INTO dbo.tBLtEST VALUES
    (‘B’,’RB2′,’RB3′,’RB4′,’RB5′,NULL,’KJL’,’LKJU’,’BBNJ’,’TEST’)–NORMAL RECORD
    INSERT INTO dbo.tBLtEST VALUES
    (‘C’,’R2′,’R3′,’R4′,’R5′,’R6′,’R7′,NULL,NULL,NULL)–DUPLCATE RECORD WITH THE FIRST RECORD
    INSERT INTO dbo.tBLtEST VALUES
    (‘D’,’DR2′,’DR3′,’DR4′,’DR5′,NULL,NULL,NULL,NULL,NULL)
    INSERT INTO dbo.tBLtEST VALUES
    (‘E’,’DR2′,’DR3′,’DR4′,’DR5′,’KK’,’NN’,NULL,NULL,NULL)–DUPLICATE WITH ABOVE RECORD (pk=D)
    INSERT INTO dbo.tBLtEST VALUES
    (‘F’,’RF2′,’RF3′,’FR4′,’FR5′,’FFFF’,’OOO’,NULL,’MMM’,’KKL’)–NORMAL RECORD
    INSERT INTO dbo.tBLtEST VALUES
    (‘G’,’RGG2′,’GGGR3′,’RGGG4′,’GGGR5′,NULL,’GGG’,’GHH’,NULL,NULL)–NORMAL RECORD

    ———————————————————-

    what i need is to remove the duplicates, but always remove the ones which the
    R6 and R7 field are filled.
    here in this sample i want to be sure that the Records C and E are removed.
    hope it is clear now and you can guid me for that.
    PS: please consider that I have Huge amount of data in the DB.
    BRs,

    Reply
  • I think I found it.
    WITH CTE (R2,R3,R4,R5, DuplicateCount)
    AS
    (
    SELECT R2,R3,R4,R5,
    ROW_NUMBER() OVER(PARTITION BY R2,R3,R4,R5 ORDER BY R6,R7) AS DuplicateCount
    FROM tBLtEST
    )
    DELETE
    FROM CTE
    WHERE DuplicateCount > 1
    GO

    Can you just check this and make me sure it is correct?

    Reply
  • This is a nifty use of a CTE. My question is, how does using the CTE compare in terms of performance to using a SELECT DISTINCT?

    The tables I am looking at text import tables. They are all very wide with upwards of 50 columns, all of the columns are varchar(255).

    ( I didn’t create the table but I have to maintain it! )

    Duplicates would only exists if all 50 of the columns were identical.

    The current query is a SELECT DISTINCT, but I wonder whether or not a CTE would be more efficient.

    Maybe if I added a computed column that’s a hash of all the columns, and looked for duplicates based on that column?

    Reply
  • as always.. u r my savior.

    Reply
  • hi sir i want a query in table i want to show all records without top 3 records ?

    any sqlquery please help me…madan

    Reply
    • @madhan

      What do you mean by “top 3”
      SELECT…FROM TABLE
      EXCEPT
      SELECT TOP(3) …FROM TABLE ORDER BY…

      Reply
  • Pinal you are doing a great job man.

    Reply
  • Not Bad

    Reply
  • This is for finding the duplicate rows:

    duplicate rows=case 1-case2

    SELECT Personentityid,COUNT(NAMESNO) AS DUPLICATE INTO DBA_DB.dbo.duplicatevalues FROM DBA_DB.dbo.SpellingVariation
    GROUP BY Personentityid

    case 1:
    select sum(duplicate) from DBA_DB.dbo.duplicatevalues where duplicate>1

    Case 2:
    select Personentityid from DBA_DB.dbo.duplicatevalues where duplicate>1

    Reply
  • We have to insert 30,000 rows every 10 minutes…we do that for about an hour. What is the best way to prevent duplicates?

    a) Use a batch approach like the methods described in this article to remove duplicates
    b) Use a real-time approach by adding a unique index to prevent duplicates when you insert?

    Thanks,
    David

    Reply
  • Man!!….that was very cool, thank dude

    Reply
  • Dude I have to thank you.. have been using your posts very often.. felt compelled to leave a thank you note.. u r doing a gud service to ur fellow professionals….

    Reply
  • hi to all,

    Here i am having a case, Can we avoid duplicates at time while insertion done at two tables and if the query is re-run then also it should not insert the same data

    Reply

Leave a Reply