Thursday, 29 September 2011

Removing Duplicate Records

 

So you have duplicate records found within tables with unique ID’s

Here are three methods you may want to consider in selecting/removing those duplicates.

 


Method 1: The unMAX method – just the earliest duplicates

Max will display the largest value of a selected column, in this case your identifier. As a result non duplicated rows will be obtained…  If you negate that result (as shown) you will have only duplicated records:

 

   1:  DELETE
   2:  FROM MyTable
   3:  WHERE ID NOT IN (
   4:       SELECT MAX(ID)
   5:       FROM MyTable
   6:       GROUP BY DuplicateColumn1, DuplicateColumn2, DuplicateColumn3
   7:  )

 

*Note* since your negating everything but the largest ID, only the smaller ID’s will be removed


Method 2: The unMATCH method – find all duplicates

Here you compare a table with itself matching all desired duplicated information, and un-matching all equivalent Id’s.

   1:  SELECT M1.DuplicateColumn1, M1.DuplicateColumn2, M1.DuplicateColumn3
   2:  FROM MyTable M1
   3:  INNER JOIN MyTable M2
   4:  ON ISNULL(M1.DuplicateColumn1, '') = ISNULL(M2.DuplicateColumn1, '')
   5:  AND ISNULL(M1.DuplicateColumn2, '') = ISNULL(M2.DuplicateColumn2, '')
   6:  AND ISNULL(M1.DuplicateColumn3, '') = ISNULL(M2.DuplicateColumn3, '')
   7:  AND M1.ID <> M2.ID
   8:  ORDER BY M1.DuplicateColumn1, M1.DuplicateColumn2, M1.DuplicateColumn3

*Note* IsNull prevents comparing null values with blank values. A measure of safety 


Method 3: The ID comparison – just the latest duplicates

Literally a combination of method 1 and Method 2. More straight forward, and with more control. Few more lines in your select statement of course!

   1:  DELETE FROM MyTable
   2:  WHERE ID IN (
   3:       SELECT M1.ID
   4:       FROM MyTable M1
   5:       INNER JOIN MyTable M2
   6:       ON ISNULL(M1.DuplicateColumn1, '') = ISNULL(M2.DuplicateColumn1, '')
   7:       AND ISNULL(M1.DuplicateColumn2, '') = ISNULL(M2.DuplicateColumn2, '')
   8:       AND ISNULL(M1.DuplicateColumn3, '') = ISNULL(M2.DuplicateColumn3, '')
   9:       AND M1.ID > M2.ID
  10:  )


FYI

My sources for the above are Pinal Dave and Gordon Bell, thank you!

If you have anything to add please comment.

 

Technorati Tags: ,,

 


Twitter

Page Hits