Finding duplicates with RANK
If you’ve made the mistake of forgetting a primary key on your table, it can be frustrating to find a way to delete all of the duplicate rows without deleting the initial instances.
delete from $table_name where rowid in ( select "rowid" from (select "rowid", rank_n from (select rank() over (partition by $primary_key order by rowid) rank_n, rowid as "rowid" from $table_name where $primary_key in (select $primary_key from $table_name group by $all_columns having count(*) > 1 ) ) ) where rank_n > 1 )
This query selects all of the ‘extra’ rowids and removes them. It is especially designed for limiting the query scans to only those records which have duplicates, which is useful if there’s only a subset of the table that you are dealing with. If you want to improve its efficiency for a table with a high percentage of duplicates, simply remove the inside where clause. In a sample test with 1% duplicates and 233 000 rows, the query took 22 seconds without the subquery, and 18 seconds with it.
* NOTE: This query may not work in Oracle versions before 8i.
A quick way to create duplicates:
create table $temp_table as select * from $table_name sample(10);
insert into $table_name select * from $temp_table;
Alternative methods
GROUP BY method
DELETE FROM $table_name
WHERE rowid not in
(SELECT MIN(rowid)
FROM $table_name
GROUP BY $all_columns;
Join with self method
SELECT
rowid
FROM
$table_name A
WHERE
rowid >
(SELECT min(rowid) FROM $table_name B
WHERE
A.$column1 = B.$column1
and A.$column2 = B.$column2
......
);
Comparison
In the test case, I used a table with 233 000 rows created with a 1% duplication factor. You’ll notice that I interrupted the execution of the ‘join with self’ because I didn’t have the patience to let it run to completion.
The following is a TKPROF output of the queries:
TKPROF: Release 8.1.7.4.0 - Production on Tue May 23 14:47:43 2006 (c) Copyright 2000 Oracle Corporation. All rights reserved. Trace file: flpdt_ora_28183.trc Sort options: default ******************************************************************************** select "rowid" from (select "rowid", rank_n from (select rank() over (partition by token_id order by rowid) rank_n, rowid as "rowid" from $schema.$table_name where token_id in (select token_id from $schema.$table_name group by tournament_id, nickname, bank, place, prize_id, userid, token_id having count(*) > :"SYS_B_0" ) ) ) where rank_n > :"SYS_B_1" call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 3 0.00 0.01 0 0 0 0 Execute 3 0.00 0.00 0 0 0 0 Fetch 315 29.98 73.11 14539 7578 94 4662 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 321 29.98 73.12 14539 7578 94 4662 Misses in library cache during parse: 1 Optimizer goal: CHOOSE Parsing user id: SYS Rows Row Source Operation ------- --------------------------------------------------- 0 VIEW 0 WINDOW SORT 0 MERGE JOIN 230731 SORT JOIN 230730 TABLE ACCESS FULL $table_name 0 SORT JOIN 0 VIEW VW_NSO_1 0 SORT UNIQUE 0 FILTER 230731 SORT GROUP BY 230730 TABLE ACCESS FULL $table_name ******************************************************************************** select "rowid" from (select "rowid", rank_n from (select rank() over (partition by token_id order by rowid) rank_n, rowid as "rowid" from $schema.$table_name ) ) where rank_n > :"SYS_B_0" call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.00 0.03 0 0 0 0 Execute 1 0.00 0.00 0 0 0 0 Fetch 157 7.67 15.47 2635 1257 15 2337 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 159 7.67 15.50 2635 1257 15 2337 Misses in library cache during parse: 1 Optimizer goal: CHOOSE Parsing user id: SYS ******************************************************************************** SELECT MIN(rowid) FROM $schema.$table_name GROUP BY tournament_id, nickname, bank, place, prize_id, userid, token_id having count(*) > :"SYS_B_0" call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 3 0.00 0.00 0 0 0 0 Execute 3 0.00 0.00 0 0 0 0 Fetch 315 16.74 41.81 7832 3897 53 4662 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 321 16.74 41.81 7832 3897 53 4662 Misses in library cache during parse: 1 Optimizer goal: CHOOSE Parsing user id: SYS Rows Row Source Operation ------- --------------------------------------------------- 0 FILTER 230731 SORT GROUP BY 230730 TABLE ACCESS FULL $table_name ******************************************************************************** SELECT rowid FROM $schema.$table_name A WHERE rowid > (SELECT min(rowid) FROM $schema.$table_name B WHERE B.tournament_id = A.tournament_id and B.nickname = A.nickname and B.bank = A.bank and B.place = A.place and B.prize_id = A.prize_id and B.userid = A.userid and B.token_id = A.token_id ) call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 3 0.02 0.01 0 0 0 0 Execute 3 0.00 0.00 0 0 0 0 Fetch 3 153.83 213.12 933 436118 2104 0 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 9 153.85 213.13 933 436118 2104 0 Misses in library cache during parse: 1 Optimizer goal: CHOOSE Parsing user id: SYS Rows Row Source Operation ------- --------------------------------------------------- 0 FILTER 30 TABLE ACCESS FULL $table_name 58 SORT AGGREGATE 3 TABLE ACCESS FULL $table_name
5 Comments. Leave new
Why not simply
DELETE table_name
WHERE rowid IN
( SELECT LEAD(rowid) OVER (PARTITION BY pk ORDER BY whatever)
FROM table_name );
I like using the LEAD function, and it works well for removing duplicates, but you need to be sure that every row has exactly one duplicate. In my test scenario, it performed poorly because I had only a subset of duplicates. In such systems, you would need to narrow the results to only problem records and also factor in the possibility of one record having multiple duplicates.
Hi,
Is it possible to delete alternate rows from a table?
If yes, can some one send me the query?
Thanks
Abhijeet.
Select alteranate rows
select * from Table where rowid%2=0;
[…] April 13, 2010 by urisnotaword https://www.pythian.com/news/209/oracle-deleting-duplicate-rows-efficiently/ […]