oracle - Finding duplicate values in table (and getting their pk's) -
i've problem selecting values i've simplefied samples below. basically, have table this:
create table sample_table ( pk_id number, business_id number ) now business_id's in table duplicates, , need know pk's of records.
let's assume (further) build , fill table this:
alter table sample_table add ( constraint sample_table_pk primary key (pk_id)); create sequence sample_sequence; create trigger sample_trigger before insert on sample_table each row begin :new.pk_id := sample_sequence.nextval; end; insert sample_table (business_id) values (1000); insert sample_table (business_id) values (1001); insert sample_table (business_id) values (1002); insert sample_table (business_id) values (1003); insert sample_table (business_id) values (1003); insert sample_table (business_id) values (1004); now figuring out business_id's duplicate easy:
select business_id, count (business_id) sample_table group business_id having count (business_id) > 1; but not want business_id's, want pk_id's.
i can them using above query subquery:
select * sample_table business_id in ( select business_id sample_table group business_id having count (business_id) > 1); or using count ( * ) on partition subquery factoring
with q (select business_id, count ( * ) on (partition business_id) totalcount sample_table) select * q q.totalcount > 1 but both of them make query pretty slow (the 1 sample work ok, when work production data of 500.000 rows, performance not great) wondering if there's nicer way this.
as stands table , pk index, first query:
select * sample_table business_id in ( select business_id sample_table group business_id having count (business_id) > 1); will need full table scan evaluate subquery , main query need full scan given list of business_ids found (the pk index won't of use this.) you'll see plan this:
-----------------------------------------------... | id | operation | name | ... -----------------------------------------------... | 0 | select statement | | ... |* 1 | hash join right semi | | ... | 2 | view | vw_nso_1 | ... |* 3 | filter | | ... | 4 | hash group | | ... | 5 | table access full| sample_table | ... | 6 | table access full | sample_table | ... -----------------------------------------------... predicate information (identified operation id): --------------------------------------------------- 1 - access("business_id"="business_id") 3 - filter(count(*)>1) throw unique index on business_id , pk_id (in order) , should able forgo 2nd table scan , use index duplicate business_ids. (the first table scan unavoidable has check of rows possible duplication.) composite index, oracle can business_id , grab pk_id @ same time without having jump on table.
-------------------------------------------------... | id | operation | name |... -------------------------------------------------... | 0 | select statement | |... | 1 | nested loops | |... | 2 | view | vw_nso_1 |... |* 3 | filter | |... | 4 | hash group | |... | 5 | table access full| sample_table |... |* 6 | index range scan | business_id_idx |... -------------------------------------------------... predicate information (identified operation id): --------------------------------------------------- 3 - filter(count(*)>1) 6 - access("business_id"="business_id") this should work pretty if duplicates exception. if, worst case scenario, business_ids duplicates, index lookups ugly.
you try bit funkier this:
select business_id, listagg(pk_id) within group (order pk_id) sample_table group business_id having count(*) > 1 now 1 full table scan, pk_ids glued on same line.
Comments
Post a Comment