MySQL从大数据库快速删除重复 [英] MySQL remove duplicates from big database quick
问题描述
我需要快速摆脱它们(我的意思是查询执行时间)。
这是它的外观:
id(index)| text1 | text2 | text3
text1& text2组合应该是唯一的,
如果有任何重复,只有一个组合与text3 NOT NULL应该保留。示例:
1 | abc | def | NULL
2 | abc | def | ghi
3 | abc | def | jkl
4 | aaa | bbb | NULL
5 | aaa | bbb | NULL
...成为:
1 | abc | def | (不真实事物id:2或id:3存活)
2 | aaa | bbb | NULL#(如果没有NOT NULL text3,NULL会做)
新的ids冷是任何东西,他们不要依赖旧的表格ids。
我已经尝试过这样的东西:
CREATE TABLE tmp SELECT text1,text2,text3
FROM my_tbl;
GROUP BY text1,text2;
DROP TABLE my_tbl;
ALTER TABLE tmp RENAME TO my_tbl;
或SELECT DISTINCT和其他变体。
在小数据库上工作时,在我的查询执行时间只是巨大的(从来没有到最后,实际上> 20分钟)
有更快的方法吗?请帮助我解决这个问题。
我相信这会做,使用重复键+ ifnull():
创建表tmp like yourtable;
alter table tmp add unique(text1,text2);
insert into tmp select * from yourtable
on duplicate key update text3 = ifnull(text3,values(text3));
重命名表yourtable to deleteme,tmp to yourtable;
drop table deleteme;
应该比需要分组或不同或子查询的任何内容要快得多,甚至可以按顺序排列。这甚至不需要一个filesort,这将在一个大的临时表上杀死性能。仍然需要对原始表格进行全面扫描,但不能避免。
I've got big (>Mil rows) MySQL database messed up by duplicates. I think it could be from 1/4 to 1/2 of the whole db filled with them.
I need to get rid of them quick (i mean query execution time).
Here's how it looks:
id (index) | text1 | text2 | text3
text1 & text2 combination should be unique,
if there are any duplicates, only one combination with text3 NOT NULL should remain. Example:
1 | abc | def | NULL
2 | abc | def | ghi
3 | abc | def | jkl
4 | aaa | bbb | NULL
5 | aaa | bbb | NULL
...becomes:
1 | abc | def | ghi #(doesn't realy matter id:2 or id:3 survives)
2 | aaa | bbb | NULL #(if there's no NOT NULL text3, NULL will do)
New ids cold be anything, they do not depend on old table ids.
I've tried things like:
CREATE TABLE tmp SELECT text1, text2, text3
FROM my_tbl;
GROUP BY text1, text2;
DROP TABLE my_tbl;
ALTER TABLE tmp RENAME TO my_tbl;
Or SELECT DISTINCT and other variations.
While they work on small databases, query execution time on mine is just huge (never got to the end, actually; > 20 min)
Is there any faster way to do that? Please help me solve this problem.
I believe this will do it, using on duplicate key + ifnull():
create table tmp like yourtable;
alter table tmp add unique (text1, text2);
insert into tmp select * from yourtable
on duplicate key update text3=ifnull(text3, values(text3));
rename table yourtable to deleteme, tmp to yourtable;
drop table deleteme;
Should be much faster than anything that requires group by or distinct or a subquery, or even order by. This doesn't even require a filesort, which is going to kill performance on a large temporary table. Will still require a full scan over the original table, but there's no avoiding that.
这篇关于MySQL从大数据库快速删除重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!