MySQL从大数据库快速删除重复 [英] MySQL remove duplicates from big database quick

查看：133 发布时间：2017/7/20 22:00:32 sql mysql duplicates

本文介绍了MySQL从大数据库快速删除重复的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有大数据（> Mil行）MySQL数据库被重复的混乱。我认为这可能是从他们的整个db的1/4到1/2。
我需要快速摆脱它们（我的意思是查询执行时间）。
这是它的外观：

id（index）| text1 | text2 | text3

text1& text2组合应该是唯一的，
如果有任何重复，只有一个组合与text3 NOT NULL应该保留。示例：

  1 | abc | def | NULL 
 2 | abc | def | ghi 
 3 | abc | def | jkl 
 4 | aaa | bbb | NULL 
 5 | aaa | bbb | NULL

...成为：

  1 | abc | def | （不真实事物id：2或id：3存活）
 2 | aaa | bbb | NULL＃（如果没有NOT NULL text3，NULL会做）

新的ids冷是任何东西，他们不要依赖旧的表格ids。

我已经尝试过这样的东西：

  CREATE TABLE tmp SELECT text1，text2，text3 
 FROM my_tbl; 
 GROUP BY text1，text2; 
 DROP TABLE my_tbl; 
 ALTER TABLE tmp RENAME TO my_tbl;

或SELECT DISTINCT和其他变体。

在小数据库上工作时，在我的查询执行时间只是巨大的（从来没有到最后，实际上> 20分钟）

有更快的方法吗？请帮助我解决这个问题。

解决方案

我相信这会做，使用重复键+ ifnull（）：

 创建表tmp like yourtable; 
 
 alter table tmp add unique（text1，text2）; 
 
 insert into tmp select * from yourtable 
 on duplicate key update text3 = ifnull（text3，values（text3））; 
 
重命名表yourtable to deleteme，tmp to yourtable; 
 
 drop table deleteme;

应该比需要分组或不同或子查询的任何内容要快得多，甚至可以按顺序排列。这甚至不需要一个filesort，这将在一个大的临时表上杀死性能。仍然需要对原始表格进行全面扫描，但不能避免。

I've got big (>Mil rows) MySQL database messed up by duplicates. I think it could be from 1/4 to 1/2 of the whole db filled with them. I need to get rid of them quick (i mean query execution time). Here's how it looks:
id (index) | text1 | text2 | text3
text1 & text2 combination should be unique, if there are any duplicates, only one combination with text3 NOT NULL should remain. Example:

1 | abc | def | NULL  
2 | abc | def | ghi  
3 | abc | def | jkl  
4 | aaa | bbb | NULL  
5 | aaa | bbb | NULL

...becomes:

1 | abc | def | ghi   #(doesn't realy matter id:2 or id:3 survives)   
2 | aaa | bbb | NULL  #(if there's no NOT NULL text3, NULL will do)

New ids cold be anything, they do not depend on old table ids.
I've tried things like:

CREATE TABLE tmp SELECT text1, text2, text3
FROM my_tbl;
GROUP BY text1, text2;
DROP TABLE my_tbl;
ALTER TABLE tmp RENAME TO my_tbl;

Or SELECT DISTINCT and other variations.
While they work on small databases, query execution time on mine is just huge (never got to the end, actually; > 20 min)

Is there any faster way to do that? Please help me solve this problem.

解决方案

I believe this will do it, using on duplicate key + ifnull():

create table tmp like yourtable;

alter table tmp add unique (text1, text2);

insert into tmp select * from yourtable 
    on duplicate key update text3=ifnull(text3, values(text3));

rename table yourtable to deleteme, tmp to yourtable;

drop table deleteme;

Should be much faster than anything that requires group by or distinct or a subquery, or even order by. This doesn't even require a filesort, which is going to kill performance on a large temporary table. Will still require a full scan over the original table, but there's no avoiding that.

这篇关于MySQL从大数据库快速删除重复的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

MySQL从大数据库快速删除重复 [英] MySQL remove duplicates from big database quick

问题描述

相关文章

数据库最新文章

热门教程

热门工具

登录关闭

MySQL从大数据库快速删除重复 [英] MySQL remove duplicates from big database quick

问题描述

相关文章

数据库最新文章

热门教程

热门工具

登录 关闭

登录关闭