从MySQL中删除对INDEX过大的列的重复项 [英] Removing duplicates from MySQL on a column that's too big to INDEX

查看:167
本文介绍了从MySQL中删除对INDEX过大的列的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要从包含数百万列的表格中移除重复的列。我想检查重复项的字段太长(它正在存储网址),以将 UNIQUE 索引。

I'm trying to remove duplicate rows from a table with millions of rows. The field I want to check for duplicates on is too long (it's storing URLs) to put a UNIQUE index on. Is there any way to remove duplicates quickly?

建议删除重复项目的方法:

The recommended method for removing duplicates:

DELETE t1 FROM table1 AS t1 JOIN table1 AS t2 ON t1.id>t2.id AND t1.name=t2.name;

从来没有完成它的工作,虽然我想它可能只需要很多时间。

Never seems to finish its job, though I suppose it might just require a lot of time to do.

我在这里听到的一个想法是创建一个 MD5 哈希列用于索引和比较。这是推荐的路线吗?如果是这样,我应该截断此列的空间/速度考虑吗?

One idea I've heard here is to create an MD5 hash column for indexing and comparison. Is this the recommended route? If so, should I be truncating this column for space/speed considerations?

推荐答案

我可能没有想到这一点,值得一试:

I may not have thought this through, but it's worth a try:


  • 创建一个列 md5url md5的URL( UPDATE table1 SET md5url = MD5(url)

  • 在该列上创建code> md5url - md5应该足够短。

  • 将您的对帐单更改为:

  • Create a column, md5url, and fill it with the md5 of the url (UPDATE table1 SET md5url = MD5(url)
  • Make a (non-unique) index on that column md5url - md5 should be short enough to do so
  • change your statement to:

DELETE t1 
FROM table1 AS t1 
     JOIN table1 AS t2 
          ON t1.md5url = t2.md5url 
             AND t1.name=t2.name 
             AND t1.id>t2.id;


这样,JOIN条件主要用于索引,如果索引列md5url适合,我们实际上检查URL - 因为根据你有多少个url在某一时刻变得可能,两个URL 可以具有相同的MD5。第三个条件是清楚的 - 确保只有两个相同的列中的一个被删除。

That way, the JOIN condition works primarily on an index. If that indexed column, md5url, fits, then we actually check the URL - because, depending on how many urls you have, it becomes possible at some point, that two URLs may have the same MD5. The third condition is clear - ensuring that only one of the two identical columns is deleted.

我想知道这是否有效 - 它在我的头右现在; - )

I would love to hear whether this works - it makes perfect sense in my head right now ;-)

这篇关于从MySQL中删除对INDEX过大的列的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆